0% found this document useful (0 votes)

27 views

Math1005 Notes

Uploaded by

Rithvik Nair

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Math1005 Notes

Uploaded by

Rithvik Nair

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture‌‌1‌ ‌

Controlled‌‌Experiment:‌ ‌
● We‌‌want‌‌to‌‌study‌‌whether‌‌the‌t‌reatment‌‌‌causes‌‌the‌r‌ esponse‌ ‌
● The‌‌response‌‌that‌‌we‌‌want‌‌may‌‌be‌‌caused‌‌by‌‌other‌‌factors/variable‌ ‌
● Hence,‌‌optimally,‌‌we‌‌conduct‌‌2‌‌parallel‌‌experiments‌‌which‌o ‌ nly‌‌‌differ‌‌in‌‌whether‌
the‌‌treatment‌‌is‌‌administered‌ ‌or‌‌not‌ ‌
● This‌‌is‌‌called‌‌controlled‌‌experiment‌‌i.e.‌‌we‌c ‌ ontrol‌‌‌the‌‌effects‌‌of‌‌the‌‌other‌‌
variables‌‌on‌‌the‌‌treatment‌ ‌
Confounding‌ ‌
● Confounding‌‌occurs‌‌when‌‌the‌‌effect‌‌of‌‌one‌‌variable‌‌(X)‌‌on‌‌another‌‌variable‌‌(Y)‌‌is‌‌
clouded‌‌by‌‌the‌‌influence‌‌of‌‌another‌‌variable‌‌(Z)‌ ‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
The‌‌diagram‌‌is‌‌called‌‌“causal‌‌graph”‌ ‌
‌
Bias‌ ‌
● Means‌‌that‌‌the‌‌quantity‌‌of‌‌interest‌‌is‌‌systematically‌‌under‌‌or‌‌overestimated‌ ‌
● Bias‌‌is‌‌often‌‌caused‌‌by‌‌a‌‌confounding‌‌variable‌‌but‌‌it‌‌can‌‌also‌‌have‌‌other‌‌causes‌‌
and‌‌sometimes‌‌it‌‌can‌‌even‌‌be‌‌desired‌ ‌
● In‌‌this‌‌module‌‌we‌‌will‌‌only‌‌consider‌‌bias‌‌due‌‌to‌‌confounding‌‌variables.‌‌This‌‌is‌‌
bias‌‌we‌‌want‌‌to‌‌avoid‌ ‌
‌
Types‌‌of‌‌Bias:‌ ‌
● Selection‌‌Bias:‌‌‌If‌‌the‌‌Treatment‌‌is‌‌not‌‌comparable‌‌to‌‌the‌‌control‌‌group,‌‌then‌‌the‌‌
differences‌‌between‌‌the‌‌two‌‌groups‌‌can‌‌confound‌‌the‌‌effect‌‌of‌‌the‌‌treatment‌ ‌
● Observer‌‌Bias:‌ ‌
○ If‌‌the‌‌subjects‌‌or‌‌investigators‌‌are‌‌aware‌‌of‌‌the‌‌identity‌‌of‌‌the‌‌two‌‌groups,‌‌
we‌‌can‌‌get‌‌bias‌‌in‌‌either‌‌the‌‌responses‌‌or‌‌evaluations,‌‌as‌‌they‌‌may‌‌
deliberately‌‌or‌‌subconsciously‌‌report‌‌more‌‌or‌‌less‌‌favourable‌‌results‌ ‌
○ In‌‌fact,‌‌the‌‌subject‌‌may‌‌even‌‌respond‌‌to‌‌the‌‌idea‌‌of‌‌the‌‌treatment‌‌-‌‌this‌‌is‌‌
called‌p‌ lacebo‌‌effect‌ ‌

‌
● The‌‌placebo‌‌‌is‌‌a‌‌pretend‌‌treatment.‌‌It‌‌is‌‌designed‌‌to‌‌be‌‌neutral‌‌and‌‌
indistinguishable‌‌from‌‌the‌‌treatment‌ ‌
The‌‌placebo‌‌effect‌‌‌is‌‌an‌‌effect‌‌which‌‌occurs‌‌from‌‌the‌‌subject‌‌thinking‌‌they‌‌have‌‌
had‌‌the‌‌treatment‌
● Consent‌‌Bias:‌‌‌Can‌‌occur‌‌when‌‌subjects‌‌choose‌‌whether‌‌or‌‌not‌‌they‌‌take‌‌part‌‌in‌‌
the‌‌experiment‌ ‌
● This‌‌quickly‌‌raises‌‌many‌‌ethical‌‌questions‌ ‌
● How‌‌can‌‌we‌‌avoid‌‌consent‌‌bias?‌ ‌
● Who‌‌determines‌‌who‌‌is‌‌part‌‌of‌‌each‌‌group‌ ‌
● It‌‌may‌‌be‌‌unethical‌‌to‌‌withhold‌‌treatment‌‌for‌‌those‌‌in‌‌the‌‌control‌‌group‌‌or‌‌enforce‌‌
treatment‌‌for‌‌those‌‌in‌‌the‌‌treatment‌‌group‌ ‌
‌
Solution‌‌for‌‌Selection‌‌and‌‌Observer‌‌Bias:‌ ‌
● We‌‌need‌‌to‌‌conduct‌‌a‌R ‌ andomised‌‌Controlled‌‌Double-Blind‌‌‌Trial‌‌where‌‌both‌‌
the‌‌subjects‌‌(“single‌‌blind”)‌‌and‌‌investigators‌‌(“double‌‌blind”)‌‌are‌‌not‌‌aware‌‌of‌‌the‌‌
identity‌‌of‌‌the‌‌groups‌ ‌
● In‌‌addition,‌‌the‌‌control‌‌of‌‌the‌‌patient’s‌‌expectations‌‌(i.e.‌‌their‌‌response)‌‌and‌‌the‌‌
investigator’s‌‌observations‌‌(evaluation‌‌of‌‌response).‌ ‌
● To‌‌do‌‌so‌‌we‌‌usually:‌ ‌
● Have‌‌a‌‌3rd‌‌party‌‌administrator‌‌of‌‌the‌‌treatment‌‌and‌‌placebo‌ ‌
● Design‌‌the‌‌placebo‌‌to‌‌mimic‌‌the‌‌treatment‌‌as‌‌much‌‌as‌‌possible‌ ‌
‌
Summary‌ ‌
● The‌‌design‌‌of‌‌a‌‌statistical‌‌study‌‌is‌‌critical‌‌in‌‌order‌‌to‌‌obtain‌‌results‌‌that‌‌can‌‌be‌‌
generalised.‌‌The‌‌best‌‌method‌‌for‌‌comparison‌‌is‌‌a‌‌controlled‌‌randomised‌‌
double-blind‌‌trial,‌‌but‌‌this‌‌is‌‌often‌‌not‌‌possible‌ ‌

Lecture‌‌2‌ ‌
The‌‌Need‌‌for‌‌Observational‌‌Studies‌ ‌
● In‌‌observation‌‌studies,‌‌the‌‌assignment‌‌of‌‌subjects‌‌into‌‌treatment‌‌and‌‌control‌‌
groups,‌‌is‌‌outside‌‌the‌‌control‌‌of‌‌the‌‌investigator‌ ‌
● Many‌‌research‌‌questions‌‌require‌‌an‌o ‌ bservational‌‌study‌,‌‌rather‌‌than‌‌a‌‌
controlled‌‌experiment‌ ‌
● The‌‌conclusions‌‌of‌‌observational‌‌studies‌‌require‌‌great‌‌care‌ ‌
● An‌‌observational‌‌study‌‌is‌‌one‌‌in‌‌which‌‌the‌‌investigator‌‌has‌‌no‌‌control‌‌over‌‌the‌‌
subjects‌‌or‌‌qualities‌‌of‌‌interest;‌‌she‌‌is‌‌just‌‌an‌‌observer.‌‌In‌‌particular‌‌the‌‌
investigator‌‌cannot‌‌use‌‌randomisation‌‌for‌‌allocation‌‌into‌‌groups‌ ‌
‌
Precautions‌ ‌

‌
● It‌‌is‌‌very‌‌difficult‌‌to‌‌establish‌‌causation‌ ‌
○ It‌‌is‌‌rather‌‌easy‌‌to‌‌establish‌‌association‌‌(that‌‌one‌‌thing‌‌is‌‌linked‌‌to‌‌
another)‌ ‌
■ Association‌‌may‌s ‌ uggest‌‌‌causation‌ ‌
■ But‌‌association‌‌does‌‌not‌p ‌ rove‌‌‌causation‌ ‌
○ Observational‌‌Studies‌‌can‌‌have‌‌misleading‌‌hidden‌‌confounders‌ ‌
■ Confounders‌‌can‌‌be‌‌hard‌‌to‌‌find,‌‌and‌‌can‌‌mislead‌‌about‌‌a‌‌cause‌‌
and‌‌effect‌‌relationship‌ ‌
● Observational‌‌studies‌‌with‌‌a‌‌confounding‌‌variable‌‌can‌‌lead‌‌to‌‌Simpson’s‌‌
Paradox‌ ‌
○ Simpson’s‌‌Paradox‌‌‌(or‌‌the‌‌reversing‌‌paradox)‌‌was‌‌first‌‌mentioned‌‌by‌‌
British‌‌statistician‌‌Udny‌‌Yule‌‌in‌‌1903.‌‌It‌‌was‌‌named‌‌after‌‌Edward‌‌H.‌‌
Simpson‌ ‌
○ Sometimes‌‌there‌‌is‌‌a‌‌clear‌‌trend‌‌in‌i‌ndividual‌‌‌groups‌‌of‌‌data‌‌that‌‌
reverses‌‌when‌‌groups‌‌are‌p ‌ ooled‌‌‌together‌ ‌
■ It‌‌occurs‌‌when‌‌relationships‌‌between‌‌percentages‌‌in‌‌subgroups‌‌are‌‌
reversed‌‌when‌‌the‌‌subgroups‌‌are‌‌combined,‌‌because‌‌of‌‌a‌‌
confounding‌‌or‌‌lurking‌‌variable‌ ‌
■ The‌‌association‌‌between‌‌a‌‌pair‌‌of‌‌variables‌‌(X,‌‌Y)‌‌reverses‌‌sign‌‌
upon‌‌conditioning‌‌of‌‌a‌‌third‌‌variable‌‌Z,‌‌regardless‌‌of‌‌the‌‌value‌‌
taken‌‌by‌‌Z.‌‌ ‌
● Historical‌‌control‌ ‌
○ Some‌‌studies‌‌present‌‌themselves‌‌as‌‌a‌‌controlled‌‌experiment,‌‌but‌‌on‌‌
further‌‌examination,‌‌there‌‌is‌‌a‌‌historical‌‌control‌‌‌and‌t‌ ime‌‌‌is‌‌a‌‌
confounding‌‌variable.‌‌(Note:‌‌This‌‌is‌‌partly‌‌observational‌‌and‌‌partly‌‌an‌‌
experiment)‌ ‌
○ Investigators‌‌might‌‌compare‌‌the‌‌effect‌‌of‌‌a‌‌new‌‌medication‌‌on‌‌current‌‌
patients,‌‌with‌‌an‌‌old‌‌medication‌‌on‌p ‌ ast‌‌‌patients.‌‌The‌‌Treatment‌‌Group‌‌
(new‌‌drug)‌‌and‌‌the‌‌historical‌‌Control‌‌Group‌‌(old‌‌drug)‌‌may‌‌differ‌‌in‌‌
aspects‌‌beside‌‌the‌‌treatment‌ ‌
○ Controlled‌‌experiments‌‌need‌‌to‌‌be‌‌performed‌‌in‌‌the‌‌same‌‌time‌‌period‌‌
(contemporaneously)‌ ‌
Uses‌‌for‌‌the‌‌word‌‌‘Control’‌ ‌
● A‌‌control‌‌=‌‌a‌‌subject‌‌who‌‌did‌‌not‌‌get‌‌the‌‌treatment‌ ‌
● A‌‌controlled‌‌experiment‌‌=‌‌a‌‌study/experiment‌‌where‌‌the‌‌investigators‌‌allocate‌‌
subjects‌‌into‌‌different‌‌groups‌ ‌
● Controlling‌‌for‌‌confounders‌‌=‌‌trying‌‌to‌‌reduce‌‌the‌‌influence‌‌of‌‌confounding‌‌
variables‌ ‌
Summary‌ ‌

‌
● Many‌‌statistical‌‌involve‌‌observation;‌‌data‌‌and‌‌so‌‌we‌‌need‌‌to‌‌be‌‌very‌‌careful‌‌with‌‌
interpretation‌‌errors‌‌as‌‌confusing‌‌association‌‌for‌‌causation,‌‌misleading‌‌
confounders,‌‌Simpson’s‌‌Paradox‌‌and‌‌historical‌‌controls‌ ‌
‌

Lecture‌‌3‌ ‌
What‌‌is‌‌data?‌ ‌
● Data‌‌is‌‌information‌‌about‌‌the‌‌set‌‌of‌s ‌ ubjects‌‌‌being‌‌studied‌‌(like‌‌road‌‌fatalities)‌ ‌
○ Most‌‌commonly,‌‌data‌‌refers‌‌to‌‌the‌‌sample‌‌not‌‌the‌‌population‌ ‌
Different‌‌types‌‌of‌‌data‌ ‌
● There‌‌are‌‌different‌‌types‌‌of‌‌data,‌‌in‌‌different‌‌formats,‌‌for‌‌example:‌ ‌
○ Survey‌‌data‌ ‌
○ Spreadsheet‌‌type‌‌data‌ ‌
○ MRI‌‌image‌‌data‌ ‌
Big‌‌data‌ ‌
● Big‌‌data‌‌refers‌‌to‌‌the‌‌massive‌‌amounts‌‌of‌‌data‌‌being‌‌collected‌ ‌
● Big‌‌data‌‌is‌‌commonly‌h ‌ igh‌‌dimensional‌,‌‌which‌‌means‌‌that‌‌there‌‌are‌‌more‌‌
variables‌p
‌ ‌‌that‌‌subjects‌n‌‌‌
○ For‌‌example,‌‌genomics‌‌data‌‌can‌‌have‌‌3‌‌billion‌‌variable,‌‌as‌‌a‌‌person’s‌‌
DNA‌‌sequence‌‌is‌‌3‌‌billion‌‌base‌‌pairs‌‌long‌ ‌
○ Measurements‌‌every‌‌milliseconds‌ ‌
○ Image‌‌data‌‌or‌‌video‌‌data‌ ‌
● Big‌‌data‌‌requires‌‌more‌‌complex‌‌visualisations‌ ‌
Initial‌‌Data‌‌Analysis‌‌(IDA)‌ ‌
● Initial‌‌Data‌‌Analysis‌‌is‌‌a‌‌first‌‌general‌‌look‌‌at‌‌the‌‌data,‌‌without‌‌formally‌‌answering‌‌
the‌‌research‌‌questions‌ ‌
○ IDA‌‌helps‌‌you‌‌to‌‌see‌‌whether‌‌the‌‌data‌‌can‌‌answer‌‌you‌‌research‌‌questions‌ ‌
○ IDA‌‌may‌‌pose‌‌other‌‌research‌‌questions‌ ‌
○ IDA‌‌can‌ ‌
■ Identify‌‌the‌‌data‌‌main‌‌qualities;‌ ‌
■ Suggest‌‌the‌‌populations‌‌from‌‌which‌‌a‌‌sample‌‌derives‌ ‌
What’s‌‌involved‌‌in‌‌IDA?‌ ‌
● Initial‌‌Data‌‌Analysis‌‌commonly‌‌involves:‌ ‌
○ Data‌‌background:‌‌checking‌‌the‌‌quality‌‌and‌‌integrity‌‌of‌‌the‌‌data‌ ‌
○ Data‌‌structure:‌‌what‌‌information‌‌has‌‌been‌‌collected?‌ ‌
○ Data‌‌wrangling:‌‌scraping‌‌cleaning,‌‌tiding,‌‌resharping,‌‌splitting,‌‌combining‌ ‌
○ Data‌‌summaries:‌‌graphical‌‌and‌‌numerical‌ ‌
● Here‌‌we‌‌focus‌‌on‌‌structure‌‌‌&‌‌graphical‌‌summaries‌‌‌for‌‌qualitative‌‌and‌‌
quantitative‌‌data‌ ‌

‌
Variables‌ ‌
● A‌v‌ ariable‌‌‌measures‌‌or‌‌describes‌‌some‌‌attribute‌‌of‌‌the‌‌subjects‌ ‌
○ Data‌‌with‌‌p‌‌(explanatory)‌‌variables‌‌is‌‌said‌‌to‌‌have‌d ‌ imension‌‌p ‌‌‌
● Number‌‌of‌‌variables‌ ‌
○ Univariate‌‌(1‌‌[explanatory]‌‌variable)‌ ‌
○ Bivariate‌‌(2‌‌[explanatory‌‌variables)‌ ‌
○ Multivariate‌‌(above‌‌2‌‌[explanatory]‌‌variables)‌ ‌
● Types‌‌of‌‌variables‌ ‌
○ Qualitative‌‌or‌‌Categorical‌‌(Categories)‌‌R:Factor‌ ‌
■ Ordinal‌‌(Ordered)‌ ‌
● Binary‌‌(2‌‌categories)‌ ‌
● 3+‌‌categories‌ ‌
■ Nominal‌‌(Non-ordered)‌ ‌
● Binary‌‌(2‌‌categories)‌ ‌
● 3+‌‌categories‌ ‌
○ Quantitative‌‌or‌‌Numerical‌‌(Measurements)‌‌R:Numeric‌ ‌
■ Discrete‌‌(Separated)‌‌R:Integer‌‌(int)‌ ‌
■ Continuous‌‌(Continuum)‌‌R:Double‌ ‌
Choosing‌‌a‌‌graphical‌‌summary‌ ‌
● The‌‌aim‌‌of‌‌a‌‌graphical‌‌summary‌‌is‌‌to‌‌best‌‌highlight‌‌features‌‌of‌‌this‌‌data‌ ‌
○ To‌‌some‌‌extent‌‌we‌‌use‌‌trial‌‌and‌‌error‌ ‌
○ While‌‌the‌‌pie‌‌chart‌‌may‌‌be‌‌popular,‌‌it‌‌is‌‌usually‌‌not‌‌informative‌ ‌
Summary‌ ‌
● The‌‌type‌‌of‌‌variables‌‌determines‌‌what‌‌type‌‌of‌‌graphical‌‌summary‌‌is‌‌most‌
appropriate‌ ‌

Lecture‌‌4‌ ‌
Overview‌‌of‌‌histogram‌ ‌
● We‌‌use‌‌a‌‌histogram‌‌for‌‌quantitative‌‌data‌ ‌
● A‌‌histogram‌‌highlights‌‌the‌‌percentage‌‌of‌‌data‌‌in‌‌one‌‌class‌‌interval‌‌compared‌‌to‌‌
another‌ ‌
○ It‌‌consists‌‌of‌‌a‌‌set‌‌of‌‌blocks‌‌which‌‌represent‌‌the‌‌percentages‌‌by‌‌area‌ ‌
○ The‌‌area‌‌of‌‌the‌‌histogram‌‌is‌‌100%‌ ‌
○ The‌‌horizontal‌‌scale‌‌is‌‌divided‌‌into‌c ‌ lass‌‌intervals‌ ‌
○ The‌‌area‌‌of‌‌each‌‌block‌‌‌represents‌‌the‌‌percentage‌‌of‌‌subjects‌‌in‌‌that‌‌
particular‌‌class‌‌interval‌ ‌
● Density‌‌Scale‌:‌ ‌
% in the block
○ Height‌‌of‌‌each‌‌block‌‌=‌‌ length of the class interval ‌

‌
○ Height‌‌of‌‌each‌‌block‌‌=‌‌average‌‌percentage‌‌per‌‌horizontal‌‌unit‌ ‌
● For‌‌continuous‌‌data,‌‌we‌‌need‌‌an‌e ‌ ndpoint‌‌convention‌‌‌for‌‌data‌‌points‌‌that‌‌fall‌‌
on‌‌the‌‌border‌‌of‌‌two‌‌class‌‌intervals‌ ‌
○ If‌‌an‌‌interval‌‌contains‌‌the‌‌left‌‌endpoint‌‌but‌‌excludes‌‌the‌‌right‌‌endpoint,‌‌
then‌‌an‌‌18‌‌year‌‌old‌‌would‌‌be‌‌counted‌‌in‌‌[18,25)‌‌not‌‌[0,18)‌ ‌
○ We‌‌call‌‌this‌‌left-closed‌‌and‌‌right-opened‌ ‌
● Number‌‌of‌‌class‌‌intervals‌ ‌
Common‌‌Mistake‌‌switch‌‌Histograms‌ ‌
● The‌‌black‌‌heights‌‌are‌‌equal‌‌to‌‌the‌‌percentage‌‌or‌‌total‌‌numbers‌ ‌
○ Here‌‌we‌‌wrongly‌‌use‌‌the‌t‌ otal‌‌numbers‌‌‌(or‌p ‌ ercentage‌)‌‌as‌‌the‌‌heights‌ ‌
○ Unless‌‌the‌‌class‌‌intervals‌‌are‌‌the‌‌same‌‌size,‌‌in‌‌both‌‌cases‌‌this‌‌will‌‌makes‌‌
larger‌‌class‌‌intervals‌‌look‌‌like‌‌a‌‌larger‌‌overall‌‌%‌ ‌
○ Solution‌:‌‌Use‌‌density‌‌as‌‌the‌‌height,‌‌especially‌‌if‌‌class‌‌intervals‌‌are‌‌not‌‌
the‌‌same‌‌size.‌‌Don't‌‌use‌‌percentage‌‌total‌‌numbers‌ ‌
● Use‌‌too‌‌many‌‌or‌‌too‌‌few‌‌class‌‌intervals‌ ‌
○ This‌‌can‌‌hide‌‌the‌‌true‌‌pattern‌‌in‌‌the‌‌data.‌‌As‌‌a‌‌rule‌‌of‌‌thumb,‌‌use‌‌between‌‌
10-15‌‌class‌‌intervals‌ ‌
Strategy‌ ‌
● Only‌‌count‌‌those‌‌deaths‌‌where‌‌person‌‌is‌‌driving‌ ‌
● Find‌‌for‌‌registered‌‌driving‌‌licences‌‌with‌‌age‌‌information‌ ‌
● Combine‌‌information‌‌and‌‌derive‌‌a‌‌death‌‌rate‌‌per‌‌driving‌‌licence‌‌for‌‌different‌‌age‌‌
groups‌ ‌
● Conclusion:‌‌Death‌‌rate‌‌per‌‌licence‌‌is‌‌approximately‌‌the‌‌same‌‌for‌‌age‌‌group‌‌[18,‌‌
25)‌‌and‌‌[70,‌‌105).‌‌Both‌‌rates‌‌are‌‌approximately‌‌three‌‌times‌‌higher‌‌than‌‌the‌‌death‌‌
rate‌‌for‌‌age‌‌groups‌‌[25,‌‌70)‌ ‌
Simple‌‌box‌‌plot‌ ‌
● The‌‌boxplot‌‌plots‌‌the‌‌median‌‌(‘middle’‌‌data‌‌point),‌‌the‌‌middle‌‌50%‌‌of‌‌the‌‌data‌‌in‌‌
a‌‌box,‌‌the‌‌maximum‌‌and‌‌minimum,‌‌and‌‌determines‌‌any‌‌outliers‌ ‌
● We‌‌will‌‌consider‌‌how‌‌to‌‌draw‌‌the‌‌box‌‌plot‌‌when‌‌we‌‌learn‌‌about‌‌the‌‌interquartile‌‌
range‌‌(IQR)‌‌in‌‌a‌‌later‌‌lecture‌ ‌
Comparative‌‌box‌‌plots‌ ‌
● A‌‌comparative‌‌boxplots‌‌splits‌‌up‌‌a‌‌quantitative‌‌variable‌‌by‌‌a‌‌qualitative‌‌variable‌ ‌
Heatmap‌ ‌
● A‌‌heatmap‌‌might‌‌be‌‌a‌‌good‌‌choice‌‌here.‌‌A‌‌heatmap‌‌is‌‌especially‌‌useful‌‌when‌‌a‌‌
contingency‌‌table‌‌is‌‌not‌‌practical‌‌due‌‌to‌‌too‌‌many‌‌different‌‌values‌ ‌
Summary‌ ‌
● The‌‌histogram‌‌is‌‌a‌‌graphical‌‌summary‌‌for‌‌quantitative‌‌data‌‌which‌‌shows‌‌the‌‌
percentage‌‌of‌‌subjects‌‌per‌‌class‌‌interval.‌‌The‌‌boxplot‌‌shows‌‌the‌‌middle‌‌50%‌‌of‌‌
the‌‌data‌‌and‌‌it’s‌‌spread.‌‌The‌‌scatterplot‌‌shows‌‌the‌‌relationship‌‌between‌‌two‌‌
variables.‌‌A‌‌heatmap‌‌is‌‌a‌‌‘contingency‌‌table’‌‌for‌‌numerical/continuous‌‌data‌ ‌

‌
Lecture‌‌5‌ ‌
Advantages‌‌of‌‌numerical‌‌summaries‌ ‌
● A‌‌numerical‌‌summary‌‌reduces‌‌all‌‌the‌‌data‌‌to‌‌one‌‌simple‌‌number‌‌0‌‌(“statistic”)‌ ‌
○ This‌‌loses‌‌a‌‌lot‌‌of‌‌information‌ ‌
○ However‌‌it‌‌allows‌‌easy‌‌communication‌‌comparisons‌ ‌
● Major‌‌features‌‌that‌‌we‌‌can‌‌summarise‌‌numerically‌‌are:‌ ‌
○ Maximum‌ ‌
○ Minimum‌ ‌
○ Centre‌‌‌[sample‌‌mean,‌‌median]‌ ‌
○ Spread‌‌‌[standard‌‌deviation,‌‌range,‌‌IQR]‌ ‌
Which‌‌might‌‌be‌‌useful‌‌for‌‌talking‌‌about‌‌Newtown‌‌house‌‌prices‌ ‌
● It‌‌depends‌ ‌
● Reporting‌‌the‌‌centre‌‌without‌‌the‌‌spread‌‌can‌‌be‌‌misleading‌ ‌
Useful‌‌notation‌‌for‌‌data‌‌(Ext)‌ ‌
● In‌‌this‌‌course,‌‌we‌‌intentionally‌‌focus‌‌on‌‌statistical‌‌concepts‌‌in‌w
‌ ords‌.‌‌This‌‌is‌‌vital‌‌
for‌‌collaborating‌‌with‌‌people‌‌from‌‌different‌‌fields.‌‌The‌‌mathematics‌‌is‌‌introduced‌‌
in‌‌2nd‌‌year.‌‌However,‌‌here‌‌some‌‌simple‌‌mathematical‌‌notation‌‌is‌‌helpful.‌ ‌
● Observations‌‌of‌‌a‌‌single‌‌variable‌‌of‌‌size‌n ‌ ‌‌can‌‌be‌‌represented‌‌by:‌ ‌
○ x1 , x2 , ..., xn ‌
● The‌‌ranked‌‌observations‌‌(ordered‌‌from‌‌smallest‌‌to‌‌largest)‌‌are:‌ ‌
○ x(1) , x(2) , ..., x(n) ‌
● The‌‌sum‌‌of‌‌the‌‌observations‌‌are:‌ ‌
n
○ ∑ xi ‌
i=1
Sample‌‌Mean‌ ‌
● The‌‌sample‌‌mean‌‌is‌‌the‌a
‌ verage‌‌‌of‌‌the‌‌data:‌ ‌
Sum of data
○ S ample M ean = Size of data
,‌‌‌or‌ ‌
n
∑ xi
○ x= n
i=1
‌
Sample‌‌mean‌‌as‌‌a‌‌balancing‌‌point‌
● The‌‌sample‌‌mean‌‌is‌‌the‌‌unique‌‌point‌‌at‌‌which‌‌the‌‌data‌‌is‌b
‌ alanced‌‌‌i.e.‌‌The‌‌
readings‌‌and‌‌the‌‌lower‌‌readings‌‌all‌‌cancel‌‌each‌‌other‌‌out.‌‌For‌‌example:‌‌When‌‌
mean‌‌is‌‌1407.143‌‌(thousands)‌ ‌
○ 19‌‌Watkin‌‌St‌‌sold‌‌for‌‌$1950‌‌(thousands)‌ ‌
■ This‌‌gives‌‌a‌‌gap‌‌of‌‌(1950-1407.143)‌ ‌
■ This‌‌is‌‌$542.857‌‌(thousands)‌a ‌ bove‌‌‌the‌‌sample‌‌mean‌‌price‌ ‌

‌
○ 30‌‌Pearl‌‌St‌‌sold‌‌$1250‌‌(thousands)‌ ‌
■ This‌‌gives‌‌a‌‌gap‌‌of‌‌(1250-1407.143)‌ ‌
■ This‌‌is‌‌$157.143‌b ‌ elow‌‌‌the‌‌sample‌‌mean‌‌price‌ ‌
Sample‌‌Median‌ ‌
● The‌‌sample‌‌median‌‌x̃ ‌is‌‌the‌m ‌ iddle‌‌data‌‌point‌,‌‌when‌‌the‌‌observations‌‌are‌‌
ordered‌‌from‌‌smallest‌‌to‌‌largest‌ ‌
○ For‌‌an‌‌odd‌‌sized‌‌number‌‌of‌‌observations:‌ ‌

■ Sample‌‌Mean‌‌=‌‌the‌‌unique‌‌middle‌‌point‌‌=‌‌ x( n+1 ) ‌
2
○ For‌‌an‌‌even‌‌sized‌‌number‌‌of‌‌observations:‌ ‌
x( n ) + x( n +1)
2 2
■ Sample‌‌Mean‌‌=‌‌average‌‌of‌‌the‌‌middle‌‌points‌‌=‌‌ ‌
2
Statistical‌‌Thinking‌ ‌
● If‌‌you‌‌had‌‌to‌‌choose‌‌between‌‌reporting‌‌the‌‌sample‌‌mean‌‌or‌‌sample‌‌median‌‌for‌
Newtown‌‌properties,‌‌which‌‌would‌‌you‌‌choose‌‌and‌‌why?‌ ‌
■ For‌‌the‌‌full‌‌property‌‌portfolio,‌‌the‌‌sample‌‌mean‌‌and‌‌the‌‌sample‌‌
median‌‌are‌‌fairly‌‌similar‌ ‌
■ For‌‌the‌‌4‌‌bedroom‌‌houses,‌‌the‌‌sample‌‌mean‌‌is‌‌higher‌‌than‌‌the‌‌
sample‌‌median‌‌because‌‌it‌‌is‌‌being‌‌“pulled‌‌up”‌‌by‌‌some‌‌very‌‌
expensive‌‌houses‌ ‌
○ For‌‌the‌‌average‌‌buyer,‌‌the‌‌sample‌‌median‌‌would‌‌be‌‌more‌‌useful‌‌as‌‌an‌‌
indication‌‌of‌‌the‌‌sort‌‌of‌‌price‌‌needed‌‌to‌‌get‌‌into‌‌the‌‌market‌ ‌
○ For‌‌any‌‌agent‌‌selling‌‌houses‌‌in‌‌the‌‌area,‌‌the‌‌sample‌‌mean‌‌might‌‌be‌‌more‌‌
useful‌‌in‌‌order‌‌to‌‌predict‌‌their‌‌average‌‌commissions‌ ‌
○ In‌‌practise,‌‌we‌‌can‌‌report‌‌both‌ ‌
‌
Robustness‌ ‌
● The‌‌sample‌‌median‌‌is‌‌said‌‌to‌‌be‌r‌ obust‌‌‌and‌‌is‌‌a‌‌good‌‌summary‌‌for‌‌skewed‌‌data‌‌
as‌‌it‌‌is‌‌not‌‌affected‌‌by‌o‌ utliers‌ ‌
● Suppose‌‌there‌‌was‌‌a‌‌data‌‌entry‌‌mistake,‌‌and‌‌the‌‌lowest‌‌property‌‌recorded‌‌as‌‌
370‌‌was‌‌in‌‌fact‌‌the‌‌highest‌‌sold‌‌at‌‌3700.‌‌How‌‌would‌‌the‌‌sample‌‌mean‌‌change?‌‌
How‌‌would‌‌the‌‌sample‌‌median‌‌change?‌ ‌
○ The‌‌sample‌‌mean‌‌would‌‌be‌‌higher,‌‌as‌‌we‌‌have‌‌replaced‌‌the‌‌smallest‌‌
reading‌‌by‌‌now‌‌maximum‌ ‌
○ The‌‌median‌‌would‌‌shift‌‌up,‌‌from‌‌the‌‌average‌‌x(28) ‌and‌‌x(29) ‌to‌‌the‌‌

average‌‌of‌‌x(29) ‌and‌‌x(30) ‌
Comparing‌‌the‌‌sample‌‌mean‌‌and‌‌the‌‌median‌ ‌

‌
● The‌‌difference‌‌between‌‌the‌‌sample‌‌mean‌‌and‌‌the‌‌median‌‌can‌‌be‌‌indication‌‌of‌‌
the‌s
‌ hape‌‌‌of‌‌the‌‌data‌ ‌
○ For‌‌symmetric‌‌data,‌‌we‌‌expect‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌median‌‌to‌‌
be‌‌same:‌‌x = x̃ ‌
○ For‌‌left‌‌skewed‌‌data,‌‌we‌‌expect‌‌the‌‌sample‌‌mean‌‌to‌‌be‌‌smaller‌‌than‌‌the‌‌
sample‌‌median:‌‌x < x̃ ‌
○ For‌‌the‌‌right‌‌skewed‌‌data,‌‌we‌‌expect‌‌the‌‌sample‌‌mean‌‌to‌‌be‌‌larger‌‌than‌‌
the‌‌sample‌‌median:‌x > x̃ ‌
Which‌‌is‌‌optimal‌‌for‌‌describing‌‌the‌‌centre?‌ ‌
● Both‌‌have‌‌strengths‌‌and‌‌weaknesses‌‌depending‌‌on‌‌the‌‌nature‌‌of‌‌the‌‌data‌ ‌
● Sometimes‌‌neither‌‌gives‌‌a‌‌sensible‌‌sense‌‌of‌‌location,‌‌for‌‌example‌‌is‌‌the‌‌data‌‌is‌‌
bimodal‌ ‌
● As‌‌the‌s ‌ ample‌‌median‌‌is‌‌robust‌,‌‌it‌‌is‌‌preferable‌‌for‌‌data‌‌which‌‌is‌‌skewed‌‌or‌‌has‌‌
many‌‌outliers,‌‌like‌‌Sydney‌‌house‌‌prices‌ ‌
● The‌‌sample‌‌mean‌‌‌is‌‌helpful‌‌for‌‌data‌‌which‌b ‌ asically‌‌symmetric‌,‌‌with‌‌not‌‌too‌‌
many‌‌outliers,‌‌and‌‌for‌‌theoretical‌‌analysis‌ ‌
Limitations‌‌of‌‌both?‌ ‌
● Both‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌median‌‌allow‌‌very‌‌easily‌‌comparisons,‌‌and‌‌
are‌‌easily‌‌understandable‌ ‌
● However,‌‌they‌‌need‌‌to‌‌be‌‌paired‌‌with‌‌a‌‌measure‌‌of‌‌spread‌ ‌
● Note‌‌in‌‌the‌‌following‌‌example,‌‌the‌‌sample‌‌means‌‌are‌‌the‌‌same,‌‌but‌‌the‌‌data‌‌are‌‌
very‌‌different‌ ‌
Summary‌ ‌
● Both‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌median‌‌summarise‌‌the‌‌centre‌‌data.‌‌The‌‌
sample‌‌median‌‌is‌‌robust‌‌making‌‌it‌‌a‌‌better‌‌choice‌‌for‌‌skewed‌‌data‌‌or‌‌where‌‌
there‌‌are‌‌outliers.‌‌Both‌‌need‌‌to‌‌be‌‌paired‌‌with‌‌a‌‌measure‌‌of‌‌spread.‌ ‌

Lecture‌‌6‌ ‌
1st‌‌attempt:‌‌The‌‌mean‌‌gap‌ ‌
● Mean‌‌gap‌‌=‌‌sample‌‌mean(data‌‌-‌‌sample‌‌mean(data))‌ ‌
● Note:‌‌It‌‌will‌‌always‌‌be‌‌0‌ ‌
○ From‌‌the‌‌definition,‌‌the‌‌mean‌‌gap‌‌must‌‌be‌‌0,‌‌as‌‌the‌‌mean‌‌is‌‌the‌‌
balancing‌‌point‌‌‌of‌‌the‌‌gaps‌ ‌
○ Or‌‌for‌‌those‌‌who‌‌like‌‌algebra,‌‌the‌‌mean‌‌gap‌‌is‌‌
n n
∑ (xi −x) ∑ xi nx
i=1 i=1
n
=n − n
=0 ‌
Better‌‌option:‌‌Standard‌‌deviation‌ ‌

‌
● First,‌‌define‌‌the‌‌Root‌‌Mean‌‌Square‌‌(RMS)‌ ‌
○ The‌‌RMS‌‌measures‌‌the‌a ‌ verage‌‌‌of‌‌a‌‌set‌‌of‌‌numbers,‌‌regardless‌‌of‌‌the‌‌
signs‌ ‌
○ The‌‌steps‌‌are:‌S ‌ quare‌‌‌the‌‌numbers,‌‌then‌‌Mean‌‌‌the‌‌results,‌‌then‌‌Root‌‌‌the‌‌
result‌ ‌

■ RMS(numbers)‌‌=‌‌
○ So‌‌effectively,‌‌the‌S
√sample mean(numbers )
‌ quare‌‌‌and‌‌the‌R
2
‌
‌ oot‌‌‌operations‌‌“reverse”‌‌each‌‌other‌ ‌

√
n
2
∑ (gapi )
● Applying‌‌RMS‌‌of‌‌gaps‌‌=‌‌
√ sample mean(gaps)2 ‌‌=‌‌ n
i=1
‌
● To‌‌avoid‌‌the‌‌cancellation‌‌of‌‌the‌‌gaps,‌‌another‌‌possible‌‌method‌‌is‌‌to‌‌consider‌‌the‌‌
n
∑ ∣gapi ∣
i=1
average‌‌of‌‌the‌‌absolute‌‌values‌‌of‌‌the‌‌gaps:‌‌
n ‌.‌‌However,‌‌this‌‌is‌‌harder‌‌
algebraically‌ ‌
Standard‌‌deviation‌‌in‌‌terms‌‌of‌‌RMS‌ ‌
● The‌‌standard‌‌deviation‌‌measures‌‌the‌s‌ pread‌‌‌of‌‌the‌‌data‌ ‌
○ S Dpop =‌‌RMS‌‌of‌‌(gaps‌‌from‌‌the‌‌mean)‌ ‌

○ Formally,‌‌S Dpop =‌‌

√M ean of (gaps f rom the mean) ‌ 2
‌=‌‌

√
n
2
∑ (xi −x)
i=1
‌
n
How‌‌to‌‌tell‌‌the‌‌difference‌‌when‌‌the‌‌data‌‌is‌‌a‌‌population‌‌or‌‌a‌‌sample?‌ ‌
● It‌‌can‌‌be‌‌tricky‌‌to‌‌work‌‌out‌‌whether‌‌your‌‌data‌‌is‌‌a‌‌population‌‌or‌‌sample‌ ‌
● Look‌‌a‌‌the‌‌information‌‌about‌‌the‌‌data‌‌story‌‌and‌‌the‌‌research‌‌questions‌ ‌
Standard‌‌Units‌‌(“Z‌‌score”)‌ ‌
● Standard‌‌units‌‌of‌‌a‌‌data‌‌point‌‌=‌‌how‌‌many‌‌standard‌‌deviations‌‌is‌‌it‌‌below‌‌or‌‌
above‌‌the‌‌mean‌ ‌
data point − mean
○ Standard‌‌units‌‌=‌‌ SD
‌
● This‌‌means‌‌data‌‌point‌‌=‌‌mean‌‌+‌‌SD‌‌×‌‌standard‌‌units‌ ‌
IQR‌ ‌
● IQR‌‌=‌‌Range‌‌of‌‌the‌‌middle‌‌50%‌‌of‌‌the‌‌data‌
○ More‌‌formally‌‌IQR‌‌=‌‌Q3 − Q1 ,‌‌where‌ ‌
■ Q1 ‌is‌‌the‌‌25%‌‌percentile‌‌(1st‌‌quartile)‌‌and‌‌Q3 is‌‌the‌‌75%‌‌
percentile‌‌(3rd‌‌quartile)‌ ‌

‌
■ The‌‌median‌‌is‌‌the‌‌50%‌‌percentile‌‌or‌‌2nd‌‌quartile‌‌x ‌‌=‌‌Q2 ‌
Quantile,‌‌quartile,‌‌percentile‌ ‌
● The‌‌set‌‌of‌‌q-‌quantiles‌‌‌divides‌‌the‌‌data‌‌into‌‌q‌‌equal‌‌sets‌‌(in‌‌terms‌‌of‌‌percentage‌‌
of‌‌data)‌ ‌
● Percentile‌‌‌is‌‌100-quantile‌ ‌
● The‌‌set‌‌of‌q ‌ uartiles‌‌‌divides‌‌the‌‌data‌‌into‌‌4‌‌quarters‌ ‌
● SO‌‌the‌‌range‌‌of‌‌the‌‌50%‌‌of‌‌properties‌‌sold‌‌is‌‌almost‌‌a‌‌million‌‌dollars‌ ‌
IQR‌‌on‌‌the‌‌boxplot‌ ‌
● The‌‌IQR‌‌is‌‌the‌‌length‌‌of‌‌the‌‌box‌‌in‌‌the‌‌box‌‌plot.‌‌It‌‌represents‌‌the‌‌span‌‌of‌‌the‌‌
middle‌‌50%‌‌of‌‌the‌‌houses‌‌sold‌ ‌
● The‌‌lower‌‌‌and‌u ‌ pper‌‌thresholds‌‌‌are‌‌a‌‌distance‌‌of‌‌1.5‌‌from‌‌the‌‌quartiles‌‌(by‌‌
convention)‌ ‌
○ LT‌‌=‌‌ Q1 ‌-‌‌1.5IQR‌ ‌
○ UT‌‌=‌‌Q3 ‌+‌‌1.5IQR‌ ‌
● Data‌‌outside‌‌these‌‌thresholds‌‌is‌‌considered‌‌an‌o ‌ utlier‌‌‌(“extreme‌‌reading”)‌ ‌
Coefficient‌‌of‌‌Variation‌ ‌
● The‌‌Coefficient‌‌of‌‌Variation‌‌(CV)‌‌‌combines‌‌the‌‌mean‌‌and‌‌standard‌‌deviation‌‌
SD
into‌‌one‌‌summary:‌‌CV‌‌=‌‌ mean ‌
● The‌‌CV‌‌is‌‌used‌‌in:‌ ‌
○ Analytical‌‌chemistry‌‌to‌‌express‌‌the‌‌precision‌‌and‌‌repeatability‌‌of‌‌an‌‌assay‌ ‌
○ Engineering‌‌and‌‌physical‌‌for‌‌quality‌‌assurance‌‌studies‌ ‌
○ Economics‌‌for‌‌determining‌‌the‌‌vitality‌‌of‌‌a‌‌security‌ ‌

Lecture‌‌7‌ ‌
Normal‌‌Curve:‌‌Origins‌ ‌
● The‌‌normal‌‌curve‌‌was‌‌discovered‌‌around‌‌1720‌‌by‌‌Abraham‌‌de‌‌Moivre,‌‌also‌‌
famous‌‌for‌‌the‌‌beautiful‌‌de‌‌Moivre’s‌‌formula‌ ‌
Why‌‌is‌‌the‌‌Normal‌‌curve‌‌famous?‌ ‌
● The‌‌Normal‌‌curve‌‌approximates‌‌many‌‌natural‌‌phenomena‌ ‌
● The‌‌Normal‌‌curve‌‌can‌‌model‌‌data‌‌caused‌‌by‌‌combining‌‌a‌‌large‌‌number‌‌of‌‌
independent‌‌observations.‌ ‌
General‌‌&‌‌Standard‌‌Normal‌‌curves‌ ‌
● The‌‌Standard‌‌‌Normal‌‌Curve‌‌(‌Z )‌‌has‌‌mean‌‌=‌‌0‌‌and‌‌SD‌‌=‌‌1.‌‌Short:‌‌N(0,‌‌1)‌ ‌
● The‌‌General‌‌‌Normal‌‌Curve‌‌(‌X )‌‌has‌‌any‌‌mean‌‌and‌‌SD.‌‌Caution:‌‌It‌‌is‌‌denoted‌‌by‌‌
N(mean,‌‌SD2 )‌‌ ‌
The‌‌Normal‌‌curve‌‌formula‌ ‌
● It‌‌turns‌‌the‌‌Normal‌‌curve‌‌has‌‌a‌‌simple‌‌formula,‌‌although‌‌you‌‌won’t‌‌need‌‌to‌‌use‌‌it‌‌
directly‌ ‌

‌
(x−μ)2
1
● The‌‌formula‌‌for‌‌the‌‌General‌‌Normal‌‌Curve‌‌is‌‌ e 2σ 2 ‌
‌for‌‌
√2πσ2
x ∈ (−∞, ∞) ‌‌where‌‌μ‌‌and‌‌σ‌‌are‌‌the‌‌(population)‌‌mean‌‌and‌‌SD‌‌respectively‌ ‌
Finding‌‌the‌‌area‌‌under‌‌the‌‌Standard‌‌Normal‌‌curve‌
● Method‌‌1:‌‌Integration‌ ‌
0.7 y2
1 −
○ Mathematically,‌‌we‌‌could‌‌use‌‌integration:‌‌area‌‌=‌‌ ∫ e 2 dy ‌
−∞ √2π
○ But‌‌this‌‌does‌‌not‌‌have‌‌a‌‌closed-form‌ ‌
● Method‌‌2:‌‌Normal‌‌Tables‌ ‌
● Method‌‌3:‌‌Use‌‌R‌ ‌
○ The‌‌pnorm‌‌command‌‌works‌‌out‌‌the‌‌lower‌‌tail‌‌area‌ ‌
○ The‌‌pnorm(x,‌‌lower.tail‌ ‌=‌‌F)‌‌works‌‌out‌‌the‌‌upper‌‌tail‌‌area‌ ‌
Finding‌‌the‌‌area‌‌under‌‌the‌‌Standard‌‌Normal‌‌curve‌
● In‌‌R‌ ‌
Properties‌‌of‌‌the‌‌Normal‌‌curve‌ ‌
● All‌‌Normal‌‌curves‌‌satisfy‌‌the‌‌“68%‌‌-‌‌95%‌‌-‌‌99.7%‌‌Rule”‌ ‌
○ The‌‌area‌‌1‌‌SD‌‌out‌‌from‌‌the‌‌mean‌‌in‌‌both‌‌directions‌‌is‌‌0.68‌‌(68%)‌ ‌
○ The‌‌area‌‌2‌‌SD‌‌out‌‌from‌‌the‌‌mean‌‌in‌‌both‌‌directions‌‌is‌‌0.95‌‌(95%)‌ ‌
○ The‌‌area‌‌3‌‌SD‌‌out‌‌from‌‌the‌‌mean‌‌in‌‌both‌‌directions‌‌is‌‌0.997‌‌(99.7%)‌ ‌
● Any‌‌General‌‌Normal‌‌can‌‌be‌‌rescaled‌‌into‌‌the‌‌Standard‌‌Normal‌ ‌
○ For‌‌any‌‌point‌‌on‌‌a‌‌Normal‌‌curve,‌‌the‌‌standard‌‌units‌‌(or‌‌z‌‌score)‌‌is‌‌how‌‌
many‌‌standard‌‌deviations‌‌that‌‌point‌‌is‌‌above‌‌(+)‌‌or‌‌below‌‌(-)‌‌the‌‌mean‌ ‌
data point − sample mean
○ standard‌‌units‌‌=‌‌ ‌
sample SD
● The‌‌Normal‌‌curve‌‌is‌‌symmetric‌ ‌
○ If‌‌X‌‌follows‌‌a‌‌normal‌‌curve‌‌with‌‌mean‌‌),‌‌then‌ ‌
■ P (X < − 0.5) = P (X > 0.5) ‌
Summary‌ ‌
● The‌‌Normal‌‌curve‌‌naturally‌‌describes‌‌many‌‌histograms,‌‌and‌‌so‌‌can‌‌be‌‌used‌‌in‌‌
modelling‌‌data.‌‌It‌‌has‌‌many‌‌useful‌‌properties,‌‌including‌‌the‌‌68/95/99.7%‌‌rule.‌‌
Any‌‌General‌‌Normal‌‌can‌‌be‌‌rescaled‌‌into‌‌a‌‌Standard‌‌Normal‌ ‌
‌

Lecture‌‌8‌ ‌
Reproducible‌‌Research‌ ‌

‌
● Increasingly,‌‌journals‌‌are‌‌requiring‌‌reproducible‌‌research,‌‌which‌‌requires‌‌“data‌‌
set‌‌and‌‌software‌‌to‌‌be‌‌made‌‌available‌‌for‌‌verifying‌‌published‌‌findings‌‌and‌‌
conducting‌‌alternative‌‌analyses”.‌ ‌
○ A‌‌study‌‌by‌‌Begley‌‌and‌‌Ellis‌‌(2012)‌‌found‌‌that‌‌47‌‌out‌‌of‌‌53‌‌medical‌‌
research‌‌papers‌‌focused‌‌on‌‌cancer‌‌research‌‌that‌‌was‌‌irreproducible‌ ‌
○ A‌‌follow-up‌‌study‌‌by‌‌Begley‌‌(2013)‌‌identified‌‌“6‌‌flags‌‌for‌‌suspect‌‌work”:‌‌
studies‌‌were‌‌not‌‌performed‌‌by‌‌investigators‌‌blinded‌‌to‌‌the‌‌experimental‌‌
versus‌‌the‌‌control‌‌arms,‌‌there‌‌was‌‌a‌‌failure‌‌to‌‌repeat‌‌experiments,‌‌a‌‌lack‌‌
of‌‌positive‌‌and‌‌negative‌‌controls,‌‌failure‌‌to‌‌show‌‌all‌‌data,‌‌inappropriate‌‌
use‌‌of‌‌statistical‌‌tests‌‌and‌‌use‌‌reagents‌‌that‌‌were‌‌appropriately‌‌validated‌ ‌
What‌‌can‌‌go‌‌wrong?‌ ‌
● Without‌‌reproducible‌‌research:‌ ‌
○ Data‌‌version‌‌can‌‌change‌‌(eg‌‌people‌‌edit‌‌an‌‌Excel‌‌file‌‌without‌‌
documenting‌‌what‌‌has‌‌changed‌‌and‌‌why);‌ ‌
○ Graphical‌‌summaries‌‌can‌‌change‌‌(eg‌‌people‌‌can‌‌photoshop‌‌images‌‌
without‌‌keeping‌‌record‌‌of‌‌what‌‌changed‌‌and‌‌why)‌ ‌
● Reproducible‌‌research‌‌is‌‌about‌‌being‌‌responsible‌‌with‌‌possible‌‌human‌‌errors,‌‌or‌‌
worse,‌‌detecting‌‌intentionally‌‌changed‌‌results‌‌ ‌

Lecture‌‌9‌ ‌
Bivariate‌‌Data‌ ‌
● Bivariate‌‌data‌‌involves‌‌a‌p
‌ air‌‌‌of‌‌variables.‌‌We‌‌are‌‌interested‌‌in‌‌the‌‌relationship‌‌
between‌‌the‌‌two‌‌variables‌‌Can‌‌one‌‌variable‌‌be‌‌used‌‌to‌‌predict‌‌the‌‌other?‌ ‌
○ Formally,‌‌we‌‌have‌‌(xi , y i ) ‌for‌‌i = 1, 2, ..., n ‌
○ X ‌is‌‌called‌‌the‌i‌ndependent‌‌‌variable‌‌(or‌‌explanatory‌‌variable,‌‌predictor‌‌
or‌‌regressor)‌ ‌
○ Y ‌is‌‌called‌‌the‌d‌ ependent‌‌‌variable‌‌(or‌‌response‌‌variable).‌ ‌
Scatter‌‌Plot‌ ‌
● A‌s ‌ catter‌‌plot‌‌‌is‌‌a‌‌graphical‌‌summary‌‌of‌‌two‌‌quantitative‌‌variables‌‌on‌‌the‌‌same‌‌
2D‌‌plane,‌‌result‌‌in‌‌a‌‌cloud‌‌of‌‌points‌ ‌
How‌‌can‌‌we‌‌summarise‌‌a‌‌scatter‌‌plot?‌ ‌
● The‌‌scatter‌‌plot‌‌can‌‌be‌‌summarised‌‌by‌‌the‌‌following‌‌five‌‌‌numerical‌‌summaries‌ ‌
○ Sample‌‌mean‌‌and‌‌sample‌‌SD‌‌of‌‌X (x, SDx ) ‌
○ Sample‌‌mean‌‌and‌‌sample‌‌SD‌‌of‌‌Y (y, SD y ) ‌
○ Correlation‌‌coefficient‌‌(r) ‌
The‌‌Correlation‌‌coefficient‌ ‌

‌
● The‌‌correlation‌‌coefficient‌‌‌r ‌is‌‌a‌‌numerical‌‌summary‌‌that‌‌measures‌‌the‌‌
clustering‌‌around‌‌the‌‌line‌ ‌
● It‌‌indicates‌‌both‌‌the‌‌sign‌‌and‌‌strength‌‌of‌‌the‌‌linear‌‌association‌ ‌
● The‌‌correlation‌‌coefficient‌‌is‌‌between‌‌-1‌‌and‌‌1‌ ‌
○ If‌‌r ‌is‌‌positive:‌‌the‌‌cloud‌‌sloped‌‌up‌ ‌
○ If‌‌r ‌is‌‌negative:‌‌the‌‌cloud‌‌slopes‌‌down‌ ‌
○ As‌r ‌gets‌‌closer‌‌to‌‌± 1 ‌:‌‌the‌‌points‌‌cluster‌‌more‌‌tightly‌‌around‌‌the‌‌line‌ ‌
Why‌‌does‌‌r‌‌measure‌‌association‌ ‌
● It‌‌divides‌‌the‌‌scatter‌‌plot‌‌into‌‌4‌‌quadrants,‌‌at‌‌the‌‌point‌‌of‌‌averages‌‌(centre)‌
○ A‌‌majority‌‌of‌‌points‌‌in‌‌the‌‌upper‌‌right‌‌(+)‌‌and‌‌lower‌‌left‌‌quadrants‌‌(+)‌‌will‌‌
be‌‌indicated‌‌by‌‌a‌‌positive‌‌r‌ ‌
○ A‌‌majority‌‌of‌‌points‌‌in‌‌the‌‌upper‌‌left‌‌(-)‌‌and‌‌the‌‌lower‌‌right‌‌(-)‌‌will‌‌be‌‌
indicated‌‌by‌‌a‌‌negative‌‌r‌ ‌
Symmetry‌ ‌
● The‌‌correlation‌‌coefficient‌‌is‌‌not‌‌affected‌‌by‌‌interchanging‌‌the‌‌variables‌ ‌
Scaling‌ ‌
● The‌‌correlation‌‌coefficient‌‌is‌‌shift‌‌and‌‌scale‌‌invariant‌ ‌
Warning‌ ‌
1. The‌‌correlation‌‌coefficient‌‌is‌‌unitless‌ ‌
○ Mistake‌:‌‌r‌‌=‌‌0.8‌‌means‌‌that‌‌80%‌‌of‌‌the‌‌points‌‌are‌‌tightly‌‌clustered‌‌around‌‌
the‌‌line‌‌or‌‌is‌‌twice‌‌as‌‌clustered‌‌as‌‌r‌‌=‌‌0.4‌ ‌
2. Outliers‌‌can‌‌overly‌‌influence‌‌the‌‌correlation‌‌coefficient‌ ‌
3. Non-linear‌‌association‌‌can’t‌‌be‌‌detected‌‌by‌‌the‌‌correlation‌‌coefficient‌ ‌
4. The‌‌same‌‌correlation‌‌coefficient‌‌can‌‌arise‌‌from‌‌very‌‌different‌‌data‌ ‌
5. Rates‌‌of‌‌averages‌‌tend‌‌to‌‌inflate‌‌the‌‌correlation‌‌coefficient‌ ‌
○ An‌‌ecological‌‌correlation‌‌‌(or‌‌spatial‌‌correlation)‌‌is‌‌the‌‌correlation‌‌
between‌‌two‌‌variables‌‌that‌‌are‌‌group‌‌means‌‌or‌‌rates‌ ‌
○ For‌‌example,‌‌if‌‌we‌‌recorded‌‌the‌‌heights‌‌of‌‌fathers‌‌and‌‌sons‌‌in‌‌many‌‌
communities‌‌and‌‌then‌‌calculated‌‌the‌‌average‌‌for‌‌each‌‌community‌ ‌
○ Ecological‌‌correlations‌‌tend‌‌to‌‌overestimate‌‌the‌‌strength‌‌of‌‌association‌‌
between‌‌the‌‌two‌‌variables‌ ‌
6. Association‌‌is‌‌not‌‌causation‌ ‌
○ Correlation‌‌measures‌‌association‌ ‌
○ But‌‌as‌‌discussed,‌‌association‌‌does‌‌not‌‌necessarily‌‌mean‌‌causation‌ ‌
○ Both‌‌variables‌‌may‌‌be‌‌simultaneously‌‌influenced‌‌by‌‌a‌‌3rd‌‌variable‌‌
(confounder)‌ ‌
Summary‌ ‌
● The‌‌scatter‌‌plot‌‌is‌‌a‌‌cloud‌‌of‌‌points‌‌which‌‌represents‌‌bivariate‌‌quantitative‌‌data‌‌(‌‌
pair‌‌of‌‌variables).‌‌Useful‌‌summaries‌‌are‌‌the‌‌two‌‌point‌‌of‌‌averages‌‌(sample‌‌

‌
means),‌‌the‌‌two‌‌sample‌‌SDs‌‌of‌‌the‌‌variables‌‌and‌‌one‌‌correlation‌‌coefficient.‌‌The‌‌
correlation‌‌coefficient‌‌is‌‌the‌‌mean‌‌of‌‌the‌‌product‌‌of‌‌the‌‌variables‌‌in‌‌standard‌‌units‌‌
and‌‌can‌‌be‌‌found‌‌using‌‌cor()‌‌in‌‌R‌ ‌

Lecture‌‌10‌ ‌
Regression‌‌Line‌ ‌
1. SD‌‌Line‌‌(Not‌‌great)‌ ‌
○ ‌The‌S ‌ D‌‌line‌‌‌might‌‌look‌‌like‌‌a‌‌good‌‌candidate‌‌as‌‌it‌‌connects‌‌the‌‌pints‌‌of‌‌
averages‌‌(x, y ) ‌‌to‌‌(x + SD x , y + SDy ) ‌(for‌‌this‌‌data‌‌with‌‌positive‌‌
correlation)‌ ‌
○ However,‌‌it‌‌does‌‌not‌‌use‌‌the‌‌correlation‌‌coefficient,‌‌so‌‌it‌‌is‌‌insensitive‌‌to‌‌
the‌‌amount‌‌of‌‌clustering‌‌around‌‌the‌‌line‌ ‌
○ Note‌‌how‌‌it‌‌underestimates‌‌(LHS)‌‌and‌‌overestimates‌‌(RHS)‌‌at‌‌the‌‌
extremes‌ ‌
2. Regression‌‌Line‌ ‌
○ To‌‌describe‌‌the‌‌scatter‌‌plot,‌‌we‌‌need‌‌to‌‌use‌a
‌ ll‌‌five‌‌‌summaries:‌ ‌
x, y , SDx , SDy , r ‌
○ The‌‌Regression‌‌line‌‌‌connects‌‌(x, y ) ‌‌to‌‌(x + SD x , y + rSDy ) ‌
Summary‌‌Regression‌‌Line‌ ‌
● We‌‌can‌‌derive‌‌the‌‌(least-squares)‌‌regression‌‌line‌‌using‌‌calculus,‌‌by‌‌minimizing‌‌
the‌‌squared‌‌residuals‌‌‌(extension)‌ ‌
Predictions‌ ‌
1. Baseline‌‌prediction‌ ‌
○ If‌‌you‌‌don’t‌‌use‌‌x ‌as‌‌an‌‌information‌‌source‌‌at‌‌all,‌‌a‌‌basic‌‌prediction‌‌of‌‌y
would‌‌be‌‌the‌a ‌ verage‌‌‌of‌‌y ‌over‌a
‌ ll‌‌‌the‌‌x ‌values‌‌in‌‌the‌‌data‌ ‌
○ So‌‌for‌‌any‌‌CE‌‌reading,‌‌we‌‌could‌‌predict‌‌the‌‌NW‌‌air‌‌quality‌‌to‌‌be‌‌56.13‌ ‌
2. Prediction‌‌in‌‌a‌‌strip‌ ‌
○ Given‌‌a‌‌certain‌‌value‌‌x0 ‌,‌‌a‌‌more‌‌careful‌‌prediction‌‌of‌‌y ‌‌would‌‌be‌‌the‌‌
average‌‌of‌‌all‌‌the‌‌y ‌in‌‌the‌‌data‌‌corresponding‌‌to‌‌a‌‌neighbourhood‌‌of‌‌x ‌
value‌‌around‌‌x0 ‌.‌ ‌
3. The‌‌Regression‌‌line‌ ‌
○ The‌‌best‌‌prediction‌‌is‌‌based‌‌on‌‌the‌‌Regression‌‌line‌ ‌
○ For‌‌AQI,‌‌we‌‌have‌‌y = 19.8874 + 0.7138x ‌
Residuals‌ ‌

‌
● A‌r‌ esidual‌‌‌is‌‌the‌‌vertical‌‌distance‌‌(or‌‌‘gap’)‌‌of‌‌a‌‌point‌‌above‌‌or‌‌below‌‌the‌‌
regression‌‌line‌ ‌
● A‌‌residual‌‌represents‌‌the‌‌error‌‌between‌‌the‌‌actual‌‌value‌‌and‌‌the‌‌prediction‌ ‌
● More‌‌formally,‌‌a‌‌residual‌‌is‌‌ei = y i − y︿i ‌,‌‌given‌‌the‌‌actual‌‌value‌‌(y i ) ‌‌and‌‌the‌‌
︿
prediction‌‌(y i ) ‌
Residual‌‌plot‌ ‌
● A‌‌residual‌‌plot‌‌graphs‌‌the‌‌residuals‌‌vs‌‌x ‌
● If‌‌the‌‌linear‌‌fit‌‌is‌‌appropriate‌‌for‌‌the‌‌data,‌‌it‌‌should‌‌show‌‌no‌‌pattern‌‌(random‌‌
‌ ‌‌0)‌ ‌
about‌‌y =
● The‌‌residual‌‌plot‌‌is‌‌a‌‌diagnostic‌‌plot‌‌to‌‌check‌‌the‌‌appropriateness‌‌of‌‌a‌‌linear‌‌
model‌ ‌
Vertical‌‌Strips‌ ‌
‌ irection,‌‌then‌‌
● If‌‌the‌‌vertical‌‌strips‌‌on‌‌the‌‌scatter‌‌plot‌‌show‌‌equal‌‌spread‌‌in‌‌the‌‌y d
the‌‌data‌‌is‌h
‌ omoscedastic‌ ‌
○ The‌‌regression‌‌line‌‌could‌‌be‌‌used‌‌for‌‌predictions‌ ‌
‌ irection,‌‌then‌‌the‌‌data‌‌is‌‌
● If‌‌the‌‌vertical‌‌strips‌‌don’t‌‌show‌‌equal‌‌spread‌‌in‌‌the‌‌y d
heteroscedastic‌ ‌
○ The‌‌regression‌‌line‌‌should‌‌not‌‌be‌‌used‌‌for‌‌predictions‌ ‌
Common‌‌mistakes‌‌when‌‌predicting‌ ‌
1. Extrapolating‌ ‌
○ If‌‌we‌‌make‌‌a‌‌prediction‌‌from‌‌an‌‌x ‌‌value‌‌that‌‌is‌‌not‌‌within‌‌the‌‌range‌‌of‌‌the‌‌
data,‌‌then‌‌that‌‌prediction‌‌can‌‌be‌‌completely‌u‌ nreliable‌ ‌
2. Not‌‌checking‌‌the‌‌scatter‌‌plot‌ ‌
○ We‌‌can‌‌have‌‌a‌‌high‌‌correlation‌‌coefficient‌‌and‌‌then‌‌fit‌‌a‌‌regression‌‌line,‌‌
but‌‌the‌‌data‌‌may‌‌not‌‌even‌‌be‌‌linear‌ ‌
○ So‌‌always‌‌check‌‌the‌‌scatter‌‌plot‌ ‌
3. Not‌‌checking‌‌the‌‌residual‌‌plot‌ ‌
○ You‌‌should‌‌also‌‌check‌‌the‌‌residual‌‌plot‌ ‌
○ This‌‌detects‌‌any‌‌pattern‌‌that‌‌has‌‌been‌‌captured‌‌by‌‌fitting‌‌a‌‌linear‌‌model‌ ‌
○ If‌‌the‌‌linear‌‌model‌‌is‌‌appropriate,‌‌the‌‌residual‌‌plot‌‌should‌‌be‌‌a‌‌random‌‌
‌ ‌‌0)‌ ‌
scatter‌‌of‌‌points‌‌(about‌‌the‌‌horizontal‌‌line‌‌y =
Summary‌ ‌
● For‌‌prediction,‌‌the‌‌regression‌‌line‌‌is‌‌better‌‌than‌‌the‌‌SD‌‌line‌‌as‌‌it‌‌uses‌‌all‌‌five‌‌
numerical‌‌summaries‌‌for‌‌the‌‌scatter‌‌plot‌ ‌
● For‌‌Regression‌‌line,‌‌the‌r‌ esiduals‌‌‌are‌‌the‌‌gaps‌‌between‌‌teh‌a ‌ ctual‌‌value‌‌‌and‌‌
the‌p
‌ rediction‌ ‌

‌
● The‌‌residual‌‌plot‌‌is‌‌a‌‌diagnostic‌‌for‌‌seeing‌‌whether‌‌a‌‌linear‌‌model‌‌is‌‌appropriate‌‌
-‌‌if‌‌it‌‌is‌‌random,‌‌then‌‌a‌‌linear‌‌model‌‌seems‌‌appropriate‌ ‌
● If‌‌the‌‌vertical‌‌strips‌‌on‌‌the‌‌scatter‌‌plot‌‌show‌e
‌ qual‌‌spread‌‌‌in‌‌the‌‌y-direction,‌‌then‌‌
the‌‌data‌‌is‌h ‌ omoscedastic‌,‌‌otherwise,‌‌the‌‌data‌‌is‌h‌ eteroscedastic‌ ‌

Lecture‌‌11‌ ‌
Probability‌ ‌
● The‌‌frequentist‌‌definition‌‌of‌‌probability‌‌‌(or‌‌chance)‌‌is‌‌the‌‌percentage‌‌of‌‌time‌‌a‌‌
certain‌‌event‌‌is‌‌expected‌‌to‌‌happen‌‌if‌‌the‌‌same‌‌process‌‌is‌‌repeated‌‌long-term‌‌
(infinitely‌‌often)‌ ‌
● This‌‌differs‌‌from‌‌the‌‌Bayesian‌‌definition‌‌of‌‌probability‌‌which‌‌relates‌‌to‌‌the‌‌degree‌‌
of‌‌belief‌‌that‌‌an‌‌even‌‌twill‌‌occur‌‌(extension)‌ ‌
Basic‌‌properties‌‌of‌‌Probability‌ ‌
1. Probabilities‌‌are‌‌between‌‌0%‌‌(impossible)‌‌and‌‌100%‌‌(certain)‌ ‌
○ P(Impossible‌‌event)‌‌=‌‌0‌ ‌
○ P(Certain‌‌event)‌‌=‌‌1‌ ‌
2. The‌‌probability‌‌of‌‌something‌‌equals‌‌100%‌‌minus‌‌its‌‌opposite‌‌(c ‌ omplement‌)‌ ‌
○ P(Event)‌‌-‌‌1‌‌-‌‌P(Complement‌‌event)‌ ‌
Conditional‌‌probability‌ ‌
● Conditional‌‌probability‌‌‌is‌‌the‌‌chance‌‌that‌‌a‌‌certain‌‌event‌‌(1)‌‌occurs,‌g ‌ iven‌‌
another‌‌event‌‌(2)‌‌has‌‌occurred‌ ‌
○ P(Event‌‌1|Event‌‌2)‌ ‌
Multiplication‌‌Rule‌ ‌
● The‌‌probability‌‌that‌‌two‌‌events‌‌occur‌‌is‌‌the‌‌chance‌‌of‌‌the‌‌1st‌‌event‌m ‌ ultiplied‌‌‌by‌‌
that‌‌chance‌‌of‌‌2nd‌‌event,‌‌given‌‌the‌‌1st‌‌has‌‌occurred‌ ‌
○ P(Event‌‌1‌‌and‌‌Event‌‌2)‌‌=‌‌P(event‌‌1)‌‌✕‌‌P(Event‌‌2|Event‌‌1)‌ ‌
Addition‌‌Rule‌ ‌
● The‌‌probability‌‌at‌‌least‌‌one‌‌of‌‌two‌‌events‌‌occurs‌‌is‌‌the‌‌chance‌‌of‌‌the‌‌1st‌‌event‌‌
plus‌‌‌the‌‌chance‌‌of‌‌2nd‌‌event‌m ‌ inus‌‌‌the‌‌probability‌‌that‌‌both‌‌events‌‌occur‌ ‌
○ P(Event‌‌1‌‌or‌‌Event‌‌2)‌‌=‌‌P(Event‌‌1)‌‌+‌‌P(Event‌‌2)‌‌-‌‌P(Event‌‌1‌‌and‌‌Event‌‌2)‌ ‌
Mutually‌‌exclusive‌ ‌
● Two‌‌events‌‌are‌m ‌ utually‌‌exclusive‌‌‌when‌‌the‌‌occurrence‌‌of‌‌one‌‌event‌‌prevents‌‌
the‌‌other‌ ‌
Independence‌ ‌
● Two‌‌events‌‌are‌i‌ndependent‌‌‌if‌‌the‌‌chance‌‌of‌‌1st‌‌given‌‌the‌‌2nd‌‌is‌‌the‌‌same‌‌as‌‌
the‌‌1st,‌‌ie.‌‌P(Event‌‌1|Event‌‌2)‌‌=‌‌P(Event‌‌1)‌ ‌
The‌‌Prosecutor’s‌‌fallacy‌ ‌

‌
● The‌‌prosecutor’s‌‌fallacy‌‌‌is‌‌a‌‌mistake‌‌in‌‌statistical‌‌thinking,‌‌whereby‌‌it‌‌is‌‌
assumed‌‌that‌‌the‌‌probability‌‌of‌‌a‌‌random‌‌match‌‌is‌‌equal‌‌to‌‌the‌‌probability‌‌that‌‌
the‌‌defendant‌‌is‌‌innocent‌ ‌
○ It‌‌has‌‌been‌‌used‌‌by‌‌the‌‌prosecution‌‌to‌‌argue‌‌for‌‌the‌‌guilt‌‌of‌‌a‌‌defendant‌‌
during‌‌famous‌‌criminal‌‌trials‌ ‌
○ It‌‌can‌‌also‌‌be‌‌used‌‌by‌‌defense‌‌lawyers‌‌to‌‌argue‌‌for‌‌the‌‌innocence‌‌of‌‌their‌‌
client‌ ‌
Summary‌ ‌
● Addition‌‌Rule‌ ‌
○ Two‌‌events‌‌are‌‌mutually‌‌exclusive‌‌when‌‌the‌‌occurrence‌‌of‌‌one‌‌event‌‌
prevents‌‌the‌‌other‌ ‌
○ If‌‌two‌‌events‌‌are‌‌mutually‌‌exclusive‌‌then‌‌the‌‌chance‌‌of‌a ‌ t‌‌least‌‌one‌‌event‌‌
occurring‌‌is‌‌the‌s
‌ um‌‌‌of‌‌the‌‌individual‌‌chances‌ ‌
● Multiplication‌‌Rule‌ ‌
○ Two‌‌events‌‌are‌‌independent‌‌if‌‌the‌‌occurrence‌‌of‌‌the‌‌first‌‌event‌‌does‌‌not‌‌
change‌‌the‌‌chance‌‌of‌‌the‌‌second‌‌event‌ ‌
○ If‌‌the‌‌two‌‌events‌‌are‌‌independent‌‌then‌‌the‌‌chance‌‌of‌b ‌ oth‌‌‌events‌‌
occurring‌‌is‌‌the‌m
‌ ultiplication‌‌‌of‌‌the‌‌individual‌‌chances‌ ‌

Lecture‌‌12‌ ‌
Counting‌‌and‌‌drawing‌‌trees‌‌(The‌‌old‌‌way)‌ ‌
● For‌‌simple‌‌chance‌‌problems,‌‌a‌‌good‌‌way‌‌to‌‌start‌‌is:‌ ‌
a. Method‌‌1:‌‌Write‌‌a‌‌full‌‌list‌‌of‌‌outcomes‌‌and‌‌count‌‌the‌‌outcomes‌‌of‌‌interest‌
■ Write‌‌a‌‌list‌‌of‌‌all‌‌outcomes‌ ‌
■ Count‌‌which‌‌outcomes‌‌belong‌‌to‌‌the‌‌event‌‌of‌‌interest‌ ‌
b. Method‌‌2:‌‌Summarise‌‌in‌‌a‌‌tree‌‌diagram‌ ‌
■ Draw‌‌a‌‌tree‌ ‌
Running‌‌a‌‌simulation‌‌(The‌‌new‌‌way)‌ ‌
1. Method‌‌3:‌‌Simulate‌ ‌
○ Use‌‌and‌‌simulate‌‌throwing‌‌dice‌ ‌x t‌imes‌‌and‌‌record‌‌the‌‌findings‌ ‌
Summary‌ ‌
● Counting‌‌outcomes‌‌or‌‌drawing‌‌a‌‌tree‌‌to‌‌derive‌‌probabilities‌‌of‌‌outcomes‌‌can‌‌
quickly‌‌become‌‌tedious.‌‌One‌‌solution‌‌is‌‌to‌‌use‌‌simulations‌ ‌

Lecture‌‌13‌ ‌
Chance‌‌error‌ ‌
● Every‌‌time‌‌you‌‌toss‌‌a‌‌fair‌‌coin,‌‌there‌‌is‌‌chance‌‌variability‌ ‌

‌
○ Number‌‌of‌‌heads‌‌(observed‌‌value)‌‌=‌‌half‌‌the‌‌number‌‌of‌‌tosses‌‌(expected‌‌
value)‌‌+‌‌chance‌‌error‌ ‌
Law‌‌of‌‌Averages‌ ‌
● The‌‌Law‌‌of‌‌Averages‌‌‌states‌‌that‌‌the‌p ‌ roportion‌‌‌of‌‌heads‌‌becomes‌‌more‌‌stable‌‌
as‌‌the‌‌length‌‌of‌‌the‌‌simulation‌‌increases‌‌and‌‌approaches‌‌a‌‌fixed‌‌number‌‌called‌‌
the‌r‌ elative‌‌frequency‌ ‌
● The‌‌chance‌‌error‌‌in‌‌the‌‌number‌‌of‌‌heads‌‌is‌‌likely‌‌to‌‌be‌l‌arge‌‌‌in‌‌absolute‌‌size,‌‌but‌‌
small‌‌‌relative‌‌to‌‌the‌‌number‌‌of‌‌tosses.‌ ‌
Important‌‌Facts‌ ‌
● For‌‌a‌‌fair‌‌coin:‌ ‌
○ Even‌‌if‌‌we‌‌observe‌‌100‌‌heads‌‌in‌‌a‌‌row,‌‌still‌‌P(Tail)‌‌=‌‌0.5.‌‌
Misunderstanding‌‌this‌‌leads‌‌to‌‌the‌‌Gambler’s‌‌Fallacy‌ ‌
○ As‌‌the‌‌number‌‌of‌‌tosses‌i‌ncreases‌ ‌
■ The‌‌absolute‌‌size‌‌of‌‌the‌‌chance‌‌error‌i‌ncreases‌ ‌
■ The‌‌absolute‌‌percentage‌‌(i.e.‌‌‘relative’)‌‌size‌‌of‌‌the‌‌chance‌‌error‌‌
decreases‌ ‌
■ The‌‌proportion‌‌of‌‌the‌‌event‌‌will‌‌converge‌‌‌to‌‌the‌‌theoretical‌‌or‌‌
expected‌‌proportion‌ ‌
Summary‌ ‌
● For‌‌independent‌‌events,‌‌it‌‌is‌‌a‌‌mistake‌‌to‌‌assume‌‌that‌‌the‌‌chance‌‌of‌‌observing‌‌a‌‌
particular‌‌event‌‌changes‌‌over‌‌time,‌‌even‌‌if‌‌the‌‌event‌‌has‌‌not‌‌occurred‌‌for‌‌a‌‌long‌‌
time.‌‌This‌‌is‌‌the‌‌Gambler’s‌‌fallacy‌‌and‌‌downfall‌ ‌
● Rather‌‌The‌‌Law‌‌of‌‌Large‌‌Numbers‌‌states‌‌that‌‌the‌o ‌ bserved‌‌proportion‌‌‌of‌‌
occurrences‌‌of‌‌the‌‌event,‌‌in‌‌the‌‌long‌‌run,‌‌approaches‌‌the‌e ‌ xpected‌‌proportion‌ ‌

Lecture‌‌14‌ ‌
Box‌‌model‌ ‌
● The‌‌box‌‌model‌‌‌is‌‌a‌‌simple‌‌way‌‌to‌‌describe‌‌many‌‌chance‌‌processes‌ ‌
● The‌‌box‌‌represents‌‌the‌‌population‌,‌‌containing‌‌different‌‌types‌‌of‌t‌ ickets‌ ‌
● We‌‌need‌‌to‌‌know:‌ ‌
○ The‌‌number‌‌‌or‌p ‌ roportion‌‌‌of‌‌each‌‌kind‌‌of‌‌ticket‌‌in‌‌the‌‌box‌ ‌
○ The‌‌number‌‌of‌d ‌ raws‌‌‌from‌‌the‌‌box‌ ‌
○ For‌‌now,‌‌we‌‌only‌‌consider‌‌drawing‌‌with‌‌replacement‌ ‌
Modelling‌‌the‌‌Sum‌‌of‌‌a‌‌sample‌ ‌
● For‌‌the‌S‌ um‌‌‌of‌‌random‌‌draws‌‌from‌‌a‌‌box‌‌model‌‌with‌‌replacement,‌‌ ‌
○ observed‌‌value‌‌=‌‌expected‌‌value‌‌+‌‌chance‌‌error‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌number‌‌of‌‌draws‌‌×‌‌mean‌‌of‌‌the‌‌box‌ ‌

√
■ Standard‌‌error‌‌(SE)‌‌=‌‌ number of draws ‌×‌‌SD‌‌of‌‌the‌‌box‌ ‌
■ SE‌‌is‌‌the‌‌expected‌‌magnitude‌‌of‌‌the‌‌chance‌‌error.‌ ‌

‌
How‌‌to‌‌calculate‌‌the‌‌SD‌‌of‌‌the‌‌box‌ ‌
● As‌‌the‌b ‌ ox‌‌‌represents‌‌the‌‌population,‌‌the‌S
‌ D‌‌of‌‌the‌‌box‌‌‌is‌‌the‌p
‌ opulation‌‌SD‌ ‌
● We‌‌could‌‌call‌‌it‌‌SD pop ‌,‌‌but‌‌in‌‌this‌‌context,‌‌we‌‌will‌‌simply‌‌use‌‌SD‌ ‌
● 3‌‌ways‌‌to‌‌calculate‌‌the‌‌SD‌‌of‌‌the‌‌box‌ ‌
○ Formula:‌‌RMS(gaps)‌‌=‌‌Root‌‌of‌‌the‌‌Mean‌‌of‌‌the‌‌Squared‌‌gaps‌ ‌
○ R:‌‌popsd()‌‌with‌‌package‌‌multicon‌ ‌
○ Short‌‌cut‌‌(for‌‌simply‌‌binary‌‌(two‌‌tickets)‌‌boxes)‌ ‌
■ If‌‌a‌‌box‌‌only‌‌contains‌‌2‌‌different‌‌numbers‌‌(“big”‌‌and‌‌“small”),‌‌then‌ ‌
● SD‌‌=‌‌(big‌‌-‌‌small)‌
√proportion of big × proportion of small
How‌‌does‌‌chance‌‌error‌‌relate‌‌to‌‌standard‌‌error‌ ‌
● An‌‌observed‌‌value‌‌is‌‌likely‌‌to‌‌be‌‌around‌‌its‌‌expected‌‌value,‌‌with‌‌a‌c ‌ hance‌‌error‌‌
similar‌‌to‌‌SE‌ ‌
● Observed‌‌values‌‌usually‌‌lie‌‌within‌‌2‌‌SEs‌‌away‌‌from‌‌the‌‌expected‌‌value‌ ‌
Modelling‌‌the‌‌Mean‌‌of‌‌the‌‌Sample‌ ‌
● As‌‌the‌M ‌ ean‌‌‌of‌‌the‌‌sample‌‌is‌‌just‌‌the‌S
‌ um‌‌‌of‌‌the‌‌sample‌‌divided‌‌by‌‌the‌‌number‌‌
of‌‌the‌‌draws,‌‌we‌‌get‌‌an‌‌equivalent‌‌result‌‌as‌‌follows‌ ‌
● For‌‌the‌M ‌ ean‌‌‌of‌‌the‌‌random‌‌draws‌‌from‌‌a‌‌box‌‌model‌‌with‌‌replacement‌ ‌
○ observed‌‌value‌‌=‌‌expected‌‌value‌‌+‌‌chance‌‌error‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌mean‌‌of‌‌the‌‌box‌ ‌
SD of the box
■ Standard‌‌error‌‌(SE)‌‌=‌‌ ‌
√number of draws
Comparison‌ ‌
● Notice‌‌that‌‌there‌‌are‌‌two‌‌sets‌‌of‌‌formulas,‌‌depending‌‌on‌‌whether‌‌we‌‌are‌‌
modelling‌‌the‌s ‌ um‌‌‌or‌m‌ ean‌‌‌of‌‌a‌‌sample‌ ‌
● The‌‌sample‌‌question‌‌‌will‌‌dictate‌‌whether‌‌the‌‌sim‌‌or‌‌mean‌‌of‌‌a‌‌sample‌‌is‌‌more‌‌
appropriate‌ ‌
● Given‌‌the‌‌mean‌‌and‌‌SD‌‌of‌‌the‌‌population‌ ‌
○ Sum‌‌of‌‌Sample‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌n‌‌×‌‌mean‌ ‌
■ Standard‌‌error‌‌(SE)‌‌=‌‌√n ‌×‌‌SD‌ ‌
○ Mean‌‌of‌‌the‌‌Sample‌ ‌
■ Expected‌‌value‌‌(EV)‌‌=‌‌mean‌ ‌
SD
■ Standard‌‌error‌‌(SE)‌‌=‌‌ ‌
√n
● Notice‌‌that‌‌as‌‌the‌‌sample‌‌size‌‌(n)‌‌increases,‌‌the‌‌SE‌‌for‌‌the‌‌sum‌‌increases,‌‌but‌‌
the‌‌SE‌‌for‌‌the‌‌mean‌‌decreases‌ ‌
Summary‌ ‌

‌
● The‌‌box‌‌models‌‌a‌‌simple‌‌chance‌‌process‌‌involving‌‌drawing‌‌tickets‌‌from‌‌a‌‌fixed‌‌
box‌‌(population).‌ ‌
● We‌‌can‌‌describe‌‌the‌‌behaviour‌‌of‌‌the‌‌SUm‌‌and‌‌the‌‌Mean‌‌of‌‌the‌‌sample‌‌in‌‌terms‌‌
of‌‌the‌‌expected‌‌value‌‌(EV)‌‌and‌‌the‌‌standard‌‌error‌‌(SE),‌‌and‌‌compare‌‌to‌‌the‌‌
observed‌‌(OV)‌ ‌
● We‌‌can‌‌find‌‌SD box ‌by‌‌using‌‌the‌‌shortcut‌‌formula‌‌or‌‌popsd()‌ ‌
● Given‌‌the‌‌mean‌‌and‌‌SD‌‌of‌‌the‌‌population‌ ‌
○ When‌‌there‌‌is‌‌one‌‌desired‌‌outcome:‌‌make‌‌the‌‌desired‌‌tickets‌‌a‌‌“1”‌‌and‌‌all‌‌
other‌‌tickets‌‌“0”‌ ‌

Lecture‌‌15‌ ‌
The‌‌Central‌‌Limit‌‌Theorem‌ ‌
● If‌‌draws‌‌are‌‌independent‌‌and‌‌random‌‌with‌‌replacement‌‌and‌‌the‌‌sample‌‌size‌‌for‌‌
the‌‌sum‌‌(or‌‌average)‌‌is‌‌sufficiently‌‌large,‌‌then‌
○ The‌‌distribution‌‌‌for‌‌the‌‌sum‌‌(or‌‌average)‌‌will‌‌closely‌‌follow‌‌the‌n ‌ ormal‌‌
curve‌,‌‌even‌‌if‌‌the‌‌contents‌‌of‌‌the‌‌box‌‌do‌‌not‌ ‌
● “The‌‌Normal‌‌curve‌‌becomes‌‌a‌‌good‌‌model‌‌for‌‌the‌‌chance‌‌error‌‌of‌‌a‌‌sum‌‌(or‌‌
average)‌‌in‌‌sufficiently‌‌large‌‌samples”‌ ‌
● “As‌‌the‌‌sample‌‌size‌‌increases,‌‌the‌‌distribution‌‌for‌‌a‌‌sum‌‌(or‌‌average)‌‌tends‌‌
towards‌‌the‌‌Normal‌‌distribution”‌ ‌
Conditions‌‌for‌‌the‌‌CLT‌ ‌
● The‌‌draws‌‌must‌‌be‌‌random‌‌and‌‌independent‌‌from‌‌a‌‌fixed‌‌population‌ ‌
● The‌‌number‌‌of‌‌draws‌‌must‌‌be‌‌reasonably‌‌large‌‌‌(especially‌‌if‌‌the‌‌histogram‌‌of‌‌the‌‌
box‌‌differs‌‌from‌‌the‌‌normal‌‌curve)‌ ‌
● How‌‌large?‌‌This‌‌depends‌‌on‌‌the‌‌shape‌‌of‌‌the‌‌histogram‌ ‌
● A‌‌common‌‌convention‌‌is‌‌the‌‌number‌‌of‌‌draws‌‌larger‌‌than‌‌30‌‌(assuming‌‌a‌‌
basically‌‌symmetric‌‌distribution‌‌with‌‌no‌‌obvious‌‌outliers)‌ ‌
○ However,‌‌this‌‌is‌‌NOT‌‌‌a‌‌rule‌ ‌
Summary‌ ‌
● The‌‌Central‌‌Limit‌‌Theorem‌‌‌states‌‌that‌‌fro‌‌repeated‌‌simulations‌‌of‌‌a‌‌chance‌‌
process‌‌resulting‌‌in‌‌a‌s ‌ um‌‌‌or‌a
‌ verage‌,‌‌the‌‌simulation‌‌histogram‌‌of‌‌the‌‌observed‌‌
values‌‌converges‌‌to‌‌the‌‌Normal‌‌distribution‌ ‌

Lecture‌‌16‌ ‌
Parameter‌‌&‌‌Estimate‌
● A‌p
‌ arameter‌‌‌is‌‌a‌‌numerical‌‌fact‌‌about‌‌the‌‌population‌‌which‌‌we‌‌are‌‌interested‌‌in.‌‌
For‌‌example‌‌the‌‌population‌‌mean‌‌μ‌‌or‌‌population‌‌standard‌‌deviation‌‌σ‌ ‌

‌
● An‌‌estimate‌‌‌(or‌s
‌ tatistic‌)‌‌is‌‌a‌‌calculator‌‌of‌‌sample‌‌values‌‌which‌‌best‌‌predicts‌‌
︿
the‌‌parameter.‌‌For‌‌examples‌‌the‌‌sample‌‌mean‌‌ ‌μ ‌(sometimes‌‌also‌‌denoted‌‌x ‌)‌‌
︿
or‌‌sample‌‌standard‌‌deviation‌‌σ ‌
● Observed‌‌value‌‌(OV)‌‌=‌‌expected‌‌value‌‌(EV)‌‌+‌‌chance‌‌error‌ ‌
● In‌‌the‌‌Sample‌‌Mean‌‌case:‌ ‌
︿
○ μ ‌‌=‌‌μ‌‌+‌‌chance‌‌error‌ ‌
■ The‌‌chance‌‌error‌‌is‌‌random‌‌by‌‌nature‌‌(noise).‌‌We‌‌can‌‌quantify‌‌the‌‌
chance‌‌error‌‌by‌‌estimating‌‌the‌‌spread‌‌(=expected‌‌magnitude)‌‌of‌‌
the‌‌chance‌‌error.‌‌This‌‌spread‌‌is‌‌called‌‌standard‌‌error‌‌(SE)‌‌and‌‌it‌‌is‌‌
the‌‌standard‌‌deviation‌‌of‌‌the‌‌chance‌‌error.‌‌It‌‌is‌‌often‌‌denoted‌‌by‌‌σ‌‌
(as‌‌well)‌ ‌
■ Not‌‌that‌‌we‌‌have‌‌two‌‌different‌‌σs‌‌now.‌‌We‌‌can‌‌call‌‌the‌‌population‌‌
SD‌‌σ pop ‌and‌‌the‌‌standard‌‌error‌‌(=standard‌‌deviation‌‌of‌‌the‌‌chance‌‌
error)‌‌remains‌‌denoted‌‌by‌‌σ.‌‌Note‌‌that‌‌σ pop ‌=‌‌SD(Box)‌ ‌
■ The‌‌central‌‌limit‌‌theorem‌‌tells‌‌us‌‌that‌‌for‌‌a‌s
‌ ample‌‌mean‌,‌‌chance‌‌
1
error‌‌behaves‌‌approximately‌‌like‌‌N (o, σ 2 ), σ− ×
√sample size
SD(Box)‌‌and‌‌in‌‌practice‌‌we‌‌can‌‌estimate‌‌σ‌‌by‌ ‌
︿ 1
σ= × ‌SD(Sample).‌‌In‌‌particular‌‌SE‌‌=‌‌SD(chance‌‌
√sample size
︿
error)‌‌=‌‌σ‌‌≈ SE = ︿ σ ‌
● Mean‌‌and‌‌standard‌‌deviation‌‌‌describe‌‌a‌‌set‌‌of‌‌data.‌‌They‌‌are‌n ‌ umerical‌‌
summaries‌.‌‌Expected‌‌value‌‌and‌s ‌ tandard‌‌error‌‌‌describe‌‌the‌‌sum/mean‌‌of‌‌a‌‌
random‌‌sample‌.‌‌The‌‌standard‌‌error‌‌is‌‌the‌‌standard‌‌deviation‌‌of‌‌the‌‌chance‌‌
error‌ ‌
The‌‌Correction‌‌factor‌ ‌
● When‌‌sampling‌‌with‌‌‌replacement,‌‌the‌‌SE‌‌is‌‌determined‌‌by‌‌the‌‌absolute‌s ‌ ample‌‌
size‌ ‌
● When‌‌sampling‌‌without‌‌‌replacement,‌‌the‌‌SE‌‌will‌‌be‌‌decreased‌‌by‌‌increasing‌‌the‌‌
ratio‌‌of‌‌sample‌‌size‌‌‌to‌‌population‌‌size‌,‌‌as‌‌when‌‌a‌‌higher‌‌proportion‌‌of‌‌the‌‌
population‌‌is‌‌sampled,‌‌the‌‌variability‌‌will‌‌decrease‌ ‌
● When‌‌the‌‌sample‌‌is‌‌only‌‌a‌‌small‌‌part‌‌of‌‌the‌‌population,‌‌the‌‌size‌‌of‌‌the‌‌population‌‌
has‌‌almost‌‌no‌‌effect‌‌on‌‌the‌‌SE‌‌of‌‌the‌‌estimate‌ ‌
● SE without replacement − correlation f actor × SE without replication ‌where‌‌correction‌‌

factor‌‌=‌‌
Summary‌ ‌
√ population size−sample size
population size − 1
‌

‌
● If‌‌we‌‌draw‌‌without‌‌replacement‌,‌‌then‌‌strictly‌‌the‌‌SE‌‌should‌‌be‌‌adjusted‌‌by‌‌the‌‌
correlation‌‌factor‌ ‌

○ correction‌‌factor‌‌=‌‌
√ population size−sample size
population size − 1
‌
● However,‌‌for‌‌large‌‌population‌‌size‌‌compared‌‌to‌‌sample‌‌size,‌‌the‌‌correction‌‌factor‌‌
is‌‌almost‌‌1‌ ‌

Lecture‌‌17‌ ‌
Population‌‌vs‌‌Sample‌ ‌
● A‌s ‌ ample‌‌‌is‌‌a‌‌part‌‌of‌‌the‌p
‌ opulation‌ ‌
Limitation‌‌of‌‌a‌‌census‌ ‌
● Collecting‌‌every‌‌‌unit‌‌of‌‌a‌‌population:‌ ‌
○ Is‌‌hard‌ ‌
○ Takes‌‌lots‌‌of‌‌time‌ ‌
○ Costs‌‌a‌‌lot‌‌of‌‌money‌ ‌
○ Requires‌‌lots‌‌of‌‌resources‌ ‌
Finding‌‌the‌‌best‌‌estimate‌‌of‌‌the‌‌parameter‌ ‌
● Much‌‌Statistical‌‌theory‌‌is‌‌concerned‌‌with‌‌how‌‌to‌‌find‌‌the‌‌best‌‌estimate‌‌of‌‌a‌‌
parameter‌ ‌
● 2‌‌critical‌‌issues‌‌are:‌ ‌
○ How‌‌was‌‌the‌‌sample‌‌chosen?‌‌Is‌‌it‌‌representative‌‌of‌‌the‌‌population?‌ ‌
○ What‌‌estimate‌‌is‌‌closest‌‌to‌‌the‌‌parameter?‌ ‌
Examples‌‌of‌‌how‌‌bias‌‌can‌‌occur‌ ‌
● If‌‌there‌‌is‌‌a‌‌systematic‌‌tendency‌‌to‌‌exclude‌‌or‌‌include‌‌one‌‌types‌‌of‌‌person‌‌from‌‌
the‌‌sample‌ ‌
○ E.g.‌‌Convenience‌‌sampling‌‌(or‌‌“grab‌‌sampling”):‌‌A‌‌non-probability‌‌
sampling‌‌technique‌‌where‌‌subjects‌‌are‌‌selected‌‌because‌‌of‌‌their‌‌
convenient‌‌accessibility.‌‌It‌‌is‌‌definitely‌‌not‌‌recommended,‌‌except‌‌possibly‌‌
to‌‌test‌‌a‌‌survey‌‌(pilot)‌ ‌
● If‌‌some‌‌participants‌‌fail‌‌to‌‌complete‌‌surveys‌ ‌
○ What‌‌was‌‌the‌‌response‌‌rate?‌ ‌
○ Non-respondents‌‌can‌‌be‌‌very‌‌different‌‌to‌‌respondents‌ ‌
● If‌‌characteristics‌‌of‌‌the‌‌interview‌‌have‌‌an‌‌effect‌‌on‌‌the‌‌answer‌‌given‌‌by‌‌
participants‌ ‌
● If‌‌the‌‌form‌‌of‌‌the‌‌question‌‌in‌‌the‌‌survey‌‌affects‌‌the‌‌response‌‌to‌‌the‌‌question‌ ‌
● Because‌‌people‌‌may‌‌forget‌‌details‌ ‌
● Because‌‌of‌‌sensitive‌‌questions:‌‌people‌‌may‌‌not‌‌tell‌‌the‌‌truth‌ ‌
● Because‌‌of‌‌lack‌‌of‌‌clarity‌‌in‌‌the‌‌question‌ ‌
● Because‌‌attributes‌‌of‌‌the‌‌interview‌‌process‌‌may‌‌cause‌‌bias‌ ‌

‌
Warning‌‌about‌‌bias‌‌and‌‌sample‌‌size‌ ‌
● When‌‌a‌‌selection‌‌process‌‌is‌‌biased,‌‌taking‌‌a‌‌larger‌‌sample‌‌‌does‌‌not‌‌reduce‌‌
bias,‌‌rather‌‌it‌‌can‌‌amplify‌‌‌the‌‌bias.‌‌It‌‌repeats‌‌the‌‌mistake‌‌on‌‌a‌‌larger‌‌scale‌ ‌
● In‌‌the‌‌famous‌‌1936‌‌US‌‌elections,‌‌the‌‌Literary‌‌digest‌‌‌magazine‌‌predicted‌‌an‌‌
overwhelming‌‌victory‌‌for‌‌Alfred‌‌Landon‌‌over‌‌Franklin‌‌Roosevelt,‌‌based‌‌on‌‌a‌‌poll‌‌
of‌‌2.4‌‌million‌‌people.‌‌However,‌‌Roosevelt‌‌won‌‌62%‌‌to‌‌38%.‌‌The‌‌Digest‌‌went‌‌
bankrupt‌‌soon‌‌after‌ ‌
● The‌‌problem‌‌was‌‌that‌‌their‌‌sampling‌‌procedure‌‌involved‌‌mailing‌‌questionnaires‌‌
to‌‌10‌‌million‌‌people,‌‌with‌‌names‌‌and‌‌addresses‌‌from‌‌sources‌‌that‌‌were‌‌biased‌‌
against‌‌the‌‌poor‌ ‌
How‌‌to‌‌pick‌‌a‌‌good‌‌sample?‌ ‌
● A‌‌sampling‌‌procedure‌‌should‌‌give‌‌a‌‌representative‌‌cross‌‌section‌‌of‌‌the‌‌
population‌ ‌
● We‌‌use‌‌a‌p ‌ robability‌‌method‌‌‌to‌‌pick‌‌the‌‌sample,‌‌so‌‌that‌ ‌
○ The‌‌interviewer‌‌is‌‌not‌‌involved‌‌in‌‌the‌‌selection.‌‌The‌‌method‌‌of‌‌selection‌‌is‌‌
impartial‌ ‌
○ The‌‌interviewer‌‌can‌‌compute‌‌the‌‌chance‌‌of‌‌any‌‌particular‌‌individuals‌‌
being‌‌chosen‌‌i.e.‌‌There‌‌is‌‌a‌‌defined‌‌procedure‌‌for‌‌selecting‌‌the‌‌sample,‌‌
which‌‌uses‌‌chance.‌‌It‌‌is‌‌objective.‌ ‌
● For‌‌example,‌‌Simple‌‌random‌‌sampling‌‌‌involves‌‌drawing‌‌at‌‌random‌‌without‌‌
replacement‌ ‌
● Multi-stage‌‌cluster‌‌sampling‌ ‌
○ As‌‌simple‌‌random‌‌sampling‌‌is‌‌often‌‌not‌‌practical,‌‌organisations‌‌may‌‌use‌‌
multi-stage‌‌cluster‌‌sampling.‌‌This‌‌is‌‌a‌‌probability‌‌sampling‌‌technique‌‌
which‌‌takes‌‌samples‌‌in‌‌stages,‌‌and‌‌individuals‌‌or‌‌clusters‌‌are‌‌chosen‌‌at‌‌
random‌‌at‌‌each‌‌stage.‌ ‌
Unavoidable‌‌Bias‌ ‌
● Even‌‌with‌‌a‌‌probability‌‌method‌‌determining‌‌the‌‌sample,‌‌bias‌‌can‌‌easily‌‌come‌‌in‌ ‌
● In‌‌addition,‌‌because‌‌the‌‌sample‌‌is‌‌only‌‌part‌‌of‌‌the‌‌population,‌‌we‌‌always‌‌have‌‌
chance‌‌error‌ ‌
○ Parameter‌‌estimate‌‌=‌‌true‌‌parameter‌‌+‌‌bias‌‌+‌‌chance‌‌error‌ ‌
Summary‌ ‌
● Unless‌‌a‌‌census‌‌is‌‌possible,‌‌information‌‌about‌‌a‌‌population‌‌comes‌‌from‌‌an‌‌
estimate‌‌from‌‌a‌‌sample.‌‌The‌‌reliability‌‌of‌‌such‌‌an‌‌estimate‌‌depends‌‌on‌‌how‌‌the‌‌
sample‌‌was‌‌chosen.‌‌Hence,‌‌we‌‌usually‌‌have:‌ ‌
○ Observed‌‌value‌‌=‌‌true‌‌parameter‌‌+‌‌bias‌‌+‌‌chance‌‌error;‌‌or‌ ‌
○ Parameter‌‌estimate‌‌=‌‌true‌‌parameter‌‌+‌‌bias‌‌+‌‌chance‌‌error‌ ‌

‌
Lecture‌‌18‌ ‌
Confidence‌‌Interval‌ ‌
● A‌‌confidence‌‌interval‌‌quantifies‌‌the‌‌uncertainty‌‌of‌‌our‌‌estimates.‌ ‌
● A‌q ‌ %‌‌‌confidence‌‌interval‌‌covers‌‌the‌‌true‌‌parameter‌‌with‌q ‌ %‌‌‌probability.‌‌More‌‌
precisely,‌‌if‌‌you‌‌calculated‌‌intervals‌‌for‌‌many‌‌samples‌‌under‌‌the‌‌same‌‌setting,‌‌
q%‌‌‌of‌‌them‌‌would‌‌cover‌‌the‌‌true‌‌parameter‌ ‌
● If‌‌the‌‌chance‌e ‌ rror‌‌‌follows‌‌a‌‌symmetric‌‌distribution,‌‌then‌‌a‌q
‌ %‌‌‌confidence‌‌interval‌‌
is‌‌given‌‌by:‌ ‌
1−q
○ Observed‌‌Value‌‌± (1 − 2
) ‌th‌‌percentile‌‌(Chance‌‌Error)‌ ‌
○ For‌‌the‌‌95%‌‌confidence‌‌interval‌‌we‌‌thus‌‌have‌ ‌
■ [OV‌‌−‌‌97.5th‌‌percentile‌‌(CE),‌‌OV‌‌+‌‌97.5th‌‌percentile‌‌(CE)]‌ ‌
Hypothesis‌‌Testing‌ ‌
● In‌H ‌ ypothesis‌‌Testing‌,‌‌we‌‌start‌‌with‌‌a‌h
‌ ypothesis‌‌‌about‌‌our‌‌population.‌‌For‌‌
example:‌ ‌
○ “The‌‌coin‌‌is‌‌fair‌‌(so‌‌the‌‌population‌‌mean‌‌is‌‌0.5)”‌ ‌
● We‌‌then‌‌calculate‌‌the‌c‌ hance‌‌error‌‌‌and‌d ‌ ecide‌‌‌whether:‌ ‌
○ The‌‌chance‌‌error‌‌fell‌‌within‌‌an‌‌interval‌‌to‌‌be‌‌expected‌‌→‌‌Our‌‌data‌‌is‌‌
consistent‌‌with‌‌the‌‌hypothesis‌ ‌
○ The‌‌chance‌‌error‌‌was‌‌extremely‌‌big‌‌→‌‌Either‌‌we‌‌observed‌‌a‌‌very‌‌rare‌‌
event‌‌or‌‌our‌‌hypothesis‌‌is‌‌wrong‌ ‌
3‌‌Main‌‌Steps‌ ‌
● Set‌‌up‌‌research‌‌question‌ ‌
○ Hypothesis‌‌H 0 vs H 1 ‌
● Weigh‌‌up‌‌evidence‌ ‌
○ Assumptions‌ ‌
○ Test‌‌Statistic‌ ‌
○ P-value‌ ‌
● Explain‌‌conclusion‌ ‌
○ Conclusion‌ ‌
Why‌‌hypothesis‌‌testing?‌ ‌
● To‌‌make‌‌evidence‌‌based‌‌decisions,‌‌we‌‌need‌‌to‌‌weigh‌‌up‌‌‌evidence,‌ ‌
● Hypothesis‌‌Testing‌‌is‌‌a‌‌scientific‌‌method‌‌for‌‌weighing‌‌up‌‌the‌‌evidence‌‌given‌‌in‌‌
the‌‌data‌‌against‌‌a‌‌given‌‌hypothesis‌‌(model)‌ ‌
○ We‌‌say‌‌that‌‌the‌‌data‌‌is‌‌not‌‌consistent‌‌with‌‌the‌‌hypothesis‌‌if‌‌the‌‌difference‌‌
between‌‌the‌‌observed‌‌value‌‌(in‌‌our‌‌case‌‌sample‌‌mean‌‌or‌‌sample‌‌sum)‌‌
and‌‌the‌‌expected‌‌value‌‌(assuming‌‌the‌‌hypothesis)‌‌is‌‌too‌‌big‌ ‌
○ Alternative‌‌formulation:‌‌If‌‌the‌‌chance‌‌error‌‌is‌‌too‌‌big‌‌we‌‌should‌‌consider‌‌to‌‌
reject‌‌the‌‌hypothesis‌ ‌

‌
Summary‌ ‌
● A‌c
‌ onfidence‌‌interval‌‌‌qualifies‌‌the‌‌uncertainty‌‌of‌‌our‌‌estimates.‌‌A‌q‌ %‌‌
confidence‌‌interval‌‌covers‌‌the‌‌true‌‌parameters‌‌with‌q ‌ %‌‌‌probability‌ ‌
● Hypothesis‌‌testing‌‌‌is‌‌a‌‌scientific‌‌method‌‌for‌‌weighing‌‌up‌‌the‌‌evidence‌‌in‌‌the‌‌
data‌‌against‌‌a‌‌give‌‌hypothesis‌‌(model)‌ ‌

Lecture‌‌19‌ ‌
The‌‌Z‌‌Test‌ ‌
● This‌‌test‌‌is‌‌used‌‌to‌‌test‌‌a‌‌hypothesis‌‌about‌‌a‌p‌ roportion‌‌‌in‌‌a‌‌population‌ ‌
● Some‌‌examples‌‌could‌‌be:‌ ‌
○ Is‌‌the‌‌proportion‌‌of‌‌the‌‌coin‌‌flips‌‌that‌‌are‌‌heads‌‌equal‌‌to‌‌50%‌ ‌
○ Is‌‌the‌‌proportion‌‌of‌‌CEOs‌‌that‌‌are‌‌female‌‌less‌‌than‌‌50%‌ ‌
○ Is‌‌the‌‌proportion‌‌of‌‌students‌‌that‌‌dropout‌‌of‌‌school‌‌greater‌‌than‌‌25%‌ ‌
H:‌‌Hypothesis‌ ‌
● The‌‌null‌‌hypothesis‌‌‌H 0 ‌postulates‌‌a‌‌certain‌‌expected‌‌value‌ ‌
● The‌‌alternative‌‌hypothesis‌‌‌H 1 ‌is‌‌that‌‌the‌‌underlying‌‌expected‌‌value‌‌is‌‌actually‌‌
different‌ ‌
● When‌‌performing‌‌a‌‌statistical‌‌test,‌‌we‌‌calculate‌‌the‌‌chance‌‌error‌‌under‌‌H 0 ‌and‌‌
weigh‌‌up‌‌whether‌‌its‌‌size‌‌is‌‌plausible‌‌or‌‌not‌ ‌
A:‌‌Assumption‌‌(for‌‌Z‌‌Test)‌ ‌
● Observation‌‌are‌‌independent‌‌‌of‌‌each‌‌other‌ ‌
● Sample‌‌mean‌‌(sample‌‌sum)‌‌follows‌‌a‌n ‌ ormal‌‌distribution‌‌‌or‌‌sample‌‌size‌‌is‌‌big‌‌
enough‌‌such‌‌that‌‌normality‌‌is‌‌approximately‌‌satisfied‌‌(from‌‌Central‌‌Limit‌‌
Theorem)‌ ‌
○ We‌‌do‌N ‌ OT‌‌‌need‌‌the‌‌data‌‌to‌‌be‌‌normal‌ ‌
○ But‌‌if‌‌the‌‌sample‌‌size‌‌is‌‌not‌‌big‌‌enough,‌‌then‌‌we‌‌can‌‌use‌‌that‌‌if‌‌the‌‌data‌‌is‌‌
approximately‌‌normal,‌‌the‌‌sample‌‌mean‌‌and‌‌sample‌‌sum‌‌will‌‌be‌‌
approximately‌‌normal‌‌as‌‌well‌ ‌
T:‌‌Test‌‌statistic‌‌(for‌‌Z‌‌test)‌ ‌
● A‌t‌ est‌‌statistic‌‌‌measures‌‌the‌‌difference‌‌between‌‌what‌‌is‌o ‌ bserved‌‌‌in‌‌the‌‌data‌‌
adn‌‌what‌‌is‌e ‌ xpected‌‌‌from‌‌the‌‌null‌‌hypothesis‌ ‌
● It‌‌takes‌‌the‌‌form‌ ‌
observed value (OV ) − expected value (EV ) chance error(CE)
○ Test‌‌statistic‌‌=‌ standard error (SE) ‌=‌‌ standard error (SE) ‌
● NOTE:‌‌If‌‌the‌‌null‌‌hypothesis‌‌is‌‌true,‌‌then‌‌the‌‌test‌‌statistic‌‌follows‌‌a‌‌standard‌‌
normal‌‌curve:‌‌N(0,1)‌ ‌
P:‌‌p-value‌‌(for‌‌Z‌‌test)‌ ‌

‌
● The‌‌p-value‌‌‌is‌‌the‌‌chance‌‌of‌‌observing‌‌the‌‌test‌‌statistic‌‌(or‌‌something‌‌more‌‌
extreme‌‌assuming‌‌H 0 ‌is‌‌true:‌ ‌
○ In‌‌a‌‌Z‌‌test,‌‌the‌‌test‌‌statistic‌‌follows‌‌a‌‌standard‌‌normal‌‌curve,‌‌hence‌‌the‌‌
p-value‌‌is‌‌given‌‌by‌ ‌
■ p = P (Z ≥ ∣test statistic∣) ‌
○ Where‌‌Z ‌‌is‌‌a‌‌standard‌‌normal:‌‌Z ～N (0, 1) ‌
P:‌‌p-value‌‌(In‌‌general)‌ ‌
● In‌‌general‌‌(for‌‌all‌‌tests),‌‌the‌‌smaller‌‌the‌‌p‌‌value,‌‌the‌‌less‌‌likely‌‌is‌‌to‌‌observe‌‌a‌‌
test‌‌statistic‌‌of‌‌the‌‌magnitude‌‌observed.‌‌If‌‌the‌‌p-value‌‌is‌‌small‌‌enough‌‌it‌‌raises‌‌
evidence‌‌to‌‌reject‌‌the‌‌null‌‌hypothesis,‌‌H 0 ‌.‌ ‌
< α ‌,‌‌where‌‌α ‌is‌‌a‌‌
● The‌‌convention‌‌is‌‌to‌‌reject‌‌the‌‌null‌‌hypothesis‌‌if‌‌p
‌ ignificant‌‌level‌,‌‌often‌‌chosen‌‌as‌‌α ‌‌=‌‌0.05‌ ‌
predetermined‌s
● However,‌‌you‌‌don’t‌‌need‌‌to‌‌follow‌‌this‌‌strictly,‌‌a‌‌you‌‌shouldn’t‌ ‌
Summary‌‌of‌‌the‌‌hypothesis‌‌test‌ ‌
● H:‌‌If‌‌p ‌=Proportion‌‌of‌‌patients‌‌who‌‌responded‌‌to‌‌the‌‌treatment,‌‌we‌‌test‌‌H 0 ‌:‌‌p ‌=‌‌
0.8‌‌vs‌‌H 1 ‌:‌‌p ‌‌>‌‌0.8‌ ‌
● A:‌‌We‌‌assume‌‌that‌‌the‌‌participant‌‌in‌‌the‌‌tournament‌‌group‌‌are‌‌independent‌‌of‌‌
each‌‌other‌‌and‌‌given‌‌a‌‌sample‌‌size‌‌of‌‌29,‌‌the‌‌sample‌‌mean‌‌is‌‌approximately‌‌
normal‌ ‌
● T:‌‌The‌‌test‌‌statistic‌‌for‌‌the‌‌observed‌‌sum‌‌is‌‌1.3‌ ‌
● P:‌‌The‌‌p-value‌‌for‌‌this‌‌test‌‌statistic‌‌is‌‌0.097‌ ‌
● C:‌‌As‌‌the‌‌p-value‌‌is‌‌greater‌‌than‌‌0.05,‌‌we‌‌do‌‌not‌‌have‌‌enough‌‌evidence‌‌to‌‌reject‌‌
the‌‌null‌‌hypothesis‌‌and‌‌so‌‌the‌‌data‌‌not‌‌provide‌‌strong‌‌evidence‌‌that‌‌p ‌‌>‌‌0.8‌ ‌
One-sided‌‌and‌‌Two-sided‌‌Tests‌ ‌
● 1‌‌sided:‌ ‌
○ Specifies‌‌the‌‌direction‌‌of‌‌the‌‌alternative‌‌hypothesis.‌‌Eg‌‌H 1 : p > 0.8 ‌
● 2‌‌sided:‌ ‌
○ Does‌‌not‌‌specify‌‌the‌‌direction‌‌of‌‌the‌‌alternative‌‌hypothesis.‌‌
H 1 : p =/ 0.8 ‌
○ In‌‌this‌‌case‌‌the‌‌p-value‌‌doubles‌ ‌
Summary‌ ‌
● Hypothesis‌‌testing‌‌‌is‌‌a‌‌scientific‌‌method‌‌for‌‌weighing‌‌up‌‌the‌‌evidence‌‌given‌‌in‌‌
the‌‌data‌‌against‌‌a‌‌given‌‌hypothesis‌‌(model).‌‌It‌‌involves‌‌the‌‌following‌‌parts:‌
○ H:‌‌Hypothesis‌‌H 0 ‌vs‌‌H 1 ‌
○ A:‌‌Assumptions‌ ‌
○ T:‌‌Test‌‌Statistic‌ ‌
○ P:‌‌p-value‌ ‌

‌
○ C:‌‌Conclusion‌ ‌
● The‌‌Z‌‌test‌‌is‌‌used‌‌to‌‌test‌‌a‌‌hypothesis‌‌about‌‌a‌p
‌ roportion‌‌‌in‌‌a‌‌population‌ ‌

Lecture‌‌20‌ ‌
When‌‌to‌‌use‌‌the‌‌Z-test‌ ‌
● ‌To‌‌use‌‌the‌‌Z-test,‌‌we‌‌need‌‌to‌k
‌ now‌‌‌the‌‌population‌‌SD‌ ‌
○ One‌‌plausible‌‌case:‌‌Under‌‌H 0 ‌in‌‌a‌‌binary‌‌case‌‌assuming‌‌a‌‌proportion‌‌
(and‌‌using‌‌Box‌‌SD)‌ ‌
● Can‌‌we‌‌just‌e ‌ stimate‌‌‌the‌‌population‌‌SD‌‌using‌‌the‌‌sample‌‌SD?‌ ‌
○ Yes‌ ‌
○ But‌‌this‌‌estimation‌‌will‌‌add‌e ‌ xtra‌‌variability‌‌‌to‌‌the‌‌test‌‌statistic‌‌as‌‌the‌‌
sample‌‌SD‌‌varies‌‌from‌‌sample‌‌to‌‌sample‌ ‌
○ For‌‌large‌‌samples,‌‌the‌‌difference‌‌between‌‌the‌‌population‌‌and‌‌sample‌‌SD‌‌
should‌‌be‌‌small,‌‌and‌‌so‌‌the‌‌Z-test‌‌may‌‌be‌‌appropriate‌ ‌
○ For‌‌small‌‌samples,‌‌the‌‌difference‌‌will‌‌be‌‌more‌‌noticeable.‌‌Hence,‌‌we‌‌
should‌‌use‌‌the‌t‌ -Test‌‌‌instead‌ ‌
The‌‌t-Test‌ ‌
● W.S‌‌Gosset‌‌(1876-1936)‌‌invented‌‌a‌‌similar‌‌test‌‌to‌‌the‌‌Z-test,‌‌which‌‌uses‌‌the‌‌
sample‌‌SD‌‌and‌‌the‌‌t-distribution‌ ‌
● The‌‌t-distribution‌‌varies‌‌in‌‌shape‌‌according‌‌to‌‌the‌‌sample‌‌size.‌‌The‌‌smaller‌‌the‌‌
sample‌‌size‌‌is,‌‌the‌‌more‌‌variable‌‌the‌‌sample‌‌is,‌ ‌and‌‌huend‌‌the‌‌distribution‌‌of‌‌the‌‌
test‌‌statistics‌‌will‌‌be‌‌“wider”.‌‌The‌‌degree‌‌of‌‌“wideness”‌‌(also‌‌called‌d ‌ egree‌‌of‌‌
freedom‌)‌‌depends‌‌on‌‌the‌‌sample‌‌size‌‌and‌‌here‌‌it‌‌is‌‌n − 1 ‌.‌‌We‌‌write‌‌such‌‌a‌‌
distribution‌‌as‌‌tn−1 ‌
The‌‌t-distribution‌ ‌
● The‌‌t-distribution‌‌with‌‌ν ‌‌degrees‌‌of‌‌freedom‌‌(‌tν ‌)‌ ‌
● ν = ∞ ‌‌results‌‌in‌‌the‌‌standard‌‌normal‌‌distribution‌ ‌
● The‌‌standardised‌‌chance‌‌error‌‌(if‌‌standardised‌‌with‌‌tht‌‌example‌‌sd)‌‌follows‌‌a‌‌
t-distribution‌‌with‌‌ν = n − 1 = sample size − 1 ‌
Summary:‌‌T-Test‌ ‌
● H:‌H 0 ‌:population‌‌mean‌‌=‌‌μ vs H 1 ‌:‌‌population‌‌mean‌‌<, >, = μ /
● A:‌‌Individuals‌‌are‌‌independent;‌‌sample‌‌size‌‌is‌‌large‌‌enough‌‌for‌‌the‌‌CLT‌‌(or‌‌
population‌‌is‌‌normal)‌ ‌
observed mean − population mean
● T:‌‌Test‌‌statistic‌‌=‌ ︿ sample SD ‌
SE = √n
● P:Use‌‌tn−1 ‌curve‌‌to‌‌find‌‌tail‌‌area‌‌for‌‌observed‌‌test‌‌statistic‌ ‌

‌
○ 1-sided:‌P (tn−1 > ∣T est statistic∣) ‌
○ 2-sided:‌2 × P (tn−1 > ∣T est statistic∣) ‌
● C:‌‌Retain‌‌or‌‌Reject‌‌H 0 ‌
Summary‌ ‌
● The‌‌t-test‌‌is‌‌used‌‌to‌‌decide‌‌whether‌‌an‌‌observed‌‌difference‌‌between‌‌data‌‌and‌‌
expected‌‌value‌‌is‌‌just‌‌due‌‌to‌‌chance‌‌error‌‌alone‌‌(the‌‌null‌‌hypothesis)‌‌or‌‌another‌‌
reason‌‌(alternative‌‌hypothesis)‌ ‌
● If‌‌the‌‌population‌‌SD‌‌is‌‌unknown,‌‌we‌‌use‌‌the‌‌t-test,especially‌‌in‌‌the‌‌case‌‌of‌‌small‌‌
samples‌ ‌
● The‌‌test‌‌statistic‌‌is:‌ ‌
observed value − expected value
○ ︿ ‌
SE
● We‌‌can‌‌also‌‌use‌‌the‌‌t-distribution‌‌to‌‌construct‌‌confidence‌‌intervals‌ ‌

Lecture‌‌21‌ ‌
Inference‌ ‌
● While‌‌visualisation‌‌of‌‌the‌‌data‌‌gives‌‌us‌‌an‌‌initial‌g
‌ limpse‌‌‌at‌‌the‌‌possible‌‌
relationship‌‌between‌‌the‌‌two‌‌populations‌‌(those‌‌who‌‌have‌‌drunk‌‌a‌‌Red‌‌Bull‌‌and‌‌
those‌‌who‌‌have‌‌not),‌‌we‌‌often‌‌want‌‌to‌‌make‌‌a‌‌decision‌‌‌on‌‌whether‌‌the‌‌mean‌‌of‌‌
the‌‌two‌‌populations‌‌is‌‌the‌‌same‌‌or‌‌different.‌ ‌
● Inference‌‌‌is‌‌making‌‌a‌‌decision‌‌about‌‌population‌‌parameter(s)‌‌based‌‌on‌‌a‌‌
sample‌ ‌
2-Sample‌‌T-Test‌ ‌
● H:‌‌Hypothesis‌ ‌
○ μ1 ‌=‌‌mean‌‌heart‌‌rate‌‌of‌‌the‌‌control‌‌group‌ ‌
○ μ2 ‌=‌‌mean‌‌heart‌‌rate‌‌our‌‌treatment‌‌group‌ ‌
○ H 0 ‌:‌‌There‌‌is‌‌no‌‌difference:‌μ1 = μ2 ‌,‌‌or‌‌μ1 − μ2 = 0 ‌
○ H 1 ‌:‌‌There‌‌is‌‌a‌‌difference:‌μ1 =/ μ2 ‌,‌‌or‌‌μ1 − μ2 =/ 0 ‌
● A:‌‌Assumption‌ ‌
○ A1)‌‌All‌‌observed‌‌individuals‌‌are‌‌independent‌‌(within‌‌groups‌‌and‌‌between‌‌
different‌‌groups)‌ ‌
■ The‌‌two‌‌sample‌‌(Red‌‌Bull‌‌and‌‌Control)‌‌contain‌‌different‌‌people‌ ‌
● Note:‌‌This‌‌design‌‌differs‌‌from‌‌the‌‌caffeine‌‌one‌‌in‌‌which‌‌the‌‌
same‌‌person‌‌is‌‌tested‌‌at‌‌both‌‌0‌‌and‌‌13‌‌mg‌‌level‌‌of‌‌caffeine‌‌
and‌‌we‌‌consider‌‌the‌‌sample‌‌of‌‌differences‌‌from‌‌each‌‌pair‌ ‌

‌
● The‌‌paired‌‌differences‌‌can‌‌eliminate‌‌personal‌‌effect‌‌on‌‌the‌‌
experimental‌‌result‌‌but‌‌it‌‌is‌‌also‌‌harder‌‌to‌‌find‌‌the‌‌same‌‌
person‌‌to‌‌undergo‌‌both‌‌treatment‌‌and‌‌control‌‌for‌‌some‌‌
experiments‌ ‌
○ A2)‌‌The‌‌sample‌‌means‌‌follow‌‌a‌‌Normal‌‌distribution‌ ‌
■ Our‌‌samples‌‌are‌‌quite‌‌small,‌‌so‌‌the‌‌Central‌‌limit‌‌Theorem‌‌might‌‌not‌‌
fully‌‌kick‌‌in.‌‌Hence,‌‌2‌‌sample‌‌t-Test‌‌is‌‌questionable‌ ‌
○ A3)‌‌The‌‌2‌‌populations‌‌have‌‌equal‌‌spread‌‌(SD/variance)‌ ‌
■ We‌‌assume‌‌that‌‌the‌‌2‌‌populations‌‌have‌‌the‌s ‌ ame‌‌variation‌‌‌in‌‌
heart‌‌rate‌ ‌
■ Check:‌‌Box‌‌Plots,‌‌Histograms,‌‌Variance‌‌Test‌ ‌
● Box‌‌plots‌‌show‌‌that‌‌RB‌‌seems‌‌to‌‌have‌‌smaller‌‌sd.‌‌But‌‌
difference‌‌might‌‌not‌‌be‌‌significant‌ ‌
■ Better:‌‌This‌‌assumption‌‌can‌‌be‌‌relaxed‌‌by‌‌using‌‌the‌W ‌ elch‌‌2‌‌
sample‌‌T-Test‌ ‌
● T:‌‌Test‌‌Statistic‌ ‌
○ Equal‌‌variance‌ ‌
■ We‌‌compare‌‌2‌‌populations.‌‌Our‌‌observed‌‌value‌‌is‌‌the‌‌difference‌‌in‌‌
sample‌‌means.‌‌Our‌‌null‌‌hypothesis‌‌is‌‌that‌‌there‌‌is‌‌no‌‌difference‌‌in‌‌
population‌‌means‌ ‌
x x 0
● test statistic =
OV −EV
︿ = 1 −︿2 − ‌where,‌‌
SE SE
︿=
SE
√ SD 2 p ( n1 +
1
1
n2
), df = n1 + n2 − 2 ‌‌based‌‌on‌‌the‌‌
2 (n1 −1)SD 2 1 +(n2 −1)SD 2 2
pooled‌‌sample‌‌SD,‌‌where‌‌SD p = n1 +n2 −2 ‌
Summary‌ ‌
● The‌‌2‌‌Sample‌‌T-Test‌‌‌is‌‌used‌‌to‌‌test‌‌for‌‌the‌d
‌ ifference‌‌in‌‌means‌‌‌of‌‌two‌‌
populations‌ ‌
● We‌‌need‌‌to‌‌assume‌‌that:‌ ‌
○ All‌‌observed‌‌individuals‌‌are‌i‌ndependent‌ ‌
○ The‌‌sample‌‌means‌‌follow‌‌a‌‌Normal‌‌distribution‌ ‌
○ The‌‌2‌‌populations‌‌have‌e ‌ qual‌‌spread‌‌‌(SD/variance)‌ ‌
● We‌‌can‌‌relax‌‌the‌‌final‌‌assumption‌‌by‌‌using‌‌a‌W ‌ elch‌‌Two-Sample‌‌T-Test‌ ‌

Lecture‌‌22‌ ‌
Welch‌‌2-Sample‌‌T-test‌ ‌
● Welch‌‌2-Sample‌‌T-test‌‌has‌‌a‌‌different‌‌SE‌‌and‌‌df‌‌formula‌‌compared‌‌to‌‌the‌‌
standard‌‌two‌‌sample‌‌t-test‌ ‌

‌
● Standard‌‌error‌‌and‌‌df‌‌formula‌‌for‌‌the‌‌difference‌‌with‌u
‌ nequal‌‌variance‌:‌ ‌

√
s21 s22
○ SE = n1
+ n2
‌
2

○ df =
( ) s2 s2
1
+
n1 n2
2

‌
2 2
æ( ) ( ) ö
s2 s2
1 2
n1 n2
ç ÷
n1 −1
+ n2 −1
ç ÷
è ︿
ø
■ Where‌‌sk = SD (Samplek ), k = 1, 2 ‌
Non-Independent‌‌Data‌‌(Paired‌‌T-test)‌ ‌
● Sometimes‌‌it‌‌is‌‌desirable‌‌to‌‌analyse‌‌dependent‌‌‌data.‌‌We‌‌often‌‌design‌‌an‌‌
experiment‌‌to‌‌take‌‌advantage‌‌of‌‌this‌‌dependency‌‌in‌‌order‌‌to‌‌control‌‌variation‌‌
between‌‌experimental‌‌groups‌‌ ‌
Summary‌ ‌
● We‌‌can‌‌perform‌‌a‌‌Welch‌‌Two-sample‌‌t-test‌‌‌to‌‌compare‌‌two‌‌populations‌‌with‌‌
different‌‌variances‌ ‌
● We‌‌can‌‌perform‌‌a‌‌Paired‌‌t-test‌‌‌(one‌‌sample‌‌t-test)‌‌fi‌‌we‌‌have‌‌paired‌‌data‌ ‌
● We‌‌can‌‌perform‌‌a‌‌t-test‌‌for‌‌the‌‌slope‌‌‌of‌‌a‌‌regression‌‌line‌ ‌
‌

Rubrics For Classroom Cleanliness Assessment
100% (1)
Rubrics For Classroom Cleanliness Assessment
7 pages
Experimental Study Design
No ratings yet
Experimental Study Design
43 pages
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet
(Case Study) Zara Fast Fashion
83% (12)
(Case Study) Zara Fast Fashion
8 pages
GEN1000 Chapter 1
No ratings yet
GEN1000 Chapter 1
19 pages
Experimental Design, I
No ratings yet
Experimental Design, I
19 pages
Quantitative Research Method 2022
No ratings yet
Quantitative Research Method 2022
31 pages
6.2 Study Design - Module - Content
No ratings yet
6.2 Study Design - Module - Content
29 pages
AP Statistics Review: Planning A Study: Key Terms and Concepts
No ratings yet
AP Statistics Review: Planning A Study: Key Terms and Concepts
12 pages
statTI5e_ppt_0103
No ratings yet
statTI5e_ppt_0103
26 pages
Unit 03 - Producing Data - 4 Per Page
No ratings yet
Unit 03 - Producing Data - 4 Per Page
17 pages
Research Design_280814
No ratings yet
Research Design_280814
64 pages
lecture_sep_13_2024_ab
No ratings yet
lecture_sep_13_2024_ab
18 pages
Study Designs
No ratings yet
Study Designs
7 pages
Confounding Variables and Their Control
No ratings yet
Confounding Variables and Their Control
46 pages
CLS 351 Lecture Note 9
No ratings yet
CLS 351 Lecture Note 9
42 pages
Section 1.3
No ratings yet
Section 1.3
38 pages
Experimental Design 1
No ratings yet
Experimental Design 1
42 pages
Chapter13 Part 4
No ratings yet
Chapter13 Part 4
16 pages
Chapter13 Part 3
No ratings yet
Chapter13 Part 3
12 pages
psychology
No ratings yet
psychology
6 pages
2.1.1.P Research Design Interactive Presentation
No ratings yet
2.1.1.P Research Design Interactive Presentation
39 pages
Chapter 2 - Methods
No ratings yet
Chapter 2 - Methods
25 pages
SIS Skills
No ratings yet
SIS Skills
19 pages
Experimental Methods
No ratings yet
Experimental Methods
27 pages
ARM Lecture 2 - Quantitative Research Designs
No ratings yet
ARM Lecture 2 - Quantitative Research Designs
9 pages
Cram Packet_2025
No ratings yet
Cram Packet_2025
14 pages
BVDCH 13 Studentnotes
No ratings yet
BVDCH 13 Studentnotes
5 pages
Types of Research Methods
85% (13)
Types of Research Methods
3 pages
Chapter 0: Quantitative Reasoning
No ratings yet
Chapter 0: Quantitative Reasoning
12 pages
Chapter 003
No ratings yet
Chapter 003
41 pages
2022-09-12 - Psych - Research Methods 1 & 2
No ratings yet
2022-09-12 - Psych - Research Methods 1 & 2
8 pages
PSYCH 1010 Notes - Summary
No ratings yet
PSYCH 1010 Notes - Summary
61 pages
What Is The Control Group? Definition and Examples
No ratings yet
What Is The Control Group? Definition and Examples
19 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Experimental Designs
No ratings yet
Experimental Designs
20 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Final Notes
No ratings yet
Final Notes
184 pages
HE Cience of Sychology: Part B, Chapter 1
No ratings yet
HE Cience of Sychology: Part B, Chapter 1
28 pages
Isbm Epi Notes 2012-13
No ratings yet
Isbm Epi Notes 2012-13
12 pages
HH PSYC 1010
No ratings yet
HH PSYC 1010
30 pages
CH-2 Part-2
No ratings yet
CH-2 Part-2
12 pages
Observational Studies and Types
No ratings yet
Observational Studies and Types
28 pages
Experimental Methods
No ratings yet
Experimental Methods
11 pages
Experimental Methods
No ratings yet
Experimental Methods
26 pages
Experimental Epidemiology (1) 1
No ratings yet
Experimental Epidemiology (1) 1
27 pages
Experimental Research 3.1
No ratings yet
Experimental Research 3.1
3 pages
Summary of Freedman - Pisani - Purves, Statistics - 4e - 2007 - With Solutions
No ratings yet
Summary of Freedman - Pisani - Purves, Statistics - 4e - 2007 - With Solutions
15 pages
Methods of Comparison: Controlled Experiments and Observational Studies
No ratings yet
Methods of Comparison: Controlled Experiments and Observational Studies
43 pages
Biostat Lec M1-8
No ratings yet
Biostat Lec M1-8
25 pages
Aims of Experimental Research
No ratings yet
Aims of Experimental Research
4 pages
Research Design-De 401 Lectures
No ratings yet
Research Design-De 401 Lectures
20 pages
Module 1-4
No ratings yet
Module 1-4
15 pages
Chapter 6 Evaluation of Evidence
No ratings yet
Chapter 6 Evaluation of Evidence
64 pages
Study Design (Quantitative) Final-2024
No ratings yet
Study Design (Quantitative) Final-2024
35 pages
QRM_UNIT2
No ratings yet
QRM_UNIT2
8 pages
Clinical Research Methods PDF
No ratings yet
Clinical Research Methods PDF
27 pages
11 Experimental.study design
No ratings yet
11 Experimental.study design
41 pages
Scientific Method (Pt 2)
No ratings yet
Scientific Method (Pt 2)
16 pages
Before Worksheet 1.3
No ratings yet
Before Worksheet 1.3
5 pages
Module #5: Experimental Design
No ratings yet
Module #5: Experimental Design
8 pages
Practice Makes Perfect Statistics
From Everand
Practice Makes Perfect Statistics
Sandra McCune
No ratings yet
Inline Weld Testing Services
No ratings yet
Inline Weld Testing Services
2 pages
Kuduremukh Echosensitive Zone Permission
No ratings yet
Kuduremukh Echosensitive Zone Permission
10 pages
Black & Decker BES720 manual
No ratings yet
Black & Decker BES720 manual
112 pages
Holy Macro Books M Is for Data Monkey 1615470344 instant download
100% (1)
Holy Macro Books M Is for Data Monkey 1615470344 instant download
33 pages
Global Business English
No ratings yet
Global Business English
128 pages
Carew Myken - CTF - Challenge Planning
No ratings yet
Carew Myken - CTF - Challenge Planning
18 pages
A Review of The Decontamination of Surgical Instruments in The NHS in England
No ratings yet
A Review of The Decontamination of Surgical Instruments in The NHS in England
18 pages
Lecture 3: Survey of Sensing Principles: Sensor Classification Mechanical Sensors Thermal Sensors Chemical Sensors
No ratings yet
Lecture 3: Survey of Sensing Principles: Sensor Classification Mechanical Sensors Thermal Sensors Chemical Sensors
21 pages
Nmwebsearch Chart Update Results: Document Query
No ratings yet
Nmwebsearch Chart Update Results: Document Query
4 pages
Logo Criteria
No ratings yet
Logo Criteria
1 page
Combat Support Associates, A.S.B.C.A. (2015)
No ratings yet
Combat Support Associates, A.S.B.C.A. (2015)
3 pages
PS Dna
No ratings yet
PS Dna
23 pages
Autocad Basics Text
No ratings yet
Autocad Basics Text
16 pages
Final Year Project (Lie Detector)
No ratings yet
Final Year Project (Lie Detector)
10 pages
Rolex Submariner Hulk - M116610LV-0002 - IWC ® Official Website
No ratings yet
Rolex Submariner Hulk - M116610LV-0002 - IWC ® Official Website
1 page
Kia Parts Offers
No ratings yet
Kia Parts Offers
6 pages
Leatus Vision-Inspection-Solutions-En
No ratings yet
Leatus Vision-Inspection-Solutions-En
21 pages
ME189 - Process Plant Shutdown, Turnaround Troubleshooting
100% (1)
ME189 - Process Plant Shutdown, Turnaround Troubleshooting
6 pages
Sample Qualitative Dissertation Prospectus
100% (2)
Sample Qualitative Dissertation Prospectus
6 pages
Threat of New Entrant Customer Bargaining Power: Member-6 - Aishwarya Anand (MBA21073) 6
No ratings yet
Threat of New Entrant Customer Bargaining Power: Member-6 - Aishwarya Anand (MBA21073) 6
7 pages
Tda 1309
No ratings yet
Tda 1309
24 pages
November 2012 Question Paper 12 PDF
No ratings yet
November 2012 Question Paper 12 PDF
12 pages
Telephone Conversation Telephone Conversation
No ratings yet
Telephone Conversation Telephone Conversation
1 page
Type of Buying Decision Behavior
No ratings yet
Type of Buying Decision Behavior
14 pages
Acceptance Criteria - ASME B31.3 2022
100% (1)
Acceptance Criteria - ASME B31.3 2022
7 pages
Parts Manual: Cascade
No ratings yet
Parts Manual: Cascade
17 pages
Public Health Informatics MPHE 622: Kassim Kamara, M.Phil
No ratings yet
Public Health Informatics MPHE 622: Kassim Kamara, M.Phil
41 pages
Formation and Functions of Sebi
No ratings yet
Formation and Functions of Sebi
32 pages

Math1005 Notes

Uploaded by

Math1005 Notes

Uploaded by

Lecture‌‌1‌ ‌

○ Formally,‌‌S Dpop =‌‌

You might also like