0% found this document useful (0 votes)

5 views

Session 07 (Inference)

Uploaded by

Gerry Contillo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Session 07 (Inference)

Uploaded by

Gerry Contillo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Data Analysis, Statistics, Machine Learning

Leland Wilkinson

Adjunct Professor
UIC Computer Science
Chief Scien<st
H2O.ai

[email protected]
Inference
o Inference involves drawing conclusions from evidence
o In logic, the evidence is a set of premises
o In data analysis, the evidence is a set of data
o In sta<s<cs, the evidence is a sample from a popula<on
o A popula<on is assumed to have a distribu<on
o The sample is assumed to be random (There are ways around that)
o The popula<on may be the same size as the sample
o There are two historical approaches to sta<s<cal inference
o Frequen<st
o Bayesian
o There are many widespread abuses of sta<s<cal inference
o We cherry pick our results (scien<sts, journals, reporters, …)
o We didn’t have a big enough sample to detect a real diﬀerence
o We think a large sample guarantees accuracy (the bigger the beRer)

2
Copyright © 2016 Leland Wilkinson

Inference
Deduc<ve (top down)
All men are mortal. (premise)
Apollo is a man. (premise)
Therefore, Apollo is mortal. (conclusion)
The conclusion is guaranteed if premises are true
Abduc<ve
Bill and Jane had a ﬁght and stopped seeing each other
I just saw Bill and Jane having coﬀee together
I conclude they are friends again
The conclusion is not guaranteed even if premise(s) are true
Induc<ve (boRom up)
o All of the swans we have seen are white.
o Therefore, all swans are white.
The conclusion is not guaranteed even if premise(s) are true
There exist black swans (also blue lobsters)
Mathema<cal proofs are deduc<ve
Data-‐analy<c inference tends to be abduc<ve
Sta<s<cal inference tends to be induc<ve
Abduc<ve and induc<ve inference necessarily involve risk

3 Copyright © 2016 Leland Wilkinson

Inference
o Data Analy<c Inference
o Works like a legal argument
o Collect evidence
o Evaluate the believability of each piece of evidence
o Combine evidence
o Draw a conclusion

Conclude Collect

Combine Evaluate

4 Copyright © 2016 Leland Wilkinson

Inference
o Data Analy<c Inference
o When data analysis is suﬃcient (without needing sta<s<cs)
o The data are determinis<c

5 Copyright © 2016 Leland Wilkinson

Inference
o Data Analy<c Inference
o When data analysis is suﬃcient
o Error distribu<on is ignorable (Berkson’s Intraocular Trauma<c Test)
o Shepard and Metzler mental rota<on angle beau<fully predicted reac<on <me
o The <me it took to iden<fy whether two ﬁgures were the same

Shepard & Metzler (1971)

6 Copyright © 2016 Leland Wilkinson

Inference
o Data Analy<c Inference
o When a graph tells the whole story

7 Copyright © 2016 Leland Wilkinson

Inference
o Sta<s<cal Inference
o Based on probability distribu<ons
o Sample data
o Apply a probability model to the data
o Assess the adequacy of the model
o Draw conclusion with resul<ng level of conﬁdence

Conclude Sample

Assess Model

8 Copyright © 2016 Leland Wilkinson

Inference
o Inferring Parameters of a Distribu<on via Maximum Likelihood
o We have a sample
o We know (assume) it is a simple random sample (SRS) from a popula<on
o SRS: every possible sample of size n has an equal probability of being selected
o Each observa<on sampled is independent of the others sampled
o We know (assume) the probability distribu<on represen<ng the popula<on
o What is not a random sample?
o Every other case, record, instance, number in the phone book, etc.
o First n cases
o Any method that fails to consider every possible case
o Persi Diaconis tossing a coin (he can toss heads every <me)
o Human-‐generated random numbers (people can’t imitate randomness)
o Pseudo-‐random numbers
o Actually, good algorithms produce numbers indis<nguishable from truly random
o So we use them and hold our breath

9 Copyright © 2016 Leland Wilkinson

Inference
o Inferring Parameters of a Distribu<on via Maximum Likelihood
o How do we infer the parameter(s) of that probability distribu<on?
o The likelihood that θ is our parameter value, based on our sample informa<on is:
L(θ; x1 , . . . , xn ) = P (x1 , . . . , xn ; θ)
o The likelihood is the probability of observing our sample values based on diﬀerent
values of θ
o It is not a probability density func<on (its mass or the area under it is not 1)
o We are going to maximize this likelihood in order to es<mate θ
o Because we want our es<mate to be the most likely value to have generated our data

10 Copyright © 2016 Leland Wilkinson

Inference
o Inferring Parameters of a Distribu<on via Maximum Likelihood
o Likelihood func<ons are not probability density func<ons

11 Copyright © 2016 Leland Wilkinson

Inference
o Inferring Parameters of a Distribu<on via Maximum Likelihood
o Given the product rule for independent events,
n
�
L(θ; x1 , . . . , xn ) = f (xi ; θ)
i=1

o The product func<on is rather awkward, so we log the likelihood
l(θ; x1 , . . . , xn ) = log [L(θ; x1 , . . . , xn )]
�n
= log f (x1 , . . . , xn ; θ)
i=1

o Maximizing the log-‐likelihood is equivalent to maximizing the likelihood

o So, all we need is f (.)
for a given probability distribu<on
o Problems:
o A closed-‐form solu<on may not exist
o In that case, we have to use numerical op<miza<on
o In other cases, there may be no maximum
o In that case, we are hosed
12 Copyright © 2016 Leland Wilkinson
Inference
o Maximum Likelihood Es<mates for Normal Distribu<on
o N
(µ, 2 ) has two parameters
σ
o Density is
1 1 x−µ 2
f (x) = √ e− 2 ( σ )
2πσ

o Likelihood is product of densi<es

n
� �
2 2 n/2 1 �
L(µ, σ ; x1 , . . . , xn ) = (2πσ ) exp − 2 (xi − µ)2
2σ i=1

o And log-‐likelihood is

n
2 n n 2 1 �
l(µ, σ ; x1 , . . . , xn ) = − log(2π) − log(σ ) − 2 (xi − µ)2
2 2 2σ i=1

13 Copyright © 2016 Leland Wilkinson

Inference
o Maximum Likelihood Es<mates for Normal Distribu<on
o Maximizing the log-‐likelihood
2
max
2
l(µ, σ ; x1 , . . . , x n )
µ,σ

o requires
∂
l(µ, σ 2 ; x1 , . . . , xn ) = 0
∂µ
∂
l(µ, σ 2 ; x1 , . . . , xn ) = 0
∂σ 2
o The respec<ve par<al deriva<ves are
� n
�
∂ 1 �
l(µ, σ 2 ; x1 , . . . , xn ) = 2 xi − nµ
∂µ σ i=1

o and, � �
n
∂ 1 1 �
2
l(µ, σ 2 ; x1 , . . . , xn ) = (xi − µ)2 − n
∂σ 2σ 2 σ 2 i=1

14 Copyright © 2016 Leland Wilkinson

Inference
o Maximum Likelihood Es<mates for Normal Distribu<on
o Maximizing the log-‐likelihood with respect to µ
� n
�
∂ 1 �
l(µ, σ 2 ; x1 , . . . , xn ) = 2 xi − nµ =0
∂µ σ i=1

2 cannot be 0)

o implies (because σ
n
1�
µ̂ = xi
n i=1

o and maximizing it with respect to σ 2

� n
�
∂
l(µ, σ 2 ; x1 , . . . , xn ) =
1 1 �
(xi − µ)2 − n = 0
∂σ 2 2σ 2 σ2 i=1

o implies
n
2 1�
σ̂ = (xi − µ̂)2
n i=1

15 Copyright © 2016 Leland Wilkinson

Inference
o Likelihood Ra<o Tests
o Let L
max
1 be the maximum value of the likelihood for a given full model
o Let L
max
0 be the maximum value of the likelihood for a restricted model
o A restricted model is one where the values of some of the parameters are fixed
o These fixed values may be null (set to zero) or some other value
o Then
� �
Lmax
0
χ2k = −2 log (the difference between two log-‐likelihoods)
Lmax
1
o has a chi-‐square distribu<on with k degrees of freedom
o k is the difference between the number of parameters in the full vs the restricted
o (The Wald Test is a type of LR test)
o Assump<ons
o Models must be nested
o The test is asympto<c (n must be large)
o This last assump<on is widely abused

16 Copyright © 2016 Leland Wilkinson

Inference
o Inferring Parameters of a Distribu<on via the Bootstrap
o Efron (1981)
o Sample with replacement from a sample
o Compute es<mate of a parameter from this bootstrap sample
o Do this lots of <mes (say, 1000)
o Histogram the bootstrap parameter es<mates
o Compute sample sta<s<cs on histogram
o sample mean, sd
o frac<les
o conﬁdence intervals
o Or, smooth the histogram before compu<ng sta<s<cs

o Efron and others have proofs for why this works

o Not as eﬀec<ve for skewed distribu<ons
o Not as eﬀec<ve for dependent observa<ons

17 Copyright © 2016 Leland Wilkinson

Inference
o Inferring Parameters of a Distribu<on via the Bootstrap

nsample = 25, σ̂ = 214.1 nboot = 1000, σ̂ = 42.3

σ 214.1
σµ = √ central limit theorem 42.82 = √ preRy close!
n 25
18 Copyright © 2016 Leland Wilkinson
Inference
o Inferring Parameters of a Distribu<on via the Bootstrap
o 20 Bootstrap es<mates of robust piecewise regression
Females

30
Bone Alkaline Phosphatase

0
0 10 20 30 40 50 60 70 80 90
Age

19 Copyright © 2016 Leland Wilkinson

Inference
o Confidence Intervals
o An interval I θ = [l(x),
u(x)]
such that P (θ ∈ Iθ ) = 1 − α
σ̂ σ̂
o For interval on normal mean, we use Iµ = (µ̂ − z1−α √ , µ̂ + z1−α √ )
n n
o Computa<on based on likelihood P (x|θ) , where θ is a fixed value
o P (θ
∈
I θ ) is based on a collec<on of intervals, not this one
o Wrong to say, “There is a 95% chance θ lies in this interval”
o It either does or doesn’t (the interval is a random variable, not θ )
o Say instead, “there is a 95% chance that when I compute a confidence interval on
a sample from this popula<on, the true value of θ will fall within it”

20 Copyright © 2016 Leland Wilkinson

Inference
o Why confidence is not probability
o Let x1 , x2 ∼ U (θ − 1, θ + 1)
o There is a 25% chance that both x 1 and x 2 will lie below θ
o There is a 25% chance that both x 1 and x 2 will lie above θ
o Therefore, there is a 50% chance that θ will lie between them
o Then (y 1 = min[x1,
x2],
y 2 =
max[x1,
x2])
is a 50% confidence interval
o However, when y 2 − y 1 >
1 , it MUST contain θ , even though
o (y
1 , y 2 ) is a confidence interval
o In other words, confidence intervals are not betworthy
o Har<gan proved this argument for other distribu<ons (e.g., Normal)
o Thanks to Jerry Dallal for dis<lling Har<gan’s argument

21 Copyright © 2016 Leland Wilkinson

Inference
o Credible Intervals
o An interval such that P (l(θ) ≤ θ ≤ u(θ)) = 1 − α
o Computa<on based on posterior P (θ|x) ∼ P (x|θ)P (θ)
o θ is ﬁxed, but we are uncertain about its value, so we use P (θ) prior
o P (l(θ)
≤
θ ≤
u(θ))
is based on observed data
“Given our observed data, there is a 95% chance that the true value of θ
falls within this credible interval”

22 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Protec<ng against false posi<ves

o Construct Null Hypothesis H0 (usually, that a result is due to chance)
o State rule for rejec<ng H0
o Compute likelihood of observed result under H0
o Draw a conclusion based on decision rule

23 Copyright © 2016 Leland Wilkinson
Inference
o Hypothesis tes<ng
o Protec<ng against false posi<ves – the ﬁrst signiﬁcance test
o An Argument for Divine Providence, taken from the Constant Regularity observed in the Births
of both Sexes. By Dr. John Arbuthnot, Physician in Ordinary to her Majesty, and Fellow of the
College of Physicians and the Royal Society

o There seems no more probable Cause to be assigned in Physics for this Equality of the
Births, than that in our ’ﬁrst Parents Seed there were at ﬁrst formed an equal Number
of both Sexes.
o […] From hence it follows, that Polygamy is contrary to the Law of Nature and Jus<ce,
and to the Propaga<on of the Human Race; for where Males and Females are in equal
number, if one Man take Twenty Wives, Nineteen Men must live in Celibacy, which is
repugnant to the Design of Nature; nor is it probable that Twenty Women will be so
well impregnated by one Man as by Twenty.
� � � �n
n 1
P (exactly equal numbers of Males and Females) =
n/2 2

24 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o The Lady Tas<ng Tea (Fisher)
o The lady claimed she could tell whether milk or tea was ﬁrst added to the cup
o Fisher gave her 8 cups, 4 of each type, in random order
o The table on the les below shows the lady’s responses
o The tables on the right show all possible tables, given the margins of 4 cups
o There is only a 1 in 70 chance that the lady could have guessed all 8 correctly

Answer
Milk Tea
Total
First First
0 4 1 3 2 2 3 1 4 0
Milk
4 0 4 4 0 3 1 2 2 1 3 0 4
Truth

First
Total
Tea 1 16 36 16 1 70
First 0 4 4
.014 .229 .514 .229 .014 1
Total 4 4 8

25 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o one-‐tailed and two-‐tailed tests
o Tukey once pointed out that the two-‐sided null hypothesis that a
o parameter is zero is really saying that we don't know its sign.
HA : µ < µ0 HA : µ > µ0
4 .0 0.4

3 .0 0.3

)x(f

f(x)
2 .0 0.2

α 1 .0 0.1
α
x̄
0 .0 0.0 x̄
3 2 H
Reject 0
1 µ0
0
1- 2- 3- -3 -2 -1 µ0 0 1 2 H
Reject 0
3

x x

HA : µ �= µ0
0.4

0.3
f(x)

0.2

0.1
α/2 α/2
0.0
µ0 0 x̄
-3 -2 H -1
Reject 1 2 H
Reject 3
0 0
x

26 Copyright © 2016 Leland Wilkinson

A one-‐sided test
Here is a graph of the results of an experiment by Smeeters & Liu
The bars show the mean number of correct answers
The red lines (my addi<on) show the standard devia<ons

Smeeters & Liu (2011) JEXP Thanks to Uri Simonsohn and Richard Gill

27 Copyright © 2016 Leland Wilkinson

A one-‐sided test
o Contrast black vs. white bars over each level
o Compute an ANOVA on black vs. white bars
o If F value is very small, be very suspicious
o Because between-‐groups varia<on is too small rela<ve to within
o The lower tail of the F distribu<on yields our p value
o For the Smeeters and Liu ar<cle, F was so small that it revealed fraud
o Smeeters was forced to resign from the university

28 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Neyman-‐Pearson procedure (Jerzy Neyman and Egon Pearson)
o Construct Null Hypothesis H0 (ordinarily, that a sample result is due to chance)
o Construct Alternate Hypothesis HA (ordinarily, that a sample result is not H0)
o State criterion for rejec<ng H0
o Compute test sta<s<c
o Make a decision based on test sta<s<c
o Like Fisher’s method, this is a falsiﬁca<on procedure
o But it allows us to determine the power of the test

29 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Type I and Type II errors (power analysis)

Conclusion
Accept* H0 Reject H0
H0 HA

Null Hypothesis
Type I error (α)
True True Nega<ve
False Posi<ve

Type II error (β)

False True Posi<ve
False Nega<ve
cri<cal value

*Fail to reject

30 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Mul<ple Tests (m tests)
o Per Comparison Error Rate (PCER)
o Uncorrected single tests
P (F Dk > 0) ≤ α ∀ 1 ≤ k ≤ m
αk∗ = α
o Family-‐Wise Error Rate (FWER)
o e.g., Bonferroni
P (F D > 0) ≤ α
�m �m
P( k=1 Ek ) ≤ k=1 P (Ek )
αk∗ = α/m Retain H0 Reject H0
o False Discovery Rate (FDR)
o Benjamini-‐Hochberg H0 True True Nondiscovery False Discovery
� �
FD
E ≤α False
D H0 False True Discovery
Nondiscovery
k
αk∗ = α
m
31 Copyright © 2016 Leland Wilkinson
Inference
o Hypothesis tes<ng
o Mul<ple Tests
o False Discovery Rate (FDR)
o 4 false discoveries out of 10 rejected null hypotheses is a more serious error than
o 20 false discoveries out of 100 rejected null hypotheses

o Assump<on is that tests are independent

o Although, this can be relaxed somewhat
o Doesn’t depend on distribu<on, only p values from tests

32 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Mul<ple Tests
o FDR Plot
o Sort the p-‐values from largest to smallest
o Plot the ordered p-‐values on the y-‐axis versus k/m on the x-‐axis
o Superimpose a line that passes through the origin and has slope α
o Any p-‐value that falls on or below this line corresponds to a signiﬁcant result

m=25, α = .05
p

Unadjusted
Bonferroni

k / m Original graph by Jack Weiss, UNC

33 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Covert mul<ple tests
o Texas sharpshooter fallacy
o A Texan ﬁres some gunshots at the side of a barn
o He paints a target centered on the biggest cluster of hits
o He then claims to be a sharpshooter

o M. Feych<ng and M. Alhbom (1992). Magne<c fields and cancer in children residing near Swedish
high-‐voltage power lines. American Journal of Epidemiology, 138, 467-‐481.
• Surveyed everyone living within 300 meters of high-‐voltage power lines from 1960 through 1985.
• Looked for sta<s<cally significant increases in rela<ve risk (against baseline) of over 800 illnesses.
• Found that there was a significant rela<ve risk of childhood leukemia for those living near power lines.
• The number of illnesses considered was so large, however, that there was high probability that the
increased risk of at least one illness would appear sta<s<cally significant by chance alone.
• Subsequent studies failed to show any links between power lines and childhood leukemia.

34 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
Highly significant p-‐value doesn’t mean effect is large or strong or
influen<al

Sta<s<cal signiﬁcance does not imply prac<cal signiﬁcance

o Prac<cal signiﬁcance (importance) depends on meaning, not
chance

35 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o H0 doesn’t mean zero value of the parameter
o H0 can involve any value
o Zero value used for “nil hypothesis” instead of “null hypothesis”
o Nil hypothesis is absurd, of course

36 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Failure to reject H0 doesn’t prove it
o Can increase n enough to make almost any H0 false
o This is a trick used in ESP research (see Duke studies)
o WRONG: p=0.05 means "the probability of the null hypothesis not being
true is 95%”
o WRONG: p = .06 means "the average eﬀect size (d=0.04) is not diﬀerent
from 0.”

37 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Conﬁdence intervals are not a cure for NHST problems
o They come out of same calcula<ons for NHST, although they
convey more informa<on

38 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o p= .05 is not sacred
o Fisher thought of p values as quan<fying evidence against an
hypothesis
o He picked .05 for a cutoﬀ for most prac<cal problems

39 Copyright © 2016 Leland Wilkinson

Inference
o Hypothesis tes<ng
o Falsiﬁca<on (Popper) is wrong
o We build evidence for a theory
o There is no such thing as a cri<cal experiment (see Kuhn)
o But a theory that cannot be falsiﬁed is suspect
o Try to disprove Freud’s Oedipus Complex
o Freud wouldn’t accept any evidence to the contrary

40 Copyright © 2016 Leland Wilkinson

Inference
Hypothesis tes<ng
P-‐values are not the main problem
• fraud
• journal selec<on bias in favor of “significant” results and against “non-‐significant”
replica<on (the winner’s curse)
• Publishing p-‐values without suppor<ng informa<on (effect sizes, confidence
intervals, …)
• failure to control false discovery rate in mul<ple tests
• small samples (low power, Tversky and Kahneman’s “Law of Small Numbers")
• large samples("more than 100,000 women diagnosed from 1988 to 2011 with DCIS”;
wow! that must make this study trustworthy)
• convenience samples (“we studied depression by giving a ques<onnaire to sophomore
psychology students”)
• experimenter bias (failure to use double blind and other controls when available)
• cherry picking (uncontrolled model selec<on, stepwise regression, …)
• promiscuous mining -‐-‐ for one of the most egregious examples, see
http://googlecloudplatform.blogspot.com/2014/08/correlating-patterns-of-
world-history-with-bigquery.html

41 Copyright © 2016 Leland Wilkinson

Inference
Hypothesis tes<ng
P-‐values are not the main problem
• pollu<on from money (covert support by tobacco, sos-‐drink, chemical, energy, food
and drug companies, …)
• overly complex sta<s<cal models in place of simple alterna<ves (LISREL, HLM, BUGS, …)
• misuse of sta<s<cal concepts in interpre<ng results
• relying on sta<s<cal bloggers or Wikipedia for advice (the madness of crowds)
• 15 minutes of fame (h-‐index, cita<ons, awards, keynotes, TED talks, and other factors
driving excess publica<ons, premature media publicity, and inaRen<on to detail)
• infla<on in tenure requirements (10 publica<ons a year? Are you kidding?)
• pressure to get grants
• ignorant media reporters (failure to understand the basics of causa<on, probability and
inference -‐-‐ coffee causes cancer, cancer causes coffee)
• plagiarism (copy someone else’s study without understanding the sta<s<cs and data
analysis)

42 Copyright © 2016 Leland Wilkinson

Inference
o The Bayesian objec<on
o Likelihood principle (Leonard Jimmie Savage)
o Aser x is observed, all relevant experimental informa<on is contained in the
likelihood func<on for the observed x. Furthermore, two likelihood func<ons
contain the same informa<on about θ if they are propor<onal to each other.

o Suppose X is the number of heads in 12 flips of a fair coin and Y is the number of
flips needed to get 3 heads.
o A frequen<st tests the result that X = 3 against a Binomial, with resul<ng p = .073.
o But she tests the result that Y = 12 against a Nega<ve Binomial, with p = .0327.
o The data are the same in both circumstances, but the experiments differ
o The difference between observing X = 3 and observing Y = 12 lies not in the actual
data, but merely in the design of the experiment. In the first case, one has
decided in advance to try 12 flips. In the second, one has decided to keep flipping
un<l 3 successes are observed. Bayesians say the inference about θ should be the
same because the two likelihoods are propor<onal to each other.
L(θ) ∝ p3 (1 − p)9

Inference
o Hypothesis tes<ng
o Bayesian inference
o Iden<fy prior distribu<on P (H)
on hypothesis parameters
o Specify parameter values of prior distribu<on
o Compute likelihood P (E|H)
on evidence given hypothesis
o Compute marginal likelihood P (E) , which averages over parameters of interest
o Compute posterior distribu<on through Bayes’ theorem

P (E|H)P (H)
P (H|E) =
P (E)

o Graphically display the posterior distribu<on

o Or, compute credible intervals and other sta<s<cs characterizing posterior
o The posterior can be used as a prior with new data (Bayesian upda<ng)

Inference
o Hypothesis tes<ng
o Bayesian visualiza<on

Andrewgelman.com