Statistical Methods
Statistical Methods
Lecture 1
Census: collection of data from every member of population, usually too large to collect.
Sample: sub-collection from the population.
Different sample different data different conclusions about population.
A sample should be representative and unbiased.
Sampling methods:
o Voluntary response sample: subjects decide themselves to be included in sample.
o Random sample: each member of population has equal probability of being selected.
o Simple random sample: each sample of size n has equal probability of being chosen
o Systematic sampling: after starting point, select every k-th member.
o Stratified sampling: divide population into subgroups such that subjects within groups have
same characteristics, then draw a (simple) random sample from each group.
o Cluster sampling: divide population into clusters, then randomly select some of these
clusters.
o Convenience sampling: easily available results.
Types of data
o Qualitative (categorical): names or labels represent counts/measurements
Nominal: names, labels, categories (no ordering)
Gender, eye color
Ordinal: categories with ordering, but no (meaningful) differences
U.S. grades, opinions
o Quantitative (numerical): numbers represent counts/measurements
Interval: ordering possible and differences between numbers are meaningful. No
natural zero starting point
Year of birth, temperatures
Ratio: ordering possible, differences are meaningful and there is a natural starting
point.
body length, marathon times
Summarizing data:
o Graphical: tables, graphs, other figures
o Descriptive:
Qualitative: describe shape, location and dispersion/variation
Quantitative: numerical summaries of location and variation
Graphical summaries:
o Frequency distribution (table)
Count occurrences of category
o Bar chart
Spaces in between the categories
o Pareto bar chart
Bar chart, but categories are ordered w.r.t. frequency.
Data of nominal measurement level is required
o Pie chart
Pie piece sizes determined by relative frequency of category
o Histogram
Bar areas are proportional to frequency in respective interval. No white space.
Only used for quantitative data
o Time series
Visualization of time-varying quantity
Qualitative description:
o Shape:
Make smooth approximation of histogram
Shape of smooth curve relates data distribution to familiar distributions.
Symmetrical
Left- or Right-skewed
Uniform
o Location:
Position on x-axis
Same shape, different location
o Dispersion (spread/variation):
Measure of variation within dataset
Same shape and location; different dispersion
Small or large dispersion
Average
Every data value is used
Strongly affected by extreme values
Sample mean denoted by ̄x
Population mean denoted by μ
o Median
Middle value after sorting
Not much affected by extreme values
o Mode
Value with highest frequency
Bimodal (2), multimodal (>2)
5 number summary:
1. Minimum
2. Q1
3. Median, Q2
4. Q3
5. Maximum
Interquartile range = Q3 – Q1
Boxplot: provide information about distribution
top value: maximum
top of box: Q3
thick line: median
bottom of box: Q1
lowest value: minimum
Lecture 2
Probability experiment: production of (random) outcome.
dice roll, coin toss
Sample space Ω: set of all possible outcomes
Ω = { 1,2,3,4,5,6}
Event A, B, …: collection of outcomes
A = {even number thrown} = { 2,4,6}
Simple event: consist of 1 outcome
Probability measure: function P(.) assigning values between 0 and 1 to events
P(A) = P({2,4,6}) = ½
Interpretation of probabilities:
o P(A) = 0 occurrence of A is impossible
o P(A) = 1 occurrence of A is certain
o P(A) = small e.g. <0.05 occurrence of A is unlikely
With relative frequency, many trials lead to the relative frequency almost being equal to the real
value of P(A) Law of Large numbers: suppose a procedure is repeated (independently). The
relative frequency probability of an event A tends towards true P(A)
Counting principle:
Suppose 2 probability experiments are performed
a > x possible outcomes;
b > y possible outcomes
Combined: a x b possible outcomes
Addition rule:
o P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Notation:
A ∪ B = A or B: union, set of outcomes which are in A or B (or both)
A ∩ B = A and B: intersection, set of outcomes which are both in A and B
Multiplication rule
o P(B|A): conditional probability that B occurs given that A has occurred.
Conditional probability:
P ( A ∩B)
If P(A) > 0, then: P(B|A) =
P( A )
BUT P(B|A) ≠ P(A|B)
o Multiplication rule:
P(A ∩ B) = P(A) · P(B|A).
o Independence:
Two events A and B are independent if P(A ∩ B) = P(A) · P(B)
P(B) = P(B|A) when A and B are independent
Independence ≠ disjointness
Lecture 3
Addition rule for disjoint events
!! , but
o Partition:
Events A1, …, Am are called partition if
They are pairwise disjoint: Ai ∩ Aj = ∅, if i ≠ j;
Their union is entire sample space: : A1 ∪ A2 ∪ . . . ∪ Am = Ω.
o Bayes’ Theorem
Let A1, …, Am be partition, then for r ∈ {1, …, m}:
EXAMPLE:
A random variable is a variable that assigns a numerical value to each outcome of a probability
experiment.
Notation: X, Y, ..
X : random variable, x value of a random variable
EXAMPLE
o Probability distribution:
Determines probabilities of values of a random variable
Given by table, formula or graph
5. Table. Left column: all values x of X and column with probabilities P(X =x)
EXAMPLE:
o Expected value (expectation/mean)
The expected value of a discrete random variable X with possible values x1, …, xk:
Weighted average of all possible values of X:
EXAMPLE
o Variance
The variance of a discrete random variable X with values x1, …, xk:
EXAMPLE
Law of Large Numbers:
Let X1, …, Xn be n independent versions of random variable X; let µ = E(X)
1
Their mean (X1 + … + Xn) tends to approach µ.
n
LLN of Lect.2: random variable Xi = 1 if A occurs, Xi = 0 if A doesn’t.
Lecture 4
Probability density function
A curve p(x) such that
o p(x) ≥ 0 for all x
o total area under curve = 1
P(X ∈ [a, b]) = area under the curve p(x) between a and b
Normal distribution
Random variable X has a normal distribution if p(x) is continuous, bell-shaped and symmetric
EXAMPLE
EXAMPLE
EXAMPLE
A model distribution
Probability distribution for describing the unknown true population distribution
Examples (continuous variables: normal, uniform, t, χ 2 , exponential.
Example: The variable ‘Date of birth - Due date’ is a random variable having a normal distribution
with mean 0 and standard deviation 10
Accessing normality
Consider dataset x1, …, xn. When is model distribution N(µ, σ2 ) reasonable?
o Shape of histogram
Bell-shaped curve
Strong deviation from bell shape? Then N(µ, σ2 ) unlikely
o Normal QQ plot
Approximately straight line
EXAMPLE
What is a QQ plot
There are QQ plots other than “normal QQ plots”: use theoretical quantiles of other continuous
distributions.
Sample size
Small n: more variation
histogram / QQ plot could deviate (from bell shape / straight line), even if N(µ, σ2 ) true.
Large n: histogram and QQ plot: more reliable
Stochasts X and Y have probability distributions that are in the same location-scale family if and only
if the QQ-plot shows a straight line Y = a + bX