Statistical Methods
Statistical Methods
Lecture 1
Census: collection of data from every member of population, usually too large to collect.
Sample: sub-collection from the population.
Different sample → different data → different conclusions about population.
A sample should be repress entative and unbiased.
Sampling methods:
o Voluntary response sample: subjects decide themselves to be included in sample.
o Random sample: each member of population has equal probability of being selected.
o Simple random sample: each sample of size n has equal probability of being chosen
o Systematic sampling: after starting point, select every k-th member.
o Stratified sampling: divide population into subgroups such that subjects within groups have
same characteristics, then draw a (simple) random sample from each group.
o Cluster sampling: divide population into clusters, then randomly select some of these
clusters.
o Convenience sampling: easily available results.
Types of data
o Qualitative (categorical): names or labels represent counts/measurements
• Nominal: names, labels, categories (no ordering)
➔ Gender, eye color
• Ordinal: categories with ordering, but no (meaningful) differences
➔ U.S. grades, opinions
o Quantitative (numerical): numbers represent counts/measurements
• Interval: ordering possible and differences between numbers are meaningful. No
natural zero starting point
➔ Year of birth, temperatures
• Ratio: ordering possible, differences are meaningful and there is a natural starting
point.
→ body length, marathon times
Summarizing data:
o Graphical: tables, graphs, other figures
o Descriptive:
• Qualitative: describe shape, location and dispersion/variation
• Quantitative: numerical summaries of location and variation
Graphical summaries:
o Frequency distribution (table)
Count occurrences of category
o Bar chart
Spaces in between the categories
o Pareto bar chart
Bar chart, but categories are ordered w.r.t. frequency.
Data of nominal measurement level is required
o Pie chart
Pie piece sizes determined by relative frequency of category
o Histogram
Bar areas are proportional to frequency in respective interval. No white space.
Only used for quantitative data
o Time series
Visualization of time-varying quantity
Qualitative description:
o Shape:
Make smooth approximation of histogram
Shape of smooth curve relates data distribution to familiar distributions.
• Symmetrical
• Left- or Right-skewed
• Uniform
o Location:
Position on x-axis
Same shape, different location
o Dispersion (spread/variation):
Measure of variation within dataset
Same shape and location; different dispersion
• Small or large dispersion
Average
Every data value is used
Strongly affected by extreme values
• Sample mean denoted by x̄
• Population mean denoted by μ
o Median
Middle value after sorting
Not much affected by extreme values
o Mode
Value with highest frequency
Bimodal (2), multimodal (>2)
Lecture 2
Probability experiment: production of (random) outcome.
➔ dice roll, coin toss
Sample space Ω: set of all possible outcomes
➔ Ω = { 1,2,3,4,5,6}
Event A, B, …: collection of outcomes
➔ A = {even number thrown} = { 2,4,6}
Simple event: consist of 1 outcome
Probability measure: function P(.) assigning values between 0 and 1 to events
➔ P(A) = P({2,4,6}) = ½
Interpretation of probabilities:
o P(A) = 0 → occurrence of A is impossible
o P(A) = 1 → occurrence of A is certain
o P(A) = small e.g. <0.05 → occurrence of A is unlikely
With relative frequency, many trials lead to the relative frequency almost being equal to the real
value of P(A) → Law of Large numbers: suppose a procedure is repeated (independently). The
relative frequency probability of an event A tends towards true P(A)
Addition rule:
o P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Notation:
A ∪ B = A or B: union, set of outcomes which are in A or B (or both)
A ∩ B = A and B: intersection, set of outcomes which are both in A and B
Multiplication rule
o P(B|A): conditional probability that B occurs given that A has occurred.
Conditional probability:
P(A ∩ B)
If P(A) > 0, then: P(B|A) = P(A)
BUT P(B|A) ≠ P(A|B)
o Multiplication rule:
P(A ∩ B) = P(A) · P(B|A).
o Independence:
Two events A and B are independent if P(A ∩ B) = P(A) · P(B)
➔ P(B) = P(B|A) when A and B are independent
➔ Independence ≠ disjointness
Two different sampling methods:
1. Sampling with replacement: selections are independent events
2. Sampling without replacement: selections are dependent events
➔ Drawing a small sample from a large population, then treat selections as independent
events.
Lecture 3
Addition rule for disjoint events
!! , but
o Partition:
Events A1, …, Am are called partition if
• They are pairwise disjoint: Ai ∩ Aj = ∅, if i ≠ j;
• Their union is entire sample space: : A1 ∪ A2 ∪ . . . ∪ Am = Ω.
o Bayes’ Theorem
Let A1, …, Am be partition, then for r ∈ {1, …, m}:
EXAMPLE:
A random variable is a variable that assigns a numerical value to each outcome of a probability
experiment.
Notation: X, Y, ..
X : random variable, x value of a random variable
EXAMPLE
o Probability distribution:
Determines probabilities of values of a random variable
Given by table, formula or graph
5. Table. Left column: all values x of X and column with probabilities P(X =x)
EXAMPLE:
o Expected value (expectation/mean)
The expected value of a discrete random variable X with possible values x1, …, xk:
Weighted average of all possible values of X:
EXAMPLE
o Variance
The variance of a discrete random variable X with values x1, …, xk:
4. Determine
5. Tabulate the results.
Lecture 4
Probability density function
A curve p(x) such that
o p(x) ≥ 0 for all x
o total area under curve = 1
P(X ∈ [a, b]) = area under the curve p(x) between a and b
Normal distribution
Random variable X has a normal distribution if p(x) is continuous, bell-shaped and symmetric
EXAMPLE
EXAMPLE
EXAMPLE
A model distribution
Probability distribution for describing the unknown true population distribution
Examples (continuous variables: normal, uniform, t, χ 2 , exponential.
Example: The variable ‘Date of birth - Due date’ is a random variable having a normal distribution
with mean 0 and standard deviation 10
Accessing normality
Consider dataset x1, …, xn. When is model distribution N(µ, σ2 ) reasonable?
o Shape of histogram
Bell-shaped curve
Strong deviation from bell shape? Then N(µ, σ2 ) unlikely
o Normal QQ plot
Approximately straight line
EXAMPLE
What is a QQ plot
There are QQ plots other than “normal QQ plots”: use theoretical quantiles of other continuous
distributions.
Sample size
Small n: more variation
➔ histogram / QQ plot could deviate (from bell shape / straight line), even if N(µ, σ2 ) true.
Large n: histogram and QQ plot: more reliable
Stochasts X and Y have probability distributions that are in the same location-scale family if and only
if the QQ-plot shows a straight line Y = a + bX