Ap Stat 1-7 Notes
Ap Stat 1-7 Notes
Kathryn Jiang
Contents
1 Part I Describing Data
1.1 Plotting Data . . . . .
1.2 Normal Distribution .
1.3 Contingency Tables . .
1.4 Central Limit Theorem
.
.
.
.
2
2
2
3
3
Variables
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
4
4
4
4
5
5
5
6
4 Part IV Probability
4.1 Probability and Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 General Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
7
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Part VI Inference
6.1 1 and 2 Sample t-tests . . . . . .
6.2 2-Sample Independent t-tests . .
6.3 Paired t-tests . . . . . . . . . . .
6.4 Confidence Intervals . . . . . . .
6.5 Hypothesis Testing Control Form
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
10
10
11
11
12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Page 2 of 12
1.1
Plotting Data
1.2
Normal Distribution
0.683
0.954
z-score: how many standard deviations away from the mean you are at. Used to find area
under curve using the green sheet.
z=
Page 3 of 12
Mode
Mode
Median
Median
Mean
Mean
1.3
Contingency Tables
Contingency tables show how individuals are distributed along each variable, depending on
another variable. They test for independence.
Table 2: Contingency Table of Class and Survival of Titanic Passengers
Alive
Dead
Total
First
202
123
325
Second
118
167
285
Third
178
528
706
Crew Total
212
710
673
1491
885
2201
In this case, we are trying to determine if survival depends on your class. The independent
variable (thing that happened first) is your ticket, not whether you lived or died.
H0 : Survival is independent of ticket
HA : Survival is dependent on ticket
Table 3: Percentages of Survival by Class of Titanic Passengers
First Second Third
Alive 62.6% 41.4% 25.2%
Dead 37.8% 58.6% 74.8%
Crew
24.0%
76.0%
1.4
CLT =
n
As n , shape becomes more normal, center stays the same, spread (stdev) decreases.
3
Page 4 of 12
2.1
Scatterplots
2.2
Linear Regression
Table 4: Regression Variables
y
y
b0
b1
r2
y y
predicted y-value
observed y-value
y-intercept
y
slope = rs
sx
coefficient of determination
residuals
Replacing x and y with variables named in context gives least squares regression line:
y = b0 + b1 x
Interpreting r2 : 56.6% of the variability in male weight (explanatory variable) can be
explained by the least squares regression line variation in male height (response variable).
Interpreting slope: According to the model, as fht increases by 1, fwt increases by 3.70.
Interpreting y-intercept: According to the model, a gestation period of 0 days is a lifeexpectancy of 7.87 years.
***CORRELATION DOES NOT IMPLY CAUSATION***
The LSRL always passes through the centroid (
x, y). Influential points do not lie within
x-range of the data and change slope. Points that are above or below the mean have leverage,
since they drag the entire line up or down toward them, but they do not change the slope.
2.3
Transformations
If there is a pattern in the residuals, or if the data looks curved, try transforming it to get
a better model. Four ways to transform data:
1. log(x)
2. log(y)
4
Page 5 of 12
3. log-log
4. when all else fails, break apart into linear segments, perform regression on each
How to see if a model has high predictive power:
residuals show random scatter, with even number of points above and below 0
high r2 value, although there are some variables that are strongly association but
correlation is not close to 1 because curve
3.1
Biases
Table 5: Types of Bias
Name
voluntary response bias
undercoverage
nonresponse bias
response bias
Definition
self-selected, type of nonresponse
excluding people
choose not to respond
response changes
Example
American Idol voting
only calling households on landlines
calling when not at home
loaded questions
The best way to minimize bias is to use randomization, where each individual is given a
fair, random chance of selection. Using random number tables: starting at the top left, look
at one digit at a time until a dozen numbers are selected. Or, you can use another random
number generator.
3.2
Types of samples
Table 6: Types of Sampling
Nonsampling error occurs when you messed up by not randomizing, or there was a
computer error.
3.3 Experiments
3.3
Page 6 of 12
Experiments
Table 7: Principles of Experimental Design
control
control group as baseline using null or placebo treatment
randomize used to even out effects we can control
replicate be able to reproduce results using different sample
block
match similar subjects to reduce effects of things you cannot control
Blinding:
people who influence results: subjects, treatment administrators, etc
people who evaluate results: judges, treating physicians, etc
If only one group is blinded, it is single-blind. Everyone from both groups needs to be blind
for the treatment to be double-blind.
Confounding variables: occurs when levels of one factor are associated with levels of
another factor.
Lurking variables: when there is a third, outside factor that affects things.
Part IV Probability
4.1
n!
(n k)!k!
8
52
4
51
If two events cannot happen, they are mutually exclusive. The law of large numbers says
observed probabilities will go toward theoretical probabilities with large enough trials.
lim P (observed) = P (theoretical)
Conditional probability is the probability that B happens given that A has already happened, so its the probability of both A and B happening over probability of A happening.
P (B|A) =
P (A B)
P (A)
Page 7 of 12
Table 8: Probability Variables
n
x
p
q
P(x)
EV or
4.2
number of trials
event
standard deviation
probability of success
1-p, probability of failure
probability of event x
expected value
General Probability
=
qX
qX
P (x) =
x2 P (x) 2
X
EV =
x P (x)
(x
)2
4.3
Bernoulli Trials
P (n) = q n1 p
= n p q
n
P (x) = x px q nx blank
Bernoulli models can be approximated with normal models, but only when np 10 and
nq 10.
Table 10: Useful Nspire Functions
geometricPdf(p, n)
probability something not happening until nth event
binomialCdf(n, p, lower, upper) probability between lower and upper number of events occur
binomialPdf(n, p, x)
probability exactly x events occur in n trials
Page 8 of 12
zts = q
p1 p2
p1 q1
n1
p2 q2
n2
Follow this progression to find out what values of p and q to use in formulas:
1. p, q from previous students
2. proportion from hypothesis
3. use p, q from 2 samples
4. p = 0.5, q = 0.5 worse case
We use p to predict p at a point, and create a confidence interval accordingly:
r
pq
CI = p margin of error = p z ?
n
p
where pq
is the standard error. Essentially, this interval is created by taking the number
n
of standard deviations away on each side, which is what z ? is. If many many samples are
collected and the confidence interval is created, then 95% of the intervals will capture the
true proportion of whatever youre trying to measure.
Table 12: Types of Errors
Name
Type I
symbol
name
false positive
null
wrongly rejected
power 1
n
Type II
false negative
wrongly supported
Page 9 of 12
6
6.1
Part VI Inference
1 and 2 Sample t-tests
Table 13: Independent t-test Variables
n
0
x
s
tts
number of samples
mean of population
value from hypothesis
mean of sample
stdev of population
stdev of sample
t-test statistic
tts =
x 0
s
n
This finds the t-test statistic for 1-sample data. The t-test statistic is pretty similar to
z-test statistic in that it describes the number of standard deviations your value is away
from the mean on a t-distribution, which depends on degrees of freedom and is fatter than
a normal distribution.
s
Error = t?
n
Error associated with the confidence interval. In most cases, since we cannot find t? , we
use z ? instead.
6.2
(x1 x2 ) 0
q 2
s1
s2
+ n22
n1
s
StandardError(x1 x2 ) =
s21
s2
+ 2
n1 n2
0 is the hypothesized difference in means for the two independent group, which can be
written as
H0 : 1 2 = 0
HA : 1 2 > 0
Page 10 of 12
Table 14: Paired t-test Variables
n
d
d
s
tts
6.3
number of pairs
population mean of differences
sample mean of differences
population stdev of differences
sample stdev of differences
t-test statistic
Paired t-tests
Paired t-tests are used when each pair of data is related in some way. For example, the data
could be before/after a certain treatment given to the sample person.
tts =
d 0
sd
sd
=
StadardError(d)
n
6.4
Confidence Intervals
s
Conf idenceInterval = Statistic CriticalV alue StandardOf Error = x t?
n
Confidence interval statement: I am 90% confident that the true mean of battery lifespans
is between 291.1 and 321.4 minutes. t? is the number of standard deviations you are to get
that confidence interval (just like a z-score).
If 0 does not lie within the confidence interval, then the data is statistically significant. If
0 lies anywhere inside the interval, even if it is not in the center, the data is not statistically
significant.
For 2-tailed tests (Ha 6=), each tail will be 2 . For 1-tailed tests (Ha > or <), each tail
will be .
6.5
10
Page 11 of 12
7.1
Chi-square Tests
Table 17: Chi-square tests
Name
Goodness of Fit
Homogeneity
Variables
1
1 (data stratified)
Degrees of Freedom (df) k 1
(r
1)
(c 1)
P
P
row col
Expected
given
total
Compares counts against given distribution proportions
H0
pO = pE for each ps,w = pj,w
HA
1 sig diff
ps,w 6= pj,w
Independence
2
(r
1)
(c 1)
P
P
row col
total
two variables
response independent of gender
response dependent on gender
Note: a 2-proportion z-test can also be used instead of a chi-square test for homogeneity
2 =
X (Obs Exp)2
Exp
The chi-square statistic is used to find the p-value based on the chi-square distribution, which
is skewed right. Increasing df reduces skew.
11
7.2
Page 12 of 12
Linear Regression
H0 : = 0
HA : 6= 0
To test to see if there is a linear association between two variables, compare the slope
, which is the parameter, to 0. The statistic is b1 , or coefficient of the x-variable in the
calculator printout (given).
tT S =
b1
SE(b1 )
The t test statistic, when calculated by hand, is found by the equation above. In most
cases, = 0, but it depends on the hypotheses.
tCdf (tT S , , df ) = p valuef oronetail; doubleitf ortwotails
In most cases, you need to double it because two tail (HA 6=). Or use the green sheet.
Conf idenceinterval : b1 t?n2 SEb
Statement: I am 95% confident the true slope of the regression line is between 0.883
and 1.737.
From the p-value, you can conclude there is / is not an association between the two
variables.
12