Probability Notes
Probability Notes
Contents
1 Events and Their Probabilities
1.1 Introduction . . . . . . . . . . .
1.2 Events as Sets . . . . . . . . . .
1.3 Probability . . . . . . . . . . .
1.4 Different Types of Probabilities
1.5 Conditional Probability . . . . .
1.6 Independence . . . . . . . . . .
1.7 Product Spaces . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
2
4
7
11
13
15
.
.
.
.
.
.
19
19
23
25
27
29
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
62
62
65
67
6 Limit Theorems
71
6.1 Chebyshevs Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.1
Introduction
Almost all in our life is random. There exist only very few events having probability 1. The aim of
probability theory is the following: Describe random experiments by mathematical methods.
Examples:
- rolling a die
- lifetime of a particle
- weather forecast
- the number of accidents in a city per day
- number of website hits
History: In the 11th century, the numbers {2, . . . , 12} are the possible sums of rolling two die,
and Arab mathematicians tried to find the probabilities of each sum. In the year 1900 there was a
mathematical congress in Paris where David Hilbert presented his 23 most important problems in
mathematics, called the Hilbert Problems. One of the problems was this: Give a solid foundation of
probability theory.
This problem was solved by A. N. Kolmogorov in 1933 with his book Foundations of Probability
Theory. Much of this is based on the former work of H. Lebesgue, who wrote his doctoral thesis
in 1902 where he developed Measure Theory. Kolmogorov decided this was a very good tool to
formulate the axioms of probability theory.
Definition 1.1 (A Random Experiment). Without changing the conditions we observe different
results/outcomes where the results are not predictable. We are only able to give a list of possible
outcomes.
1.2
Events as Sets
Definition 1.3 (Events). A subset A is called an event. We denote P(A) as the power set of
A, also written 2A , as the set of all subsets of A. We say # is the number of elements of . If it is
finite then #P(A) = 2#A .
We observe . Let A then one case is that w A. Then A occurred. If w
/ A then A did
not occur. As an example, if we denote the lifetime of a particle as = [0, ) and A = [2, ) then
A occurs if the particle lives longer than two years.
Easy Rules:
(i) always occurs (certain event)
(ii) never occurs (impossible event)
(iii) A occurs if and only if Ac does not occur.
(iv) The union of two events occurs if and only if either one or the other occurs.
(v) The intersection occurs if and only if both occur.
There are some events of special interest. An event A is elementary if #A = 1.
Basic idea: Events occur with a certain probability. Given A , there is a number p [0, 1] such
that A occurs with probability p. We write P(A) = p as the probability of occurrence.
Example 1.2. If = {1, . . . , 6} and A = {2, . . . , 6} then P(A) = 56 .
In short, we are looking for a function P : P() [0, 1] such that P(A) is the probability of
occurrence of A. Our problem is that if we let be R then there are no non-trivial functions
P : P() [0, 1] possessing the natural properties of a probability. Kolmogorovs solution to this
problem was to classify good events and bad events. By this we mean that we choose a collection
of sets, A , where P(A) is defined for A A and not defined for A
/ A.
Definition 1.4 (-field). A is called a -field if:
1. A
2. A A = Ac A .
3. For any countable collection A1 , A2 , A then
Proposition 1.1. Let A be a -field, then
(i) A
(ii) A1 , . . . , An A =
n
S
Aj A .
j=1
(iii) A1 , A2 , A =
Aj A
j=1
(iv) A, B A = A \ B A
An A .
Proof.
(i) = c
(ii) Let Aj = for j > n. Then A1 A2 A but this is just A1 An .
S
T
(iii) A1 , A2 , TA implies that Ac1 , Ac2 , A . Thus because of ( Acn )c = An by DeMorgans
law we get An A as well.
(iv) B A means B c A thus A B c = A \ B A .
Proposition 1.2. Let E be any non-empty collection of events in . Then there is a smallest -field
A0 such that E A0 .
Proof. Let = {A : A is a field and E A }. We know is non-empty, because E P().
So let
\
A0 =
= {A : A A , A }
A
1.3
Probability
What does P(A) mean? P(A) = 0.7?? One repeats the experiment n times under the same conditions,
independently. Then an (A) = #{j : A occurs in the j th trial} denotes the absolute frequency of
occurrence of A. Of course an (A) is random with 0 an (A) n. Then its relative frequency rn (A)
is defined as
an (A)
rn (A) =
1
(1)
n
Then we suggest rn (A) p for some p [0, 1] as n . We let p = P(A) be the limit of these
frequencies. If n is large then A occurs approximately n P(A) times.
5
3
Example 1.3. If A {1, . . . , 6}, A = {1, 2} and n = 103 then the average A occurs is 103 times.
Similarly if you have a chance of curing a disease of .9 then out of 1000 patients, approximately 900
will be cured.
Properties of rn :
rn () = 0, rn () = 1.
A B = = rn (A B) = rn (A) + rn (B).
The limit should also have similar properties:
1. P() = 0, P() = 1.
2. A B = = P(A B) = P(A) + P (B).
3. By induction on n we get that if A1 , . . . , An are pairwise disjoint then P(
n
S
j=1
Aj ) =
n
P
P(Aj ).
j=1
Property 2. or, equivalently, property 3. is called finite additivity. It does not suffice to build a
powerful theory, although many mathematicians tried to do so. We need the stronger -additivity
of P defined below.
Definition 1.6 (Probability Measure). Let (, A ) be a measurable space, so that 6= and A is
a -field on . Then P : A [0, 1] is called a probability measure or a probability if:
1. P() = 0, P() = 1.
2. P is -additive i,e, if there are A1 , A2 , A pairwise disjoint, then
!
[
X
P
An =
P(An ).
n=1
n=1
[
X
P
An
P(An ).
n=1
n=1
[
P
An = lim P(An ).
n
n=1
\
P
An = lim P(An ).
n
n=1
Proof.
1. An+1 = =
2. Immediate consequence of 1.
3. B = (B \ A) A.
4. Let B = in 4.
5. Let B1 = A1 , B2 = A2 \ A1 , B3 = A3 \ (A1 A2 ) and so on so Bn = An \ (A1 An1 ).
S
S
Then
Bn =
An and B1 , B2 , . . . are pairwise disjoint and Bn An for all n. Then
n=1
n=1
!
=P
An
!
Bn
P(Bn )
P(An ).
n=1
n=1
n=1
n=1
Bn =
n=1
n=1
disjoint. Then
P
!
An
=P
n=1
!
Bn
n=1
P(Bn ) = lim
n=1
n
X
P(Bj ) = lim
j=1
n
X
P(Aj \ Aj1 )
j=1
[
P
Acn = lim P(Acn ) = 1 lim P(An ).
n
n=1
and
P
[
n=1
!
Acn
=P
\
n=1
!c !
An
=1P
\
n=1
!
An .
1j1 <<jk n
k=1
k=1
The probability inside the right hand sum is the probability that persons j1 , . . . , jk have their gift
out of n people. This is just (nk)!
. Then our probability is
n!
!
n
n
n
[
X
X
X
1
k+1 (n k)!
P
Aj =
(1)
1=
(1)k+1 1 e1 , n
n!
k!
j=1
1j <j <n
k=1
k=1
1
Final remark: If P(A) = 1 then A occurs almost surely (a.s.) But P(A) = 1 6 = A = . An easy
way to see this is to let be the set of all infinite sequences of 0s and 1s. Then the probability of
choosing one sequence at random out of all of them is 0, so the probability of not choosing it is 1.
Thus we choose a sequence that is not {0, 0, 0, . . . } almost surely. Similarly we say A is a zero set if
P(A) = 0.
1.4
Continuous Probabilities
Those probabilities describe experiments where uncountably many outcomes are possible. Typical examples are the lifetime of a light bulb, the length of a phone call or the values of a
measurement (e.g. of the pressure of the air or of an item).
Remark. There exists a third kind of probabilities different from both mentioned above, so called
singular continuous probabilities. Although they satisfy P({a}) = 0 for a R, they do not possess
a density. They are concentrated on thin sets as for example on Cantors Discontinuum.
Case 1: Discrete Probabilities
Here we have either = {x1 , . . . , xN } or = {x1 , x2 , . . .}. As -field A we may always choose the
power set, i.e. A = P().
We define a function f : R by
f (x) := P({x}) ,
Then
f (x) 0 and
x .
f (x) = 1 .
(2)
(3)
xA
Conversely, given any function f : R satisfying (2), by (3) a probability P on P() is defined.
Thus we have a one-to-one relation between probabilities on P() and real-valued functions f on
satisfying (2).
Example 1.5. Suppose = {0, 1, 2, 3}. Letting f (0) = 1/4, f (1) = 1/4, f (2) = 1/6 and f (3) = 1/3,
the generated probability P satisfies
P({1, 3}) = f (1) + f (3) = 1/4 + 1/3 = 7/12 .
Basic examples of discrete probabilities
1. Uniform distribution on a finite set: We have = {x1 , . . . , xN } and assume that all elementary
events are equally likely. Hence
f (xi ) = P({xi }) =
1
,
N
i = 1, . . . , N .
This leads to
#A
#A
=
, A .
N
#
This probability is called uniform distribution on = {x1 , . . . , xN }.
P(A) =
2. Binomial distribution: Given an integer n 1 let = {0, . . . , n}. For p [0, 1] we set
n k
f (k) = Bn,p ({k}) =
p (1 p)nk , k = 0, . . . , n .
k
The generated probability Bn,p on P() is called binomial distribution with parameters n
and p. Execute n independent trials of an experiment where each time success appears with
probability p and failure with probability 1 p. Then Bn,p ({k}) is the probability to observe
exactly k times success (hence nk times failure). The number p is called the success probability.
k = 1, 2, . . . .
k k
e ,
k!
k = 0, 1, . . . .
nm
N
n
m = 0, . . . , n .
Suppose there is a delivery of N items among which M are defective. Choose by random n of
the N items. Then HN,M,n ({m}) is the probability to observe exactly m defective ones among
the chosen n.
Case 2: Continuous Probabilities
Here we have = R or R, for example = [0, 1]. As -field we choose A = B(R).
Definition 1.7. A Riemann integrable function f : R R is called probability density if it satisfies
Z
f (x) 0 , x R , and
f (x)dx = 1 .
(4)
f (x)dx .
(5)
Proposition 1.7. (Extension Theorem) Given a probability density f on R. Then there is a unique
probability P on B(R) satisfying (5).
The function f is called density of P. Probabilities having a density are said to be continuous.1
1
The correct notation would be absolute continuous. But since we do not deal with singular probabilities we may
shortly say continuous.
10
Remark.
(1) In contrast to the discrete case continuous probabilities satisfy P({a}) = 0 for all a R.
(2) The density of a continuous probability is not unique. For example, changing the density at
finitely many points does not change the generated probability.
Basic examples of continuous probabilities
1. Uniform distribution on a finite interval: Let I = [, ] be a finite interval in the real line.
Define the function p by
1
: x [, ]
f (x) =
0 : x
/ [, ]
Of course, f satisfies (4), hence it is a probability density, and the probability P generated by
f via (5) is called uniform distribution on the interval I.
Note that P satisfies
[a, b] [, ]
P([a, b]) =
[, ]
where |A| denotes the length of a set A. In particular, if [a, b] [, ], then we get
P([a, b]) =
ba
.
The crucial property of the uniform distribution is as follows: The probability of occurrence of
a set A I is independent of the position of A in I = [, ]. Only the size of A matters.
2. Exponential distribution: Given > 0 define p on R by
x
e
: x0
f (x) =
0
: x<0
Let E be the probability generated by f . It is called exponential distribution with parameter
> 0.
If 0 a < b, then we get
Z
E ([a, b]) =
ex dx = ea eb .
The exponential distribution describes the lifetime of non-aging particles, components or other
items.
3. Cauchy distribution: Define f by
f (x) =
1 1
,
1 + x2
x R.
The probability P generated by this density is called Cauchy distribution. Note that
P([a, b]) =
arctan(b) arctan(a)
.
The Cauchy distribution is used to describe experiments where large values may appear with
comparable large probability.
11
1.5
Conditional Probability
Let (, A , P) be our probability space and B A such that P(B) > 0. Then we say
P(A|B) = Probability of A given B or probability of A conditional on B.
Example 1.6.
4
= 19 . Let B = { First result is
1. Roll a die twice. A = {sum of the results is 5} then P(A) = 36
1 or 2}. We get the result by restricting the sample space, = {1, . . . , 6} {1, . . . , 6} to the
new sample space, B = {1, 2} {1, . . . , 6}. So
P(A|B) =
#(A B)
1
=
#B
6
2. We have an urn containing two black balls and two white. Let A = {2nd is white} and B = {1st
is black}. Then P(A|B) = 23
Definition 1.8. (Conditional Probability) (, A , P), B A with P(B) > 0. Then we set
P(A|B) =
P(A B)
.
P(B)
12
1. From the above example, compute P(A B) = P({(b, w)}). Then P(B) =
so P(A B) = 13 .
1
2
and P(A|B) =
2
3
2. Lottery with 49 numbers. Choose 6 in a row (without replacing them). What is the probability
that the first number is even and the second is odd? Let B be the first case and A be the
24
. Then P(A|B) = 25
and our result is that P(A B) = 2425
= 25
second. Then P(B) = 49
48
4849
98
Proposition 1.8. The mapping A 7 P(A|B) for P(B) > 0 is a probability measure on (, A ).
= 0 Similarly if P(|B) = P(B)
= 1.
Proof. P(|B) = P(B)
P(B)
P(B)
Let A1 , A2 , . . . disjoint. Then
!
!
!
S
S
! P
Aj B
P
(Aj B)
[
j=1
j=1
P
Aj |B =
=
P(B)
P(B)
j=1
P(Aj B)
j=1
P(B)
=
X
j=1
P(Aj B) X
=
P(Aj |B)
P(B)
j=1
Definition 1.9. The probability A 7 P(A|B) is called conditional probability (with respect to B).
It satisfies P(B|B) = 1 and P(B c |B) = 0.
n
S
j=1
Then
P(A) =
n
X
P(Bj ) P(A|Bj )
(6)
j=1
Proof. We see
n
X
j=1
P(Bj ) P(A|Bj ) =
n
X
P(Bj )
j=1
P(A Bj )
=
P(Bj )
n
X
P(A Bj ) = P
j=1
n
[
!
A Bj
j=1
Note that the Bj are all disjoint, allowing us to write the sum as the union. Then this is equal to
!!
n
[
P A
Bj
= P(A ) = P(A).
j=1
Example 1.8. Consider an urn with four balls, two black and two white. If A is the event that the
2nd ball picked is white and B1 is the event that the first ball is white, while B2 is the event that
the first ball is black. Then = B1 B2 and
P(A) = P(B1 )P(A|B1 ) + P(B2 )P(A|B2 ) =
1 1 1 2
1
+ =
2 3 2 3
2
n
S
13
j=1
abilities. These are the probabilities of the Bj s before executing the experiment. Now execute
the experiment and observe the occurrence of an event A. Then P(B1 |A), . . . , P(Bn |A) are the a
posteriori probabilities, i.e., those after knowing the occurrence of A.
Proposition 1.10 (Bayes Rule).
P(Bj )P(A|Bj )
P(Bj |A) = P
n
P(Bi )P(A|Bi )
(7)
i=1
Proof. From the previous proposition, the denominator is equal to P(A). Then
P(AB )
P(Bj ) P(Bj )j
P(Bj )P(A|Bj )
P(Bj )P(A|Bj )
P(Bj A)
=
=
=
= P(Bj |A)
n
P
P(A)
P(A)
P(A)
P(Bi )P(A|Bi )
i=1
Example 1.9. We have three coins, two are fair and one is biased. The biased coin has the condition
that heads occurs with probability 31 and tails occurs with probability 23 . Say we choose a coin
and toss it and we observe heads. What is the probability that the chosen coin was fair?
Let Bj be the event that the j-th coin was chosen, j {1, 2, 3}. The a priori probabilities are
P(Bj ) = 13 . Now let A be the event we observe heads. Then using the a posteriori probabilities,
P(Bj |A) we see
4
11 11 11
+
+
= .
P(A) =
32 32 33
9
Then we see
11
P(B1 )P(A|B1 )
19
3
P(B1 |A) =
= 342 =
= .
P(A)
64
8
9
By a same computation, P(B2 |A) = 38 , thus P(B3 |A) = 41 , which is the probability we wanted to find.
1.6
Independence
What is the mathematical description of independence, i.e. A and B occur independently. We can
describe this using conditional probabilities. If A and B are independent, then P(A|B) = P(A). We
say this because the event B does not affect A or that A occurs independently of B. In short, no
data about A can be recovered from knowledge of B. So we have
P(A) = P(A|B) =
P(A B)
= P(A B) = P(A)P(B).
P(B)
(8)
14
1. Roll a die twice. Let B be the event that the sum of the results is 7. Let A be the event that
6
1
the first result is 3. We know P(A) = 61 and P(B) = 36
= 61 . Then P(A B) = 36
. Thus
P(A B) = P(A)P(B) so that they are independent. This is an interesting example because
the events are not independent if we replace 7 with 6.
2. Consider the urn with two black and two white balls. Let A be the event that the first ball
is white and B be the event that the second is black. We see P(A) = P(B) = 12 Then
P(A B) = P(A)P(B|A) = 21 23 = 13 6= 14 = P(A)P(B). Thus they are dependent.
Proposition 1.11.
(i) A is independent of B B is independent of A.
(ii) and are independent of each event.
(iii) if A and B are independent then A and B c , and Ac and B c are independent.
Proof.
(i) Trivial
(ii) P( B) = P(B) = P()P(B)
(iii) P(A B c ) = P(A) P(A B) = P(A) P(A)P(B) = P(A)(1 P(B)) = P(A)P(B c ). The
first equality was gotten by noting that the disjoint sets A B and A B c give us P(A) =
P(A B) + P(A B c ). Since A and B c are independent, then using (iii) again we see that Ac
and B c are independent.
iI0
15
Remark.
1. #(I0 ) 2 because for #(I0 ) = 1 there is nothing to prove.
2. If I = {1, . . . , n} then for all 1 i1 < i2 < < im n we have
P(Ai1 Aim ) = P(Ai1 ) P(Aim )
by taking I0 = {i1 , . . . , im }.
3. For n = 3, #(I0 ) = 2 then P(Ai Aj ) = P(Ai )P(Aj ) for all i 6= j in {1, 2, 3} and if #(I0 ) = 3,
then P(A1 A2 A3 ) = P(A1 )P(A2 )P(A3 ).
Example 1.12. Consider a coin with sides 0 and 1 that is fair and toss it n times. Let Aj = {jth toss is 0}. We want to show A1 , . . . , An is independent. We see = {0, 1}n and A = P()
and
#A
2n
2n1
P(Aj ) = n =
2
P(A) =
1
2
Since
1.7
iI0
P(Ai ) =
1 m
2
we are done.
Product Spaces
Suppose we execute two (maybe different) random experiments. They are described by probability
spaces (1 , A1 , P1 ) and (2 , A2 , P2 ). How to express that they are executed independently? One
possibility would be as follows: Independence of the experiments means that the probability of
occurrence of A1 A1 in the first experiment and of A2 A2 in the second one, this probability
should be P1 (A1 ) P2 (A2 ). But there is no probability space that describes the joint occurrence of A1
and A2 . Consequently we need a probability space (, A , P) which describes the common execution
of both experiments and where the occurrence of A1 and A2 equals P1 (A1 ) P2 (A2 ).
Example 1.13. The first experiment is rolling a fair die while the second one is tossing a fair coin
twice. In order to express that both experiments are independent we need a probability space which
describes the simultaneous execution of both experiments. A suitable sample space would be
= {1, . . . , 6} {H, T }2 = {(x1 , x2 , x3 ) : x1 = 1, . . . , 6, x2 , x3 are H or T} .
Let (1 , A1 , P1 ) and (2 , A2 , P2 ) be two probability spaces. The sample space for combining the two
experiments is
:= 1 2 = {(1 , 2 ) : 1 1 , 2 2 } .
The natural -field on is given by
A = A1 A2 = {A1 A2 : A1 A1 , A2 A2 }
Definition 1.12. The -field A1 A2 is called the product--field of A1 and A2 .
16
Next we observe that the event A1 A2 occurs if and only if A1 and A2 occur. Hence a suitable
probability P on (, A ) should satisfy
P(A1 A2 ) = P1 (A1 )P2 (A2 ) ,
A1 A1 , A2 A2 .
(10)
A1 A1 , A2 A2 .
Concrete cases:
1. Discrete case: Suppose 1 = {x1 , x2 , . . .} and 2 = {y1 , y2 , . . .}. Then
= 1 2 = {(xi , yj ) : 1 i, j < } .
The product--field of P(1 ) and P2 (2 ) is P().
Proposition 1.13. If f1 (x) = P1 ({x}), x 1 , and f2 ({y}) = P2 ({y}), y 2 , their product
probability P = P1 P2 is given by
X
P(A) =
f1 (x) f2 (y) , A .
(x,y)A
Example 1.14. Choose independently two numbers from {1, . . . , n} where the same number
may be chosen twice. Find the probability that the first number is strictly less than the second
one.
The two sample spaces are {1, . . . , n}, so the common sample space is = {(i, j) : 1 i, j n}.
Since all numbers are equally likely the probability mass functions f1 and f2 are constant with
f1 ({i}) = f2 ({j}) = n1 . Hence, for any A we have
P(A) =
X 1 1
#A
= 2 ,
n n
n
(i,j)A
that is, the product probability P is the uniform distribution on {1, . . . , n}2 . In particular, if
A = {(i, j) : i < j}, then #A = (n1)n
, hence
2
P(A) =
n(n 1)
n1
=
.
2
2n
2n
Example 1.15. Two players, say X and Y , roll simultaneously a die. Winner is who gets
number 6 first. Find the probability that X wins.
Here we have 1 = 2 = N. Moreover, the probability to observe 6 for the first time in the
ith roll is
i1
1 5
f1 ({i}) = f2 ({i}) =
, i = 1, 2, . . .
6 6
17
2 X
i
6
1
1
1
5
=
1 = .
5
6
25 1 5/6
5
i=1
2
.
5
Example 1.16. Suppose the number of accidents per week in a city is Poisson distributed
with parameter > 0. Further we assume that the numbers of accidents in different weeks are
independent of each other. Find the probability that the number of accidents in the next week
is at least twice time that of this week.
We have 1 = 2 = N0 . Hence = N20 . The probability mass functions are given by
f1 ({i}) = f2 ({i}) =
i
e ,
i!
i = 0, 1, . . . .
j
X i X
X i+j
e2 =
P(A) =
e2 .
i!j!
i!
j!
i=0
j=2i
(i,j)A
2. Continuous case: We are given two probability spaces (R, B(R), P1 ) and (R, B(R), P2 ) and
suppose that both probabilities possess densities:
Z
Z
P1 (A) =
f1 (x)dx and P2 (A) =
f2 (y)dy .
(11)
A
18
Example 1.17. Choose by random and independent of each other two numbers from [0, 1].
Find the probability that their sum does not exceed 1/4.
The probabilities P1 and P2 are the uniform distributions on [0, 1]. Hence the densities f1 and
f2 are given by f1 (x) = f2 (x) = 1[0,1] (x). The set of interest is
1
A = (x, y) : x + y
.
4
Consequently, if P = P1 P2 , it follows
ZZ
1[0,1] (x) 1[0,1] (y) dx dy =
P(A) =
A
ZZ
dx dy =
1
.
32
0x+y1/4
Example 1.18. Suppose the lifetime of certain light bulbs is exponentially distributed with
parameter > 0. Switch on two bulbs at the same time. Find the probability that at least
one of the bulbs still burns at time T > 0.
The densities f1 and f2 are given by
f1 (x) = f2 (x) =
ex : x 0
0
: x<0
Let t1 and t2 be the lifetimes of bulb 1 and bulb 2. Then we ask for the probability of the event
A which occurs if max{t1 , t2 } > T . Let us look for the complementary event
Ac = {(t1 , t2 ) : max{t1 , t2 } T } .
Then it follows
c
ZZ
P(A ) =
0t1 ,t2 T
hence
P(A) = 1 (1 eT )2 = 2 eT e2T .
Remark. Of course, the construction of products easily extends from two probability spaces to finitely
many ones. More precisely, given probability spaces (1 , A1 , P1 ) to (n , An , Pn ), we set
= 1 n
and A = A1 An is the smallest -field containing rectangle sets A1 An with Aj Aj .
A probability measure P on (, A ) is called product measure of P1 to Pn (write P = P1 Pn ) if
P(A1 An ) = P1 (A1 ) Pn (An ) ,
Aj Aj .
19
2.1
Random Variables
X(t, s) = t2 + s2 ,
the distance of its impact to the center of the board.
Definition 2.1 (Random Variable). Suppose that X : R. Then X is said to be a random
variable if for all t R, the sets
{ : X() t} = {X t} A
Remark.
(i) If A = P() then every X is a random variable.
(ii) The other extreme case is A = {, }. Then only constant functions X are random variables.
(iii) Suppose X : Rn R is a continuous function, then X is a random variable. We see this
because { Rn : X() t} is closed in Rn thus belongs to B(Rn )
(iv) X : R R is monotone, then { R : X() t} is an open or a closed interval, hence belongs
to B(R).
We call X 1 (B) = { : X() B} the pre-image of B. We have some easily verified
properties:
!
[
[
1
X
Bj =
X 1 (Bj ) ,
j=1
j=1
c
(A B) = X 1 (A) X 1 (B) and X 1 (Ac ) = X 1 (A) .
20
Proposition 2.1. X is a random variable if and only if X 1 (B) A for all Borel sets B R.
Proof. Suppose first that X 1 (B) A for all Borel sets B. Take t R and set B = (, t]. This
is a Borel set, thus X 1 (B) A , hence X is a random variable.
For the other direction, let X be a random variable. Then we know X 1 ((, t]) A for all t. We
want to prove that X 1 (B) A for all Borel sets B. Let us prove that
C = {A R : X 1 (A) A }
is a -field. We know this is true because
- C as X 1 () = A .
- If A C then X 1 (A) A which means X 1 (A)c = X 1 (Ac ) A as the left hand side is in A .
S
S
1
1
1
An A as the
- If A1 , A2 , C Then X (An ) A for all n thus
X (An ) = X
n=1
n=1
3. P{X = k} =
k k
e
k!
P(X = 1) = p. Then
:t<0
0
FX (t) = 1 p : 0 t < 1
1
: t 1.
for > 0, k = 0, 1, 2, . . . . Then
(
0
:t<0
FX (t) = Pn k k
:nt<n+1
k=0 k! e
4. Suppose x1 < x2 < < xN in R. Let X be uniformly distributed on {x1 , . . . , xn }, that means
that P{X = xj } = N1 . Then one easily gets
FX (t) =
k
, for t [xk , xk+1 ) ,
N
k = 1, . . . , N 1 .
Furthermore,
FX (t) = 0 , t < x1 ,
and FX (t) = 1 , t xN .
21
FX () = 1.
T
Aj = . Thus P(An ) 0, and noting P(An ) = FX (tn ) we are done.
j=1
S
Aj = . Thus P(An ) 1, and noting P(An ) = FX (tn ), we are done.
j=1
Aj = { : X() t}.
j=1
Thus FX has a jump of height h > 0 in t if and only if P{X = t} = h and it is continuous at t if and
only if P{X = t} = 0.
22
X
X
PX (B) = P(A) =
P(Aj ) =
PX (Bj )
j=1
36
3
FX (t) = 36
36
j=1
1
,
36
PX ({3}) = P{X = 3} =
2
,
36
PX ([2, 4]) =
1
.
6
2.2
23
3. FX (s) =
f (t)
ts, tD
4. PX (B) =
f (t)
tBD
1
2k
for k N. Then
(
2t : t N
f (t) =
0
: else
What are the distribution function FX and what the distribution law PX (B) of that X?
[t]
X
1
1 (1/2)[t]+1
FX (t) =
=
1 = 1 2[t] ,
k
2
1/2
k=1
and
PX (B) =
t 1,
X 1
2k
kBN
f (u)du
a
4.
d
F (t)
dt X
= f (t) if f is continuous at t.
24
Example 2.6. If
then
(
0
FX (t) =
1 et
:t0
:t>0
(
0
f (t) =
t1 et
:t0
:t>0
Example 2.7. Lifetime of a light bulb. Let X = t if the bulb burns out at time t > 0. What is the
probability P(X t)? It is the probability that the bulb is still burning after time t. This is given
that P(X t) = et . The lifetime is wholly dependent on . Then
FX (t) = P{X t} = 1 P(X t} = 1 et , t > 0
and the density is
f (t) = et , t > 0
This is called the exponential distribution, with parameter .
One important question is, are the discrete and continuous random variables the only random variables? The answer is no, here are some examples:
1. First of all we may have random variables which are a mixture of a discrete and of a continuous
random variable.
Example 2.8. Let X be the lifetime of a light bulb. At a certain time T > 0 we switch the
bulb off provided it is still burning. Then
1 et : 0 t < T
FX (t) =
1
:
tT
2. If FX is continuous, does this imply that X is continuous? I.e. for any given continuous, nondecreasing function can we recover it from its derivative? The answer is no. Take the devils
staircase for example..
Definition 2.7 (Binomial distributed random variables). Let X : {0, . . . , n} with 0 p 1,
then X is Bn,p distributed if
n k
P{X = k} =
p (1 p)nk
k
Then
n k
f (k) =
p (1 p)nk , k = 0, . . . , n
k
This describes an experiment with n trials where each trial can be a failure (0) or success (1). X = k
is k-times success and p is the probability to have a success 1 in each trial.
Example 2.9. Say there is an air plane company having a plane with m seats. The company sells
n > m tickets in order to minimize the risk of empty seats. Let p be the probability that a passenger
shows up so that 1 p is the probability of a passenger not showing up. Let Xn be the number of
passengers that show up when selling n tickets. The company has to find the maximal n > m such
that
n
n
X
X
n k
P{Xn > m} =
P{Xn = k} =
p (1 p)nk
k
k=m+1
k=m+1
25
2.3
Random Vectors
1jn
26
Note that
PX ([a1 , b1 ] [an , bn ]) = P (X [a1 , b1 ] [an , bn ]) = P(a1 X1 b1 , . . . , an Xn bn )
If n = 1, then the distribution function FX and the distribution law PX are tightly related via
FX (t) = PX ((, t]) and PX ((a, b]) = P{a < X b} = FX (b) FX (a) .
Let t = (t1 , . . . , tn ). Define if X is a random vector,
FX (t) = P{X1 tn , , Xn tn }
Then FX : Rn [0, 1] is the distribution function. The relation between FX and PX becomes now
much more complicated. If n = 2, then
P{a1 < X1 b1 , a2 < X2 b2 } = FX (b1 , b2 ) FX (a1 , b2 ) FX (a2 , b1 ) + FX (a1 , a2 )
And the bigger n becomes the more complicated is the relation between FX and PX .
Conclusion: Distribution functions are a helpful tool if n = 1. If n > 1, then they are less useful.
There are only a few cases where we consider distribution functions
Definition 2.10 (Special Cases).
(a) Call X = (X1 , . . . , Xn ) discrete if there is an at-most countable set D Rn such that X : D.
The joint probability mass function is defined for t D, we call it
X
f (t) = P{X = t} = PX (B) = P{X B} =
f (t)
tDB
(b) X is continuous if
Z
t1
tn
P{X1 t1 , . . . , Xn tn } =
f (x)dx
B
Example 2.11.
1. Consider an urn with four balls, two marked 0 and two marked 1. Take out two balls without
replacement. Let X1 be the value of the first ball and X2 be the value of the second. Then X =
(X1 , X2 ) is a 2-dim random vector. The possible values are D = {(0, 0), (0, 1), (1, 0), (1, 1)}.
What is f (0, 0)? It is
1
f (0, 0) = P{X1 = 0, X2 = 0} = ,
6
f (1, 1) =
1
6
1
6
1
3
1
3
1
6
1
4
1
4
1
4
1
4
27
2. Roll a die twice. Draw the tables yourself. = (1 , 2 ). Let X1 = min{1 , 2 } and X2 =
max{1 , 2 }
3. Consider the case that we throw a dart at the unit circle K1 with X = (X1 , X2 ) be the
coordinates of the point that it lands. Then
(
1
: u2 + v 2 1
f (u, v) =
0 : else
and
Z
f (x)dx =
PX (B) =
B
vol2 (B K1 )
.
2.4
Let X = (X1 , . . . , Xn ) a random vector and PX = P(X1 ,...,Xn ) . Recall that the joint distribution PX
is uniquely described by
PX (B1 Bn ) = P{X1 B1 , . . . , Xn Bn } .
On the other hand, we also have the n probabilities
PXj (B) = P{Xj B}
which were called the marginal distributions. Recall that they are probabilities on (R, B(R) while
the joint distribution is a probability on (Rn , B(Rn )).
j
z}|{
Proposition 2.6. PXj (B) = P(X1 ,...,Xn ) (R B R)
Proof.
j
z}|{
R, Xj B, , Xn R} = P(X1 ,...,Xn ) (R B R)
28
Consequences:
The discrete case, n = 2, consider (X, Y ). Then P(X,Y ) is the joint distr. and PX and PY are the
marginals. Say
X : D = {x1 , x2 , . . . },
X : E = {y1 , y2 , . . . }
so that
(X, Y ) : D E = {(xi , yj ) : i, j N}
Then the probability mass function f is
f (xi , yj ) = P{X = xi , Y = yj } ,
i, j N .
X
j=1
f (xi , yj )
f (xi , yj )
i=1
Proof. Since E =
j=1
f (xi , yj )
j=1
0
1
0 1/4 1/4
1 1/4 1/4
or
P(X t, Y s) =
f (u, v)dudv
29
Z
fY (v) =
f (u, v)du is a density of Y.
Proof.
Z
Z
f (u, v)dv du
{z
}
=fX (u)
1
f (u, v)dv =
fX (u) =
2.5
Z
1dv =
u2 +v 2 1
2
1 u2
Bj B(R) .
Remark.
(1) It suffices to satisfy (12) for Bj = (, tj ]. In different words, the Xj are independent if and
only if for all t1 , . . . , tn R, we have
P{X1 t1 , . . . , Xn tn } = P{X1 t1 } P{Xn tn } .
(2) X1 , . . . , Xn are independent if and only if X11 (B1 ), , Xn1 (Bn ) are independent as events for
all B1 , . . . , Bn B(R).2
2
Why this is not a trivial assertion? Recall how the independence of n events was defined.
30
!
=
xB1
xi B1 ,yj B2
fX (xj )
yj B2
Example 2.15.
1. Roll a die twice. Let X be the result of the first roll and Y be the result of the second. Then
1
f (xi , yj ) =
= fX (xi )fY (yj ).
36
2. P{X = k} = P{Y = k} =
find
P{|X Y | 1} =
=
=
X
k=1
X
k=1
X
k=1
2
3
1
.
2k
1
.
2k+l
We want to
P{X = k, |Y k| 1}
[P{X = k, Y = k + 1} + P{X = k, Y = k} + P{X = k, Y = k 1}]
1
22k+1
X
X
1
1
+
22k k=2 22k1
k=1
31
3. Say you have a massive ball of dough and say that the raisins in the dough are uniformly
distributed. Take out a piece of dough, then what is the number of raisins in the dough?
Let X be the number of raisins in 1 pound of bread. A good model is to assume that PX is a
Poisson distribution with parameter > 0,
P{X = k} =
k
e ,
k!
k = 0, 1, 2, . . .
Now take another pound of bread where Y is the number of raisins in the next pound. What
is
X
X
2k 2
P{X = Y } =
P{X = k, Y = k} =
e
2
(k!)
k=0
k=0
Proposition 2.9 (Extension to n random variables). Let
Xj : Dj ,
1 j n,
hence
(X1 , . . . , Xn ) : D1 Dn .
Then X1 , . . . , Xn are independent if and only if
P{X1 = d1 , . . . , Xn = dn } = P{X1 = d1 } P{Xn = dn }
for all dj Dj .
Example 2.16. Toss a coin with 0 and 1 n times. Let Xj be the result of the jth toss. Suppose
P{Xj = 0} = 1 p and P{Xj = 1} = p. Then, given xj {0, 1} it follows
P{(X1 = x1 , . . . , Xn = xn )} = pk (1 p)nk
where exactly k of the xj satisfy xj = 1. Then
P{X1 = x1 } P{Xn = xn } = pk (1 p)nk = P{X1 = x1 } P{Xn = xn } .
Thus X1 , . . . , Xn are independent.
2.6
fY (v)dv
32
= P{X t}P{Y s}
Z s
Z t
fX (u)du
fY (v)dv
=
Z t Z s
fX (u)fY (v)dudv
=
Thus fX (u)fY (v) is the joint density of (X, Y ). The other direction is the exact same.
Example 2.17. Compute the lifetime of a bulb. Let X = t denote the bulb burning out at time t.
Give it the distribution
Z t
P{X t} =
eu du
0
Let Y denote the lifetime of a second bulb. The density of (X, Y ) is then 2 e(u+v) . What is
P{Y X + 1} = P{(X, Y ) {((u, v) : v u + 1, u > 0}}
Then
ZZ
2 e(u+v) dudv
P{Y X + 1} =
Zvu+1,u>0
Z
= 2
Z
=
ev dveu du
u+1
e(u+1) eu du
1
=
2e
Example 2.18. Take (X, Y ) to be the points (x, y) in the circle, with the density
(
1
: u2 + v 2 1
f (u, v) =
0 : else
R
R
Then fX (u) = f (u, v)dv = 2 1 u2 and fY (v) = f (u, v)du = 2 1 v 2 Thus f (u, v) 6=
fX (u)fY (v) so X and Y are not independent. In order for these two densities to be independent, the
domain must be a rectangle.
33
3.1
f (xj ) = 1
j=1
f (xj )
xj B
(iv) FX (t) =
xj t
f (xj )
1
N
1
N
for j = 1, . . . , N and
PX (B) = P{X B} =
X
xj B
f (xj ) =
#(B D)
#D
6n1
1
= ,
n
6
6
1
.
6n
Let Xj () = j where
k = 1, . . . , 6 .
X =Y .
Remark. Note that X and Y need not to be defined on the same probability space in order that
d
X = Y . Roll a die twice and take X as the first result. Now roll it four times and define Y as the
d
last result. Then X = Y although they are defined on different probability spaces.
34
100
X
100
k=60
pk (1 p)100k
For example, in order that P(to pass) 0.75, the success probability p has to satisfy p > 0.6274.
(c) Poisson Distributed Random Variables
D = N0 = {0, 1, 2, . . . } with > 0. We say X Pois (Poisson distributed with parameter
> 0) if
k
P{X = k} = e , k = 0, 1, 2, . . .
k!
Note that
X
k
e = e e = 1
k!
k=0
Lemma 3.1. xn x in R implies
x n n
1+
ex ,
n
Take
|an bn | = |a b||an1 + an2 b + + bn1 |
Suppose |a|, |b| c < . Then
|an bn | |a b| n cn1
We know xn x so that |xn | 1 + |x| = d for n large enough. Then a = 1 +
addition, |a|, |b| 1 + nd
n1
|x xn |
d
|a b |
n 1+
C|x xn | 0,
n
n
n
Note that 1 +
d n1
n
xn
n
and b = 1 + nx . In
35
Theorem 3.1 (Poissons Limit Theorem). Let Xn Bn,pn distributed and suppose
lim npn = > 0
Then
P{Xn = k}
k k
e
k!
or
Proof. We have
n k
1
n!
P{Xn = k} =
pn (1 pn )nk =
pk (1 pn )nk
k
k! (n k)! n
1 n(n 1) (n k + 1)
=
(npn )k (1 pn )n (1 pn )k
k!
nk
Since npn > 0, it follows pn 0 as n , hence we get
(i)
n(n1)(nk+1)
nk
(ii) (npn )k k
(iii) (1 pn )k 1
(iv) Employing the previous lemma, (1 pn )n = 1 +
Thus for any k N0 we have
lim P{Xn = k} =
npn n
n
e , as npn , n
k
e .
k!
k
e .
k!
Example 3.3. Let X be the number of people among N having a birthday today. Then
0
N
1
N 1
N
1
364
N
1
364
P{X 2} = 1 P{X 1} = 1
.
0
365
365
1
365
365
Using the above approximation, =
N
365
(13)
then
0 N/365 N/365
P{X 2} 1 e
e
.
0
1
(14)
For example, if N = 500, then (13) gives P{X 2} = 0.397895 while (14) equals 0.397719.
Remark. Poissons Limit Theorem also explains why Poisson distributed random variables appear in
cases where there are many trials, each having a small success probability. For example, there are
many cars in the city (when driven it is a trial), but the probability that a fixed car has an accident
at that day is very small. This explains why the number of accidents per day may be described by
a Poisson distributed random variable.
36
, or
There are exactly n+k1
ways to distribute the n 1 ones and k zeros. So we say X is Bn,p
k
negative binomial distributed with parameters n and p, if
n+k1 n
P{X = k} =
p (1 p)k
k
P{X = k} = p (1 p)k ,
k = 0, 1, . . .
Let us say that X is modified geometric distributed. Note that in this the case Y := X + 1 is
geometric distributed with parameter p. Indeed, we have
P{Y = k} = P{X = k 1} = p (1 p)k1 ,
k = 1, 2, . . . .
In different words X is modified geometric if X = k says that in the (k + 1)st trial we have
success for the first time. So everything is shifted by one.
Let us come back to the general case. We can write
n+k1
(n + k 1)(n + k 2) n
=
k!
k
(n)(n 1)(n 2) (n k + 1)
= (1)k
k!
n
= (1)k
k
37
n n
P{X = k} =
p (p 1)k
k
P{X = k} =
X
n
pn (p 1)k
k
X
n
n
=p
(p 1)k
k
k=0
k=0
k=0
= pn
1
(1 + (p 1))n
=1
This is found by computing the Taylor expansion of
1
.
(1+x)n
Example 3.6 (Banachs Matchbox Problem). Two matchboxes M1 and M2 each with N matches.
The probability of removing a match from either box is p = 12 . Take the last match out of M1 or M2 .
How many matches remain in the other box? Each trial we take out one match. For each m, there
are two cases: M1 is empty and M2 has m matches or M2 is empty and M1 has m matches. Let M2
be empty and take
X = m, if m matches are left in M1 .
Let success be if we choose M2 . Thus taking out the last match from M2 means that we have exactly
for the N th time success. In order that there are still m matches in M1 , this N -th success has to
happen in the (2N m)th trial. Then the probability is success after 2N m trials that M2 is empty
equals
N N m
2N m 1
1
1
P{X = m} =
, m = 1, . . . , N .
N m
2
2
The event that after 2N m trials, M1 is empty has the same probability (and they are disjoint).
Thus the probability that when taking out the last match from one box, the other one contains m
matches, equals
N N m
2N m1
2N m 1
1
1
2N m 1
1
=
.
2
N m
2
2
N m
2
Remark. Since some m = 1, . . . , N has to appear, the last result tells us
2N m1
N
X
2N m 1
1
= 1.
N
m
2
m=1
Letting k = N m, we get
N
1
X
k=0
k
N +k1
1
= 2N 1 .
k
2
38
Remark. In the literature one also finds Banachs Matchbox Problem in a slightly modified form:
One asks for the probability to have left m matches in the other box when choosing a box which
turns out to be empty. Note that here m = 0 may occur, namely if both boxes are empty at the
same time.
(f) Hypergeometric Random Variables
A retailer delivers N machines. Among them are M defective. Choose n of the N machines and
check them. What is the probability that among the n tested, m are defective? Counting, this
becomes
P{X = m} =
M
m
N M
nm
N
n
m = 0, . . . , n
We say that the random variable X here is hypergeometric distributed with parameters N, M
and n.
Example 3.7. In an urn are M white balls and N M black ones. Take out n balls in a single
draw. Let X be the number of observed white balls. Then X is hypergeometric with parameters N ,
M , and n.
What happens if we take out the n balls one after the other with replacement. Probability for a
. So
white ball is p = M
N
nm
n
M
n
M
1
P{X = m} =
N
N
m
Example 3.8. Lottery of 49 numbers and 6 chosen without repetition. There are 6 numbers written
on the lottery ticket. What is the probability that exactly three of them were chosen and three not?
Let X = m if m numbers of my ticket were chosen.
43
6
P{X = m} =
6m
49
6
m = 0, . . . , 6 .
M
m
N M
nm
N
n
39
2. Size Estimation: Suppose we have the following situation: We know M and n and observe
m. Is it possible to get some information about N ?
Let us illustrate this problem by an example. Say there are N fish in a pond, N unknown.
Catch M and mark them. Put back and wait! Catch n observe m marked. We investigate
M N M
(N ) 7
nm
N
n
n
(m) =
N
m
is an estimator for N .
= 214. If m = 2 then N
= 750. If
Say M = 50 and n = 30. If m = 7 we estimate that N
m = 16 then N = 93.
(g) Multinomial Distributed Random Vectors
We have m boxes B1 , . . . , Bm and place independently one after the other n balls into these
m boxes. The probability to place a single ball into box Bj is pj where 0 pj 1 and
p1 + + pm = 1. Let Xj be the number of balls in Bj . Then we have X1 , . . . , Xm random
variables. What is the probability
P{X1 = k1 , . . . , Xm = km } where k1 + + km = n ?
This can be modeled by a random vector X = (X1 , . . . , Xm ) that takes values in in the set
D = {(k1 , . . . , km ) : k1 + + km = n} .
Then
P{X1 = k1 , , Xm = km } =
n
pk11 pkmm
k1 , . . . , km
k1 + + km = n .
(a1 + + am ) =
X
k1 ++km =n
n
ak11 akmm
k1 , . . . , km
40
Example 3.9. Suppose n m. The number of balls is smaller than the number of boxes. What is
the probability that in the first n boxes B1 , . . . , Bn is exactly one ball? Then k1 = = kn = 1 and
kn+1 = = km = 0. Then the probability is
n!
mn
Since there are
most one ball is
m
n
ways to fix n of the m boxes, the probability that in each of the m boxes is at
m
n!
n
m
n
Example 3.10. A train with 3 coaches arrives. 6 passengers enter the train. Find the probability
that there are exactly 2 persons in each coach. We have m = 3 and n = 6 with p1 = p2 = p3 = 13 .
Then
6
6!
1
10
P{X1 = 2, X2 = 2, X3 = 2} =
=
2!2!2! 3
81
Remark. Suppose the vector X = (X1 , . . . , Xm ) is multinomial distributed with parameters p1 , . . . , pm
Question: What are the marginal distributions, that is, the distributions of the Xj ?
Answer: Each Xj Bn,pj .
3.2
Recall, for a continuous random variable X whose distribution has density f , we have
Z t
P{X t} =
f (u)du
R
f density, f (u) 0, f (u)du = 1.
From this we deduce
Z
P{a X b} =
f (u)du
a
P{X t} = F (t) =
1
t
dt =
:t<
:t
:t>
Basically the uniform distribution is saying that the probability of an event B is just the length
of B [, ] divided by the length of [, ], i.e. the Lebesgue measure of the event in [, ]
normalized by .
41
Example 3.11. Say a train leaves the station every 30 minutes. X is the time since the last
departure and X U [0, 30]. What is the probability that we wait more than 20 minutes? If
we wait more than 20 minutes, then the last train had to have departed in the first 10 minutes.
This probability is
Z 10
1
1
P{0 X 10} =
= .
30 0
3
0
(b) Uniform Distribution on Rn
Given a set K Rn bounded and closed and n the Lebesgue measure. Then X U (K) or X
is uniformly distributed on K if
P{X B} =
n (B K)
n (K)
/400
1
=
400
Example 3.13. Two friends visit a bar independently between 1 and 2 oclock. After arrival
each of them waits 20 minutes. Find the probability that the two friends meet each other.
If X1 is the arrival time of the first friend and X2 that of the second, the vector X = (X1 , X2 )
is uniformly distributed on K = [1, 2]2 . They meet each other if X A where
1
2
.
A = (x1 , x2 ) [0, 1] : |x1 x2 |
3
Hence P{X A} = 2 (A) =
5
9
Example 3.14 (Buffons Needle Test). Take a needle of length a < 1 and throw it on a lined
sheet of paper. Say the distance between two lines on the paper is 1. Find the probability that
the needle cuts a line.
What is random in this experiment? Choose the two lines such that between them the midpoint
of the needle lies. Let x [0, 1] be the distance of the midpoint of the needle to the lower line.
Furthermore, denote by [/2, /2] the angle of the needle to a line perpendicularly to the
lines on the paper. For example, if = 0, then the needle is perpendicular to the lines on the
paper while for = /2 it lies parallel.
Hence, to throw a needle randomly is equivalent to choosing a point (, x) uniformly distributed
in K = [/2, /2] [0, 1].
The needle cuts the lower line if and only if
a
cos 1 x.
2
a
2
If
A = {(, x) [/2, /2] [0, 1] : x
a
cos
2
or 1 x
a
cos } ,
2
then we get
P{The needle cuts a line} = P(A) =
2 (A)
2 (A)
=
.
2 (K)
42
But it follows
Z
/2
2 (A) = 2
/2
a
cos d = 2a ,
2
hence
P(A) =
2a
.
2 /2
ex
dx =
2 .
Y = X + N(, 2 )
Y
N(0, 1)
Proof. We only prove part (a). The second assertion is proved in the same way.
t
P{Y t} = P{X + t} = P X
Z t
u2
1
=
e 2 du
2
Letting s =
we see that
Z t
(s)2
1
P{Y t} =
e 22 ds
2
Conclusion: If Y N(, 2 ) then
b
a
P{a Y b} =
43
2 /2
t
1 e
t
2
lim
1 (t)
1
=1
t 1 + 1/t2
= lim
2 /2
t 1 et
t
2
What is the n-dimensional standard normal? Suppose X1 , . . . , Xn are all independent and N(0, 1).
Then if X = (X1 , . . . , Xn ) and f : Rn [0, ) is the density of X, we see that
f (s1 , s2 , . . . , sn ) = f1 (s1 ) f2 (s2 ) fn (sn ) =
1
2
e|s| /2
n/2
(2)
where s = (s1 , . . . , sn ).
(e) Gamma Distributed Random Variables:
Recall the Gamma function,
(s) =
0
Properties:
(1) is continuous on (0, ).
(2) (s + 1) = s (s).
(3) (n) = (n 1)!
(4) (1/2) =
Proof. To show (4),
Z
u1/2 eu du
(1/2) =
0
2
ev
2 /2
dv =
R
0
(
0
1
u1 eu/
()
:u0
:u>0
f, (u)du ,
0
Special Cases:
0 t < .
44
eu du
P{X t + s}
e(t+s)
= s = et = P{X t}
P{X s}
e
1)!
(u)
To see where this applies, take n bulbs of the same lifetime. Switch on the first bulb. It
burns out, then replace it by a second, and so on. Let Xn be the time the n-th bulb burns
out, then Erlang distributed.
(iii) 2n = 2,n/2 is the Chi-Square distribution with n degrees of freedom. So the density is
2n/2 n/21 u/2
u
e
,
n2
u>0
(x)(y)
(x+y)
(n1)!(m1)!
(n+m1)!
=
(1)
x, y > 0
45
Then X is Cauchy.
1
i
1h
arctan(t) +
du =
1 + u2
46
4.1
t R.
f (d)
{dD:(d)=yk }
1
7
1
,
7
P{X 2 = 1} =
2
, ...
7
Example 4.3. Simple Random Walk. Say there is a drunken sailor that takes steps of exactly length
1 to the right with probability p and to the left with probability 1 p. Let Xj be the value of the
j-th step, either +1 or -1. Then Sn = X1 + + Xn is the place where the sailor ends up after taking
and
n random steps. We have Sn {n, n + 2, . . . , n 2, n}. Let (s) = s+n
2
Zn = (Sn ) =
Sn + n
2
Xj +1
.
2
X1 + 1 X2 + 1
Xn + 1
+
+ +
.
2
2
2
47
To go even further, assume that p = 1/2 and that n is even and m = 0. Then
n 1
n! 1
P{Sn = 0} = n n = n 2 n .
2
[( 2 )!] 2
2
Stirlings formula gives n!
n n
e
2n so that
2n
2
P{Sn = 0}
n 2 = .
n
n 2
2n 2e
n
n n
e
Rt
f (1 (t))
0 (1 (t))
f (1 (t))
0 (1 (t))
Example 4.4.
1. X N(0, 1) and y = eX , (s) = es . Then 1 (t) = ln(t). The density is
1
1 e 2 (ln(t))
g(t) =
, t > 0,
t
2
and g(t) = 0 , t 0 .
2. Let U be uniform on [0,1] and Y = 1/U with (s) = 1s . Of course f (t) = 1[0,1] (t). Then
(1
1[0,1] 1t
:t1
2
g(t) =
= t
2
t
0 : else
Note that here is decreasing, hence we have to use (15) in this case.
What do we do if is not monotone?
Idea: Try to find FY (t) = P{(X) t}. Then g(t) = FY0 (t).
(15)
48
2
1
2
2
u2 /2
FY (t) = P{X t} = P{ t X t} =
eu /2 du .
e
du =
0
2 t
What if FY0 (t)? This is
r
FY0 (t) =
1
1
2 t2 /2 1
1
= t1/2 et/2 = 1/2
e
t 2 1 et/2 .
2 (1/2)
2 t
2
1
u1 eu/
()
Density g(t) = a1 f
tb
FY (t) = P{Y t} = P X
a
tb
= FX
tb
a
2. a < 0
FY (t) = P{aX t b} = P
tb
a
tb
X
a
= 1 FX
tb
a
g(t) =
f
= e 2|a|2
|a|
a
|a| 2
So that Y N(b, |a|2 ).
2. Suppose X Exp . So that f (t) = et for t > 0. Let Y = aX for a > 0. Then
1
1
t
g(t) = f
= et/a
a
a
a
so that Y Exp/a .
(16)
4.2
49
Say x [0, 1) then we can write x in binary form, i.e. x = 0.x1 x2 . . . where the xj are either 0 or 1.
Equivalently, we have
X
xk
x=
.
k
2
k=1
To make the representation of x unique, we will not consider representations ending in all 1s.
Let U : [0, 1] be uniform distributed. P{U t} = t for 0 t 1. Let Xj : {0, 1} so that
U () = 0.X1 ()X2 () . . . = U () =
X
Xj ()
j=1
2j
P{U I1 2 ...n1 0 }
1 ,...,n1
X
1 ,...,n1
1
2n1
1
=
= .
n
n
2
2
2
1
= P{X1 = 1 } P{Xn = n } .
2n
Since n and the choice of 1 , . . . , n are arbitrary, this shows that X1 , . . . , Xn are independent.
Remark. The preceding theorem tells us that choosing a number u = 0.x1 x2 x3 . . . uniformly in [0, 1]
leads to an infinite sequence (x1 , x2 , . . .) of coin tossing. Every time 0 and 1 appear with probability
1/2 and the results are independent of each other.
The next theorem shows that we also may go the other way round. We toss a coin infinitely often and
obtain a number uniformly distributed on [0, 1]. So let us assume X1 , X2 , . . . are random variables
with the two properties
(a) P{Xj = 0} = P{Xj = 1} = 21 .
50
P
j=1
Xj
,
2j
An (t)
n=1
where An (t) = {s : sj = tj , j < n, sn = 0, tn = 1}. When is An (t) 6= ? This happens if and only
if tn = 1. Note also that An (t) is disjoint from Am (t) for n 6= m. Then we want to investigate
X
P{U < t} =
P{U An (t)} .
(17)
tn =1
P{U An (t)} =
tn =1
X 1
X
tn
=
= t.
n
n
2
2
t =1
n=1
n
Thus
P{U t} = P{U < t} + P{U = t} = t ,
so that U is uniform distributed.
Remark. In view of the previous result we may generate a random number in [0, 1] by using a coin:
Take a fair coin with 0 and 1 and flip it N times. We get u = 0.x1 x2 . . . xN where xj represents
the j-th coin toss. Then, if N is large enough, u is almost uniformly chosen from [0, 1]. What does
this mean? It tells us that the probability to get an u which is in [a, b] [0, 1] is approximately b a.
Say we want not only one number uniform distributed on [0, 1] but n independent numbers u1 , . . . , un .
How one can do this? The answer is very easy. Toss not only one coin but n independently. The
results are x11 , x12 , . . . to xn1 , xn2 , . . .. Constructing with these sequences u1 , . . . , un , these numbers are
uniformly distributed and independent.
Problem: Until now we are only able to construct uniform distributed numbers on [0, 1] by tossing a
coin. Is it also possible to find such numbers which possess other distributions?
Let us explain by an example why such a question may be of interest. Say you want to build
a vending machine. If the machine dispenses to quickly, the operating cost is expensive. If the
51
machine dispenses too slowly not enough money is made. Say the random variable Xj dictated the
time between the j-th and the (j + 1)-st customer is Exp and independent. So we have arrival times
0 < t1 < t2 < with tj+1 tj independent and Exp distributed. To simulate whether or not the
machine is adequate we have to test it with randomly chosen t1 , t2 , . . . having the desired properties.
Discrete case:
Construct a discrete random variable X so that for given xk R and pk 0 with
have
P{X = xk } = pk , k = 1, 2, . . .
The answer is that
[0, 1] =
|Ik | = pk ,
Ik ,
X=
k=1
xk 1Ik (U )
k=1
pk = 1 we
(18)
k=1
k = 1, 2, . . .
Proof.
P{X = xk } = P{U Ik } = |Ik | = pk .
Example 4.7. We want to construct a random variable X which is Bn,p distributed, i.e. that
n k
P{X = k} =
p (1 p)nk , k = 0, . . . , n.
k
To this end split [0, 1] into intervals Ik , k = 0, . . . , n, where each length |Ik | = nk pk (1 p)nk . Then
let
n
X
k 1Ik (U ) .
X=
k=0
Continuous case:
R
Given a function f with f 0 and f (s)ds = 1, we want to construct a random variable X so
that f is its density, i.e. that for all t R
Z t
P{X t} =
f (s)ds .
Define F : R [0, 1] by
Z
F (t) :=
f (s)ds .
(19)
0 s < 1.
52
Remark. Of course, F (s) is a well-defined real number if 0 < s < 1 and F (0) = . If F is
one-to-one on a finite or infinite interval (a, b), then F (s) = F 1 (s) if F (a) < s < F (b). This
happens, for example, if f (t) > 0, a < t < b.
We shall need the following properties of F :
Lemma 4.1.
1. For s (0, 1) and t R holds
F (F (s)) = s
and
F (F (t)) t .
(20)
Proof. The equality F (F (s)) = s is a direct consequence of the continuity of F . Indeed, if there
are tn & F (s) with F (tn ) = s, then
s = lim F (tn ) = F (F (s)) .
n
The second part of the first assertion follows directly from the definition of F .
Now let us come to the proof of (20). If F (s) t, then the monotonicity of F as well as F (F (s)) = s
lead to s = F (F (s)) F (t).
Conversely, if s F (t), then F (s) F (F (t)) t by the first part, thus (20) is proved.
Now choose a uniform distributed U and set X = F (U ). Since P{U = 0} = 0, we may assume that
X attains values in R.
R
Proposition 4.4. Let f satisfy f 0 and f (s)ds = 1. Define F by (19) and let F be its
pseudo-inverse. Take U uniform on [0, 1] and set X = F (U ). Then f is a density of the random
variable X.
Proof. Using (20) it follows
FX (t) = P{X t} = P{ : F (U ()) t} = P{ : U () F (t)} = F (t)
which completes the proof.
(
0
Example 4.8. f (s) =
es
Then
:s<0
.
:s0
Z t
F (t) =
es ds = 1 et , t > 0 .
0
ln(1 s)
,
Hence
0 < s < 1.
ln(1 U )
is exponential distributed with parameter . Note that 1 U is also uniform distributed on [0, 1],
hence
= ln U
X
is exponential as well.
X=
53
3 t2 : 0 t 1
0 : otherwise
(21)
t<0
0 :
3
t : 0t1
F (t) =
1 :
t>1
and X = U 1/3 , where is U uniform on [0, 1], is a random variable having density (21).
4.3
Problem: Given X and Y random variables, how is X + Y distributed? Do we even know that X + Y
is a random variable?
Proposition 4.5. If X and Y are random variables, then X + Y is a random variable.
Proof. Since Q is dense in R for any a < b, we can choose q Q such that a < q < b. Then
[
[{X < q} {q < t Y }]
{X + Y < t} = {X < t Y } =
qQ
Since X and Y are random variables, {X < q} A and {q < t Y } A . Thus their intersection is
in A as a -field is closed under finite intersections. Since A is also closed under countable unions,
the whole set is in A , thus X + Y is a random variable.
Theorem 4.3. Let X and Y be independent with values in Z with mass functions
f (k) = P{X = k},
Then
h(k) = (f g)(k) =
f (k j)g(j) =
j=
h(k) = P{X + Y = k}
X
j=
f (j)g(k j)
54
Then
P{X + Y = k} = P{(X, Y ) Bk } =
=
P{X = i, Y = j}
(i,j)Bk
P{X = i} P{Y = j} =
P{X = k j}P{Y = j}
j=
(i,j)Bk
f (k j)g(j) = (f g)(k)
j=
Example 4.11. Say P{X = k} = P{Y = k} = 21k , k = 1, 2, 3, . . . . Say X, Y are independent and
Z = X Y . Let f , g, and h be the densities of X, Y and Z respectively. Then
(
(
1
1
:
k
1
: k 1
k
2
2k
,
g(k) =
f (k) =
0 : otherwise
0
: otherwise
and for k 0,
h(k) =
f (k j)g(j) =
j=
1
X
f (k j)g(j) =
j=
f (k + j)g(j) =
j=1
2k
.
3
2(k+j) 2j =
j=1
2k
.
3
P{X Y = k} =
2|k|
,
3
k Z.
k
X
P{X = j}P{Y = k j}
j=0
for k N0 .
Proof. Same as above theorem, noting that f (k) = g(k) = 0 if k
/ N0 .
Example 4.12. X Bn,p , Y Bm,p independent. Then
k
X
k
X
n
m
P{X + Y = k} =
p (1 p)
pkj (1 p)m(kj)
P{X = j}P{Y = k j} =
j
kj
j=0
j=0
k
X
n
m
k
n+mk
= p (1 p)
j
kj
j=0
m+n
= pk (1 p)n+mk
k
j
nj
Thus X + Y Bn+m,p .
Corollary 4.1. X1 , Xn B1,p . Then if Sn = X1 + + Xn we have Sn Bn,p .
55
k j
X
j!
j=0
kj
e
(k j)!
k
e(+) X
k!
e(+)
=
j kj =
( + )k
k! j=0 j!(k j)!
k!
Explanation: Let X Pois and Y Pois be the number of accidents in city A and B respectively.
Then if Z is the number of accidents in both, then Z Pois+ .
Then
k
X
f (k j)g(j) = p
n+m
j=0
k
X
n
m
n+m
k n m
(1 p)
=p
(1 p)
k
j
j
k
j=0
k
Now we ask a different question. Say we have continuous random variables X and Y with densities
f and g. Does X + Y possess a density, and if so what is it?
Definition 4.1. If f, g : R R, then we define the convolution of f and g as
Z
Z
f (y)g(x y) dy
f (x y)g(y) dy =
(f g)(x) =
Z
Z
=
|g(y)|
|f (x y)| dx dy < .
Since X and Y are independent, f(X,Y ) (u, y) = f (u)g(y). Then the above integral is equal to
ZZ
Z Z ty
f (u)g(y) du dy =
f (u) du g(y) dy
(u,y)Bt
(f g)(x) dx
56
Example 4.15. X and Y uniform on [0, 1]. Then f (x) = 1[0,1] (x) and g(y) = 1[0,1] (y). Then
Z
Z 1
(f g)(x) =
f (x y)g(y) dy =
1[0,1] (x y) dy
1[0,1] (x y) dy =
dy = x
0
if 1 x 2 then
Z
1[0,1] (x y) dy =
dy = 2 x
x1
x
: 0x1
2x : 1x2
h(x) =
0
: otherwise
Example 4.16. X ,1 and Y ,2 with X and Y independent. Then
1
f (y) =
1 (1 )
1
g(x y) =
Then
2 (2 )
y 1 1 ey/ ,
(x y)2 1 e(xy)/ ,
ex/
(f g)(x) = 1 +2
(1 )(2 )
so that
() = x
1 +2 2
Let s =
y
x
y0
Z
|0
y 1 1 (x y)2 1 dy
{z
}
()
y 1 1
y 2 1
1
dy
x
x
and dy = x ds so that
() = x
1 +2 1
Z
|0
s1 (1 s)2 1 ds
{z
}
=B(1 ,2 )=
So that
yx
(1 )(2 )
(1 +2 )
x1 +2 1
(f g)(x) = 1 +2
ex/ ,
(1 + 2 )
x>0
57
Proposition 4.8. If X and Y are independent Erlang distributed with parameters n or m and > 0,
then X + Y is Erlang distributed with parameters n + m and .
Proof. X Exp,n = 1 ,n and Y Exp,m = 1 ,m implies X + Y 1 ,n+m = Exp,n+m .
un1 eu du =
n1
X
(t)j
j!
j=0
et
Proposition 4.9. Let X1 , X2 , . . . be independent Exp distributed random variables and set Sn =
X1 + + Xn . Given T > 0 it follows
(T )n T
P{Sn T < Sn+1 } =
e
,
n!
n = 1, 2, . . .
Proof. It holds
{Sn T < Sn+1 } = {Sn+1 > T } \ {Sn > T } .
Hence in view of the preceding lemma we get
P{Sn T < Sn+1 } = P{Sn+1 > T } P{Sn > T } =
n
X
(t)j
j=0
j!
et
n1
X
(t)j
j=0
j!
et =
(T )n T
e
.
n!
(T )n T
e
.
n!
58
or, equivalently,
Z
(x 1 2 )2
1
1
(x y 1 )2 (y 2 )2
exp
.
dy =
exp
21 2
212
222
2(12 + 22 )
2(12 + 22 )1/2
5
5.1
59
Discrete Case:
Let X : R. What is the mean value, or expected value of X? Say P{Xj = xj } = pj for
j = 1, . . . , n. Then the barycentre is
p 1 x1 + p n xn =
n
X
xj P{Xj = xj } = EX .
j=1
xj P{X = xj } [0, ]
j=1
A first example shows that EX = may occur even in quite natural problems.
Example 5.1. Play a series of fair games. Whenever you put M dollars into the pool you get back
2M dollars if you win. If you lose, then the M dollars are lost.
Apply the following strategy. After losing a game double the amount in the pool. Say you start with
$1 and lose, then next time put $2 into the pool, then $4 and so on, until you win for the first time.
As easily seen, in the n-th game the stake is $2n1 .
Suppose for some n 1 you lost n 1 games and won the n-th one. How much money did you lose?
If n = 1, then you lost nothing, while for n 2 you spent
1 + 2 + 4 + + 2n2 = 2n1 1
dollars. Note that 2n1 1 = 0 if n = 1, hence for all n 1 the total lost is 2n1 1 dollars.
On the other hand, if you win the n-th game, you gain 2n1 dollars. Consequently, no matter what
the results are, you will always win 2n1 (2n1 1) = 1 dollars.
Where is the catch? Let X be the amount of money needed in the case that one wins for the first
time in the n-th game. One needs $1 to play the first game, $1 +$2 =$3 to play the second, until
1 + 2 + 4 + + 2n1 = 2n 1
to play the n-th game. Thus X has values in {1, 3, 7, 15 . . .} and
P{X = 2n 1} = P{First win in game n} =
1
,
2n
n = 1, 2 . . . .
Consequently, it follows
EX =
X
n=1
(2n 1) P{X = 2n 1} =
X
2n 1
n=1
2n
= .
This result tells us that the average amount of money needed to use this strategy is arbitrarily large.
Of course, the owners of gambling casinos know this strategy as well. Therefore they limit the
possible amount of money in the pool. For example, if the largest possible stake is $N, then the
strategy breaks down as soon as you lose n games with 2n > N .
60
Definition 5.2. Let X : {x1 , x2 , . . . }. We say that X has expected value, or that its expectation
exists, if
X
E|X| =
|xj |P{Xj = xj } < .
(22)
n=1
Then we set
EX =
xj P{Xj = xj } .
(23)
n=1
k=1
k = lim
n
X
k=1
k .
k=1
1
,
N
k = 1, . . . , N and
N
X
N
1 X
1 N (N + 1)
N +1
kP{X = k} =
EX =
k=
=
N k=1
N
2
2
k=1
n
k
k
p (1 p)nk , k = 0, . . . , n. Then
n
n
X
X
(n 1)!
n k
k
p (1 p)nk = np
EX =
pk1 (1 p)nk
k
(k
1)!(n
k)!
k=1
k=1
= np
n1
X
k=0
(n 1)!
pk (1 p)n1k
k!(n 1 k)!
= np(p + (1 p))n1 = np
The explanation of this is that if you have n trials with success probability p, then on average one
has success np times.
Continuous Case:
Think of a Riemann sum of discrete random variables, approximating a continuous r.v. X. Then
you can say that
Z
EX =
uf (u) du .
If X 0 almost surely then this integral is well-defined in [0, ]. For the general case:
Definition 5.3. X has an expected value if
Z
E|X| =
(24)
EX =
uf (u) du .
(25)
61
g(x) dx .
g(x) dx = a
lim
b+
Hence, if condition (24) is satisfied, then the integral in (25) is well-defined and EX exists in R.
Example 5.4.
EX =
uf (u) du =
1
2 2
1
+
du =
2
1 1
.
1u2
|u|
2
du =
2
1+u
Z
0
Then
u
1
2
du = ln(1 + u ) =
1 + u2
so that
1
E|X| =
2
1
EX =
2
2 /2
|u|eu
du < ,
2 /2
ueu
du = 0 .
u u1 eu/ du
eu du = e90/80 = e9/8
P{X 90} =
90
X
k=0
kp(1 p)k1 = p
X
k=1
k(1 p)k1
1
1x
is
1
,
(1x)2
k=0
x = h (x) =
k=0
so that h0 (x) =
62
kxk1 ,
k=1
thus
EX =
p
1
= .
2
(1 (1 p))
p
Remark. The expected value of a random variable is motivated by the strong law of large numbers
which asserts that the average of the values of n experiments converges to the expected value as the
number of trials goes to infinity.
Theorem 5.1 (Strong Law of Large Numbers). If X1 , X2 , . . . i.i.d. and EX1 exists, then
X1 + + Xn
P lim
= EX1 = 1 .
n
n
Remark. The expected value should not be mixed up with the median. Note that m is a median of
X if P{X m} 1/2 and P{X m} 1/2. For example, if X is Exp -distributed, then EX = 1
while the median is m = ln2 .
5.2
(a) P{X = xj } =
1
,
N
j = 1, . . . , N , X is uniform. Then EX =
1
N
N
P
xj .
j=1
k
e ,
k!
k = 0, 1, . . . . Then EX = .
(d) If P{X = k} = p(1 p)k1 , k = 1, 2, . . . then EX = p1 . For example, when rolling a die in the
average 6 appears for the first time in the 6th trial.
(e) X U [, ] then EX =
+
2
B(+1,)
B(,)
.
+
X1 X2 Xn
Then Xk B(k, n k + 1) so EXk =
k
.
n+1
5.3
63
(xj )P{X = xj }
j=1
Z
(s)f (s) ds
n
X
xj 1Aj
j=1
m
X
yi 1Bi
i=1
m X
n
X
xj yi 1Aj Bi = E(XY ) =
i=1 j=1
m X
n
X
xj yi P(Aj Bi ) =
i=1 j=1
n
X
xj yi P(Aj )P(Bi )
i=1 j=1
!
xj P(Aj )
j=1
m X
n
X
m
X
!
yi P(Bi )
= (EX)(EY )
i=1
2 /2
E(X + ) = EX + =
2. X N(0, 1). Then
Z
1
2
EX =
sn es /2 ds
2
n
If n is odd, then clearly EX = 0. So look at
Z
Z
1
2
2
2n
2n s2 /2
() = EX =
s e
ds =
s2n es /2 ds
0
2
n
2u then ds =
1
2u
64
du. Then
1
2n
(n + ) = 2n (1/2)(n 1/2) (n 3/2) (1/2)
(1/2)
2
= (2n 1)(2n 3)(2n 5) 3 1
= (2n 1)!!
2n un1/2 eu du =
3. Company produces cornflakes. In order to sell the cornflakes they put a picture in each box. If
there are n distinct pictures, how many boxes must we buy, on average, to collect all n pictures?
Solution: Assume we already have k pictures, k = 0, . . . , n 1. Let Xk be the number of
purchases before one gets a new picture. We see that
P{X0 = 1} = 1
..
.
P{Xk = m} = pk (1 pk )m1
. In other words, the Xk are
where the success probability of getting the k-th picture is nk
n
nk
geometric distributed with parameter pk = n where 2 k n 1. Then
EXk =
1
n
=
pk
nk
n1
X
1
n
n
+ +
=1+
=1+
pk
n1
n (n 1)
k=1
n
X
1
1
1
+
+ + 1) = n
n ln n
n n1
j
j=1
What is the average of necessary purchases before one has n/2 pictures? Then you need to
take
n
n
1
1
1
+
=n
+
+ +
EX0 + + En/21 = 1 +
n 1 n (n/2 1)
n n1
n/2 + 1
This can be written as
() = n
n
X
1
j=1
We know
lim
n
X
1
j=1
n/2
X
1
j=1
!
ln n
= 0.577 . . .
7n
.
2
5.4
65
(E|X|n ) n (E|X|m ) m
Example 5.5.
If X N(0, 1) then E|X|n < for all n.
If X Exp , then
E|X| =
sn es ds <
1
x
EX = ( 1)
1
sn
ds < n < 1
s
Remark (Hamburger Moment Problem). Given two r.v. X and Y such that EX n = EY n for all
n N. Then do X and Y possess the same distribution?
Now we come to the variance. We know that the EX is similar to the mean value of X. Now let
= EX. Then the mean quadratic distance of X to EX
E|X |2
This measures how far the values of X are, on average, close to the mean value.
Definition 5.4. Suppose EX 2 < . Then
VX = E|X |2 = EX 2 2EX + 2
is called the Variance of X.
Proposition 5.2.
66
(a) VX = EX 2 (EX)2 .
(b) V(aX + b) = a2 VX.
(c) If X is discrete, then
X
VX =
(xj )2 P{X = xj }
j=1
VX =
(s )2 f (s) ds
Example 5.6.
1. X is Pois . Then EX = and
2
EX =
X
k=1
k!
X
k=1
X
kk
(k + 1)k
e =
e = 2 + .
(k 1)!
k!
k=0
Hence, VX = EX 2 (EX)2 = 2 + 2 = .
67
2. X is uniform on [, ] Then EX = +
and
2
Z
1
2 + + 2
1 3 3
2
EX =
=
.
s2 ds =
3
3
Hence,
VX =
2 + + 2 2 + 2 + 2
( )2
=
.
3
4
12
Z
s2
1
s2 e 2 ds = 1 .
EX =
2
2
If X N(, ) then EX = and VX = 2 .
2
1
1x
xk so
k=0
X
X
1
x
k1
kx
=
kxk
=
=
2
(1 x)2
(1
x)
k=1
k=1
Thus
X
2x
1+x
1
2 k1
+
=
k
x
=
(1 x)2 (1 x)3
(1 x)3
k=1
Then
2
EX = p
k 2 (1 p)k1 = p
k=1
and finally
VX =
5.5
1 + (1 p)
2p
=
3
p
p2
1
1p
2p
=
p2
p2
p2
Covariance
Say we are given X and Y , dependent. We want to measure their degree of dependence.
Definition 5.5. Let E|X|2 < , E|Y |2 < . Where = EX, = EY . The covariance of X and
Y is
Cov(X, Y ) = E(X )(Y )
Properties:
(a) |ab| 21 (a2 + b2 ), so
1
|(X )(Y )| [(X )2 + (Y )2 ]
2
implies
1
E|(X )(Y )| [VX + VY ] <
2
this means that Cov(X, Y ) is well-defined.
68
Cov(X, Y ) =
i=1 j=1
1
6
1
3
1
3
1
6
Then,
1
1
1
Cov(X, Y ) = (0 1/2)(0 1/2) + (1 1/2)(0 1/2) + (0 1/2)(0 1/2)
6
3
3
1
1
+ (1 1/2)(1 1/2) =
6
12
Continuous Case:
The joint distribution is
Z
P{X t, Y s} =
f (u, v) du dv
then
(u )(v )f (u, v) du dv
Cov(X, Y ) =
: u2 + v 2 1
0 : else
What we see is that EX = 0 = EY because ufX (u) and vfY (v) are odd functions. Then
ZZ
Z
Z 1v2
1
1 1
Cov(X, Y ) =
(u 0)(v 0) du dv =
v
u du dv = 0
1 1v2
u2 +v 2 1
We see then that X and Y are uncorrelated, but we have proven earlier that they are not independent.
Let us summarize some properties of the covariance.
(a) Cov(X, Y ) = E(XY ) E(X)E(Y ). This is seen by computing
Cov(X, Y ) = E(X )(Y ) = E(XY ) EX EY + = E(XY )
as = EX and = EY . If we look at Example (5.7) we see that E(XY ) = 61 1 and (EX)(EY ) =
1
1
. Then Cov(XY ) = 61 41 = 12
.
4
69
0 EX 2
(EX 2 ) 2
1
(EY 2 ) 2
E|XY | + EX 2
(EX 2 ) 2
(EY
2) 2
E|XY | EX 2
1
Cov(X, Y )
1
(VX) 2 (VY ) 2
Note that by Proposition 5.3 we have 1 (X, Y ) 1. Then (X, X) = 1, (X, X) = 1 and
(X, Y ) = 0 if X, Y are independent (the converse again is not always true).
We say that X and Y are strongly correlated if (X, Y ) is near 1 or -1. We say that X and Y are
positively correlated if (X, Y ) > 0. This means that large X implies that a larger Y is more likely.
Similarly, X and Y are negatively correlated if (X, Y ) < 0. This means that larger values of X
mean smaller values of Y more likely.
For example, if we choose a person by random and X is the height and Y the weight. Then X and
Y are surely positively correlated.
On the other hand, if X is the number of cigarettes a man smokes per day and Y his lifetime, then
X and Y are, as known, negatively correlated.
Example 5.9. We have an urn with 2n balls. n are labeled 0 and n are labeled 1. Take out
two balls without replacement, where X is the first and Y is the second ball. How are X and Y
correlated? By the law of multiplication,
1
EX = EY = ,
2
E(XY ) =
n1
,
4n 2
Cov(X, Y ) =
n1
1
1
=
4n 2 4
8n 4
70
Then
1
1
= (X, Y ) =
4
2n 1
The result tells us the following: The larger n the smaller is the correlation between X and Y .
For very large n they are almost uncorrelated. Furthermore we see that X and Y are negatively
correlated. Why this so? If X is large, i.e., if X = 1, then it becomes more likely that Y = 0, that
is, that it becomes smaller.
VX = VY =
-1
-1
1
10
1
10
1
10
1
10
2
10
1
10
1
10
1
10
1
10
6 LIMIT THEOREMS
71
Limit Theorems
6.1
Chebyshevs Inequality
1
EZ
1Z () Z() = E(1Z ) EZ
But if Z is continuous,
E(1Z ) = E1Z =
f (s) ds = P{Z }
1
1
EZ = 2 VX
2
c
c
Example 6.1.
1. Roll a die n times and let Sn be the sum of the values. Then
7
7
1
Sn
= n =
E
n
n
2
2
and
V
Then
Sn
n
=
1
1 35
VSn = nVX1 =
2
n
n 12
Sn 7
35
P
n
2
n 12 2
Sn
35
35
36} 1
=1
0.7083
n
12 n 0.01
120
2. Toss a fair coin n times let Sn be the number of heads. Then ESn =
P{|Sn n/2| c}
n
2
n
4c2
Theorem 6.1 (Weak Law of Large Numbers). Let X1 , X2 , ... be i.i.d (independent and identically
distributed) and set = EX1 = EX2 = . Let Sn = X1 + + Xn then for each > 0 we have
Sn
lim P = 0
n
n
6 LIMIT THEOREMS
72
Theorem 6.2 (Strong Law of Large Numbers). X1 , X2 , . . . i.i.d and let EX1 = < , Sn =
X1 + + Xn . Then
Sn
= =1
P lim
n n
Application:
Let h : [0, 1]d R. Our goal is to evaluate
Z
h(x) dx
[0,1]d
What we do is we take U uniform on [0, 1]d . Then U = (U1 , . . . , Ud ) with density f (x) = 1[0,1]d (x).
Then
Z
Z
Eh(U ) =
h(x)f (x) dx =
h(x) dx
Rd
[0,1]d
1X
h(U j ) Eh(U1 ) =
n k=1
6.2
Z
h(x) dx
[0,1]d
Let X1 , X2 , . . . be i.i.d. where E|X1 |2 < . Let Sn = X1 + + Xn with = EX1 and 2 = VX1 .
Then
E(Sn n) = 0,
V(Sn n) = n 2
which implies
V
Sn n
= 1 and E
Sn n
=0
Theorem 6.3 (Central Limit Theorem). X1 , X2 , . . . i.i.d. with E|X1 |2 < . Let Sn , , and 2 be
as above, then
Z b
Sn n
1
2
P a
b
et /2 dt
n
2 a
In other words if Zn =
Sn
n
n
Example 6.2.
35
1. Roll a die n times = 72 and 2 = 12
. Then Sn is almost N(7n/2, 35n/12) distributed for n
3
large. Let n = 10 and a = 3, 400, b = 3, 600. Then
!
!
3600 3500
3400 3500
P{3400 Sn 3600} p
p
0.9352
n 35/12
n 35/12
6 LIMIT THEOREMS
73
2. Calculations in a bank. Say there is a deposit of $1.2347. Then it rounds down so that $1.23
shows in the account, so the bank gains $0.0047. Let Xj be the amount that the bank gains
or loses on transaction j. Then 0.5 Xj 0.5 if we measure X in cents. Then Xj are
1
independent and uniformly distributed on [0.5, 0.5]. Then EX = 0 and VXj = 12
. Let Sn =
X1 + + Xn be the total amount of money (in cents) gained or lost from transactions. Then
Sn is approximately N(0, n/12) distributed for n large. Let n = 106 , then a = 103 cents = $10.
Thus
!
103 12
0.0002663
P{Sn $10} = 1 P{Sn a} 1
103