Continuous Valued Random Variables
Continuous Valued Random Variables
{X ≤ a} ∈ F,
{X ≤ x} = {ω ∈ Ω ∶ X(ω) ≤ x}.
Thus {X ≤ x} ∈ F will imply that we will be able to assign a probability to every such
element, as the measure P (⋅) is defined for all elements of F. In this sense, {(−∞, x]} are
the seed-sets that we discussed previously, with σ((−∞, x]) = B(R).
Definition 2 The cumulative distribution function (cdf ) is the induced probability measure
defined as
P ((∞, x]) = P (X ≤ x).
Clearly, since the cdf defines probability for the class of sets which generate B(R), we can
extend the cdf to B(R). That such an extension exists, and is unique, follows from the
properties that we discussed in earlier lectures. We will not revisit these at this point,
rather assume that it is sufficient to consider {X ≤ x}. The cdf of X is usually denoted as
FX (x).
Theorem 1 The cdf FX (x) is right continuous and monotone non-decreasing. Further-
more, if X is R−valued
F (∞) = lim F (x) = 1 and F (−∞) = lim F (x) = 0.
x↑∞ x↓−∞
Proof: The proof is left as an exercise from class notes. While the cdf FX (x) of a random
variable completely specifies it, many a times it is convenient to work with the so called
probability density functions. However, many random variables may not have such a repre-
sentation, for example, when the random variable also takes a few discrete values. Discrete
random variables have a mass distribution function as opposed to a density function.
Definition 3 Consider a non-negative function f (⋅) such that
∫R f (x)dx = 1.
If the cdf FX (x) of a R− valued random variable can be represented as
x
FX (x) = ∫ f (u)du, ∀x ∈ R,
−∞
then X is said to admit a density f (x). In this case, the random variable X is also called
absolutely continuous (or absolutely continuous with respect to the Lebesgue measure on R).
One standard trick that we will repeatedly perform is to write the cdf of any random
variable in the integral form shown above. Then, we can compute the density function, if it
exists, by mere identification. The identification will give us a pdf which is unique, except
at a set of measure zero. This later condition is due to the fact that two functions differing
only at a countable set of real values is equivalent under integral operations.
General random variables that we deal can possibly have a discontinuous cdf. We can
deal with the discontinuities separately by considering the cdf as the summation of two
parts. One part corresponding to a discrete random variable and another part for the
absolutely continuous component. Let us use this idea to define integrals in a convenient
way.
Definition 4 Consider a cdf FX (x) with a countable number of discontinuities occurring
at {dn }, n ∈ N. The integral w.r.t a probability measure is defined as,
+∞ +∞
∫−∞ g(x)dFX (x) = ∑ g(dn ) (FX (dn ) − FX (dn− )) + ∫−∞ g(x)fc (x)dx (1)
n∈N
where fc (x) is the density function of FXc (x). For those are familiar with the theory of
distributions in Fourier Transform etc, FXd corresponds to a density with two impulses of
appropriate heights at d1 and d2 .
2
FX (x)
1.0
FX (d1 )
x
d1 d2
FXd
FX (d1 ) − FX (d−1 )
x
d1 d2
FXc
x
d1 d2
3
2 Random Vectors
Like in the discrete case, a collection of random variables is called a random vector. Let us
first consider two random variables X1 and X2 .
Recall the definition of the product sigma-field F = F1 × F2 , which is the smallest sigma-
field containing events of the form A1 × A2 with Ai ∈ Fi , i = 1, 2. For example, we already
demonstrated the construction of a probability measure in R2 , by considering rectangles as
the seed-sets (refer chapters 2-3).
Consider a probability measure P (A × B) defined on F. The random vector (X1 , X2 )
induces a probability on (B(R) × B(R)). Thus it makes sense to talk about joint events
generated by (X1 , X2 ). We can easily generalize the above definition to Rn .
2.2 Independence
The notion of independence of RVs translates to the independence of events associated
with X1 and X2 . The relevant events can be expressed as {Xi ≤ xi }, i = 1, 2.
3 Expectation
Many properties of the expectation from the discrete random variable case carries over
to the continuous valued ones. In particular, expectation in the former is a weighted
summation, where g(x) is weighted by P (X = x) and summed to obtain E[g(X)]. In a
more general form, g(x) will be integrated with respect to the measure dFX (x) to obtain
E[g(X)].
4
Notice that if the absolutely continuous component of X is identically zero, the expectation
reduces to the summation as in the discrete case, see (1). Let us look at another intuitive
result which also has an exact counterpart in the discrete case.
Theorem 2 Let X admit a density fX (x). Then,
E1{X∈A} = P (A).
Proof: By definition,
= ∫ fX (x)dx
A
= P (A).
Thus probabilities can be written as expectation. This is very useful, since many properties
of expectation are not only easier to derive, but also carry over from the discrete case with
appropriate change to integrals. For example, we know that when X and Y are independent
discrete RVs, E[g1 (X)g2 (Y )] = E[g1 (X)]E[g2 (Y )]. The same holds true for the continuous
case also. The key here is the decomposition of joint density into a product form.
Theorem 3 If X1 and X2 are independent random variables admitting respective densities,
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ).
Proof: By the definition of independence, the events {X1 ≤ x1 } and {X2 ≤ x2 } are inde-
pendent.
P (X1 ≤ x1 , X2 ≤ x2 ) = P (X1 ≤ x1 )P (X2 ≤ x2 )
x1 x2
=∫ fX1 (u)du ∫ fX2 (v)dv
−∞ −∞
x1 x2
=∫ ∫−∞ fX1 (u)fX2 (v)dv
−∞
Eg1 (X1 )g2 (X2 ) = ∫ ∫ g1 (x1 )g2 (x2 )fX1 ,X2 (x1 , x2 )dx1 dx2
5
Theorem 5 Let X1 , X2 be independent random variables. Consider a function g ∶ R × R →
R, which is either non-negative or integrable. Then
Proof:
The last theorem will find several applications, some of them we will illustrate in the coming
sections.
fX (x)
x
µ
Exercise 1 Find the value of x for which a zero mean Gaussian distribution packs 99% of
the probability in the interval [−x, x].
6
Theorem 6 If X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) are independent Gaussian random
variables, the joint density function of (X1 , X2 ) is given by,
⎡ ⎤
⎢x 1
−1 ⎢
− µ1 ⎥⎥
− 12 [x1 − µ1 x2 − µ 2 ]K ⎢
⎢
⎥
f (X1 , X2 )(x1 , x2 ) = √
1
e ⎢x 2
⎣
− µ2 ⎥⎥⎦ , (3)
det(2πK)
where
σ2 0
K=[ 1 ]. (4)
0 σ22
x − µ1 (x − µ1 )2 (x − µ2 )2
[x1 − µ1 x2 − µ2 ] K −1 [ 1 ]= + .
x2 − µ 2 2σ12 2σ22
n
Let us now find the joint distribution of Gaussians which are not independent. What
does it mean by ‘Gaussians which are not independent’ ? One way to think about this is
to start with two independent Gaussians and take linear combinations of it. The linear
combinations of independent Gaussians are also Gaussians, but such combinations are not
necessarily dependent. We have not proved this statement. While this can be done by
basic probability tools, let us wait till we introduce the more elegant generating function
framework, and proceed now by taking linear combinations.
Theorem 7 Let X and Z be two independent and identical Gaussians with X ∼ N (0, σ 2 ).
Consider the random vector (X1 , X2 ) as
√ √
X1 = X ; X2 = aX + 1 − aZ.
where
√
1 a
K == σ [√ 2
]. (5)
a 1
√ √
Proof: We have used the scale α = a and β = 1 − a so that X1 and X2 have the same
variance. Let us compute pdf of (X1 , X2 ). We start with the joint cdf
x1 x2
P (X1 ≤ x1 , X2 ≤ x2 ) = ∫ ∫−∞ fX1 X2 (u, v)dudv. (6)
−∞
7
Once we get the form in the RHS, we can obtain the joint pdf by identification.
P (X1 ≤ x1 , X2 ≤ x2 ) = E1{X1 ≤x1 ,X2 ≤x2 }
= E1{X1 ≤x1 } 1{X2 ≤x2 }
= E1{X≤x1 } 1{αX+βZ≤x2 }
x1 1
β
(x2 −αu)
=∫ ∫−∞ fX,Z (u, v)dudv
−∞
x1 1
β
(x2 −αu)
=∫ ∫−∞ fX (u)fZ (v)dudv
−∞
x1 1
β
(x2 −αu)
=∫ fX (u) (∫ fZ (v)dv) du
−∞ −∞
8
5 Covariance Matrices
We have encountered the matrix K while dealing with Gaussian random vectors. Clearly K
is related to the moments of the participating random variables. What is the significance of
K? Is it merely an end-product of the manipulations on joint probability density functions,
as we saw in the last section? Turns out that the matrix K has some physical significance,
and can be easily identified. It depends only on the individual and pair-wise relations
of the participating random variables. The matrix K is popularly known as covariance
matrix. Notice the connection to the word variance which we have introduced. It is kind
of a pair-wise variance or inner product of random variables.
Definition 9 For a random vector X̄ = (X1 , ⋯, Xn )T with EXi2 < ∞, 1 ≤ i ≤ n, the covari-
ance matrix K is defined as the outer product
Since E[U1 U2 ] = 0,
σ2 0
K=[ 1 ]
0 σ22
n
In the last example, notice that the covariance matrix is diagonal. A diagonal matrix
implies the lack of pairwise covariance. This is different, and in fact weaker, in several
cases, from saying that X1 and X2 are independent.
We have already showed that if X1 and X2 are independent, the E[g1 (X1 )g2 (X2 )] =
E[g1 (X1 )]E[g2 (X2 )]. Thus independence implies uncorrelatedness, and thus the former
is stronger. The reverse is not always true, but for the important class of Gaussian Ran-
dom variables independence and uncorrelatedness amount to the same notion. Before we
prove such a statement, we need to first show that linear combinations of independent
Gaussians will result in Gaussians. In other words, Gaussians form an invariant distribu-
tion under linear transformations. While this can be proved in many ways, perhaps the
simplest proof uses an analog from the generating functions of discrete random variables,
which we call characteristic functions in the real-valued case.
9
6 Characteristic Functions
Recall that we defined the Z−transform of the probability law of a discrete random vari-
able as its generating function. The transform there is a polynomial representation. In
continuous valued case, we cannot simply use polynomials and their coefficients, but the
more general framework of Fourier (or two-sided Laplace) transform is required. We call
this the characteristic function of random variable X.
E[esX ],
Observe that for the discrete case, we get back our generating function by substituting
z = es . For the continuous valued cases which admit a density, it is sufficient to consider
the Fourier transform of the pdf, and take s = −jω to conclude our results. We have learnt
in signals and systems that for a large class of functions, the Fourier transform completely
specifies the function, for example integrable functions. Even otherwise, we know the
formalism to determine the function almost everywhere from its Fourier Transform. Thus,
we assume for the rest of the section that the characteristic function uniquely specifies
a random variable. In other words, two random variables having the same characteristic
function are considered to be identical.
To show the power of this transformations, let us find the probability distribution of
the sum of two independent random variables.
The last step used the independence assumptions of X1 and X2 . We also know that
multiplications in the frequency domain corresponds to convolutions in time domain. Thus,
taking inverse Fourier Transforms, we get
fY = fX1 ⋆ fX2 ,
Proof: Integrate and apply the fact that the pdf of a random variable sums to one. n
10
Theorem 9 Let X̄ = (X1 , ⋯, Xn )T be a random vector containing independent Gaussian
entries of mean µi , 1 ≤ i ≤ n and variance σi2 , 1 ≤ i ≤ n respectively. Then for any a ∈ Rn ,
the random variable aT X̄ = ∑i ai Xi is Gaussian distributed with mean ∑ ai µi and variance
∑ a2i σ12 .
Proof: Let us find the generating function of aT X̄. Using the independence of Xi ,
n
] = ∏ Eesai Xi
T X̄
E[esa
i=1
n
1 2 2 2
= ∏ esai µ+ 2 s ai σ
i=1
= e+s ∑i=1 ai µi + 2 s
1 2
n
∑n 2 2
i=1 ai σi .
This expression corresponds to a Gaussian random variable N (∑ni=1 ai µi , ∑ni=1 a2i σi2 ).
since they are uncorrelated. Recall that the Gaussian joint pdf is completely determined
by the mean vector µ̄ and K, Thus
1
e− 2 (x̄−µ̄)
1 T K −1 (x̄−µ̄)
fX1 ,X2 (x1 , x2 ) = √
det(2πK)
1 − 12 (x1 −µ1 )2 1 1
(x2 −µ2 )2
=√ e 2σ
1 √ 2σ 2
e 2
2πσ12 2πσ22
= fX (x1 )fX2 (x2 ).
Comparing this with the joint pdf, it is evident that P (X1 ≤ x1 , X2 ≤ x2 ) = P (X1 ≤
x1 )P (X2 ≤ x2 ) and thus X1 and X2 are independent. n
Let us now precisely define what is meant by a Gaussian random vector, that we can
extend our results to arbitrary dimensions.
Example 3 Consider two independent zero mean Gaussian random variables X1 and X2
with variances σ12 and σ22 respectively. Let
1 3 X1
[U1 U2 ] = [ ][ ]
2 6 X2
Solution: Observe that 2U1 − U2 = 0, which is not a strict Gaussian random variable(it is
a trivial random variable). n
11
7.1 Multi-Dimensional Gaussian
Definition 13 A jointly Gaussian vector X̄ = (X1 , ⋯, Xn )T is specified by the pdf
1
e− 2 (x̄−µ̄)
1 T K −1 (x̄−µ̄)
fX̄ (x̄) = √ ,
det(2πK)
where K is the covariance matrix and µ̄ = E[X̄]. We will say X̄ ∼ N (µ̄, K).
It is easy to see that both uncorrelated as well as independence implies the same notion
for multi-dimensional jointly Gaussian random vectors, as the matrix K becomes diagonal
in each case.
EesY = EesaY
= ∫ easx fX (x)dx
R
u du
= ∫ esu fX ( )
R a ∣a∣
fX ( ua )
= ∫ esu du
R ∣a∣
By Inverse Fourier Transform, we can identify,
1 x
fY (y) = fX ( ).
∣a∣ a
n
The important point to realize in the example is that the Jacobian matrix J = dy
dx = a.
So the integral transformation takes the form
fX ( ua )
EesY = ∫ esu du.
R ∣ det(J)∣
This is the key in doing change of variables in integration. If there are multiple variables to
be changed, we have to compute the Jacobian matrix J and divide by the absolute value
of det(J). Let us do an example.
Example 4 Show that if Y = AX,
fX (A−1 y)
fY (y) = .
∣ det A∣
12
Solution: Comparing with the last theorem, the above result is akin to saying that the
Jacobian matrix is A . Recall that det(A) = det(AT ) for the n × n matrix A. We follow the
same route as the scalar case, except that the Fourier transform also contains n param-
eters/frequency variables s1 , s2 , ⋯sn . Thus the charactersitic function is E[es Y ], where
T
s = (s1 , ⋯, sn )T .
] = ∫ fX (x)es
TY T Ax
E[es dx
Let u = Ax, then dudx = A and the Jacobian J = A. This is because Jij = δxj by definition
T δui
fX (A−1 u) sT u
]=∫
TY
E[es e du.
∣ det(A)∣
fX (A−1 u)
Clearly, the characteristic function appears as the Fourier transform of ∣ det(A)∣ , validating
our claim.
n
Theorem 12 If Y = X 2 , then
1 √ √
fY (y) = √ (fX ( y) + fX (− y)) .
2 y
Proof: We have shown the direct computation in class. We can also use the generating
function framework.
9 Markov’s Inequality
Recall the Markov’s inequality for the discrete random variables. An exact analog holds
for continuous valued random variables too. We will state a more general version.
E[X]
P (X > a) ≤ , a > 0.
a
Proof: The proof follows exactly as in the discrete case, in particular
Many other theorems and inequalities related to the expectation carry over from the discrete
case to the continuous valued ones. We may not repeat each of them, the reader can check
the proofs in the discrete case, to find out whether it can be generalized.
13
10 Conditional Probability
As the name implies, there are at least two random variables involved, let us denote it as
X and Y . Let us start with a familiar idea of probabilities of the joint events of the form
i.e. {X ≤ x} and {Y ≤ y}. We will use this later to define conditional probabilities. We
know that
P (X ≤ x, Y ≤ y)
P (X ≤ x∣Y ≤ y) = 1{P (Y ≤y)≠0} .
P (Y ≤ y)
In the all-discrete case, we used P (Y = y) instead of P (Y ≤ y) in the above expression,
to define the conditional cumulative distribution. However, this cannot be done if Y is
absolutely continuous, as P (Y = y) = 0 uniformly for all y. So we will define four separate
versions of conditional probability depending on whether X and Y are continuous or not.
Though we say different versions, all of them have the same theme, with appropriate
replacement of distributions by densities. The first case is when X and Y are discrete,
which is already familiar.
1. X− discrete, Y − discrete
P (X = x, Y = y)
P (X = x∣Y = y) = 1{P (Y =y)>0} .
P (Y = y)
2. X− continuous, Y − discrete
Here we will replace the event {X = x} by {X ≤ x}
x
P (X ≤ x∣Y = y) = ∫ fX∣Y (u∣y)du1{P (Y =y)>0} ,
−∞
where we used the notation fX∣Y (x∣y) to denote the density function of X given Y = y.
Notice that
P (X ≤ x, Y = y)
P (X ≤ x∣Y = y) = 1{P (Y =y)>0}
P (Y = y)
1 x
= 1{P (Y =y)>0} ∑ ∫ fXY (u, v)du
P (Y = y) v=y −∞
x 1
=∫ fXY (u, y)1{P (Y =y)>0} du
−∞ P (Y = y)
14
Thus, our definition gives a valid probability measure for all y ∶ P (Y = y) > 0. Also,
the marginal distribution of X becomes
x
P (X ≤ x) = ∑ P (Y = k) ∫ fk (x)dx.
k −∞
3. X− discrete, Y −continuous
We will reverse engineer a consistent definition of conditional distribution from the
last two cases. Specifically, let us define
fXY (i, y)
P (X = i∣Y = y) = 1{fY (y)>0} . (10)
fY (y)
Clearly
fXY (i, y)
∑ P (X = i∣Y = y) = ∑ 1{fY (y)>0}
i i fY (y)
fY (y)
= 1{f (y)>0}
fY (y) Y
= 1{fY (y)>0} .
Thus, our definition gives a valid probability measure for all y ∶ fY (y) > 0. The
conditional probability that we defined also takes the convenient form,
4. X− continuous, Y − continuous
Let us define
fXY (x, y)
fX∣Y (x∣y) = 1{fY (y)>0} . (12)
fY (y)
15
11 Conditional Expectation
Expectation with respect to a conditional distribution is known as conditional expectation.
Since we defined the conditional distribution for 4 separate cases, the conditional expec-
tation has to be evaluated accordingly in these cases. For generality, we will denote the
conditional distribution that we introduced in the last section by Π(x∣y). We will mention
the general framework here.
Definition 14 Consider a function g ∶ R × R → R, which is either non-negative or inte-
grable. The function
Ψ(y) = E[g(X, Y )∣Y = y]
is known as the conditional expectation of g(X, Y ) given Y = y, where the expectation is
evaluated with respect to Π(x∣y).
Observe that this definition can be easily specialized to each of the cases that we dealt.
The most important case in our discussion is the last one, where it reads,
in which Π(x∣y) is same as the function fX∣Y (x∣y) given in (12). On the other hand, when
X is discrete and Y continuous, we can write
Ψ(y) = ∑ g(i, y)Π(i∣y),
i
where Π(i∣y) is taken as per (10), or the more convenient form in (11).
Example 5 Let (X1 , X2 ) be a zero mean jointly Gaussian random vector with covariance
matrix
σ2 ρσ1 σ2
K=[ 1 ].
ρσ1 σ2 σ22
Find E[X1 ∣X2 ] and E[X12 ∣X2 ].
Solution: Notice that we need to specialize our definition of conditional distribution. In
particular, since both RVs are continuous, let us look at their joint density.
1
e− 2 x
1 T
fX1 ,X2 (x1 , x2 ) = √ K −1 x
,
det(2πK)
where x = (x1 , x2 )T . The marginal distribution of X2 is a Gaussian (since the given vector
is jointly Gaussian), which we can identify as
x2
1 − 2
fX2 (x2 ) = √ e 2σ 2
2 .
2πσ22
The conditional density becomes
fX1 ,X2 (x1 , x2 ) 1 − 1 σ1
2 2 (x1 −ρ σ2 x2 )
2
=√ e 2(1−ρ )σ1 .
fX2 (x2 ) 2π(1 − ρ2 )σ12
Observe that the conditional density is nothing but a Gaussian density function with mean
ρ σσ12 x2 and variance (1 − ρ2 )σ12 . Thus, it is easy to identify
σ1 σ2
E[X1 ∣X2 ] = ρ x2 and E[X12 ∣X2 ] = (1 − ρ2 )σ12 + ρ2 12 x22 .
σ2 σ2
n
16
12 Other results
It is clear that Ψ(Y ) defined above is a random variable taking values in R̄ whenever
E∣g(X, Y )∣ < ∞, and then it is meaningful to talk about its expectation. We have performed
such computations for the discrete case. Many results from the conditional expectation for
discrete cases have exact analogs in the general case. Here we list a few of them, whose
proofs are straightforward, whenever the quantities involved are finite or non-negative. For
example
1.
2.
3. If X á Y ,
Ey Ex [g(X)∣Y ] = Eg(X).
4. Wald’s Inequality.
Exercise 3 Using the definition of conditional probabilities, prove each of these expres-
sions.
17