0% found this document useful (0 votes)
45 views

In Depend

In this paper we investigate the notion of conditional independence. We prove several information inequalities for conditionally independent random variables.

Uploaded by

Kula Kundalini
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

In Depend

In this paper we investigate the notion of conditional independence. We prove several information inequalities for conditionally independent random variables.

Uploaded by

Kula Kundalini
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1

Conditionally independent random variables


Konstantin Makarychev and Yury Makarychev
AbstractIn this paper we investigate the notion of condi-
tional independence and prove several information inequal-
ities for conditionally independent random variables.
Keywords Conditionally independent random variables,
common information, rate region.
I. Introduction
Ahlswede, Gacs, Korner, Witsenhausen and Wyner [1],
[2], [4], [7], [8] studied the problem of extraction of com-
mon information from a pair of random variables. The
simplest form of this problem is the following: Fix some
distribution for a pair of random variables and . Con-
sider n independent pairs (
1
,
1
), . . . , (
n
,
n
); each has
the same distribution as (, ). We want to extract
common information from the sequences
1
, . . .
n
and

1
, . . . ,
n
, i.e., to nd a random variable such that
H([(
1
, . . . ,
n
)) and H([(
1
, . . . ,
n
)) are small. We say
that extraction of common information is impossible if
the entropy of any such variable is small.
Let us show that this is the case if and are indepen-
dent. In this case
n
= (
1
, . . . ,
n
) and
n
= (
1
, . . . ,
n
)
are independent. Recall the well-known inequality
H() H([
n
) +H([
n
) +I(
n
:
n
).
Here I(
n
:
n
) = 0 (because
n
and
n
are independent);
two other summands on the right hand side are small by
our assumption.
It turns out that a similar statement holds for dependent
random variables. However, there is one exception. If the
joint probability matrix of (, ) can be divided into blocks,
there is a random variable that is a function of and a
function of (block number). Then = (
1
, . . . ,
n
) is
common information of
n
and
n
.
It was shown by Ahlswede, Gacs and Korner [1], [2],
[4] that this is the only case when there exists common
information.
Their original proof is quite technical. Several years
ago another approach was proposed by Romashchenko [5]
using conditionally independent random variables. Ro-
mashchenko introduced the notion of conditionally inde-
pendent random variables and showed that extraction of
common information from conditionally independent ran-
dom variables is impossible. We prove that if the joint
probability matrix of a pair of random variables (, ) is
Princeton University
E-mail: {kmakaryc,ymakaryc}@princeton.edu
This work was done while the authors were at Moscow State Uni-
versity.
Supported by Russian Foundation for Basic Research grant 01-01-
01028.
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version
may no longer be accessible.
not a block matrix, then and are conditionally indepen-
dent. We also show several new information inequalities for
conditionally independent random variables.
II. Conditionally independent random variables
Consider four random variables , ,

. Suppose
that

and

are independent, and are indepen-


dent given

, and also independent given

, i.e., I(

) = 0, I( : [

) = 0 and I( : [

) = 0. Then we
say that and are conditionally independent of order
1. (Conditionally independent random variables of order 0
are independent random variables.)
We consider conditional independence of random vari-
ables as a property of their joint distributions. If a pair of
random variables and has the same joint distribution
as a pair of conditionally independent random variables
0
and
0
(on another probability space), we say that and
are conditionally independent.
Replacing the requirement of independence of

and

by the requirement of conditional independence of or-


der 1, we get the denition of conditionally independent
random variables ( and ) of order 2 and so on. (Con-
ditionally independent variables of order k are also called
k-conditionally independent in the sequel.)
Denition 1: We say that and are conditionally in-
dependent with respect to

and

if and are inde-


pendent given

, and they are also independent given

,
i.e. I( : [

) = I( : [

) = 0.
Denition 2: (Romashchenko [5]) Two random variables
and are called conditionally independent random vari-
ables of order k (k 0) if there exists a probability space
and a sequence of pairs of random variables
(
0
,
0
), (
1
,
1
), . . . , (
k
,
k
)
on it such that
(a) The pair (
0
,
0
) has the same distribution as (, ).
(b)
i
and
i
are conditionally independent with respect
to
i+1
and
i+1
when 0 i < k.
(c)
k
and
k
are independent random variables.
The sequence
(
0
,
0
), (
1
,
1
), . . . , (
k
,
k
)
is called a derivation for (, ).
We say that random variables and are conditionally
independent if they are conditionally independent of some
order k.
The notion of conditional independence can be applied
for analysis of common information using the following ob-
servations (see below for proofs):
Lemma 1: Consider conditionally independent random
variables and of order k. Let
n
[
n
] be a sequence
2
of independent random variables each with the same dis-
tribution as []. Then the variables
n
and
n
are con-
ditionally independent of order k.
Theorem 1: (Romashchenko [5]) If random variables
and are conditionally independent of order k, and is an
arbitrary random variable (on the same probability space),
then
H() 2
k
H([) + 2
k
H([).
Denition 3: An mn matrix is called a block matrix if
(after some permutation of its rows and columns) it consists
of four blocks; the blocks on the diagonal are not equal to
zero; the blocks outside the diagonal are equal to zero.
Formally, A is a block matrix if the set of its rst indices
1, . . . , m can be divided into two disjoint nonempty sets
I
1
and I
2
(I
1
. I
2
= 1, . . . , m) and the set of its second
indices 1, . . . , n can be divided into two sets J
1
and J
2
(J
1
.J
2
= 1, . . . , n) in such a way that each of the blocks
a
ij
: i I
1
, j J
1
and a
ij
: i I
2
, j J
2
contains
at least one nonzero element, and all the elements outside
these two blocks are equal to 0, i.e. a
ij
= 0 when (i, j)
(I
1
J
2
) (I
2
J
1
).
Theorem 2: Random variables are conditionally inde-
pendent i their joint probability matrix is not a block
matrix.
Using these statements, we conclude that if the joint
probability matrix of a pair of random variables (, ) is
not a block matrix, then no information can be extracted
from a sequence of n independent random variables each
with the same distribution as (, ):
H() 2
k
H([
n
) + 2
k
H([
n
)
for some k (that does not depend on n) and for any random
variable .
III. Proof of Theorem 1
Theorem 1: If random variables and are con-
ditionally independent of order k, and is an arbitrary
random variable (on the same probability space), then
H() 2
k
H([) + 2
k
H([).
Proof : The proof is by induction on k. The statement
is already proved for independent random variables and
(k = 0).
Suppose and are conditionally independent with re-
spect to conditionally independent random variables

and

of order k 1. From the conditional form of the


inequality
H() H([) +H([) +I( : )
(

is added everywhere as a condition) it follows that


H([

) H([

) +H([

) +I( : [

) =
H([

) +H([

) H([) +H([).
Similarly, H([

) H([) + H([). By the induction


hypothesis H() 2
n1
H([

) + 2
n1
H([

). Replac-
ing H([

) and H([

) by their upper bounds, we get


H() 2
n
H([) + 2
n
H([).
Corollary 1.1: If the joint probability matrix A of a pair
of random variables is a block matrix, then these random
variables are not conditionally independent.
Proof: Suppose that the joint probability matrix A of
random variables (, ) is a block matrix and these random
variables are conditionally independent of order k.
Let us divide the matrix A into blocks I
1
J
1
and I
2

J
2
as in Denition 3. Consider a random variable with
two values that is equal to the block number that contains
(, ):
= 1 I
1
J
1
;
= 2 I
2
J
2
.
The random variable is a function of and at the
same time a function of . Therefore, H([) = 0 and
H([) = 0. However, takes two dierent values with
positive probability. Hence H() > 0, which contradicts
Theorem 1.
A similar argument shows that the order of conditional
independence should be large if the matrix is close to a
block matrix.
IV. Proof of Theorem 2
For brevity, we call joint probability matrices of condi-
tionally independent random variables good matrices.
The proof of Theorem 2 consists of three main steps.
First, we prove, that the set of good matrices is dense in
the set of all joint probability matrices. Then we prove
that any matrix without zero elements is good. Finally, we
consider the general case and prove that any matrix that
is not a block matrix is good.
The following statements are used in the sequel.
(a) The joint probability matrix of independent random
variables is a matrix of rank 1 and vice versa. In particular,
all matrices of rank 1 are good.
(b) If and are conditionally independent,

is a
function of and

is a function of , then

and

are
conditionally independent. (Indeed, if and are condi-
tionally independent with respect to some

and

, then

and

are also conditionally independent with respect


to

and

.)
(c) If two random variables are k-conditionally inde-
pendent, then they are l-conditionally independent for
any l > k. (We can add some constant random variables
to the end of the derivation.)
(d) Assume that conditionally independent random vari-
ables
1
and
1
are dened on a probability space
1
and
conditionally independent random variables
2
and
2
are
dened on a probability space
2
. Consider random vari-
ables (
1
,
2
) and (
1
,
2
) that are dened in a natural
way on the Cartesian product
1

2
. Then (
1
,
2
) and
(
1
,
2
) are conditionally independent. Indeed, for each
pair (
i
,
i
) consider its derivation
(
0
i
,
0
i
), (
1
i
,
1
i
), . . . , (
l
i
,
l
i
)
(using (c), we may assume that both derivations have the
same length l).
3
Then the sequence
((
0
1
,
0
2
), (
0
1
,
0
2
)), . . . , ((
l
1
,
l
2
), (
l
1
,
l
2
))
is a derivation for the pair of random variables
((
1
,
2
), (
1
,
2
)). For example, random variables
(
1
,
2
) = (
0
1
,
0
2
) and (
1
,
2
) = (
0
1
,
0
2
) are independent
given the value of (
1
1
,
1
2
), because
1
and
1
are indepen-
dent given
1
1
, variables
2
and
2
are independent given

1
2
, and the measure on
1

2
is equal to the product of
the measures on
1
and
2
.
Applying (d) several times, we get Lemma 1.
Combining Lemma 1 and (b), we get the following state-
ment:
(e) Let (
1
,
1
), . . . , (
n
,
n
) be independent and identi-
cally distributed random variables. Assume that the vari-
ables in each pair (
i
,
i
) are conditionally independent.
Then any random variables

and

, where

depends
only on
1
, . . . ,
n
and

depends only on
1
, . . . ,
n
, are
conditionally independent.
Denition 4: Let us introduce the following notation:
D

=
_
1/2
1/2
_
(where 0 1/2).
The matrix D
1/4
corresponds to a pair of independent
random bits; as tends to 0 these bits become more depen-
dent (though each is still uniformly distributed over 0, 1).
Lemma 2: (i) D
1/4
is a good matrix.
(ii) If D

is a good matrix then D


(1)
is good.
(iii) There exists an arbitrary small such that D

is
good.
Proof:
(i) The matrix D
1/4
is of rank 1, hence it is good (inde-
pendent random bits).
(ii) Consider a pair of random variables and dis-
tributed according to D

.
Dene new random variables

and

as follows:
if (, ) = (0, 0) then (

) = (0, 0);
if (, ) = (1, 1) then (

) = (1, 1);
if (, ) = (0, 1) or (, ) = (1, 0) then
(

) =
_

_
(0, 0) with probability /2;
(0, 1) with probability (1 )/2;
(1, 0) with probability (1 )/2;
(1, 1) with probability /2.
The joint probability matrix of

and

given = 0 is
equal to
_
(1 )
2
(1 )
(1 )
2
_
and its rank equals 1. Therefore,

and

are independent
given = 0.
Similarly, the joint probability matrix of

and

given
= 1, = 0 or = 1 has rank 1. This yields that

and

are conditionally independent with respect to and ,


hence

and

are conditionally independent.


The joint distribution of

and

is
_
1/2 (1 ) (1 )
(1 ) 1/2 (1 )
_
,
hence D
(1)
is a good matrix.
(iii) Consider the sequence
n
dened by
0
= 1/4 and

n+1
=
n
(1
n
). The sequence
n
tends to zero (its limit
is a root of the equation x = x(1 x)). It follows from
statements (i) and (ii) that all matrices D

n
are good.
Note: The order of conditional independence of D

tends
to innity as 0. Indeed, applying Theorem 1 to ran-
dom variables and with joint distribution D

and to
= , we obtain
H() 2
k
(H([) +H([)) = 2
k
H([).
Here H() = 1; for any xed value of the random variable
takes two values with probabilities 2 and 12, therefore
H([) = (12) log
2
(12)2 log
2
(2) = O( log
2
)
and (if D

corresponds to conditionally independent vari-


ables of order k)
2
k
H()/H([) = 1/O( log
2
)
as 0.
Lemma 3: The set of good matrices is dense in the set of
all joint probability matrices (i.e., the set of mn matrices
with non-negative elements, whose sum is 1).
Proof: Any joint probability matrix A can be approx-
imated as closely as desired by matrices with elements of
the form l/2
N
for some N (where N is the same for all
matrix elements).
Therefore, it suces to prove that any joint probability
matrix B with elements of the form l/2
N
can be approxi-
mated (as closely as desired) by good matrices. Take a pair
of random variables (, ) distributed according to D. The
pair (, ) can be represented as a function of N indepen-
dent Bernoulli trials. The joint distribution matrix of each
of these trials is D
0
and, by Lemma 2, can be approximated
by a good matrix. Using statement (e), we get that (, )
can also be approximated by a good matrix. Hence D can
be approximated as closely as desired by good matrices.
Lemma 4: If A = (a)
ij
and B = (b)
ij
are stochastic
matrices and M is a good matrix, then A
T
MB is a good
matrix.
Proof: Consider a pair of random variables (, )
distributed according to M. This pair of random variables
is conditionally independent.
Roughly speaking, we dene random variable

] as
a transition from [] with transition matrix A [B]. The
joint probability matrix of (

) is equal to A
T
MB. But
since the transitions are independent from and , the
new random variables are conditionally independent.
More formally, let us randomly (independently from
and ) choose vectors c and

d as follows
Pr(proj
i
(c) = j) = a
ij
,
4
Pr(proj
i
(

d) = j) = b
ij
,
where proj
i
is the projection onto the i-th component.
Dene

= proj

(c) and

= proj

d). Then
(i) the joint probability matrix of (

) is equal to
A
T
MB;
(ii) the pair (, c) is conditionally independent from the
pair (,

d). Hence by statement (b),

and

are condi-
tionally independent.
Now let us prove the following technical lemma.
Lemma 5: For any nonsingular n n matrix M and a
matrix R = (r)
ij
with the sum of its elements equal to 0,
there exist matrices P and Q such that
1. R = P
T
M +MQ;
2. the sum of all elements in each row of P is equal to 0;
3. the sum of all elements in each row of Q is equal to 0.
Proof: First, we assume that M = I (here I is the
identity matrix of the proper size), and nd matrices P

and Q

such that
R = P
T
+Q

.
Let us dene P

= (p

)
ij
and Q

= (q

)
ij
as follows:
q

ij
=
1
n
n

k=1
r
kj
.
Note that all rows of Q

are the same and equal to the


average of rows of R.
P

= (R Q

)
T
It is easy to see that condition (1) holds. Condition (3)
holds because the sum of all elements in any row of Q is
equal to the sum of all elements of R divided by n, which
is 0 by the condition. Condition (2) holds because
n

j=1
p

ij
=
n

j=1
_
r
ji

1
n
n

k=1
r
ki
_
= 0.
Now we consider the general case. Put P = (M
1
)
T
P

and Q = M
1
Q

. Clearly (1) holds. Conditions (2) and


(3) can be rewritten as Pu = 0 and Qu = 0, where u is
the vector consisting of ones. But Pu = (M
1
)
T
(P

u) = 0
and Qu = M
1
(Q

u) = 0. Hence (2) and (3) hold.


By altering the signs of P and Q we get Corollary 5.1.
Corollary 5.1: For any nonsingular matrix M and a ma-
trix R with the sum of its elements equal to 0, there exist
matrices P and Q such that
1. R = P
T
M MQ;
2. the sum of all elements in each row of P is equal to 0;
3. the sum of all elements in each row of Q is equal to 0.
Lemma 6: Any nonsingular matrix M without zero ele-
ments is good.
Proof: Let M be a nonsingular nn matrix without
zero elements. By Lemma 4, it suces to show that M can
be represented as
M = A
T
GB,
where G is a good matrix; A and B are stochastic matri-
ces. In other words, we need to nd invertible stochastic
matrices A, B such that (A
T
)
1
MB
1
is a good matrix.
Let V be the ane space of all n n matrices in which
the sum of all the elements is equal to 1:
V = X :
n

i=1
n

j=1
x
ij
= 1.
(This space contains the set of all joint probability matri-
ces.)
Let U be the ane space of all n n matrices in which
the sum of all elements in each row is equal to 1:
U = X :
n

j=1
x
ij
= 1 for all i.
(This space contains the set of stochastic matrices.)
Let

U be a neighborhood of I in U such that all matrices
from this neighborhood are invertible. Dene a mapping
:

U

U V as follows:
(A, B) = (A
T
)
1
MB
1
.
Let us show that the dierential of this mapping at the
point A = B = I is a surjective mapping from T
(I,I)

U

U
(the tangent space of

U

U at the point (I, I)) to T
M
V
(the tangent space of V at the point M). Dierentiate
at (I, I):
d[
A=I, B=I
= d
_
(A
T
)
1
MB
1
_
= (dA)
T
M MdB.
We need to show that for any matrix R T
M
V , there
exist matrices (P, Q) T
(I,I)

U

U such that
R = P
T
M MQ.
But this is guaranteed by Corollary 5.1.
Since the mapping has a surjective dierential at (I, I),
it has a surjective dierential in some neighborhood N
1
of
(I, I) in

U

U. Take a pair of stochastic matrices (A
0
, B
0
)
from this neighborhood such that these matrices are inte-
rior points of the set of stochastic matrices.
Now take a small neighborhood N
2
of (A
0
, B
0
) from the
intersection of N
1
and the set of stochastic matrices. Since
the dierential of at (A
0
, B
0
) is surjective, the image of
N
2
has an interior point. Hence it contains a good matrix
(recall that the set of good matrices is dense in the set of
all joint probability matrices). In other words, (A
1
, B
1
) =
(A
T
1
)
1
MB
1
1
is a good matrix for some pair of stochastic
matrices (A
1
, B
1
) N
2
. This nishes the proof.
Lemma 7: Any joint probability matrix without zero el-
ements is a good matrix.
Proof: Suppose that X = (v
1
, . . . v
n
) is an m n
(m > n) matrix of rank n. It is equal to the product of a
5
nonsingular matrix and stochastic matrix:
X = (v
1
u
1
. . . u
mn
, v
2
, . . . , v
n
, u
1
, . . . , u
mn
)

_
_
_
I
1 0 ... 0
.
.
.
.
.
.
.
.
.
.
.
.
1 0 ... 0
_
_
_
where u
1
, . . . , u
mn
are suciently small vectors with pos-
itive components that form a basis in R
m
together with
v
1
, . . . , v
n
(it is easy to see that such vectors do exist); vec-
tors u
1
, . . . , u
mn
should be small enough to ensure that
the vector v
1
u
1
. . . u
mn
has positive elements.
The rst factor is a nonsingular matrix with positive ele-
ments and hence is good. The second factor is a stochastic
matrix, so the product is a good matrix.
Therefore, any matrix of full rank without zero elements
is good. If a mn matrix with positive elements does not
have full rank, we can add (in a similar way) m linearly
independent columns to get a matrix of full rank and then
represent the given matrix as a product of a matrix of full
rank and stochastic matrix.
We denote by S(M) the sum of all elements of a matrix
M.
Lemma 8: Consider a matrix N whose elements are ma-
trices N
ij
of the same size. If
(a) all N
ij
contain only nonnegative elements;
(b) the sum of matrices in each row and in each column
of the matrix N is a matrix of rank 1;
(c) the matrix P with elements p
ij
= S(N
ij
) is a good
joint probability matrix;
then the sum of all the matrices N
ij
is a good matrix.
Proof: This lemma is a reformulation of the denition
of conditionally independent random variables. Consider
random variables

such that the probability of the


event (

) = (i, j) is equal to p
ij
, and the probability
of the event
= k, = l,

= i,

= j
is equal to the (k, l)-th element of the matrix N
ij
.
The sum of matrices N
ij
in a row i corresponds to the
distribution of the pair (, ) given

= i; the sum of
matrices N
ij
in a column j corresponds to the distribution
of the pair (, ) given

= j; the sum of all the matrices


N
ij
corresponds to the distribution of the pair (, ).
From Lemma 8 it follows that any 2 2 matrix of the
form
_
a b
0 c
_
is good.
1
Indeed, let us apply Lemma 8 to
the following matrix:
N =
_
_
_
_
a 0
0 0
0 b/2
0 0
0 b/2
0 0
0 0
0 c
_
_
_
_
.
The sum of matrices in each row and in each column is of
rank 1. The sum of elements of each matrix N
ij
is positive,
1
a, b and c are positive numbers whose sum equals 1.
so (by Lemma 7) the matrix p
ij
= S(N
ij
) is a good matrix.
Hence the sum of matrices N
ij
is good.
Recalling that a, b and c stand for any positive numbers
whose sum is 1, we conclude that any 2 2-matrix with 0
in the left bottom corner and positive elements elsewhere
is a good matrix. Combining this result with the result of
Lemma 7, we get that any non-block 2 2 matrix is good.
In the general case (we have to prove that any non-block
matrix is good) the proof is more complicated.
We will use the following denitions:
Denition 5: The support of a matrix is the set of po-
sitions of its nonzero elements. An r-matrix is a matrix
with nonnegative elements and with a rectangular sup-
port (i.e., with support A B where A[B] is some set of
rows[columns]).
Lemma 9: Any r-matrix M is the sum of some r-matrices
of rank 1 with the same support as M.
Proof: Denote the support of M by N = A B.
Consider the basis E
ij
in the vector space of matrices whose
support is a subset of N. (Here E
ij
is the matrix that has
1 in the (i, j)-position and 0 elsewhere.)
The matrix M has positive coordinates in the basis E
ij
.
Let us approximate each matrix E
ij
by a slightly dierent
matrix E

ij
of rank 1 with support N:
E

ij
=
_
e
i
+

kA
e
k
_

_
e
j
+

lB
e
l
_
T
,
where e
1
, . . . , e
n
is the standard basis in R
n
.
The coordinates c
ij
of M in the new basis E

ij
continu-
ously depend on . Thus they remain positive if is suf-
ciently small. So taking a suciently small we get the
required representation of M as the sum of matrices of
rank 1 with support N:
M =

(i,j)N
c
ij
E

ij
.
Denition 6: An r-decomposition of a matrix is its ex-
pression as a (nite) sum of r-matrices M = M
1
+M
2
+. . .
of the same size such that the supports of M
i
and M
i+1
intersect (for any i). The length of the decomposition is
the number of the summands; the r-complexity of a matrix
is the length of its shortest decomposition (or +, if there
is no such decomposition).
Lemma 10: Any non-block matrix M with nonnegative
elements has an r-decomposition.
Proof: Consider a graph whose vertices are nonzero
entries of M. Two vertices are connected by an edge
i they are in the same row or column. By assump-
tion, the matrix is a non-block matrix, hence the graph
is connected and there exists a (possibly non-simple) path
(i
1
, j
1
) . . . (i
m
, j
m
) that visits each vertex of the graph at
least once.
Express M as the sum of matrices corresponding to the
edges of the path: each edge corresponds to a matrix whose
support consists of the endpoints of the edge; each positive
6
element of M is distributed among matrices corresponding
to the adjacent edges. Each of these matrices is of rank 1.
So the expression of M as the sum of these matrices is an
r-decomposition.
Corollary 10.1: The r-complexity of any non-block ma-
trix is nite.
Lemma 11: Any non-block matrix M is good.
Proof: The proof uses induction on r-complexity of
M. For matrices of r-complexity 1, we apply Lemma 7.
Now suppose that M has r-complexity 2. In this case M
is equal to the sum of some r-matrices A and B such that
their supports are intersecting rectangles. By Lemma 9,
each of the matrices A and B is the sum of matrices of
rank 1 with the same support.
Suppose, for example, that A = A
1
+ A
2
+ A
3
and B =
B
1
+B
2
. Consider the block matrix
_
_
_
_
_
_
A
1
0 0 0 0
0 A
2
0 0 0
0 0 A
3
0 0
0 0 0 B
1
0
0 0 0 0 B
2
_
_
_
_
_
_
.
The sum of the matrices in each row and in each column is
a matrix of rank 1. The sum of all the entries is equal to
A + B. All the conditions of Lemma 8 but one hold. The
only problem is that the matrix p
ij
is diagonal and hence
is not good, where p
ij
is the sum of the elements of the
matrix in the (i, j)-th entry (see Lemma 8). To overcome
this obstacle take a matrix e with only one nonzero element
that is located in the intersection of the supports of A and
B. If this nonzero element is suciently small, then all the
elements of the matrix
N =
_
_
_
_
_
_
A
1
4e e e e e
e A
2
4e e e e
e e A
3
4e e e
e e e B
1
4e e
e e e e B
2
4e
_
_
_
_
_
_
are nonnegative matrices. The sum of the elements of each
of the matrices that form the matrix N is positive. And
the sum of the elements in any row and in any column is
not changed, so it is of rank 1. Using Lemma 8 we conclude
that the matrix M is good.
The proof for matrices of r-complexity 3 is similar. For
simplicity, consider the case where a matrix of complexity 3
has an r-decomposition M = A+B+C, where A, B, C are
r-matrices of rank 1. Let e
1
be a matrix with one positive
element that belongs to the intersection of the supports of
A and B (all other matrix elements are zeros), and e
2
be
a matrix with a positive element in the intersection of the
supports of B and C.
Now consider the block matrix
N =
_
_
Ae
1
e
1
0
e
1
B e
1
e
2
e
2
0 e
2
C e
2
_
_
.
Clearly, the sums of the matrices in each row and in each
column are of rank 1. The support of the matrix (p)
ij
is of
the form
_
_
0

0
_
_
;
and (p)
ij
has r-complexity 2.
2
By the inductive assumption
any matrix of r-complexity 2 is good. Therefore, M is a
good matrix (Lemma 8).
In the general case (any matrix of r-complexity 3) the
reasoning is similar. Each of the matrices A, B, C is repre-
sented as the sum of some matrices of rank 1 (by Lemma 9).
Then we need several entries e
1
(e
2
) (as it was for matrices
of r-complexity 2). In the same way, we prove the lemma
for matrices of r-complexity 4 etc.
This concludes the proof of Theorem 2: Random vari-
ables are conditionally independent if and only if their joint
probability matrix is a non-block matrix.
Note that this proof is constructive in the following
sense. Assume that the joint probability matrix for , is
given and this matrix is not a block matrix. (For simplic-
ity we assume that matrix elements are rational numbers,
though this is not an important restriction.) Then we can
eectively nd k such that and are k-independent,
and nd the joint distribution of all random variables that
appear in the denition of k-conditional independence.
(Probabilities for that distribution are not necessarily ratio-
nal numbers, but we can provide algorithms that compute
approximations with arbitrary precision.)
V. Improved version of Theorem 1
The inequality
H() 2
k
H([) + 2
k
H([)
from Theorem 1 can be improved. In this section we prove
a stronger theorem.
Theorem 3: If random variables and are condition-
ally independent of order k, and is an arbitrary random
variable, then
H() 2
k
H([) + 2
k
H([) (2
k+1
1)H([),
or, in another form,
I( : ) 2
k
I( : [) + 2
k
I( : [).
Proof: The proof is by induction on k.
We use the following inequality:
H() = H([) +H([)+
I( : ) I( : [) H([)
H([) +H([) +I( : ) H([).
If and are independent then I( : ) = 0, we get the
required inequality.
2
Its support is the union of two intersecting rectangles, so the ma-
trix is the sum of two r-matrices.
7
Assume that and are conditionally independent with
respect to

and

and

are conditionally indepen-


dent of order k 1.
We can assume without loss of generality that two ran-
dom variables, the pair (

), and are independent


given (, ). Indeed, consider random variables (

)
dened by the following formula
Pr(

= c,

= d[ = a, = b, = g) =
Pr(

= c,

= d[ = a, = b).
The distribution of (, ,

) is the same as the distri-


bution of (, ,

), and (

) is independent from
given (, ).
From the relativized form of the inequality
H() H([) +H([) +I( : ) H([)
(

is added as a condition everywhere) it follows that


H([

)
H([

) +H([

) +I( : [

) H([

)
H([) +H([) H([

).
Note that according to our assumption

and are inde-


pendent given and , so H([

) = H([).
Using the upper bound for H([

), the similar bound for


H([

) and the induction assumption, we conclude that


H() 2
k
H([) + 2
k
H([)
2
k
H([) (2
k
1)H([

).
Applying the inequality
H([

) H([

) = H([),
we get the statement of the theorem.
VI. Rate Regions
Denition 7: The rate region of a pair of random vari-
ables , is the set of triples of real numbers (u, v, w) such
that for all > 0, > 0 and suciently large n there exist
coding functions t, f and g; their arguments are pairs
(
n
,
n
); their values are binary strings of length (u+)n|,
(v +)n| and (w +)n| (respectively).
decoding functions r and s such that
r(t(
n
,
n
), f(
n
,
n
)) =
n
and
s(t(
n
,
n
), g(
n
,
n
)) =
n
with probability more then 1 .
This denition (standard for multisource coding theory,
see [3]) corresponds to the scheme of information transmis-
sion presented on Figure 1.
The following theorem was discovered by Vereshchagin.
It gives a new constraint on the rate region when and
are conditionally independent.

n
f(
n
,
n
) t(
n
,
n
) g(
n
,
n
)

n

n
r s
f t
g
Fig. 1. Values of
n
and
n
are encoded by functions f, t and g
and then transmitted via channels of limited capacity (dashed lines);
decoder functions r and s have to reconstruct values
n
and
n
with
high probability having access only to a part of transmitted informa-
tion.
Theorem 4: Let and be k-conditionally independent
random variables. Then,
H() +H() v +w + (2 2
k
)u
for any triple (u, v, w) in the rate region.
(It is easy to see that H() u + v since
n
can be
reconstructed with high probability from strings of length
approximately nu and nv. For similar reasons we have
H() u +w. Therefore,
H() +H() v +w + 2u
for any and . Theorem 4 gives a stronger bound for the
case when and are k-independent.)
Proof: Consider random variables
= t(
n
,
n
), = f(
n
,
n
), = g(
n
,
n
)
from the denition of the rate region (for some xed > 0).
By Theorem 1, we have
H() 2
k
(H([
n
) +H([
n
)).
We can rewrite this inequality as
2
k
H() H((,
n
)) +H((,
n
)) H(
n
) H(
n
)
or
H() +H() + (2 2
k
)H() H() +H()+
2H() H((,
n
)) H((,
n
)) +H(
n
) +H(
n
).
We will prove the following inequality
H() +H() H((,
n
)) cn
8
for some constant c that does not depend on and for suf-
ciently large n. Using this inequality and the symmetric
inequality
H() +H() H((,
n
)) cn
we conclude that
H() +H() + (2 2
k
)H()
H(
n
) +H(
n
) 2cn.
Recall that values of are (v + )n-bit strings; therefore
H() (v + )n. Using similar arguments for and
and recalling that H(
n
) = nH() and H(
n
) = nH()
(independence) we conclude that
(v +)n + (w +)n + (2 2
k
)(u +)n
nH() +nH() 2cn.
Dividing over n and recalling that and may be chosen
arbitrarily small (according to the denition of the rate
region), we get the statement of Theorem 4.
It remains to prove that
H() +H() H((,
n
)) cn
for some c that does not depend on and for suciently
large n. For that we need the following simple bound:
Lemma 12: Let and

be two random variables that


coincide with probability (1 ) where < 1/2. Then
H(

) H() + 1 + log m
where m is the number of possible values of

.
Proof: Consider a new random variable with m+1
values that is equal to

if ,=

and takes a special value


if =

. We can use at most 1 + log m bits on average


to encode (log m bits with probability , if ,=

, and
one additional bit to distinguish between the cases =

and ,=

). Therefore, H() 1 + log m. If we know


the values of and , we can determine the value of

,
therefore
H(

) H() +H() H() + 1 + log m.


The statement of Lemma 12 remains true if

can be
reconstructed from with probability at least (1 ) (just
replace with a function of ).
Now recall that the pair (,
n
) can be reconstructed
from and (using the decoding function r) with prob-
ability (1 ). Therefore, H((,
n
)) does not exceed
H((, )) + 1 + cn (for some c and large enough n) be-
cause both and
n
have range of cardinality O(1)
n
. It
remains to note that H((, )) H() +H().
Acknowledgements
We thank participants of the Kolmogorov seminar, and
especially Alexander Shen and Nikolai Vereshchagin for the
formulation of the problem, helpful discussions and com-
ments.
We wish to thank Emily Cavalcanti, Daniel J. Webre and
the referees for useful comments and suggestions.
References
[1] R. Ahlswede, J. Korner, On the connection between the entropies
of input and output distributions of discrete memoryless channels,
Proceedings of the 5th Brasov Conference on Probability Theory,
Brasov, 1974; Editura Academiei, Bucuresti, pp. 1323, 1977.
[2] R. Ahlswede, J. Korner. On common information and related
characteristics of correlated information sources. [Online]. Avail-
able: www.mathematik.uni-bielefeld.de/ahlswede/homepage.
[3] I. Csiszar, J. Korner, Information Theory: Coding Theorems for
Discrete Memoryless Systems, Second Edition, Akademiai Kiado,
1997
[4] P. Gacs, J. Korner, Common information is far less than mu-
tual information, Problems of Control and Information Theory,
vol. 2(2), pp. 149162, 1973.
[5] A. E. Romashchenko, Pairs of Words with Nonmaterializable Mu-
tual Information, Problems of Information Transmission, vol. 36,
no. 1, pp. 320, 2000.
[6] C. E. Shannon, A mathematical theory of communication. Bell
System Tech. J., vol. 27, pp. 379423, pp. 623656.
[7] H. S. Witsenhausen, On sequences of pairs of dependent random
variables, SIAM J. Appl. Math, vol. 28, pp. 100113, 1975
[8] A. D. Wyner, The Common Information of two Dependent Ran-
dom Variables, IEEE Trans. on Information Theory, IT-21,
pp. 163179, 1975.

You might also like