Chapter 1
Chapter 1
Shannon perceived that the goals of approach ing error-free digital communication on noisy
channels and of maximally efficient conversion of analog signals to digital form were dual
facets of the same problem, that they share a common framework and virtually a common
solution.
2. Less certain events ought to contain more information than more certain events.
3. The information of unrelated events taken as a single event should equal the sum of the
information of the unrelated events.
The formal term for unrelatedness is independence; two events a and B are said to be
independent if
(1.1.1)
(1.1.2)
The base of the logarithm merely specifies the scaling and hence the unit of information we
wish to use.
Sources of information, for example, are defined in terms of probabilistic models which emit
events or random variables.
(1.1.3)
Each unit of time, say every Ts seconds, the source emits a random variable which is
independent of all preceding and succeeding source outputs.
If we use natural logarithms, then our units are called nats and if we use logarithms to the
base 2, our units are called bits.
Fundamental Equality
(Binary memoryless source) For a DMS with alphabet U= (0, 1} with probability P(0) = p
and P(1) = 1-p we have entropy
where H(p) is called the binary entropy function. Here H(p)<=1 with equality if and only if
p = 1/2. When p = 1/2, we call this source a binary symmetric source (BSS). Each output of a
BSS contains 1 bit of information
of length N, we can define the average amount of information per source output sequence as
(1.1.10)
The total average information in a sequence of independent outputs is the sum of the average
information in each output in the sequence.
(1.1.12)
If the N outputs are not independent, the equality (1.1.12) becomes only an upper bound.
(1.1.14)
with equality if and only if the source outputs u l ,u 2 ,...,uN are independent; i.e., the random
variables u1, u2 , ..., uN are the outputs of a memoryless source.
For fixed N, the set of all A^N binary sequences (codewords) corresponding to all the source
sequences of length N is called a code.
Since codeword lengths can be different, in order to be able to recover the original source
sequence from the binary symbols we require that no two distinct finite sequences of
codewords form the same overall binary sequence. Such codes are called uniquely
decodable.
then the occurrence of B provides us with all the information regarding a. That is
(1.2.1)
Hence I(a) is sometimes referred to as the self-information of the event a. Note that although
I(a) is always nonnegative, mutual information I(a;B) can assume negative values.
alphabet ,a discrete output alphabet ,and a set of conditional probabilities for outputs
given each of the inputs.
Figure 1.4
Binary symmetric channel.
(1.2.2)
(1.2.3)
• Definition The memoryless discrete-input additive Gaussian noise channel is a
memoryless channel with discrete input alphabet ,output
alphabet and conditional probability density
(1.2.4)
where k = 1, 2, ..., Q.
• We define the average mutual information between inputs and outputs of the DMC
as
(1.2.7)
Definition The channel capacity of a DMC is the maximum average mutual information,
where the maximization is over all possible input probability distributions. That is,
The incoming data which arrives at the rate of one symbol every Ts seconds is stored in an
input register until a block of K data symbols has been accumulated.
This block is then presented to the channel encoder, as one of M possible messages, denoted
H1, H2 , . . . , HM where M = q^K and q is the size of the data alphabet. The combination of
encoder and modulator performs a mapping from a set of M messages, onto a set of M
finite-energy signals, , of finite duration T = KTs
Over the finite interval the M finite-energy signals X1(t), x2(t), . . . , xM(t),
representing the M block messages H1, H2 , . . . , HM respectively, can be expressed as
(2.1.1)
The basis functions are orthonormal
(2.1.2)
and N<=M. In fact, N = M if and only if the signals are linearly independent.
(2.1.3)
At this point the only disturbance that we will consider is additive white Gaussian noise.
The additive white Gaussian noise (AWGN) channel is modeled simply with a summing
junction
(2.1.4)
(2.1.5)
is the Dirac delta function and No is the one-sided noise power spectral density
The demodulator -decoder can be regarded in general as a mapping from the received process
y(t) to a decision on the state of the original message
the random process y(t) onto each of the modulator s basis functions, thus generating the N
integral inner products
(2.1.6)
(2.1.7)
(2.1.8)
(2.1.9)
(2.1.10)
Returning to the representation (2.1.11) of y(t), while it is clear that the vector of observables
y = (y1,y2,.... yN)) completely characterizes the terms of the summation, there remains the
term , defined by (2.1.10), which depends only on the noise and not at all on the
signals.
(2.1.11)
(2.1.12)
(2.1.13)
Any statistical decision regarding the transmitted message is based on the a priori
probabilities of the messages and on the conditional probabilities (densities or distributions)
of the measurements performed on y(t), generally called the observables.
(2.1.14)
Since the variables are Gaussian, implies that they are also independent. Then defining the
vector of N observables
The conditional probability density of y given the signal vector xm (or equivalently, given
that message Hm was sent) is
(2.1.15)
Hence, we conclude finally that the components of the original observable vector y are the
only data based on y(t) useful for the decision and thus represent sufficient statistics.
• (2.1.16)
Any channel whose conditional (or transition) probability density (or distribution) satisfies
(2.1.16) is called a memoryless channel.
MINIMUM ERROR PROBABILITY AND MAXIMUM LIKELIHOOD DECODER
The goal of the decoder is to perform a mapping from the vector y to a decision on the
message transmitted.
When the vector y takes on some particular value (a real vector), we make the decision
. The probability of an error in this decision, which we denote by
(2.2.2)
Since our criterion is to minimize the error probability in mapping each given y into a
decision, it follows that the optimum decision rule is,
(2.2.2)
(2.2.3)
In terms of the conditional probabilities of y given each Hm (usually called the likelihood
functions
(2.2.4)
This last relation follows from the fact that the mapping from Hm to xm , which is the coding
operation, is deterministic and one-to-one.
These likelihood functions, which are in fact the channel characterization (Fig. 2.2), are also
called the channel transition probabilities.
(2.2.5)
so,
(2.2.6)
• For a memoryless channel as defined by (2.1.16), this decision simplifies further to
if (2.2.7)
• Another useful interpretation of the above, consistent with our original view of the
decoder as a mapping, is that the decision rule (2.2.6) or (2.2.7) defines a partition of
the N-dimensional space of all observable vectors y into regions
where
(2.2.8)
(2.2.9)
It then follows that the union of the regions covers the entire N-dimensional observation
space
(2.2.11)
In most cases of interest, the message a priori probabilities are all equal; that is,
(2.2.13)
The maximum likelihood decoder depends only on the chan nel, and is often robust(strong) in
the sense that it gives the same or nearly the same error probability for each message
regardless of the true message a priori probabilities.
For a memoryless channel, the logarithm of the likelihood function (2.2.4) is commonly
called the metric; thus a maximum likelihood decoder computes the metrics for each possible
signal vector, compares them, and decides in favor of the maximum.
Given that message Hm (signal vector xm ) was sent and a given observation vector y was
received, an error will occur if y is not in Am (denoted Since y is a
random vector, the probability of error when xm is sent is then
(2.3.1)
We use the symbol to denote summation or integration over a subspace of the observation
space. Thus, for continuous channels (such as the AWGN channel) with N-dimensional
observation vectors, represents an N-dimensional integration and is a density function.
On the other hand, for discrete channels where both the xm and y vector components are
elements of a finite symbol alphabet, represents an N-fold summation and pN (.)
represents a discrete distribution.
The overall error probability is then the average of the message error probabilities
(2.3.2)
(2.3.3)
Note that each of the terms of the union is actually the decision region for when
there are only the two signals (messages) xm and xm'
(2.3.4)
denotes the pairwise error probability when xm is sent and xm' is the only
alternative.
We note that the inequality (2.3.4) becomes an equality whenever the regions Amm' are
disjoint, which occurs only in the trivial case where M = 2. For obvious reasons the bound of
(2.3.4) is called a union bound.
(2.3.15)
The expression (2.3.15) is called the Bhattacharyya bound, and its negative logar ithm the
Bhattacharyya distance.
Combining the union bound (2.3.4) with the general Bhattacharyya bound (2.3.15), we obtain
finally a bound on the error probability for the wth message
(2.3.16)
<skiped>
When the union bound fails to give useful results, a more refined technique will invariably
yield an improved bound which is tight over a significantly wider range.
<skiped>
(2.5.13)
Rate,
(2.5.14)
as is appropriate since we assumed that the source emits one of q equally likely symbols once
every Ts seconds.
q = 2, as the energy per signal normalized by the number of bits transmitted per signal, that is
(2.5.17)
(2.5.18)
M=2^K
The only limitation on dimensional ity discussed thus far was the one inherent in the fact that
M signals defined over a T-second interval can be represented using no more than M
orthogonal basis functions, or dimensions, as established by the Gram-Schmidt theorem
These orthogonal functions (or signal sets) can take on an infinite multitude of forms.
(2.6.1)
All other channels would have no effect on the given channel' s performance since the basis
functions of the other channels are orthogonal to those assigned to the given channel and
consequently would add zero components to the integrator outputs y1, y2 , ..., yN of the
demodulator of Figure 2.1b.
Aside from the obvious impossibility of regulating all channels to adopt a common
modulation system with identical timing, is that, inherent in the transmitter, receiver, and
transmission medium, there is a linear distortion which causes some frequencies to be
attenuated more than others. This leads to the necessity of defining bandwidth in terms of the
signal spectrum. The spectral density of the transmission just described is actually non zero
for all frequencies, although its envelope decreases in proportion to the frequency separation
from wo . This, in fact, is a property of all time-limited signals.
LINEAR CODES
We may label each of the M = q^K messages, as in Fig. 2.1, by a K-vector over a q-ary
alphabet.
Then the encoding becomes a one-to-one mapping from the set of message vectors
into the set of signal vector
The coder then consists of L modulo-2 adders, each of which adds together a subset of the
data symbols to generate one code symbol where n = 1, 2, . . . , L
(2.9.1)
Thus the first stage of the linear coding operation for binary data can be repre sented by
(2.9.3)
where
(2.9.4a)
Note that the set of all possible codewords is the space spanned by the row vectors of G. The
rows of G then form a basis with the information bits being the basis coefficients for the
codeword. Since the basis vectors are not unique for any linear space, it is clear that there are
many generator matrices that give the same set of codewords.
For the simplest cases of biphase or quadriphase modulation, we need only a one-
dimensional mapping so that, in fact, L = N
(2.9.5)
Similarly, for the 16-phase modulation scheme of Fig.2.12b, we must take L = 4N and use
four consecutive v-symbols to select one of the 16-phase x-symbols.
• inverse property - Every vector is its own negative (or additive inverse) under the
operation of modulo-2 addition.
(2.9.8)
(2.9.7)
When a set satisfies the closure property (2.9.6), the identity property (2.9.7), and the inverse
property (2.9.8) under an operation which is associative and commutative it is called an
Abelian group.
Hence linear codes are also called group codes and Parity check codes.
• The set of Hamming distances from a given code vector to the (M-1) other code
vectors is the same for all code vectors.
(2.9.10)
(2.9.11)
(2.9.12)
(2.9.13)
A linear code as one whose code vectors are gen erated from the data vectors by the linear
mapping
(2.10.1)
(2.10.2)
a linear code (2.10.1) generated by the matrix (2.10.2) has its first K code symbols identical
to the data symbols, that is
(2.10.3a)
(2.10.3b)
Such a code, which transmits the original K data symbols unchanged together with L-K
"parity-check" symbols is called a systematic code.
Interchanging any two rows of G or adding together modulo-2 any combination of rows does
not alter the set of code vectors generated; it simply relabels(changes the order) them.
Thus, whenever the code-generator matrix has linearly independent rows and nonzero
columns, the code is equivalent, except for relabeling of code vectors and possibly
reordering of the columns, to a systematic code generated by (2.10.2).
(2.10.5)
Thus decoding might be performed by taking the weight of the vector formed by adding
modulo-2 the received binary vector y to each possible code vector and deciding in favor of
the message whose code vector results in the lowest weight.
HT is the L x (L K) matrix
(2.10.6b)
(2.10.7)
(2.10.10)
(2.10.8)
For a given received vector y and corresponding syndrome vector s, (2.10.10) will have
solutions one for each possible transmitted vector.
Maximum likelihood decoder for any binary code on the BSC will decode correctly if
(2.10.12)
(2.10.13)
<skipped>