0% found this document useful (0 votes)

240 views

Information Theory A Tutorial Introduction-1-20

This document provides an introduction to information theory as developed by Claude Shannon. Some key points: 1) Information theory defines limits on how much information can be transmitted through any communication channel based on factors like channel capacity and noise. 2) Information is quantifiable and can be treated like a physical quantity. The amount of information is measured in units called bits. 3) Shannon's theory established that information is the amount needed to make a choice between equally probable alternatives, with each bit of information allowing for one binary decision. The more alternatives there are, the more bits of information are needed.

Uploaded by

Zer R

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

240 views

Information Theory A Tutorial Introduction-1-20

Uploaded by

Zer R

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Information Theory: A Tutorial Introduction

James V Stone, Psychology Department, University of Sheffield, England.

[email protected]
File: main InformationTheory JVStone v3.tex
arXiv:1802.05968v2 [cs.IT] 20 Feb 2018

Abstract

Shannon’s mathematical theory of communication defines fundamental limits on

how much information can be transmitted between the different components of any
man-made or biological system. This paper is an informal but rigorous introduction to
the main ideas implicit in Shannon’s theory. An annotated reading list is provided for
further reading.

1 Introduction

In 1948, Claude Shannon published a paper called A Mathematical Theory of

Communication[1]. This paper heralded a transformation in our understanding of
information. Before Shannon’s paper, information had been viewed as a kind of poorly
defined miasmic fluid. But after Shannon’s paper, it became apparent that information is
a well-defined and, above all, measurable quantity. Indeed, as noted by Shannon,

A basic idea in information theory is that information can be treated very much
like a physical quantity, such as mass or energy.
Caude Shannon, 1985.

Message Noise Message

s η
s
x y
Encoding Channel Decoding

Figure 1: The communication channel. A message (data) is encoded before being used as
input to a communication channel, which adds noise. The channel output is decoded by a
receiver to recover the message.

1
Information theory defines definite, unbreachable limits on precisely how much
information can be communicated between any two components of any system, whether
this system is man-made or natural. The theorems of information theory are so important
that they deserve to be regarded as the laws of information[2, 3, 4].

The basic laws of information can be summarised as follows. For any communication
channel (Figure 1): 1) there is a definite upper limit, the channel capacity, to the amount
of information that can be communicated through that channel, 2) this limit shrinks as
the amount of noise in the channel increases, 3) this limit can very nearly be reached by
judicious packaging, or encoding, of data.

2 Finding a Route, Bit by Bit

Information is usually measured in bits, and one bit of information allows you to choose
between two equally probable, or equiprobable, alternatives. In order to understand why
this is so, imagine you are standing at the fork in the road at point A in Figure 2, and
that you want to get to the point marked D. The fork at A represents two equiprobable
alternatives, so if I tell you to go left then you have received one bit of information. If
we represent my instruction with a binary digit (0=left and 1=right) then this binary digit
provides you with one bit of information, which tells you which road to choose.

Now imagine that you come to another fork, at point B in Figure 2. Again, a binary
digit (1=right) provides one bit of information, allowing you to choose the correct road,
which leads to C. Note that C is one of four possible interim destinations that you could

0 ----- 0 0 0 = 0
0
1 -----0 0 1 = 1
B
0 0 -----0 1 0 = 2
1
Left C
1 D --0 1 1 = 3
A
0 ----- 1 0 0 = 4
Right
0 ----- 1 0 1 = 5
1 1
0 ----- 1 1 0 = 6
1
1 ----- 1 1 1 = 7

Figure 2: For a traveller who does not know the way, each fork in the road requires one bit
of information to make a correct decision. The 0s and 1s on the right-hand side summarise
the instructions needed to arrive at each destination; a left turn is indicated by a 0 and a
right turn by a 1.

2
have reached after making two decisions. The two binary digits that allow you to make
the correct decisions provided two bits of information, allowing you to choose from four
(equiprobable) alternatives; 4 equals 2 × 2 = 22 .

A third binary digit (1=right) provides you with one more bit of information, which
allows you to again choose the correct road, leading to the point marked D. There are now
eight roads you could have chosen from when you started at A, so three binary digits (which
provide you with three bits of information) allow you to choose from eight equiprobable
alternatives, which also equals 2 × 2 × 2 = 23 = 8.

We can restate this in more general terms if we use n to represent the number of forks,
and m to represent the number of final destinations. If you have come to n forks then you
have effectively chosen from m = 2n final destinations. Because the decision at each fork
requires one bit of information, n forks require n bits of information.

Viewed from another perspective, if there are m = 8 possible destinations then the
number of forks is n = 3, which is the logarithm of 8. Thus, 3 = log2 8 is the number
of forks implied by eight destinations. More generally, the logarithm of m is the power to
which 2 must be raised in order to obtain m; that is, m = 2n . Equivalently, given a number
m, which we wish to express as a logarithm, n = log2 m. The subscript 2 indicates that we
are using logs to the base 2 (all logarithms in this book use base 2 unless stated otherwise).

3 Bits Are Not Binary Digits

The word bit is derived from binary digit, but a bit and a binary digit are fundamentally
different types of quantities. A binary digit is the value of a binary variable, whereas a bit
is an amount of information. To mistake a binary digit for a bit is a category error. In this
case, the category error is not as bad as mistaking marzipan for justice, but it is analogous
to mistaking a pint-sized bottle for a pint of milk. Just as a bottle can contain between
zero and one pint, so a binary digit (when averaged over both of its possible states) can
convey between zero and one bit of information.

4 Information and Entropy

Consider a coin which lands heads up 90% of the time (i.e. p(xh ) = 0.9). When this coin
is flipped, we expect it to land heads up (x = xh ), so when it does so we are less surprised
than when it lands tails up (x = xt ). The more improbable a particular outcome is, the
more surprised we are to observe it. If we use logarithms to the base 2 then the Shannon

3
information or surprisal of each outcome, such as a head xh , is measured in bits (see Figure
3a)

1
Shannon information = log bits, (1)
p(xh )

which is often expressed as: information = − log p(xh ) bits.

Entropy is Average Shannon Information. We can represent the outcome of a coin flip
as the random variable x, such that a head is x = xh and a tail is x = xt . In practice, we
are not usually interested in the surprise of a particular value of a random variable, but we
are interested in how much surprise, on average, is associated with the entire set of possible
values. The average surprise of a variable x is defined by its probability distribution p(x),
and is called the entropy of p(x), represented as H(x).

The Entropy of a Fair Coin. The average amount of surprise about the possible outcomes
of a coin flip can be found as follows. If a coin is fair or unbiased then p(xh ) = p(xt ) = 0.5
then the Shannon information gained when a head or a tail is observed is log 1/0.5 = 1 bit,
so the average Shannon information gained after each coin flip is also 1 bit. Because entropy
is defined as average Shannon information, the entropy of a fair coin is H(x) = 1 bit.

The Entropy of an Unfair (Biased) Coin. If a coin is biased such that the probability
of a head is p(xh ) = 0.9 then it is easy to predict the result of each coin flip (i.e. with 90%
accuracy if we predict a head for each flip). If the outcome is a head then the amount of
Shannon information gained is log(1/0.9) = 0.15 bits. But if the outcome is a tail then

7
Surprise = log 1/p(x)

0
0 0.2 0.4 0.6 0.8 1
p(x)
(a) (b)

Figure 3: a) Shannon information as surprise. Values of x that are less probable have larger
values of surprise, defined as log2 (1/p(x)) bits. b) Graph of entropy H(x) versus coin bias
(probability p(xh ) of a head). The entropy of a coin is the average amount of surprise or
Shannon information in the distribution of possible outcomes (i.e. heads and tails).

4
the amount of Shannon information gained is log(1/0.1) = 3.32 bits. Notice that more
information is associated with the more surprising outcome. Given that the proportion of
flips that yield a head is p(xh ), and that the proportion of flips that yield a tail is p(xt )
(where p(xh ) + p(xt ) = 1), the average surprise is

1 1
H(x) = p(xh ) log + p(xt ) log , (2)
p(xh ) p(xt )

which comes to H(x) = 0.469 bits, as in Figure 3b. If we define a tail as x1 = xt and a
head as x2 = xh then Equation 2 can be written as

2
X 1
H(x) = p(xi ) log bits. (3)
p(xi )
i=1

More generally, a random variable x with a probability distribution p(x) =

{p(x1 ), . . . , p(xm )} has an entropy of
m
X 1
H(x) = p(xi ) log bits. (4)
p(xi )
i=1

The reason this definition matters is because Shannon’s source coding theorem (see Section
7) guarantees that each value of the variable x can be represented with an average of (just
over) H(x) binary digits. However, if the values of consecutive values of a random variable
are not independent then each value is more predictable, and therefore less surprising, which
reduces the information-carrying capability (i.e. entropy) of the variable. This is why it is
important to specify whether or not consecutive variable values are independent.

Interpreting Entropy. If H(x) = 1 bit then the variable x could be used to represent
m = 2H(x) or 2 equiprobable values. Similarly, if H(x) = 0.469 bits then the variable x

0.18
0.16
Outcome probability

0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
2 3 4 5 6 7 8 9 10 11 12
Outcome value

(a) (b)

Figure 4: (a) A pair of dice. (b) Histogram of dice outcome values.

5
could be used to represent m = 20.469 or 1.38 equiprobable values; as if we had a die
with 1.38 sides. At first sight, this seems like an odd statement. Nevertheless, translating
entropy into an equivalent number of equiprobable values serves as an intuitive guide for
the amount of information represented by a variable.

Dicing With Entropy. Throwing a pair of 6-sided dice yields an outcome in the form of
an ordered pair of numbers, and there are a total of 36 equiprobable outcomes, as shown
in Table 1. If we define an outcome value as the sum of this pair of numbers then there
are m = 11 possible outcome values Ax = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, represented by the
symbols x1 , . . . , x11 . These outcome values occur with the frequencies shown in Figure 4b
and Table 1. Dividing the frequency of each outcome value by 36 yields the probability
P of each outcome value. Using Equation 4, we can use these 11 probabilities to find the
entropy

1 1 1
H(x) = p(x1 ) log + p(x2 ) log + · · · + p(x11 ) log
p(x1 ) p(x2 ) p(x11 )
= 3.27 bits.

Using the interpretation described above, a variable with an entropy of 3.27 bits can
represent 23.27 = 9.65 equiprobable values.

Entropy and Uncertainty. Entropy is a measure of uncertainty. When our uncertainty

is reduced, we gain information, so information and entropy are two sides of the same coin.
However, information has a rather subtle interpretation, which can easily lead to confusion.

Average information shares the same definition as entropy, but whether we call a given
quantity information or entropy usually depends on whether it is being given to us or taken

Symbol Sum Outcome Frequency P Surprisal

x1 2 1:1 1 0.03 5.17
x2 3 1:2, 2:1 2 0.06 4.17
x3 4 1:3, 3:1, 2:2 3 0.08 3.59
x4 5 2:3, 3:2, 1:4, 4:1 4 0.11 3.17
x5 6 2:4, 4:2, 1:5, 5:1, 3:3 5 0.14 2.85
x6 7 3:4, 4:3, 2:5, 5:2, 1:6, 6:1 6 0.17 2.59
x7 8 3:5, 5:3, 2:6, 6:2, 4:4 5 0.14 2.85
x8 9 3:6, 6:3, 4:5, 5:4 4 0.11 3.17
x9 10 4:6, 6:4, 5:5 3 0.08 3.59
x10 11 5:6, 6:5 2 0.06 4.17
x11 12 6:6 1 0.03 5.17

Table 1: A pair of dice have 36 possible outcomes.

Sum: outcome value, total number of dots for a given throw of the dice.
Outcome: ordered pair of dice numbers that could generate each symbol.
Freq: number of different outcomes that could generate each outcome value.
P : the probability that the pair of dice yield a given outcome value (freq/36).
Surprisal: P log(1/P ) bits.

6
away. For example, if a variable has high entropy then our initial uncertainty about the value
of that variable is large and is, by definition, exactly equal to its entropy. If we are told the
value of that variable then, on average, we have been given an amount of information equal
to the uncertainty (entropy) we initially had about its value. Thus, receiving an amount of
information is equivalent to having exactly the same amount of entropy (uncertainty) taken
away.

5 Entropy of Continuous Variables

For discrete variables, entropy is well-defined. However, for all continuous variables, entropy
is effectively infinite. Consider the difference between a discrete variable xd with n possible
values and a continuous variable xc with an uncountably infinite number of possible values;
for simplicity, assume that all values are equally probable. The probability of observing
each value of the discrete variable is Pd = 1/m, so the entropy of xd is H(xd ) = log m. In
contrast, the probability of observing each value of the continuous variable is Pc = 1/∞ = 0,
so the entropy of xc is H(xc ) = log ∞ = ∞. In one respect, this makes sense, because each
value of of a continuous variable is implicitly specified with infinite precision, from which
it follows that the amount of information conveyed by each value is infinite. However,
this result implies that all continuous variables have the same entropy. In order to assign
different values to different variables, all infinite terms are simply ignored, which yields the
differential entropy
Z
1
H(xc ) = p(xc ) log dxc . (5)
p(xc )

It is worth noting that the technical difficulties associated with entropy of continuous
variables disappear for quantities like mutual information, which involve the difference
between two entropies. For convenience, we use the term entropy for both continuous
and discrete variables below.

6 Maximum Entropy Distributions

A distribution of values that has as much entropy (information) as theoretically possible is

a maximum entropy distribution. Maximum entropy distributions are important because,
if we wish to use a variable to transmit as much information as possible then we had better
make sure it has maximum entropy. For a given variable, the precise form of its maximum
entropy distribution depends on the constraints placed on the values of that variable[3]. It
will prove useful to summarise three important maximum entropy distributions. These are
listed in order of decreasing numbers of constraints below.

7
The Gaussian Distribution. If a variable x has a fixed variance, but is otherwise
unconstrained, then the maximum entropy distribution is the Gaussian distribution (Figure
5a). This is particularly important in terms of energy efficiency because no other distribution
can provide as much information at a lower energy cost per bit.

If a variable has a Gaussian or normal distribution then the probability of observing a

particular value x is

1 2
p(x) = √ e−(µx −x) /(2vx ) , (6)
2πvx

where e = 2.7183. This equation defines the bell-shaped curve in Figure 5a. The term µx is
the mean of the variable x, and defines the central value of the distribution; we assume that
all variables have a mean of zero (unless stated otherwise). The term vx is the variance of
the variable x, which is the square of the standard deviation σx of x, and defines the width
of the bell curve. Equation 6 is a probability density function, and (strictly speaking) p(x)
is the probability density of x.

The Exponential Distribution. If a variable has no values below zero, and has a
fixed mean µ, but is otherwise unconstrained, then the maximum entropy distribution
is exponential,

1 −x/µ
p(x) = e , (7)
µ

which has a variance of var(x) = µ2 , as shown in Figure 5b.

The Uniform Distribution. If a variable has a fixed lower bound xmin and upper bound
xmax , but is otherwise unconstrained, then the maximum entropy distribution is uniform,

p(x) = 1/(xmax − xmin ), (8)

0.4 1 0.5
0.9
0.35
0.8 0.4
0.3 0.7
0.25 0.6 0.3
p(x)

p(x)

σ 0.5
0.2
0.4 0.2
0.15
0.3
0.1
0.2 0.1
0.05 0.1
0 0 0
-3 -2 -1 0 1 2 3 0 1 2 3 4 5 -0.5 0 0.5 1 1.5 2 2.5
x x x

(a) (b) (c)

Figure 5: Maximum entropy distributions. a) Gaussian distribution, with mean µ = 0 and

a standard deviation σ = 1 (Equation 6). b) Exponential distribution, with mean indicated
by the vertical line (Equation 7). c) A uniform distribution with a range between zero and
two (Equation 8).

8
which has a variance (xmax − xmin )2 /12, as shown in Figure 5c.

7 Channel Capacity

A very important (and convenient) channel is the additive channel. As encoded values pass
through an additive channel, noise η (eta) is added, so that the channel output is a noisy
version y of the channel input x

y = x + η. (9)

The channel capacity C is the maximum amount of information that a channel can provide
at its output about the input.

The rate at which information is transmitted through the channel depends on the
entropies of three variables: 1) the entropy H(x) of the input, 2) the entropy H(y) of
the output, 3) the entropy H(η) of the noise in the channel. If the output entropy is high
then this provides a large potential for information transmission, and the extent to which
this potential is realised depends on the input entropy and the level of noise. If the noise
is low then the output entropy can be close to the channel capacity. However, channel
capacity gets progressively smaller as the noise increases. Capacity is usually expressed in
bits per usage (i.e. bits per output), or bits per second (bits/s).
Binary digits transmitted

Noise
Capacity
Channel

Capacity
Channel

Noiseless Noisy
Channel Channel

Figure 6: The channel capacity of noiseless and noisy channels is the maximum rate at
which information can be communicated. If a noiseless channel communicates data at 10
binary digits/s then its capacity is C = 10 bits/s. The capacity of a noiseless channel is
numerically equal to the rate at which it communicates binary digits, whereas the capacity
of a noisy channel is less than this because it is limited by the amount of noise in the
channel.

9
8 Shannon’s Source Coding Theorem

Shannon’s source coding theorem, described below, applies only to noiseless channels. This
theorem is really about re-packaging (encoding) data before it is transmitted, so that, when
it is transmitted, every datum conveys as much information as possible. This theorem is
highly relevant to the biological information processing because it defines definite limits to
how efficiently sensory data can be re-packaged. We consider the source coding theorem
using binary digits below, but the logic of the argument applies equally well to any channel
inputs.

Given that a binary digit can convey a maximum of one bit of information, a noiseless
channel which communicates R binary digits per second can communicate information at
the rate of up to R bits/s. Because the capacity C is the maximum rate at which it can
communicate information from input to output, it follows that the capacity of a noiseless
channel is numerically equal to the number R of binary digits communicated per second.
However, if each binary digit carries less than one bit (e.g. if consecutive output values are
correlated) then the channel communicates information at a lower rate R < C.

Now that we are familiar with the core concepts of information theory, we can quote
Shannon’s source coding theorem in full. This is also known as Shannon’s fundamental
theorem for a discrete noiseless channel, and as the first fundamental coding theorem.

Let a source have entropy H (bits per symbol) and a channel have a capacity C
(bits per second). Then it is possible to encode the output of the source in such
a way as to transmit at the average rate C/H − symbols per second over the
channel where is arbitrarily small. It is not possible to transmit at an average
rate greater than C/H [symbols/s].
Shannon and Weaver, 1949[2].
[Text in square brackets has been added by the author.]

Recalling the example of the sum of two dice, a naive encoding would require 3.46
(log 11) binary digits to represent the sum of each throw. However, Shannon’s source
coding theorem guarantees that an encoding exists such that an average of (just over) 3.27
(i.e. log 9.65) binary digits per value of s will suffice (the phrase ‘just over’ is an informal
interpretation of Shannon’s more precise phrase ‘arbitrarily close to’).

This encoding process yields inputs with a specific distribution p(x), where there
are implicit constraints on the form of p(x) (e.g.power constraints). The shape of the
distribution p(x) places an upper limit on the entropy H(x), and therefore on the maximum
information that each input can carry. Thus, the capacity of a noiseless channel is defined

10
in terms of the particular distribution p(x) which maximises the amount of information per
input

C = max H(x) bits per input. (10)

p(x)

This states that channel capacity C is achieved by the distribution p(x) which makes H(x)
as large as possible (see Section 6).

9 Noise Reduces Channel Capacity

Here, we examine how noise effectively reduces the maximum information that a channel
can communicate. If the number of equiprobable (signal) input states is mx then the input
entropy is

H(x) = log mx bits. (11)

For example, suppose there are mx = 3 equiprobable input states, say, x1 = 100 and
x2 = 200 and x3 = 300, so the input entropy is H(x) = log 3 = 1.58 bits. And if there
are mη = 2 equiprobable values for the channel noise, say, η1 = 10 and η2 = 20, then the
noise entropy is H(η) = log 2 = 1.00 bit.

Now, if the input is x1 = 100 then the output can be one of two equiprobable states,
y1 = 100 + 10 = 110 or y2 = 100 + 20 = 120. And if the input is x2 = 200 then the

Possible inputs
for one output = 2H(x|y) Outputs y

2H(x|y)

Number of Number of
inputs = 2H(x) outputs = 2H(y)

2H(y|x)

Inputs x
Possible outputs
for one input = 2H(y|x)

Figure 7: A fan diagram shows how channel noise affects the number of possible outputs
given a single input, and vice versa. If the noise η in the channel output has entropy
H(η) = H(Y |X) then each input value could yield one of 2H(Y |X) equally probable output
values. Similarly, if the noise in the channel input has entropy H(X|Y ) then each output
value could have been caused by one of 2H(X|Y ) equally probable input values.

11
output can be either y3 = 210 or y4 = 220. Finally, if the input is x3 = 300 then the output
can be either y5 = 310 or y6 = 320. Thus, given three equiprobable input states and two
equiprobable noise values, there are my = 6(= 3 × 2) equiprobable output states. So the
output entropy is H(y) = log 6 = 2.58 bits. However, some of this entropy is due to noise,
so not all of the output entropy comprises information about the input.

In general, the total number my of equiprobable output states is my = mx × mη , from

which it follows that the output entropy is

H(y) = log mx + log mη (12)

= H(x) + H(η) bits. (13)

Because we want to explore channel capacity in terms of channel noise, we will pretend
to reverse the direction of data along the channel. Accordingly, before we ‘receive’ an input
value, we know that the output can be one of 6 values, so our uncertainty about the input
value is summarised by its entropy H(y) = 2.58 bits.

Conditional Entropy. Our average uncertainty about the output value given an input
value is the conditional entropy H(y|x). The vertical bar denotes ‘given that’, so H(y|x) is,
‘the residual uncertainty (entropy) of y given that we know the value of x’.

After we have received an input value, our uncertainty about the output value is reduced
from H(y) = 2.58 bits to

H(y|x) = H(η) = log 2 = 1bit. (14)

Figure 8: The relationships between information theoretic quantities. Noise refers to noise
η in the output, which induces uncertainty H(y|x) = H(η) regarding the output given the
input; this noise also induces uncertainty H(x|y) regarding the input given the output. The
mutual information is I(x, y) = H(x) − H(x|y) = H(y) − H(y|x) bits.

12
Because H(y|x) is the entropy of the channel noise η, we can write it as H(η). Equation 14
is true for every input, and it is therefore true for the average input. Thus, for each input,
we gain an average of

H(y) − H(η) = 2.58 − 1 bits, (15)

about the output, which is the mutual information between x and y.

10 Mutual Information

The mutual information I(x, y) between two variables, such as a channel input x and output
y, is the average amount of information that each value of x provides about y

I(x, y) = H(y) − H(η) bits. (16)

Somewhat counter-intuitively, the average amount of information gained about the output
when an input value is received is the same as the average amount of information gained
about the input when an output value is received, I(x, y) = I(y, x). This is why it did
not matter when we pretended to reverse the direction of data through the channel. These
quantities are summarised in Figure 8.

11 Shannon’s Noisy Channel Coding Theorem

All practical communication channels are noisy. To take a trivial example, the voice signal
coming out of a telephone is not a perfect copy of the speaker’s voice signal, because various
electrical components introduce spurious bits of noise into the telephone system.

As we have seen, the effects of noise can be reduced by using error correcting
codes. These codes reduce errors, but they also reduce the rate at which information is
communicated. More generally, any method which reduces the effects of noise also reduces
the rate at which information can be communicated. Taking this line of reasoning to its
logical conclusion seems to imply that the only way to communicate information with zero
error is to reduce the effective rate of information transmission to zero, and in Shannon’s
day this was widely believed to be true. But Shannon proved that information can be
communicated, with vanishingly small error, at a rate which is limited only by the channel
capacity.

Now we give Shannon’s fundamental theorem for a discrete channel with noise, also
known as the second fundamental coding theorem, and as Shannon’s noisy channel coding
theorem[2]:

13
Let a discrete channel have the capacity C and a discrete source the entropy per
second H. If H ≤ C there exists a coding system such that the output of the
source can be transmitted over the channel with an arbitrarily small frequency
of errors (or an arbitrarily small equivocation). If H ≥ C it is possible to encode
the source so that the equivocation is less than H − C + where is arbitrarily
small. There is no method of encoding which gives an equivocation less than
H − C.

(The word ‘equivocation’ means the average uncertainty that remains regarding the value
of the input after the output is observed, i.e. the conditional entropy H(X|Y )). In essence,
Shannon’s theorem states that it is possible to use a communication channel to communicate
information with a low error rate (epsilon), at a rate arbitrarily close to the channel
capacity of C bits/s, but it is not possible to communicate information at a rate greater
than C bits/s.

The capacity of a noisy channel is defined as

C = max I(x, y) (17)

p(x)
= max [H(y) − H(y|x)] bits. (18)
p(x)

If there is no noise (i.e. if H(y|x) = 0) then this reduces to Equation 10, which is the
capacity of a noiseless channel. The data processing inequality states that, no matter how
sophisticated any device is, the amount of information I(x, y) in its output about its input
cannot be greater than the amount of information H(x) in the input.

12 The Gaussian Channel

If the noise values in a channel are drawn independently from a Gaussian distribution (i.e.
η ∼ N (µη , vη ), as defined in Equation 6) then this defines a Gaussian channel.

Given that y = x + η, if we want p(y) to be Gaussian then we should ensure that

p(x) and p(η) are Gaussian, because the sum of two independent Gaussian variables is also
Gaussian[3]. So, p(x) must be (iid) Gaussian in order to maximise H(x), which maximises
H(y), which maximises I(x, y). Thus, if each input, output, and noise variable is (iid)
Gaussian then the average amount of information communicated per output value is the
channel capacity, so that I(x, y) = C bits. This is an informal statement of Shannon’s
continuous noisy channel coding theorem for Gaussian channels. We can use this to express
capacity in terms of the input, output, and noise.

14
If the channel input x ∼ N (µx , vx ) then the continuous analogue (integral) of Equation
4 yields the differential entropy

H(x) = (1/2) log 2πevx bits. (19)

The distinction between differential entropy effectively disappears when considering the
difference between entropies, and we will therefore find that we can safely ignore this
distinction here. Given that the channel noise is iid Gaussian, its entropy is

H(η) = (1/2) log 2πevη bits. (20)

Because the output is the sum y = x + η, it is also iid Gaussian with variance vy = vx + vη ,
and its entropy is

H(y) = (1/2) log 2πe(vx + vη ) bits. (21)

Substituting Equations 20 and 21 into Equation 16 yields

1 vx
I(x, y) = log 1 + bits, (22)
2 vη

which allows us to choose one out of m = 2I equiprobable values. For a Gaussian channel,
I(x, y) attains its maximal value of C bits.

The variance of any signal with a mean of zero is equal to its power, which is the rate at
which energy is expended per second, and the physical unit of power is measured in Joules
per second (J/s) or Watts, where 1 Watt = 1 J/s. Accordingly, the signal power is S = vx
J/s, and the noise power is N = vη J/s. This yields Shannon’s famous equation for the

2.5
Channel capacity, bits/s

1.5

0.5

0
0 5 10 15 20
Signal power, J/s

Figure 9: Gaussian channel capacity C (Equation 23) increases slowly with signal power S,
which equals signal power here because N = 1.

15
capacity of a Gaussian channel

1 S
C = log 1 + bits, (23)
2 N

where the ratio of variances S/N is the signal to noise ratio (SNR), as in Figure 9. It is
worth noting that, given a Gaussian signal obscured by Gaussian noise, the probability of
detecting the signal is[5]
r !!
1 S
P = log 1 + erf , (24)
2 8N

where erf is the cumulative distribution function of a Gaussian.

13 Fourier Analysis

If a sinusoidal signal has a period of λ seconds then it has a frequency of f = 1/λ periods per
second, measured in Hertz (Hz). A sinusoid with a frequency of W Hz can be represented
perfectly if its value is measured at the Nyquist sample rate[6] of 2W times per second.
Indeed, Fourier analysis allows almost any signal x to be represented as a mixture of
sinusoidal Fourier components x(f ) : (f = 0, . . . , W ), shown in Figure 10 (see Section 13).
A signal which includes frequencies between 0 Hz and W Hz has a bandwidth of W Hz.

Fourier Analysis. Fourier analysis allows any signal to be represented as a weighted sum
of sine and cosine functions (see Section 13). More formally, consider a signal x with a value

Figure 10: Fourier analysis decomposes the signal x in d into a unique set of sinusoidal
Fourier components x(f ) (f = 0, . . . , W Hz) in a-c, where d=a+b+c.

16
xt at time t, which spans a time interval of T seconds. This signal can be represented as a
weighted average of sine and cosine functions
∞
X ∞
X
xt = x0 + an cos(fn t) + bn sin(fn t), (25)
n=1 n=1

where fn = 2πn/T represents frequency, an is the Fourier coefficient (amplitude) of a

cosine with frequency fn , and bn is the Fourier coefficient of a sine with frequency fn ;
and x0 represents the background amplitude (usually assumed to be zero). Taken over all
frequencies, these pairs of coefficients represent the Fourier transform of x.

The Fourier coefficients can be found from the integrals

Z T
2
an = xt cos(fn t) dt (26)
T 0
Z T
2
bn = xt sin(fn t) dt. (27)
T 0

Each coefficient an specifies how much of the signal x consists of a cosine at the frequency
fn , and bn specifies how much consists of a sine. Each pair of coefficients specifies the power
and phase of one frequency component; the power at frequency fn is Sf = (a2n + b2n ), and
the phase is arctan(bn /an ). If x has a bandwidth of W Hz then its power spectrum is the
set of W values S0 , . . . , SW .

An extremely useful property of Fourier analysis is that, when applied to any variable,
the resultant Fourier components are mutually uncorrelated[7], and, when applied to any
Gaussian variable, these Fourier components are also mutually independent. This means
that the entropy of any Gaussian variable can be estimated by adding up the entropies
of its Fourier components, which can be used to estimate the mutual information between
Gaussian variables.

Consider a variable y = x + η, which is the sum of a Gaussian signal x with variance

S, and Gaussian noise with variance N . If the highest frequency in y is W Hz, and if values
of x are transmitted at the Nyquist rate of 2W Hz, then the channel capacity is 2W C bits
per second, (where C is defined in Equation 23). Thus, when expressed in terms of bits per
second, this yields a channel capacity of

S
C = W log 1 + bits/s. (28)
N

If the signal power of Fourier component x(f ) is S(f ), and the noise power of component
η(f ) is N (f ) then the signal to noise ratio is S(f )/N (f ) (see Section 13). The mutual
information at frequency f is therefore

S(f )
I(x(f ), y(f )) = log 1 + bits/s. (29)
N (f )

17
Because the Fourier components of any Gaussian variable are mutually independent, the
mutual information between Gaussian variables can be obtained by summing I(x(f ), y(f ))
over frequency
Z W
I(x, y) = I(x(f ), y(f )) df bits/s. (30)
f =0

If each Gaussian variable x, y and η is also iid then I(x, y) = C bits/s, otherwise I(x, y) <
C bits/s[2]. If the peak power at all frequencies is a constant k then it can be shown
that I(x, y) is maximised when S(f ) + N (f ) = k, which defines a flat power spectrum.
Finally, if the signal spectrum is sculpted so that the signal plus noise spectrum is flat then
the logarithmic relation in Equation 23 yields improved, albeit still diminishing, returns[7]
C ∝ (S/N )1/3 bits/s.

14 A Very Short History of Information Theory

Even the most gifted scientist cannot command an original theory out of thin air. Just as
Einstein could not have devised his theories of relativity if he had no knowledge of Newton’s
work, so Shannon could not have created information theory if he had no knowledge of the
work of Boltzmann (1875) and Gibbs (1902) on thermodynamic entropy, Wiener (1927)
on signal processing, Nyquist (1928) on sampling theory, or Hartley (1928) on information
transmission[8].

Even though Shannon was not alone in trying to solve one of the key scientific problems
of his time (i.e. how to define and measure information), he was alone in being able to
produce a complete mathematical theory of information: a theory that might otherwise
have taken decades to construct. In effect, Shannon single-handedly accelerated the rate of
scientific progress, and it is entirely possible that, without his contribution, we would still
be treating information as if it were some ill-defined vital fluid.

15 Key Equations

Logarithms use base 2 unless stated otherwise.

Entropy
m
X 1
H(x) = p(xi ) log bits (31)
p(xi )
i=1
Z
1
H(x) = p(x) log dx bits (32)
x p(x)

18
Joint entropy
m X
m
X 1
H(x, y) = p(xi , yj ) log bits (33)
p(xi , yj )
i=1 j=1
Z Z
1
H(x, y) = p(y, x) log dy dx bits (34)
x y p(y, x)

H(x, y) = I(x, y) + H(x|y) + H(y|x) bits (35)

H(x|y) = H(x, y) − H(y) bits (40)

H(y|x) = H(x, y) − H(x) bits (41)

from which we obtain the chain rule for entropy

H(x, y) = H(x) + H(y|x) bits (42)

= H(y) + H(x|y) bits (43)

Marginalisation
m
X m
X
p(xi ) = p(xi , yj ), p(yj ) = p(xi , yj ) (44)
j=1 i=1
Z Z
p(x) = p(x, y) dy, p(y) = p(x, y) dx (45)
y x

19
Mutual Information

m X
m
X p(xi , yj )
I(x, y) = p(xi , yj ) log bits (46)
p(xi )p(yj )
i=1 j=1
Z Z
p(x, y)
I(x, y) = p(x, y) log dx dy bits (47)
y x p(x)p(y)

I(x, y) = H(x) + H(y) − H(x, y) (48)

= H(x) − H(x|y) (49)
= H(y) − H(y|x) (50)
= H(x, y) − [H(x|y) + H(y|x)] bits (51)

If y = x + η, with x and y (not necessarily iid) Gaussian variables then

Z W
S(f )
I(x, y) = log 1 + df bits/s, (52)
f =0 N (f )

where W is the bandwidth, S(f )/N (f ) is the signal to noise ratio of the signal and noise
Fourier components at frequency f (Section 13), and data are transmitted at the Nyquist
rate of 2W samples/s.

Channel Capacity

C = max I(x, y) bits per value. (53)

p(x)

If the channel input x has variance S, the noise η has variance N , and both x and η are iid
Gaussian variables then I(x, y) = C, where

1 S
C = log 1 + bits per value, (54)
2 N

where the ratio of variances S/N is the signal to noise ratio.

Applebaum D (2008)[9]. Probability and Information: An Integrated Approach. A thorough

introduction to information theory, which strikes a good balance between intuitive and
technical explanations.

Avery J (2012)[10]. Information Theory and Evolution. An engaging account of how

information theory is relevant to a wide range of natural and man-made systems, including

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
Procedural Generation in Game Design
93% (14)
Procedural Generation in Game Design
339 pages
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Salomon Smith Barney Principles of Principal Components A Fresh Look at Risk Hedging and Relative Value
100% (1)
Salomon Smith Barney Principles of Principal Components A Fresh Look at Risk Hedging and Relative Value
45 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Variance Analysis Problems With Answers
75% (4)
Variance Analysis Problems With Answers
5 pages
A Coomer's Guide To AI Dungeon
No ratings yet
A Coomer's Guide To AI Dungeon
30 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
EST.2215 Accuracy of FEL 2 Estimates in Process Plants: Melissa C. Matthews
No ratings yet
EST.2215 Accuracy of FEL 2 Estimates in Process Plants: Melissa C. Matthews
17 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Information Theory: A Tutorial Introduction
0% (1)
Information Theory: A Tutorial Introduction
23 pages
Subhajit_FinalReport_Quantum
No ratings yet
Subhajit_FinalReport_Quantum
26 pages
TSDS Admission Exam Questions 2022 FINAL
No ratings yet
TSDS Admission Exam Questions 2022 FINAL
4 pages
Lecture01 02 Part1
No ratings yet
Lecture01 02 Part1
27 pages
Ec23ec4211itc PPT
No ratings yet
Ec23ec4211itc PPT
148 pages
Statistic SimpleLinearRegression
No ratings yet
Statistic SimpleLinearRegression
7 pages
Information Theory Final
No ratings yet
Information Theory Final
50 pages
INFORMATION THEORY AND SOURCE CODING
No ratings yet
INFORMATION THEORY AND SOURCE CODING
45 pages
Information, Entropy, and The Motivation For Source Codes: Hapter
No ratings yet
Information, Entropy, and The Motivation For Source Codes: Hapter
12 pages
Lecture 01
No ratings yet
Lecture 01
3 pages
MIT6 02F12 Chap02
No ratings yet
MIT6 02F12 Chap02
12 pages
Chapter 02 Information Theory
No ratings yet
Chapter 02 Information Theory
15 pages
Module 1
No ratings yet
Module 1
29 pages
Error-Correcting Codes With Linear Algebra: 1 The Problem
No ratings yet
Error-Correcting Codes With Linear Algebra: 1 The Problem
8 pages
Principal Components Analysis: A How-To Manual For R
No ratings yet
Principal Components Analysis: A How-To Manual For R
12 pages
Course Notes For Unit 6 of The Udacity Course ST101 Introduction To Statistics PDF
No ratings yet
Course Notes For Unit 6 of The Udacity Course ST101 Introduction To Statistics PDF
23 pages
Quantum Chaos and The Brain
No ratings yet
Quantum Chaos and The Brain
21 pages
Notes PDF
No ratings yet
Notes PDF
54 pages
Lecture On Communication Theory Gabor
100% (1)
Lecture On Communication Theory Gabor
52 pages
All Coding
No ratings yet
All Coding
52 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
No ratings yet
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
10 pages
Notes STA408 - Chapter 3
No ratings yet
Notes STA408 - Chapter 3
17 pages
A Brief Introduction On Shannon's Information Theory: January 2016
No ratings yet
A Brief Introduction On Shannon's Information Theory: January 2016
10 pages
Classical Information Theory
No ratings yet
Classical Information Theory
6 pages
What Is Shannon Information
No ratings yet
What Is Shannon Information
33 pages
MIT18 05S14 Reading6b PDF
No ratings yet
MIT18 05S14 Reading6b PDF
13 pages
Information Theory and Coding PDF
No ratings yet
Information Theory and Coding PDF
61 pages
Information Theory
No ratings yet
Information Theory
37 pages
Detection Theory
No ratings yet
Detection Theory
24 pages
Dsrp Unit-II Notes
No ratings yet
Dsrp Unit-II Notes
62 pages
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Machine Learning Assignments and Answers
No ratings yet
Machine Learning Assignments and Answers
35 pages
17th Jan-2
No ratings yet
17th Jan-2
9 pages
4 20240 456
0% (1)
4 20240 456
5 pages
Entropy: Low Entropy High Entropy
No ratings yet
Entropy: Low Entropy High Entropy
11 pages
3 The Physics of Computing
No ratings yet
3 The Physics of Computing
32 pages
Digital Communication: Information Theory
No ratings yet
Digital Communication: Information Theory
5 pages
Why in Mayan Mathematics Zero and Infinity Are the Same_ A Possi
No ratings yet
Why in Mayan Mathematics Zero and Infinity Are the Same_ A Possi
5 pages
What Is Information?: W. Szpankowski
No ratings yet
What Is Information?: W. Szpankowski
29 pages
Solutions Manual To Mathematical Analysis by Tom Apostol - Second Edition-Chapter One
No ratings yet
Solutions Manual To Mathematical Analysis by Tom Apostol - Second Edition-Chapter One
67 pages
Robust Statistics For Outlier Detection (Peter J. Rousseeuw and Mia Hubert)
No ratings yet
Robust Statistics For Outlier Detection (Peter J. Rousseeuw and Mia Hubert)
8 pages
A Statistical Measure of Complexity- Desiquilibrio
No ratings yet
A Statistical Measure of Complexity- Desiquilibrio
23 pages
Interpolation/Extrapolation and Its Application To Solar Cells
No ratings yet
Interpolation/Extrapolation and Its Application To Solar Cells
18 pages
Econ 623 AsymptoticTheory 2023
No ratings yet
Econ 623 AsymptoticTheory 2023
74 pages
Kemh115 PDF
No ratings yet
Kemh115 PDF
36 pages
Anomaly Detection
No ratings yet
Anomaly Detection
15 pages
Channel Coding Theorem
No ratings yet
Channel Coding Theorem
23 pages
A Brief Introduction On Shannon's Information Theory
No ratings yet
A Brief Introduction On Shannon's Information Theory
12 pages
nn1
No ratings yet
nn1
6 pages
Information Theory 1
No ratings yet
Information Theory 1
31 pages
Tutorial Question 2
No ratings yet
Tutorial Question 2
2 pages
In Social Systems Anticipatory Information Is Reduced To Produce Sub-Systems
No ratings yet
In Social Systems Anticipatory Information Is Reduced To Produce Sub-Systems
8 pages
Shannon's Coding Theorems: Saint Mary's College of California
No ratings yet
Shannon's Coding Theorems: Saint Mary's College of California
23 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Active Inference: The Free Energy Principle in Mind, Brain, and Behavior
From Everand
Active Inference: The Free Energy Principle in Mind, Brain, and Behavior
Thomas Parr
4/5 (3)
Analytic Inequalities
From Everand
Analytic Inequalities
Nicholas D. Kazarinoff
5/5 (1)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
Mercity - Ai-Guide To Fine-Tuning LLMs Using PEFT and LoRa Techniques
No ratings yet
Mercity - Ai-Guide To Fine-Tuning LLMs Using PEFT and LoRa Techniques
25 pages
Mythic Magazine #009
100% (3)
Mythic Magazine #009
27 pages
Improved Statistical Test
87% (171)
Improved Statistical Test
20 pages
Download Complete Artificial Intelligence and Problem Solving 1st Edition Danny Kopec PDF for All Chapters
100% (4)
Download Complete Artificial Intelligence and Problem Solving 1st Edition Danny Kopec PDF for All Chapters
61 pages
Algebra Workbook
100% (3)
Algebra Workbook
299 pages
Next Generation Sequencing Data Analysis
No ratings yet
Next Generation Sequencing Data Analysis
435 pages
Ghosh S. Mathematics and Computer Science Vol 1. 2023
No ratings yet
Ghosh S. Mathematics and Computer Science Vol 1. 2023
743 pages
Prompt Engineering - Links and Resources
No ratings yet
Prompt Engineering - Links and Resources
2 pages
A Methodology For Detecting Credit Card Fraud
No ratings yet
A Methodology For Detecting Credit Card Fraud
60 pages
Deep Thinking Where Machine Intelligence PDF
100% (1)
Deep Thinking Where Machine Intelligence PDF
3 pages
Scientific American - April 2024
100% (1)
Scientific American - April 2024
88 pages
Websites and Tools Links
No ratings yet
Websites and Tools Links
3 pages
List of Deepfake Tools
No ratings yet
List of Deepfake Tools
5 pages
Cognitive Bias Cheat Sheet
100% (1)
Cognitive Bias Cheat Sheet
17 pages
Instant Download The R Book 2nd Edition Michael J. Crawley PDF All Chapters
100% (4)
Instant Download The R Book 2nd Edition Michael J. Crawley PDF All Chapters
60 pages
مقال اجنبي PDF
No ratings yet
مقال اجنبي PDF
6 pages
(Economics Collection) Vu, Tam Bang-Seeing The Future - How To Build Basic Forecasting Models-Business Expert Press (2015) PDF
No ratings yet
(Economics Collection) Vu, Tam Bang-Seeing The Future - How To Build Basic Forecasting Models-Business Expert Press (2015) PDF
284 pages
MTL390
No ratings yet
MTL390
8 pages
Anjali BRMbasic Data Analysis
No ratings yet
Anjali BRMbasic Data Analysis
19 pages
Unit-1: The Random Variable
100% (1)
Unit-1: The Random Variable
63 pages
The Role of Salesperson Communication in Luxury Selling
No ratings yet
The Role of Salesperson Communication in Luxury Selling
15 pages
Cps 330 C
No ratings yet
Cps 330 C
110 pages
SPC SQC
No ratings yet
SPC SQC
68 pages
Fundamental Statistics For The Behavioral Sciences v2.0
No ratings yet
Fundamental Statistics For The Behavioral Sciences v2.0
342 pages
The Rolling Cross-Section and Causal Attribution: Henry E. Brady and Richard Johnston
No ratings yet
The Rolling Cross-Section and Causal Attribution: Henry E. Brady and Richard Johnston
32 pages
Introduction To Probability Solutions To The Odd Numbered Exercises (Charles M. Grinstead & J. Laurie Snell - 0821807498) PDF
No ratings yet
Introduction To Probability Solutions To The Odd Numbered Exercises (Charles M. Grinstead & J. Laurie Snell - 0821807498) PDF
51 pages
Shipping Exam Question
No ratings yet
Shipping Exam Question
5 pages
Stats 3
No ratings yet
Stats 3
46 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
MJC JC 2 H2 Maths 2011 Mid Year Exam Questions Paper 2
No ratings yet
MJC JC 2 H2 Maths 2011 Mid Year Exam Questions Paper 2
10 pages
RV and Distributions Examples
No ratings yet
RV and Distributions Examples
5 pages
GRR Study MSA Template
No ratings yet
GRR Study MSA Template
20 pages
Two-way ANOVA final notes
No ratings yet
Two-way ANOVA final notes
10 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (3)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Economic Analysis Journal
No ratings yet
Economic Analysis Journal
7 pages
GMD 7 1247 2014
No ratings yet
GMD 7 1247 2014
5 pages
Previewpdf
No ratings yet
Previewpdf
27 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
11 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Ch7 Sampling Distributions
No ratings yet
Ch7 Sampling Distributions
14 pages

Information Theory A Tutorial Introduction-1-20

Uploaded by

Information Theory A Tutorial Introduction-1-20

Uploaded by

Information Theory: A Tutorial Introduction

James V Stone, Psychology Department, University of Sheffield, England.

Shannon’s mathematical theory of communication defines fundamental limits on

In 1948, Claude Shannon published a paper called A Mathematical Theory of

Message Noise Message

2 Finding a Route, Bit by Bit

3 Bits Are Not Binary Digits

4 Information and Entropy

which is often expressed as: information = − log p(xh ) bits.

More generally, a random variable x with a probability distribution p(x) =

Figure 4: (a) A pair of dice. (b) Histogram of dice outcome values.

Entropy and Uncertainty. Entropy is a measure of uncertainty. When our uncertainty

Symbol Sum Outcome Frequency P Surprisal

Table 1: A pair of dice have 36 possible outcomes.

5 Entropy of Continuous Variables

6 Maximum Entropy Distributions

A distribution of values that has as much entropy (information) as theoretically possible is

If a variable has a Gaussian or normal distribution then the probability of observing a

which has a variance of var(x) = µ2 , as shown in Figure 5b.

p(x) = 1/(xmax − xmin ), (8)

(a) (b) (c)

Figure 5: Maximum entropy distributions. a) Gaussian distribution, with mean µ = 0 and

C = max H(x) bits per input. (10)

9 Noise Reduces Channel Capacity

H(x) = log mx bits. (11)

In general, the total number my of equiprobable output states is my = mx × mη , from

H(y) = log mx + log mη (12)

H(y|x) = H(η) = log 2 = 1bit. (14)

H(y) − H(η) = 2.58 − 1 bits, (15)

about the output, which is the mutual information between x and y.

I(x, y) = H(y) − H(η) bits. (16)

11 Shannon’s Noisy Channel Coding Theorem

The capacity of a noisy channel is defined as

C = max I(x, y) (17)

12 The Gaussian Channel

Given that y = x + η, if we want p(y) to be Gaussian then we should ensure that

H(x) = (1/2) log 2πevx bits. (19)

H(η) = (1/2) log 2πevη bits. (20)

H(y) = (1/2) log 2πe(vx + vη ) bits. (21)

Substituting Equations 20 and 21 into Equation 16 yields

where erf is the cumulative distribution function of a Gaussian.

where fn = 2πn/T represents frequency, an is the Fourier coefficient (amplitude) of a

The Fourier coefficients can be found from the integrals

Consider a variable y = x + η, which is the sum of a Gaussian signal x with variance

14 A Very Short History of Information Theory

Logarithms use base 2 unless stated otherwise.

H(x, y) = I(x, y) + H(x|y) + H(y|x) bits (35)

H(x|y) = H(x, y) − H(y) bits (40)

from which we obtain the chain rule for entropy

H(x, y) = H(x) + H(y|x) bits (42)

I(x, y) = H(x) + H(y) − H(x, y) (48)

If y = x + η, with x and y (not necessarily iid) Gaussian variables then

C = max I(x, y) bits per value. (53)

where the ratio of variances S/N is the signal to noise ratio.

Applebaum D (2008)[9]. Probability and Information: An Integrated Approach. A thorough

Avery J (2012)[10]. Information Theory and Evolution. An engaging account of how

You might also like