0% found this document useful (0 votes)
48 views89 pages

Data Transmission and Channel Capacity

Chapter 4 discusses data transmission and channel capacity, focusing on reliable transmission methods that minimize errors in communication over noisy channels. It defines discrete memoryless channels and various types of channels, including binary symmetric and erasure channels, along with their transition matrices. The chapter also introduces fixed-length data transmission codes and the average probability of error associated with these codes.

Uploaded by

李峻毅
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views89 pages

Data Transmission and Channel Capacity

Chapter 4 discusses data transmission and channel capacity, focusing on reliable transmission methods that minimize errors in communication over noisy channels. It defines discrete memoryless channels and various types of channels, including binary symmetric and erasure channels, along with their transition matrices. The chapter also introduces fixed-length data transmission codes and the average probability of error associated with these codes.

Uploaded by

李峻毅
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 4

Data Transmission and Channel Capacity

Po-Ning Chen, Professor

Institute of Communications Engineering

National Chiao Tung University

Hsin Chu, Taiwan 30010, R.O.C.


Principle of Data Transmission I: 4-1

• Data transmission
– To carefully select codewords from the set of channel input words (of a given
length) so that a minimal ambiguity is obtained at the channel receiver.
• E.g., to transmit binary message through the following channel.
00 1 -* 0
1
01

10 1 -
* 1
1
11

Code of (00 for event A, 10 for event B) obviously induces less ambiguity at
the receiver than the code of (00 for event A, 01 for event B).
Reliable Transmission I: 4-2

• Definition of “reliable” transmission


– The message can be transmitted with arbitrarily small error.
• Objective of data transmission
– To transform a noisy channel into a reliable medium for sending messages
and recovering them at the receiver.
• How?
– By taking advantage of the common parts between the sender and the
receiver sites that are least affected by the channel noise.
– We will see that these common parts are probabilistically captured by
the mutual information between the channel input and the channel
output.
Notations I: 4-3

W - Channel Xn - Channel Yn - Channel Ŵ -


Encoder PY n|X n (·|·) Decoder

• A data transmission system, where


– W represents the message for transmission,
– X n = (X1, . . . , Xn) denotes the codeword corresponding to the channel
input symbol W ,
– Y n = (Y1, . . . , Yn) represents the received vector due to channel input X n,
– Ŵ denotes the reconstructed messages from Y n.
Query? I: 4-4

• What is the maximum amount of information (per channel input) that can be
reliably transmitted via a given noisy channel?
– E.g. We can transmit 1 bit per channel usage by the following code.
00 1 -
* 0
1
01

10 1 -
* 1
1
11
Code = (00 for event A, 10 for event B)
Discrete memoryless channels I: 4-5

Definition 4.1 (Discrete channel) A discrete communication channel is char-


acterized by
• A finite input alphabet X .
• A finite output alphabet Y.
• A sequence of n-dimensional transition distributions
{PY n|X n (y n |xn)}∞
n=1

such that 
PY n|X n (y n |xn) = 1
y n ∈Y n

for every xn ∈ X n , where xn = (x1, · · · , xn ) ∈ X n and y n = (y1, · · · , yn ) ∈


Y n. We assume that the above sequence of n-dimensional distribution is con-
sistent, i.e.,
 
i i
PY i |X i (y |x ) = PXi+1|X i (xi+1|xi)PY i+1|X i+1 (y i+1|xi+1)
xi+1 ∈X yi+1∈Y

for every xi, y i , PXi+1|X i and i = 1, 2, · · · .


Discrete memoryless channels I: 4-6

Definition 4.2 (Discrete memoryless channel) A discrete memoryless chan-


nel (DMC) is a channel whose sequence of transition distributions PY n|X n satisfies
n

PY n|X n (y n |xn) = PY |X (yi |xi) (4.2.1)
i=1

for every n = 1, 2, · · · , xn ∈ X n and y n ∈ Y n. In other words, a DMC is


fully described by the channel’s transition distribution matrix Q := [px,y ] of size
|X | × |Y|, where
px,y := PY |X (y|x)
for x ∈ X , y ∈ Y. Furthermore, the matrix
 Q is stochastic; i.e., the sum
 of the

entries in each of its rows is equal to 1 since y∈Y px,y = 1 for all x ∈ X .
Frequently used channels I: 4-7

1. Identity (noiseless) channels: An identity channel has equal-size input and


output alphabets (|X | = |Y|) and channel transition probability satisfying

1 if y = x
PY |X (y|x) =
0 if y = x.
This is a noiseless or perfect channel as the channel input is received error-free
at the channel output.
Frequently used channels I: 4-8

X PY |X Y
1−ε -
0 * 0

-
j
1 1
1−ε

2. Binary symmetric channels (BSC):


• ε ∈ [0, 1] is called the channel’s crossover probability or bit error rate.
• The channel’s transition distribution matrix is given by
 
p0,0 p0,1
Q = [px,y ] =
p1,0 p1,1
   
PY |X (0|0) PY |X (1|0) 1−ε ε
= = (4.2.4)
PY |X (0|1) PY |X (1|1) ε 1−ε

• ε = 0 reduces the BSC to the binary identity (noiseless) channel.


Frequently used channels I: 4-9

• BSC can be explicitly represented via a binary modulo-2 additive noise


channel whose output at time i is the modulo-2 sum of its input and noise
variables:
Yi = Xi ⊕ Zi for i = 1, 2, · · ·
where


 ⊕ denotes addition modulo-2,



 Yi, Xi and Zi are the channel output, input and noise, respectively,



the alphabets X = Y = Z = {0, 1} are all binary,
.

 Xi ⊥ Zj for any i, j = 1, 2, · · · , and





 the noise process is a Bernoulli(ε) process


– i.e., a binary i.i.d. process with Pr[Z = 1] = ε.
Frequently used channels I: 4-10

X PY |X Y
1−α -
0 0
α
z
: E
α
-
1 1
1−α

3. Binary erasure channels (BEC):


• In BEC, the receiver knows the exact location of the “error” bits in the
received bitstream or codeword, but not their actual value.
• These “error” bits are then declared as “erased” during transmission and
are called “erasures.”
• The channel transition matrix is given by
 
p0,0 p0,E p0,1
Q = [px,y ] =
p1,0 p1,E p1,1
   
PY |X (0|0) PY |X (E|0) PY |X (1|0) 1−α α 0
= =
PY |X (0|1) PY |X (E|1) PY |X (1|1) 0 α 1−α
where 0 ≤ α ≤ 1 is called the channel’s erasure probability.
Frequently used channels I: 4-11

X PY |X Y
1−ε−α -
0 * 0
ε α
z
: E
ε α
-
j
1 1
1−ε−α

4. Binary symmetric erasure channel (BSEC):

• One can combine the BSC with the BEC to obtain a binary channel with
both errors and erasures.
• The channel’s transition matrix is given by
   
p0,0 p0,E p0,1 1−ε−α α ε
Q = [px,y ] = = (4.2.8)
p1,0 p1,E p1,1 ε α 1−ε−α
where ε, α ∈ [0, 1] are the channel’s crossover and erasure probabilities,
respectively.
• Clearly, setting α = 0 reduces the BSEC to the BSC, and setting ε = 0
reduces the BSEC to the BEC.
Frequently used channels I: 4-12

• More generally, the channel needs not have a symmetric property in the
sense of having identical transition distributions when inputs bits 0 or 1 are
sent. For example, the channel’s transition matrix can be given by
   
p0,0 p0,E p0,1 1−ε−α α ε
Q = [px,y ] = = (4.2.10)
p1,0 p1,E p1,1 ε α  1 − ε − α 
where the probabilities ε = ε and α = α in general. We call such channel,
an asymmetric channel with errors and erasures.
Frequently used channels I: 4-13

5. q-ary symmetric channels:


• Given an integer q ≥ 2, the q-ary symmetric channel is a nonbinary exten-
sion of the BSC; it has alphabets X = Y = {0, 1, · · · , q − 1} of size q and
channel transition matrix given by
Q = [px,y ]
 
p0,0 p0,1 · · · p0,q−1
 
 p p1,1 · · · p1,q−1 
=  1,0... ... ... ... 
 
pq−1,0 pq−1,1 · · · pq−1,q−1
 ε ε

1 − ε q−1 · · · q−1
 ε 1−ε · · · q−1ε 
 
=  q−1 ...  (4.2.11)
 ... ... ... 
ε ε
q−1 q−1 ··· 1− ε
where 0 ≤ ε ≤ 1 is the channel’s symbol error rate (or probability).
• When q = 2, the channel reduces to the BSC with bit error rate ε, as
expected.
Frequently used channels I: 4-14

• Similar to the BSC, the q-ary symmetric channel can be expressed as a


modulo-q additive noise channel with common input, output and noise
alphabets X = Y = Z = {0, 1, · · · , q − 1} and whose output Yi at time i
is given by
Yi = Xi ⊕q Zi,
for i = 1, 2, · · · , where ⊕q denotes addition modulo-q, and Xi and Zi are
the channel’s input and noise variables, respectively, at time i.
• Here, the noise process {Zn}∞
n=1 is assumed to be an i.i.d. process with
distribution
ε
Pr[Z = 0] = 1 − ε and Pr[Z = a] = ∀a ∈ {1, · · · , q − 1}.
q−1
It is also assumed that the input and noise processes are independent from
each other.
Frequently used channels I: 4-15

6. q-ary erasure channels:


• Given an integer q ≥ 2, one can also consider a non-binary extension of
the BEC, yielding the so called q-ary erasure channel. Specifically, this
channel has input and output alphabets given by X = {0, 1, · · · , q − 1}
and Y = {0, 1, · · · , q − 1, E}, respectively, where E denotes an erasure,
and channel transition distribution given by


1 − α if y = x, x ∈ X

PY |X (y|x) = α if y = E, x ∈ X (4.2.12)


0 if y = x, x ∈ X

where 0 ≤ α ≤ 1 is the erasure probability.


• As expected, setting q = 2 reduces the channel to the BEC.
4.3 Block codes for data transmission over DMCs I: 4-16

W - Channel Xn - Channel Yn - Channel Ŵ -


Encoder PY n|X n (·|·) Decoder

Definition 4.4 (Fixed-length data transmission code) Given positive in-


tegers n and M , and a discrete channel with input alphabet X and output alpha-
bet Y, a fixed-length data transmission code (or block code) for this channel with
blocklength n and rate n1 log2 M message bits per channel symbol (or channel use)
is denoted by ∼Cn = (n, M ) and consists of:
1. M information messages intended for transmission.
2. An encoding function
f : {1, 2, . . . , M } → X n
yielding codewords f (1), f (2), · · · , f (M ) ∈ X n , each of length n. The set of
these M codewords is called the codebook and we also usually write ∼Cn =
{f (1), f (2), · · · , f (M )} to list the codewords.
3. A decoding function g : Y n → {1, 2, . . . , M }.
4.3 Block codes for data transmission over DMCs I: 4-17

Definition 4.5 (Average probability of error) The average probability of


error for a channel block code ∼Cn = (n, M ) code with encoder f (·) and decoder
g(·) used over a channel with transition distribution PY n|X n is defined as
M
1 
Pe(∼Cn ):= λw (∼Cn),
M w=1

where
λw (∼Cn) := Pr[Ŵ = W |W = w] = Pr[g(Y n) = w|X n = f (w)]

= PY n|X n (y n |f (w))
y n ∈Y n : g(y n)=w

is the code’s conditional probability of decoding error given that message w is sent
over the channel.
4.3 Block codes for data transmission over DMCs I: 4-18

Observation 4.6 Another more conservative error criterion is the so-called max-
imal probability of error
λ(∼Cn ):= max λw (∼Cn).
w∈{1,2,··· ,M}

Clearly,
Pe(∼Cn = (n, M )) ≤ λ(∼Cn = (n, M ));
However,
2 × Pe(∼Cn = (n, M )) ≥ λ(∼Cn = (n, M/2)),
where ∼Cn is constructed by throwing away from ∼Cn half of its codewords with
largest conditional probability of error λw (∼Cn ).
So
1
λ(∼Cn ) ≤ Pe(∼Cn) ≤ λ(∼Cn )
2
with code rates
1 1 1
R = log2(M ) and R  = log2(M/2) = R − .
n n n
Consequently, a reliable transmission rate R under the average probability of error
criterion is also a reliable transmission rate under the maximal probability of error
criterion.
4.3 Block codes for data transmission over DMCs I: 4-19

Definition 4.7 (Jointly typical set) The set Fn (δ) of jointly δ-typical n-tuple
pairs (xn, y n ) with respect to the memoryless distribution
n

PX n,Y n (xn, y n ) = PX,Y (xi, yi )
i=1

is defined by

Fn (δ) := (xn, y n ) ∈ X n × Y n :
 
 1 
− log2 PX n (xn) − H(X) < δ,
 n 
 
 1 
− log2 PY n (y n ) − H(Y ) < δ,
 n 
  
 1 
and − log2 PX n,Y n (xn, y n ) − H(X, Y ) < δ .
n
In short, a pair (xn, y n ) generated by independently drawing n times under PX,Y is
jointly δ-typical if its joint and marginal empirical entropies are respectively δ-close
to the true joint and marginal entropies.
4.3 Block codes for data transmission over DMCs I: 4-20

Theorem 4.8 (Joint AEP) If (X1, Y1), (X2, Y2), . . ., (Xn, Yn), . . . are i.i.d.,
i.e., {(Xi, Yi)}∞
i=1 is a dependent pair of DMSs, then

1
− log2 PX n (X1, X2, . . . , Xn) → H(X) in probability,
n
1
− log2 PY n (Y1, Y2, . . . , Yn) → H(Y ) in probability,
n
and
1
− log2 PX n,Y n ((X1, Y1), . . . , (Xn, Yn)) → H(X, Y ) in probability
n
as n → ∞.

Proof: By the weak law of large numbers, we have the desired result. 2
4.3 Block codes for data transmission over DMCs I: 4-21

Theorem 4.9 (Shannon-McMillan-Breiman theorem for pairs) Given


a dependent pair of DMSs with joint entropy H(X, Y ) and any δ greater than zero,
we can choose n big enough so that the jointly δ-typical set satisfies:
1. PX n,Y n (Fnc (δ)) < δ for sufficiently large n.
2. The number of elements in Fn (δ) is at least (1 − δ)2n(H(X,Y )−δ) for sufficiently
large n, and at most 2n(H(X,Y )+δ) for every n.
3. If (xn, y n ) ∈ Fn (δ), its probability of occurrence satisfies
2−n(H(X,Y )+δ) < PX n,Y n (xn, y n ) < 2−n(H(X,Y )−δ).

Proof: The proof is quite similar to that of the Shannon-McMillan-Breiman the-


orem for a single memoryless source presented in the previous chapter; we hence
leave it as an exercise. 2
4.3 Block codes for data transmission over DMCs I: 4-22

Definition 4.10 (Operational capacity) A rate R is said to be achievable


for a discrete channel if there exists a sequence of (n, Mn) channel codes ∼Cn with
1
lim inf log2 Mn ≥ R and lim Pe(∼Cn) = 0.
n→∞ n n→∞

The channel’s operational capacity, Cop, is the supremum of all achievable rates:
Cop = sup{R : R is achievable}.

• The next theorem shows Cop = C, i.e., the information capacity is equal
to the operational capacity.
4.3 Block codes for data transmission over DMCs I: 4-23

Theorem 4.11 (Shannon’s channel coding theorem) Consider a DMC


with finite input alphabet X , finite output alphabet Y and transition distribu-
tion probability PY |X (y|x), x ∈ X and y ∈ Y. Define the channel capacity (or
information capacity)
C := max I(X; Y ) = max I(PX , PY |X )
PX PX

where the maximum is taken over all input distributions PX . Then the following
hold.

• Forward part (achievability): For any 0 < ε < 1, there exist γ > 0 and a
sequence of data transmission block codes {∼Cn = (n, Mn)}∞
n=1 with
  1
C> lim inf log Mn ≥ C − γ
n→∞ n 2
and
Pe(∼Cn) < ε for sufficiently large n,
where Pe(∼Cn) denotes the (average) probability of error for block code ∼Cn .
4.3 Block codes for data transmission over DMCs I: 4-24

• Converse part: For any 0 < ε < 1, any sequence of data transmission block
codes {∼Cn = (n, Mn)}∞
n=1 with

1
lim inf log Mn > C
n→∞ n 2
satisfies
Pe(∼Cn) > (1 − )µ for sufficiently large n, (4.3.1)
where
C
µ=1− > 0,
lim inf n→∞ n1 log2 Mn
i.e., the codes’ probability of error is bounded away from zero for all n suffi-
ciently large.

Notes:
• (4.3.1) actually implies that
lim inf Pe(∼Cn) ≥ lim(1 − )µ = µ,
n→∞ ↓0

where the error probability lower bound is nothing to do with . Here we state
the converse of Theorem 4.11 in a form in parallel to the converse statements
in Theorems 3.6 and 3.15.
4.3 Block codes for data transmission over DMCs I: 4-25

• Also note that the mutual information I(X; Y ) is actually a function of the
input statistics PX and the channel statistics PY |X . Hence, we may write it as
 PY |X (y|x)
I(PX , PY |X ) = PX (x)PY |X (y|x) log2  .
P
x ∈X X
 (x  )P
Y |X (y|x )
x∈X y∈Y

Such an expression is more suitable for calculating the channel capacity.


• Channel capacity C is well-defined
– since for a fixed PY |X , I(PX , PY |X ) is concave and continuous in PX
with respect to both the variational distance and the Euclidean distance
(i.e., L2-distance) [415, Chapter 2], and
– since the set of all input distributions PX is a compact (closed and bounded)
subset of R|X | due to the finiteness of X .
For the above two reasons, there must exist a PX that achieves the supremum
of the mutual information and the maximum is attainable.
4.3 Block codes for data transmission over DMCs I: 4-26

Idea behind the proof of the forward part:


• It suffices to prove the existence of a good block code sequence, satisfying the
rate condition,
1
lim inf log2 Mn ≥ C − γ
n→∞ n
for some γ > 0, whose average error probability is ultimately less than ε.
• Random coding argument:
– The desired good block code sequence is not deterministically con-
structed;
– instead, its existence is implicitly proven by showing that for a class (en-
semble) of block code sequences {∼Cn }∞ n=1 and a code-selecting distribution
Pr[∼Cn] over these block code sequences, the expectation value of the av-
erage error probability, evaluated under the code-selecting distribution on
these block code sequences, can be made smaller than ε for n sufficiently
large: 
E∼Cn [Pe(∼Cn)] = Pr[∼Cn]Pe(∼Cn) → 0 as n → ∞.
∼Cn

– Hence, there must exist at least one such a desired good code sequence
{∼Cn∗}∞
n=1 among them (with Pe(∼ Cn∗) → 0 as n → ∞).
4.3 Block codes for data transmission over DMCs I: 4-27

Proof of the forward part:


• Since the forward part holds trivially when C = 0 by setting Mn = 1, we
assume in the sequel that C > 0.
• Fix ε ∈ (0, 1) and some γ with 0 < γ < min{4ε, C}.
• Observe that there exists N0 such that for n > N0, we can choose an integer
Mn with
γ 1
C − ≥ log2 Mn > C − γ. (4.3.2)
2 n
(Since we are only concerned with the case of “sufficient large n,” it suffices to
consider only those n’s satisfying n > N0, and ignore those n’s for n ≤ N0 .)
• Define δ := γ/8.
4.3 Block codes for data transmission over DMCs I: 4-28

• Let PX̂ be the probability distribution achieving the channel capacity:


C:= max I(PX , PY |X ) = I(PX̂ , PY |X ).
PX

Denote by PŶ n the channel output distribution due to channel input product

distribution PX̂ n with PX̂ n (xn) = ni=1 PX̂ (xi); in other words,

n
PŶ n (y ) = PX̂ n,Ŷ n (xn, y n )
xn ∈X n

and
PX̂ n,Ŷ n (xn, y n ):=PX̂ n (xn)PY n|X n (y n |xn)
for all xn ∈ X n and y n ∈ Y n.

– Note that since PX̂ n (xn) = ni=1 PX̂ (xi ) and the channel is memoryless,
the resulting joint input-output process {(X̂i, Ŷi)}∞ i=1 is also memoryless
with n

n n
PX̂ n,Ŷ n (x , y ) = PX̂,Ŷ (xi, yi )
i=1
and
PX̂,Ŷ (x, y) = PX̂ (x)PY |X (y|x) for x ∈ X , y ∈ Y.

We next present the proof in three steps.


4.3 Block codes for data transmission over DMCs I: 4-29

Step 1: Code construction.


• For any blocklength n, independently select Mn channel inputs with re-
placement from X n according to the distribution PX̂ n (xn).
• For the selected Mn channel inputs yielding codebook
∼Cn:={c1, c2, . . . , cMn },
define the encoder fn(·) and decoder gn (·), respectively, as follows:
fn (m) = cm for 1 ≤ m ≤ Mn ,
and


 m, if cm is the only codeword in ∼Cn


satisfying (cm , y n ) ∈ Fn (δ);
gn (y n) =



 any one in {1, 2, . . . , M }, otherwise,
n

where Fn(δ) is defined in Definition 4.7 with respect to distribution PX̂ n,Ŷ n .
(We assume that the codebook ∼Cn and the channel distribution PY |X are
known at both the encoder and the decoder.)
4.3 Block codes for data transmission over DMCs I: 4-30


Fn (δ) := (xn, y n ) ∈ X n × Y n :
   
 1   1 
− log2 PX n (xn) − H(X) < δ, − log2 PY n (y n ) − H(Y ) < δ,
 n   n 
  
 1 
and − log2 PX n,Y n (xn, y n ) − H(X, Y ) < δ .
n

• Again, let me repeat the encoding and decoding process here!


– A message W is chosen according to the uniform distribution from the
set of messages.
– The encoder fn then transmits the W th codeword cW in ∼Cn over the
channel.
– Then Y n is received at the channel output and the decoder guesses the
sent message via Ŵ = gn(Y n).
– Note that there is a total |X |nMn possible randomly generated codebooks
∼Cn and the probability of selecting each codebook is given by
Mn

Pr[∼Cn] = PX̂ n (cm).
m=1
4.3 Block codes for data transmission over DMCs I: 4-31

Step 2: Conditional error probability.


• For each (randomly generated) data transmission code ∼Cn , the conditional
probability of error given that message m was sent, λm (∼Cn), can be upper
bounded by:

λm(∼Cn) ≤ PY n|X n (y n |cm )
y n ∈Y n : (cm ,y n )∈Fn (δ)
Mn
 
+ PY n|X n (y n|cm), (4.3.3)
m =1 y n ∈Y n : (cm ,y n )∈Fn (δ)
m =m

where
– the first term in (4.3.3) considers the case that the received channel
output y n is not jointly δ-typical with cm , (and hence, the decoding rule
gn (·) would possibly result in a wrong guess), and
– the second term in (4.3.3) reflects the situation when y n is jointly δ-
typical not only with the transmitted codeword cm, but also with an-
other codeword cm (which may cause a decoding error).
4.3 Block codes for data transmission over DMCs I: 4-32

• By taking expectation in (4.3.3) with respect to the mth codeword-selecting


distribution PX̂ n (cm), we obtain
  
PX̂ n (cm)λm (∼Cn) ≤ PX̂ n (cm )PY n|X n (y n |cm )
cm ∈X n cm ∈X n y n ∈Fn (δ|cm )
Mn
  
+ PX̂ n (cm)PY n|X n (y n|cm)
cm ∈X n m =1 y n ∈Fn (δ|cm )
m =m
= PX̂ n,Ŷ n (Fnc (δ))
Mn  
+ PX̂ n,Ŷ n (cm , y n ),
m =1 cm ∈X n y n ∈Fn (δ|cm )
m =m
(4.3.4)
where
Fn(δ|xn ):= {y n ∈ Y n : (xn, y n ) ∈ Fn(δ)} .
4.3 Block codes for data transmission over DMCs I: 4-33

Step 3: Average error probability.


E∼Cn [Pe(∼Cn )] = Pr[∼Cn]Pe(∼Cn)
∼Cn
 Mn

  1 
= ··· PX̂ n (c1) · · · PX̂ n (cMn ) λm(∼Cn)
Mn m=1
c1 ∈X n cMn ∈X n
Mn 
1    
= ··· ···
Mn m=1 n c1 ∈X cm−1 ∈X n cm+1 ∈X n cMn ∈X n
PX̂ n (c1) · · · PX̂ n (cm−1 )PX̂ n (cm+1) · · · PX̂ n (cMn )
 

×  PX̂ n (cm)λm(∼Cn)
cm ∈X n
4.3 Block codes for data transmission over DMCs I: 4-34

Mn 
1    
≤ ··· ···
Mn m=1 n c1 ∈X cm−1 ∈X n cm+1 ∈X n cMn ∈X n
PX̂ n (c1) · · · PX̂ n (cm−1 )PX̂ n (cm+1) · · · PX̂ n (cMn )
×PX̂ n,Ŷ n (Fnc (δ))
Mn 
1    
+ ··· ···
Mn m=1 n c1 ∈X cm−1 ∈X n cm+1 ∈X n cMn ∈X n
PX̂ n (c1) · · · PX̂ n (cm−1 )PX̂ n (cm+1) · · · PX̂ n (cMn )
Mn  
× PX̂ n,Ŷ n (cm , y n ) (4.3.5)
m =1 cm ∈X n y n ∈Fn (δ|cm )
m =m
4.3 Block codes for data transmission over DMCs I: 4-35

= PX̂ n,Ŷ n (Fnc (δ))




 
Mn   Mn
1  
   
+ ··· ···
Mn m=1  
 
 m =1 c1∈X
n cm−1 ∈X n cm+1 ∈X n cMn ∈X n
m =m
PX̂ n (c1) · · · PX̂ n (cm−1 )PX̂ n (cm+1 ) · · · PX̂ n (cMn )

  
n 
× PX̂ n,Ŷ n (cm, y ) ,
n n

cm ∈X y ∈Fn (δ|cm )

where (4.3.5) follows from (4.3.4), and the last step holds since PX̂ n,Ŷ n (Fnc (δ))
is a constant independent of c1, . . ., cMn and m.
4.3 Block codes for data transmission over DMCs I: 4-36


Mn
    
(Then for n > N0 )  ··· ···
m =1 c1 ∈X n cm−1 ∈X n cm+1 ∈X n cMn ∈X n
m =m
PX̂ n (c1) · · · PX̂ n (cm−1)PX̂ n (cm+1) · · · PX̂ n (cMn )

 
× P n n (cm , y n )
X̂ ,Ŷ
cm ∈X n y n ∈F n (δ|cm )
 
Mn
   
=  PX̂ n (cm )PX̂ n,Ŷ n (cm , y n )
m =1 cm ∈X n cm ∈X n y n ∈Fn (δ|cm )
m =m
  
Mn
   
=  PX̂ n (cm )  PX̂ n,Ŷ n (cm, y n )
m =1 cm ∈X n y n ∈Fn (δ|cm ) cm ∈X n
m =m
 
Mn
  
=  PX̂ n (cm )PŶ n (y n )
m =1 cm ∈X n y n ∈Fn (δ|cm )
m =m
4.3 Block codes for data transmission over DMCs I: 4-37

 
Mn
 
=  PX̂ n (cm )PŶ n (y n )
m =1 (cm ,y n )∈Fn (δ)
m =m
Mn

≤ |Fn(δ)|2−n(H(X̂)−δ)2−n(H(Ŷ )−δ)
m =1
m =m
Mn

≤ 2n(H(X̂,Ŷ )+δ)2−n(H(X̂)−δ)2−n(H(Ŷ )−δ)
m =1
m =m

= (Mn − 1)2n(H(X̂,Ŷ )+δ)2−n(H(X̂)−δ) 2−n(H(Ŷ )−δ)


< Mn · 2n(H(X̂,Ŷ )+δ)2−n(H(X̂)−δ) 2−n(H(Ŷ )−δ)
≤ 2n(C−4δ) · 2−n(I(X̂;Ŷ )−3δ) = 2−nδ ,


 the 1st inequality follows from the definition of the jointly typical set Fn(δ),


the 2nd inequality holds by the Shannon-McMillan-Breiman theorem for pairs (Theorem 4
where 

 since C = I(X̂; Ŷ ) by definition of X̂ and Ŷ , and

 the last inequality follows

since (1/n) log2 Mn ≤ C − (γ/2) = C − 4δ.
4.3 Block codes for data transmission over DMCs I: 4-38

Consequently,
E∼Cn [Pe(∼Cn )] ≤ PX̂ n,Ŷ n (Fnc (δ)) + 2−nδ ,
which for sufficiently large n (and n > N0), can be made smaller than 2δ =
γ/4 < ε by the Shannon-McMillan-Breiman theorem for pairs. 2
Fano’s inequality I: 4-39

Relation between Fano’s inequality and converse proof:


• Consider an (n, Mn) channel block code ∼Cn with encoding and decoding func-
tions given respectively by
fn : {1, 2, · · · , Mn } → X n
and
gn : Y n → {1, 2, · · · , Mn}.
• Let message W , which is uniformly distributed over the set of messages {1, 2, · · · , Mn},
be sent via codeword X n(W ) = fn(W ) over the DMC.
• Let Y n be received at the channel output.
• At the receiver, the decoder estimates the sent message via Ŵ = gn (Y n).
• The probability of estimation error is given by the code’s average error proba-
bility:
Pr[W = Ŵ ] = Pe(∼Cn).
• Then Fano’s inequality yields
H(W |Y n) ≤ 1 + Pe(∼Cn) log2(Mn − 1)
< 1 + Pe(∼Cn) log2 Mn. (4.3.6)
4.3 Block codes for data transmission over DMCs I: 4-40

Proof of the converse part:


• For any (n, Mn) block channel code ∼Cn as described above, we have that
W → Xn → Y n
form a Markov chain; we thus obtain by the data processing inequality that
I(W ; Y n ) ≤ I(X n; Y n ). (4.3.7)

• We can also upper bound I(X n; Y n ) in terms of the channel capacity C as


follows
I(X n ; Y n) ≤ max I(X n; Y n)
PX n
n

≤ max I(Xi; Yi) (by Theorem 2.21: Bounds on mutual information)
PX n
i=1
n

≤ max I(Xi; Yi)
PX n
i=1
n
= max I(Xi; Yi) = nC. (4.3.8)
PX i
i=1
4.3 Block codes for data transmission over DMCs I: 4-41

• Consequently, code ∼Cn satisfies the following:


log2 Mn = H(W ) (since W is uniformly distributed)
= H(W |Y n) + I(W ; Y n)
≤ H(W |Y n) + I(X n; Y n) (by (4.3.7))
≤ H(W |Y n) + nC (by (4.3.8))
< 1 + Pe(∼Cn) · log2 Mn + nC. (by (4.3.6))
• This implies that
C 1 C + 1/n
Pe(∼Cn) > 1 − − =1− .
(1/n) log2 Mn log2 Mn (1/n) log2 Mn
• So if
1 C
lim inf log2 Mn = ,
n→∞ n 1−µ
then for any 0 < ε < 1, there exists an integer N such that for n ≥ N ,
1 C + 1/n
log2 Mn ≥ , (4.3.9)
n 1 − (1 − ε)µ
because, otherwise, (4.3.9) would be violated for infinitely many n, implying a
contradiction that
1 C + 1/n C
lim inf log2 Mn ≤ lim inf = .
n→∞ n n→∞ 1 − (1 − ε)µ 1 − (1 − ε)µ
4.3 Block codes for data transmission over DMCs I: 4-42

• Hence, for n ≥ N ,
C + 1/n
Pe(∼Cn) > 1 − [1 − (1 − ε)µ] = (1 − )µ > 0;
C + 1/n
i.e., Pe(∼Cn ) is bounded away from zero for n sufficiently large. 2

• Converse part: For any 0 < ε < 1, any sequence of data transmission
block codes {∼Cn = (n, Mn)}∞
n=1 with

1
R = lim inf log Mn > C
n→∞ n 2
satisfies
Pe(∼Cn ) > (1 − )µ for sufficiently large n,
where
C C
µ=1− = 1− > 0,
lim inf n→∞ n1 log2 Mn R
i.e., the codes’ probability of error is bounded away from zero for all n
sufficiently large.
4.3 Block codes for data transmission over DMCs I: 4-43

When R > C, Pe(Cn) > (1 − ε)µ is bounded


away from 0 for n sufficiently large.

C
µ=1−
R

0
0 C
4.3 Block codes for data transmission over DMCs I: 4-44

Observation 4.12

• The results of the above channel coding theorem is illustrated in the figure
below, where
R = lim inf Rn = lim inf (1/n) log2 Mn message bits/channel use
n→∞ n→∞

is usually called the asymptotic coding rate of channel block codes, and Rn is
the code rate for codes of blocklength n.

lim supn→∞ Pe(Cn) = 0 lim supn→∞ Pe(Cn) > 0


for the best channel block code for all channel block codes -
C R

– Note that Theorem 4.11 actually indicates



limn→∞ Pe(Cn) = 0, for R < C;
lim inf n→∞ Pe(Cn) > 0, for R > C

– Such a “two-region” behavior however only holds for a DMC.


4.3 Block codes for data transmission over DMCs I: 4-45

– For a more general channel, three partitions instead of two may result, i.e.,
(i) R < C, (ii) C < R < C̄ and (iii) R > C̄,
which respectively correspond to


(i) lim supn→∞ Pe(Cn) = 0 for the best block code,

(ii) lim supn→∞ Pe(Cn) > 0 but lim inf n→∞ Pe = 0 for the best block code, and


(iii) lim inf n→∞ Pe(Cn) > 0 for all channel code codes,

where C̄ is named the optimistic channel capacity.


– Since C̄ = C for DMCs, the three regions are thus reduced to two.
4.5 Calculating channel capacity I: 4-46

4.5.1 Symmetric, Weakly Symmetric, and Quasi-symmetric Channels

Definition 4.15
• A DMC with finite input alphabet X , finite output alphabet Y and channel
transition matrix Q = [px,y ] of size |X |×|Y| is said to be symmetric if the rows
of Q are permutations of each other and the columns of Q are permutations of
each other.
• The channel is said to be weakly-symmetric if the rows of Q are permutations
of each other and all the column sums in Q are equal.

Example of symmetric channel: A ternary DMC channel with X = Y =


{0, 1, 2} and transition matrix
   
PY |X (0|0) PY |X (1|0) PY |X (2|0) 0.4 0.1 0.5
Q = PY |X (0|1) PY |X (1|1) PY |X (2|1) = 0.5 0.4 0.1 .
PY |X (0|2) PY |X (1|2) PY |X (2|2) 0.1 0.5 0.4
4.5 Calculating channel capacity I: 4-47

Example of weakly symmetric but non-symmetric channel: A quadra-


try DMC with |X | = |Y| = 4 and
 
0.5 0.25 0.25 0
 
0.5 0.25 0.25 0 
Q=  (4.5.1)
 0 0.25 0.25 0.5
0 0.25 0.25 0.5

is weakly-symmetric (but not symmetric).

Lemma 4.16 The capacity of a weakly-symmetric channel Q is achieved by a


uniform input distribution and is given by
C = log2 |Y| − H(q1, q2 , · · · , q|Y|) (4.5.3)
where (q1, q2, · · · , q|Y| ) denotes any row of Q and
|Y|

H(q1 , q2, · · · , q|Y|):= − qi log2 qi
i=1

is the row entropy.


4.5 Calculating channel capacity I: 4-48

Proof:
• The mutual information between the channel’s input and output is given by
I(X; Y ) = H(Y ) − H(Y |X)

= H(Y ) − PX (x)H(Y |X = x)
x∈X

where
 
H(Y |X = x) = − PY |X (y|x) log2 PY |X (y|x) = − px,y log2 px,y .
y∈Y y∈Y

• Noting that every row of Q is a permutation of every other row, we obtain that
H(Y |X = x) is independent of x and can be written as
H(Y |X = x) = H(q1 , q2, · · · , q|Y|)
where (q1, q2, · · · , q|Y| ) is any row of Q.
4.5 Calculating channel capacity I: 4-49

• Thus

H(Y |X) = PX (x)H(q1, q2, · · · , q|Y| )
x∈X
 

= H(q1 , q2, · · · , q|Y| ) PX (x)
x∈X
= H(q1 , q2, · · · , q|Y| ).
This implies
I(X; Y ) = H(Y ) − H(q1, q2, · · · , q|Y| )
≤ log2 |Y| − H(q1 , q2, · · · , q|Y| )
with equality achieved iff Y is uniformly distributed over Y.
• The proof is completed by confirming that for a weakly symmetric channel, the
uniform input distribution induces the uniform output distribution (see the
text). 2
4.5 Calculating channel capacity I: 4-50

Example 4.18 (Capacity of the BSC) Since the BSC with crossover proba-
bility (or bit error rate) ε is symmetric, we directly obtain from Lemma 4.16 that
its capacity is achieved by a uniform input distribution and is given by
C = log2(2) − H(1 − ε, ε) = 1 − hb(ε) (4.5.5)
where hb(·) is the binary entropy function.

Example 4.19 (Capacity of the q-ary symmetric channel) Similarly, the


q-ary symmetric channel with symbol error rate ε described in (4.2.11) is symmetric;
hence, by Lemma 4.16, its capacity is given by
!
ε ε
C = log2 q − H 1 − ε, ,··· ,
q−1 q−1
ε
= log2 q + ε log2 + (1 − ε) log2(1 − ε).
q−1
Question: Does the uniform input achieve the channel capacity iff
the channel is weakly symmetric? No.
4.5 Calculating channel capacity I: 4-51

Definition 4.20 (Quasi-symmetric channels) A DMC with finite input al-


phabet X , finite output alphabet Y and channel transition matrix Q = [px,y ] of
size |X | × |Y| is said to be quasi-symmetric if Q can be partitioned along its
columns into m weakly-symmetric sub-matrices Q1, Q2, · · · , Qm for some in-
teger m ≥ 1, where each Qi sub-matrix has size |X | × |Yi| for i = 1, 2, · · · , m with
Y1 ∪ · · · ∪ Ym = Y and Yi ∩ Yj = ∅ ∀i = j, i, j = 1, 2, · · · , m.

Quasi- = “having some, but not all of the features of” such as quasi-scholar and
quasi-official.

• The notion of “quasi-symmetry” we provide here is slightly more general than


Gallager’s notion [135, p. 94], as we herein allow each sub-matrix to be weakly-
symmetric (instead of symmetric as in [135]).
4.5 Calculating channel capacity I: 4-52

Lemma 4.21 The capacity of a quasi-symmetric channel Q is achieved by a uni-


form input distribution and is given by
m

C= ai C i (4.5.6)
i=1

where 
ai := px,y = sum of any row in Qi, i = 1, · · · , m,
y∈Yi
and
" #
Ci = log2 |Yi| − H any row in the matrix 1
Q
ai i
, i = 1, · · · , m

is the capacity of the ith weakly-symmetric “sub-channel” whose transition matrix


is obtained by multiplying each entry of Qi by a1i (this normalization renders sub-
matrix Qi into a stochastic matrix and hence a channel transition matrix).
4.5 Calculating channel capacity I: 4-53

Example 4.22 (Capacity of the BEC) The BEC with erasure probability α
and transition matrix
   
PY |X (0|0) PY |X (E|0) PY |X (1|0) 1−α α 0
Q = =
PY |X (0|1) PY |X (E|1) PY |X (1|1) 0 α 1−α
is quasi-symmetric (but neither weakly-symmetric nor symmetric).
• Its transition matrix Q can be partitioned along its columns into two symmetric
(hence weakly-symmetric) sub-matrices
 
1−α 0
Q1 =
0 1−α
and  
α
Q2 = .
α
4.5 Calculating channel capacity I: 4-54

• Thus applying the capacity formula for quasi-symmetric channels of Lemma 4.21
yields that the capacity of the BEC is given by
C = a1 C 1 + a2 C 2
where a1 = 1 − α, a2 = α,
!
1−α 0
C1 = log2(2) − H , = 1 − H(1, 0) = 1 − 0 = 1,
1−α 1−α
and "α#
C2 = log2(1) − H = 0 − 0 = 0.
α
Therefore, the BEC capacity is given by
C = (1 − α)(1) + (α)(0) = 1 − α. (4.5.7)
4.5 Calculating channel capacity I: 4-55

Example 4.23 (Capacity of the BSEC) Similarly, the BSEC with crossover
probability ε and erasure probability α and transition matrix
   
p0,0 p0,E p0,1 1−ε−α α ε
Q = [px,y ] = =
p1,0 p1,E p1,1 ε α 1−ε−α
is quasi-symmetric; its transition matrix can be partitioned along its columns into
two symmetric sub-matrices
 
1−ε−α ε
Q1 =
ε 1−ε−α
and  
α
Q2 = .
α
Hence by Lemma 4.21, the channel capacity is given by C = a1C1 + a2C2 where
a1 = 1 − α, a2 = α,
! !
1−ε−α ε 1−ε−α
C1 = log2(2) − H , = 1 − hb ,
1−α 1−α 1−α
and "α#
C2 = log2(1) − H = 0.
α
4.5 Calculating channel capacity I: 4-56

We thus obtain that


 !
1−ε−α
C = (1 − α) 1 − hb + (α)(0)
1−α
 !
1−ε−α
= (1 − α) 1 − hb . (4.5.8)
1−α
4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-57

Definition 4.24 (Mutual information for a specific input symbol) The


mutual information for a specific input symbol is defined as:
 PY |X (y|x)
I(x; Y ):= PY |X (y|x) log2 .
PY (y)
y∈Y

From the above definition, the mutual information becomes:


  PY |X (y|x)
I(X; Y ) = PX (x) PY |X (y|x) log2
PY (y)
x∈X y∈Y

= PX (x)I(x; Y ).
x∈X
4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-58

Lemma 4.25 (KKT condition for channel capacity) For a given DMC,
an input distribution PX achieves its channel capacity iff there exists a constant C
such that 
I(x : Y ) = C ∀x ∈ X with PX (x) > 0;
(4.5.9)
I(x : Y ) ≤ C ∀x ∈ X with PX (x) = 0.
Furthermore, the constant C is the channel capacity (justifying the choice of nota-
tion).

Proof: The forward (if) part holds directly; hence, we only prove the converse
(only-if) part.
• Without loss of generality, we assume that PX (x) < 1 for all x ∈ X , since
PX (x) = 1 for some x implies that I(X; Y ) = 0.
4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-59

• The problem of calculating the channel capacity is to maximize


 PY |X (y|x)
I(X; Y ) = PX (x)PY |X (y|x) log2   )P )
, (4.5.10)

x ∈X P X (x Y |X (y|x
x∈X y∈Y

subject to the condition 


PX (x) = 1 (4.5.11)
x∈X
for a given channel distribution PY |X .
• By using the Lagrange multiplier method (e.g., see Appendix B.10), maximizing
(4.5.10) subject to (4.5.11) is equivalent to maximize:
 
 PY |X (y|x) 
f (PX ):= PX (x)PY |X (y|x) log2  +λ PX (x) − 1 .
 
x∈X PX (x )PY |X (y|x ) x∈X
y∈Y x ∈X
4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-60

• We then take the derivative of the above quantity with respect to PX (x ), and
obtain that
∂f (PX )
= I(x ; Y ) − log2(e) + λ.
∂PX (x )


The details for taking the derivative are as follows:



∂ 
PX (x)PY |X (y|x) log2 PY |X (y|x)
∂PX (x ) x∈X y∈Y
$ %  &
  
− PX (x)PY |X (y|x) log2 PX (x )PY |X (y|x) + λ PX (x) − 1
x∈X y∈Y x ∈X x∈X
 $ %
  
= PY |X (y|x ) log2 PY |X (y|x) − PY |X (y|x) log2 PX (x )PY |X (y|x )
y∈Y y∈Y x ∈X

 PY |X (y|x )
+ log2 (e) PX (x)PY |X (y|x)  +λ
x ∈X PX (x )PY |X (y|x )
 
x∈X y∈Y
$ %
  PY |X (y|x )
= I(x ; Y ) − log2 (e) PX (x)PY |X (y|x)  +λ
 ∈X PX (x )PY |X (y|x )
 
y∈Y x∈X x


= I(x ; Y ) − log2 (e) PY |X (y|x ) + λ
y∈Y

= I(x ; Y ) − log2 (e) + λ.
4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-61

• By Property 2 of Lemma 2.46, I(X; Y ) = I(PX , PY |X ) is a concave function


in PX (for a fixed PY |X ). Therefore,
1. the maximum of I(PX , PY |X ) occurs for a zero derivative when PX (x) does
not lie on the boundary, namely 1 > PX (x) > 0.
2. For those PX (x) lying on the boundary, i.e., PX (x) = 0, the maximum
occurs iff a displacement from the boundary to the interior decreases the
quantity, which implies a non-positive derivative, namely
I(x; Y ) ≤ −λ + log2(e), for those x with PX (x) = 0.

• To summarize, if an input distribution PX achieves the channel capacity, then



I(x ; Y ) = −λ + log2(e), for PX (x) > 0;
I(x ; Y ) ≤ −λ + log2(e), for PX (x) = 0.
for some λ.
• With the above result, setting C = −λ + 1 yields (4.5.9).
• Finally, multiplying both sides of each equation in (4.5.9) by PX (x) and sum-
ming over x yields that maxPX I(X; Y ) on the left and the constant C on the
right, thus proving that the constant C is indeed the channel’s capacity. 2
4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-62

Question: Does the uniform input achieve the channel capacity iff


the channel is quasi-symmetric? No.

Observation 4.28 (Capacity achieved by a uniform input distribu-


tion)
• T -symmetric channels [319, Section V, Definition 1]: A channel is T -symmetric
if
 PY |X (y|x)
T (x) := I(x; Y ) − log2 |X | = PY |X (y|x) log2 
x ∈X PY |X (y|x )

y∈Y

is a constant function of x (i.e., functionally independent of x), where I(x; Y )


is the mutual information for input x under a uniform input distribution.
• An example of a T -symmetric channel that is not quasi-symmetric is the binary-
input ternary-output channel with the following transition matrix
1 1 1
Q = 13 31 32 .
6 6 3

Its capacity is achieved by the uniform input distribution.


4.5.2 Karuch-Kuhn-Tucker cond. for chan. capacity I: 4-63

• Unlike quasi-symmetric channels, T -symmetric channels do not admit in gen-


eral a simple closed-form expression for their capacity (such as the one given
in (4.5.6)).

m

C= ai C i (4.5.6)
i=1
4.4 Example of Polar Codes for the BEC I: 4-64

• Polar coding is a new channel coding method proposed by Arikan during 2008-
2009, which can provably achieve the capacity of any binary-input memoryless
channel Q whose capacity is realized by a uniform input distribution.
• The main idea behind polar codes is channel “polarization,” which transforms
n uses of BEC(ε) into extremal “polarized” channels; i.e., channels which are
either perfect (noiseless) or completely noisy.
• It is shown that as n → ∞, the number of unpolarized channels converges to
0 and the fraction of perfect channels converges to I(X; Y ) = 1 − ε under a
uniform input, which is the capacity of the BEC (see Example 4.22 in Section
4.5).
• A polar code can then be naturally obtained by sending information bits di-
rectly through those perfect channels and sending known bits (usually called
frozen bits) through the completely noisy channels.
4.4 Example of Polar Codes for the BEC I: 4-65

U1 -⊕ X1 - BEC(ε) Y1 -
6

U2 X2 - BEC(ε) Y2 -

• We start with the simplest case (often named basic transformation) of n = 2.


• Under uniformly distributed X1 and X2, we have
I(Q):=I(X1; Y1) = I(X2; Y2) = 1 − ε.

• Now consider the following linear modulo-2 operation:


X1 = U1 ⊕ U2,
X2 = U2 ,
where U1 and U2 represent uniformly distributed independent message bits.
4.4 Example of Polar Codes for the BEC I: 4-66

U1 - ⊕ X1 - BEC(ε) Y1 -
6

U2 X2 - BEC(ε) Y2 -

• The decoder performs successive cancellation decoding as follows.


– It first decodes U1 from the received (Y1, Y2),
– and then decodes U2 based on (Y1, Y2) and the previously decoded U1
(assuming the decoding is done correctly).
• This will create two new channels; namely the “worse” channel Q− and the
“better” channel Q+ given by
Q− : U1 → (Y1, Y2),
Q+ : U2 → (Y1, Y2, U1 ),
respectively (the names of these channels will be justified shortly).
4.4 Example of Polar Codes for the BEC I: 4-67

U1 - ⊕ U1 ⊕ U2 - BEC(ε) Y1 -
6

U2 U2 - BEC(ε) Y2 -



Y1 ⊕ Y2 , if Y1, Y2 ∈ {0, 1}


? ⊕ Y , if Y1 = E, Y2 ∈ {0, 1}
2
• Q− : U1 =

Y1 ⊕ ? , if Y1 ∈ {0, 1}, Y2 = E


? ⊕ ?, if Y1 = Y2 = E
Noting that given output E for a BEC, the receiver knows “nothing” about
the input.
• Thus, Q− is a BEC with erasure probability ε− := 1 − (1 − ε)2.
4.4 Example of Polar Codes for the BEC I: 4-68

U1 - ⊕ U1 ⊕ U2 - BEC(ε) Y1 -
6

U2 U2 - BEC(ε) Y2 -



Y1 ⊕ U1, if Y1 ∈ {0, 1}

• Q+: U2 = Y2, if Y2 ∈ {0, 1}


 ?, if Y1 = Y2 = E

• Q+ is a BEC with erasure probability ε+ := ε2.


Thus, let U1 be the frozen bit and U2 be the info bit. One can transform the
system to a BEC(ε2) with code rate 1/2 bits/channel usage.
4.4 Example of Polar Codes for the BEC I: 4-69

The channel capacity remains the same.


I(Q+) + I(Q− ) = I(U2 ; Y1, Y2, U1 ) + I(U1 ; Y1, Y2)
= (1 − ε2) + [1 − (1 − (1 − ε)2)]
= 2(1 − ε)
= 2I(Q), (4.4.1)
4.4 Example of Polar Codes for the BEC I: 4-70

• Now, let us consider the case of n = 4 and suppose we perform the basic trans-
formation twice to send (i.i.d. uniform) message bits (U1, U2 , U3, U4), yielding

Q− : V1 → (Y1, Y2), where X1 = V1 ⊕ V2 ,


Q+ : V2 → (Y1, Y2, V1), where X2 = V2 ,
Q− : V3 → (Y3, Y4), where X3 = V3 ⊕ V4 ,
Q+ : V4 → (Y3, Y4, V3), where X4 = V4 ,

where V1 = U1 ⊕ U2, V3 = U2, V2 = U3 ⊕ U4 and V4 = U4.


•

 Q −−
: U1 → (Y1, Y2, Y3, Y4) with erasure probability ε−−:=1 − (1 − ε−)2


Q+− : U → (Y , Y , Y , Y , U , U ) with erasure probability ε+−:=1 − (1 − ε+)2
3 1 2 3 4 1 2

 Q −+
: U2 → (Y1, Y2, Y3, Y4, U1 ) with erasure probability ε−+:=(ε−)2


Q++ : U → (Y , Y , Y , Y , U , U , U ) with erasure probability ε++:=(ε+)2.
4 1 2 3 4 1 3 2
4.4 Example of Polar Codes for the BEC I: 4-71

In polar coding terminology,


• the process of using multiple basic transformations to get X1, . . . , Xn from
U1 , . . . , Un (where the Ui ’s are i.i.d. uniform message random variables) is
called channel “combining”
• and that of using Y1, . . . , Yn and U1, . . . , Ui−1 to obtain Ui for i ∈ {1, . . . , n}
is called channel “splitting.”
• Altogether, the phenomenon is called channel “polarization.”

Example 4.14 Consider a BEC with erasure probability ε = 0.5 and let n = 8.
4.4 Example of Polar Codes for the BEC I: 4-72

(0.9961) (0.9375) (0.75) (0.5)


U1 -⊕ T1 -⊕ V1-⊕ X1- BEC(0.5) Y-
1
6 6 6

(0.6836) (0.4375) (0.25) (0.5)


U5 -⊕ T3- V2 X2- Y-
2
⊕ BEC(0.5)
6 6

(0.8086) (0.5625) (0.75) (0.5)


U3 -⊕ T2 V3- X3- Y-
3
⊕ BEC(0.5)
6 6

(0.1211) (0.0625) (0.25) (0.5)


U7- T4 V4 X4- Y-
4
⊕ BEC(0.5)
6

(0.8789) (0.9375) (0.75) (0.5)


U2 T5 -⊕ V5-⊕ X5- BEC(0.5) Y-
5
6 6

(0.1914) (0.4375) (0.25) (0.5)


U6 T7- V6 X6- Y-
6
⊕ BEC(0.5)
6

(0.3164) (0.5625) (0.75) (0.5)


U4 T6 V7- X7- Y-
7
⊕ BEC(0.5)
6

(0.0039) (0.0625) (0.25) (0.5)


U8 T8 V8 X8- Y-
8
BEC(0.5)
4.4 Example of Polar Codes for the BEC I: 4-73

• A key reason for the prevalence of polar coding after its invention is that they
form the first coding scheme that has an explicit low-complexity construction
structure while being capable of achieving channel capacity as code length
approaches infinity.
• More importantly, polar codes do not exhibit the error floor behavior, which
Turbo and (to a lesser extent) LDPC codes are prone to.
• Due to their attractive properties, polar codes were adopted in 2016 by the
3rd Generation Partnership Project (3GPP) as error correcting codes for the
control channel of the 5th generation (5G) mobile communication standard.
4.6 Lossless joint source-channel coding I: 4-74

and Shannon’s separation principle


• We next establish Shannon’s lossless joint source-channel coding theorem
(or lossless information transmission theorem), which provides explicit (and
directly verifiable) conditions for any communication system in terms of its
source and channel information-theoretic quantities under which the source can
be reliably transmitted (i.e., with asymptotically vanishing error probability).
• This key theorem is sometimes referred to as Shannon’s source-channel sep-
aration theorem or principle.
– Why it is named “separation principle”?
– Answer: The theorem’s necessary and sufficient conditions for reliable trans-
missibility are a function of entirely “separable” or “disentangled” informa-
tion quantities, i.e., the source’s minimal compression rate and the chan-
nel’s capacity with no quantities that depends on both the source and the
channel.
4.6 Lossless joint source-channel coding I: 4-75

• We will prove the theorem by assuming that the source is stationary ergodic in
the forward part and just stationary in the converse part and that the channel
is a DMC.
• Note that the theorem can be extended to more general sources and channels
with memory (see Dobrushin 1963, Vembu & Verdu & Steinberg 1995, Chen
& Alajaji 1999).

Xn Yn
Source - Source - Channel - Channel - Channel - Source - Sink
Encoder Encoder Decoder Decoder

A separate (tandem) source-channel coding scheme.

Xn Yn
Source - Encoder - Channel - Decoder - Sink

A joint source-channel coding scheme.


4.6 Lossless joint source-channel coding I: 4-76

Definition 4.29 (Source-channel block code) Given a discrete source {Vi}∞ i=1

with finite alphabet V and a discrete channel {PY n|X n }n=1 with finite input and
output alphabets X and Y, respectively, an m-to-n source-channel block code ∼Cm,n
with rate mn source symbol/channel symbol is a pair of mappings (f (sc), g (sc)), where

f (sc) : V m → X n
and
g (sc) : Y n → V m.

Encoder Xn Channel Yn Decoder


V m - - - -
PY n |X n V̂ m
f (sc) g(sc)

An m-to-n block source-channel coding system.

The code’s error probability is given by


 
m m
Pe(∼Cm,n ) := Pr[V = V̂ ] = PV m (v m )PY n|X n (y n |f (sc)(v m))
v m ∈V m y n ∈Y n : g (sc)(y n )=v m

where PV m and PY n|X n are the source and channel distributions, respectively.
4.6 Lossless joint source-channel coding I: 4-77

Theorem 4.30 (Lossless joint source-channel coding theorem for rate-


one block codes) Consider a discrete source {Vi}∞ i=1 with finite alphabet V and
entropy rate H(V) and a DMC with input alphabet X , output alphabet Y and
capacity C, where both H(V) and C are measured in the same units (i.e., they
both use the same base of the logarithm). Then the following hold:
• Forward part (achievability): For any 0 <  < 1 and given that the source is
stationary ergodic, if
H(V) < C,
then there exists a sequence of rate-one source-channel codes {∼Cm,m }∞
m=1 such
that
Pe(∼Cm,m ) <  for sufficiently large m,
where Pe(∼Cm,m ) is the error probability of the source-channel code ∼Cm,m .
4.6 Lossless joint source-channel coding I: 4-78

• Converse part: For any 0 <  < 1 and given that the source is stationary, if
H(V) > C,
then any sequence of rate-one source-channel codes {∼Cm,m }∞
m=1 satisfies

Pe(∼Cm,m ) > (1 − )µ for sufficiently large m, (4.6.1)


where µ = HD (V) − CD with D = |V|, and HD (V) and CD are entropy rate
and channel capacity measured in D-ary digits, i.e., the codes’ error probability
is bounded away from zero and it is not possible to transmit the source over
the channel via rate-one source-channel block codes with arbitrarily low error
probability.
4.6 Lossless joint source-channel coding I: 4-79

Proof of the forward part:


• Without loss of generality, we assume throughout this proof that both the
source entropy rate H(V) and the channel capacity C are measured in nats
(i.e., they are both expressed using the natural logarithm).
• Key idea: We will show the existence of the desired rate-one source-channel
codes ∼Cm,m via a separate (tandem or two-stage) source and channel coding
scheme.
• Let γ := C − H(V) > 0.
• Given any 0 <  < 1, by the lossless source-coding theorem for stationary
ergodic sources (Theorem 3.15), there exists a sequence of source codes of
blocklength m and size Mm with
encoder fs : V m → {1, 2, . . . , Mm} and decoder gs : {1, 2, . . . , Mm} → V m
such that
1
log Mm < H(V) + γ/2 (4.6.2)
m
and
Pr [gs(fs (V m)) = V m] < /2
for m sufficiently large.
4.6 Lossless joint source-channel coding I: 4-80

• Furthermore, by the channel coding theorem under the maximal probability of


error criterion (see Observation 4.6 and Theorem 4.11), there exists a sequence
of channel codes of blocklength m and size M̄m with encoder
fc : {1, 2, . . . , M̄m} → X m
and decoder
gc : Y m → {1, 2, . . . , M̄m}
such that
1  1 
log M̄m > C − γ/2 = H(V) + γ/2 > log Mm (4.6.5)
m m
and
λ := max Pr [gc(Y m) = w|X m = fc (w)] < /2
w∈{1,...,M̄m }

for m sufficiently large.


4.6 Lossless joint source-channel coding I: 4-81

• Now we form our source-channel code by concatenating in tandem the above


source and channel codes.
• Specifically, the m-to-m source-channel code ∼Cm,m has the following encoder-
decoder pair (f (sc) , g (sc)):
f (sc) : V m → X m with f (sc) (v m) = fc(fs (v m)) ∀v m ∈ V m
and
g (sc) : Y m → V m
with

m gs(gc(y m )), if gc(y m ) ∈ {1, 2, . . . , Mm }
g (sc)
(y ) = ∀y m ∈ Y m.
arbitrary, otherwise

• The above construction is possible since {1, 2, . . . , Mm } is a subset of {1, 2,


. . ., M̄m }.
4.6 Lossless joint source-channel coding I: 4-82

Pe(∼Cm,m ) = Pr[g (sc)(Y m) = V m ]


= Pr[g (sc)(Y m) = V m , gc(Y m ) = fs(V m)]
+ Pr[g (sc)(Y m ) = V m, gc (Y m ) = fs (V m)]
= Pr[gs(gc(Y m)) = V m , gc(Y m ) = fs(V m)]
+ Pr[g (sc)(Y m ) = V m, gc (Y m ) = fs (V m)]
≤ Pr[gs(fs(V m )) = V m ] + Pr[gc(Y m ) = fs(V m)]
= Pr[gs(fs(V m )) = V m ]

+ Pr[fs(V m ) = w] Pr[gc (Y m ) = w|fs (V m) = w]
w∈{1,2,...,Mm }
= Pr[gs(fs(V m )) = V m ]

+ Pr[X m = fc (w)] Pr[gc(Y m) = w|X m = fc(w)]
w∈{1,2,...,Mm }
≤ Pr[gs(fs(V m ))
= V m ] + λ
< /2 + /2 = 
for m sufficiently large. Thus the source can be reliably sent over the channel via
rate-one block source-channel codes as long as H(V) < C.
4.6 Lossless joint source-channel coding I: 4-83

Proof of the converse part: For simplicity, we assume in this proof that H(V)
and C are measured in bits.
For any m-to-m source-channel code ∼Cm,m , we can write
1
H(V) ≤ H(V m) (4.6.6)
m
1 1
= H(V m|V̂ m ) + I(V m; V̂ m)
m m
1 1
≤ [Pe(∼Cm,m ) log2(|V|m) + 1] + I(V m ; V̂ m) (4.6.7)
m m
1 1
≤ Pe(∼Cm,m ) log2 |V| + + I(X m; Y m ) (4.6.8)
m m
1
≤ Pe(∼Cm,m ) log2 |V| + + C (4.6.9)
m
where
• (4.6.6) is due to the fact that (1/m)H(V m) is non-increasing in m and converges
to H(V) as m → ∞ since the source is stationary (see Observation 3.12),
• (4.6.7) follows from Fano’s inequality,
H(V m|V̂ m ) ≤ Pe(∼Cm,m ) log2(|V|m)+hb (Pe(∼Cm,m)) ≤ Pe(∼Cm,m ) log2(|V|m)+1,
• (4.6.8) is due to the data processing inequality since V m → X m → Y m → V̂ m
form a Markov chain.
4.6 Lossless joint source-channel coding I: 4-84

Note that in the above derivation, the information measures are all measured in
bits. This implies that for m ≥ logD (2)/(εµ),
H(V) − C 1 log (2)
Pe(∼Cm,m ) ≥ − = HD (V) − CD − D ≥ (1 − ε)µ.
log2(|V|) m log2(|V|) ' () * ' ()m *

≤εµ
4.6 Lossless joint source-channel coding I: 4-85

Theorem 4.32 (Lossless joint source-channel coding theorem for gen-


eral rate block codes) Under the same notation as in Theorem 4.30, the fol-
lowing hold:
• Forward part (achievability): For any 0 <  < 1 and given that the source
is stationary ergodic, there exists a sequence of m-to-nm source-channel codes
{∼Cm,nm }∞m=1 such that

Pe(∼Cm,nm ) <  for sufficiently large m


if
m C
lim sup < .
m→∞ nm H(V)
• Converse part: For any 0 <  < 1 and given that the source is stationary, any
sequence of m-to-nm source-channel codes {∼Cm,nm }∞
m=1 with

m C
lim inf > ,
m→∞ nm H(V)
satisfies
Pe(∼Cm,nm ) > (1 − )µ for sufficiently large m,
for some positive constant µ that depends on lim inf (m/nm), H(V) and C.
m→∞
4.6 Lossless joint source-channel coding I: 4-86

Discussion: separate vs joint source-channel coding

• Shannon’s separation principle has provided the linchpin for most modern com-
munication systems where source coding and channel coding schemes are sep-
arately constructed (with the source (resp., channel) code designed by only
taking into account the source (resp., channel) characteristics) and applied in
tandem without the risk of sacrificing optimality in terms of reliable transmis-
sibility under unlimited coding delay and complexity.
• However, in practical implementations, there is a price to pay in delay and
complexity for extremely long coding blocklengths (particularly when delay and
complexity constraints are quite stringent such as in wireless communications
systems).
• Under finite coding blocklengths and/or complexity, many studies have demon-
strated that joint source-channel coding can provide better performance than
separate coding.
4.6 Lossless joint source-channel coding I: 4-87

• Even in the infinite blocklength regime where separate coding is optimal in


terms of reliable transmissibility, it can be shown that for a large class of sys-
tems, joint source-channel coding can achieve an error exponentthat is as large
as double the error exponent resulting from separate coding. This indicates that
one can realize via joint source-channel coding the same performance as sepa-
rate coding, while reducing the coding delay by half (this result translates into
notable power savings of more than 2 dB when sending binary sources over
channels with Gaussian noise, fading an output quantization).
Key Notes I: 4-88

• Definition of reliable transmission


• Discrete memoryless channels
• Data transmission code and its rate
• Joint typical set
• Shannon’s channel coding theorem and its converse theorem
• Fano’s inequality
• Calculation of the channel capacity
– Symmetric, weakly symmetric, quasi-symmetric and T -symmetric channels
– KKT condition
• Polar coding
• Joint source-channel coding theorem

You might also like