Communication with Feedback via Posterior
Matching
Ofer Shayevitz
Meir Feder
Tel Aviv University, Dept. of EE-Systems
Tel Aviv 69978, Israel
[email protected]
Tel Aviv University, Dept. of EE-Systems
Tel Aviv 69978, Israel
[email protected]
Abstract— In this paper we describe a general algorithmic
scheme for communication over any memoryless channel in the
presence of noiseless feedback. The scheme is based on the idea of
posterior matching, in which the information still missing at the
receiver is extracted from the a-posteriori density function, and
matched to any desirable input distribution. We analyze the error
probability attained by this scheme for additive noise channels,
and show that the well-known Schalkwijk-Kailath scheme for the
AWGN channel with average power constraint and the Horstein
scheme for the BSC, can be derived as special cases.
I. I NTRODUCTION
Feedback cannot increase the capacity of memoryless channels, but it can significantly improve the error probability
performance, and perhaps more importantly - it can drastically
simplify the transmission schemes required to achieve capacity. Whereas complex coding techniques strive to approach
capacity in the absence of feedback, the same goal can
sometimes be obtained using noiseless feedback by simple
deterministic schemes that work “on the fly”.
Probably the first elegant feedback scheme ever to be
presented is due to Horstein [1] for the Binary Symmetric
Channel (BSC) with feedback. In that paper, a message point
inside the unit interval was used to represent the data bits, and
was conveyed to the receiver by always indicating whether it
lies to the left or to the right of the receiver posterior’s median,
which is also known to the transmitter via feedback. That way,
the transmitter always answers the most informative question
that can be posed by the receiver based on the information
the latter has. Remarkably, this simple technique is enough to
attain the capacity of the BSC, and is easily adopted to any
Discrete Memoryless Channel (DMC) with feedback.
A few years later, two landmark papers by SchalkwijkKailath [2] and Schalkwijk [3] presented an elegant capacity
achieving feedback scheme for the Additive White Gaussian
Noise (AWGN) channel with an average power constraint.
The Schalkwijk-Kailath scheme has a “parameter estimation”
spirit. At each time point, it finds the Minimum Mean Square
Error (MMSE) estimate of the message point at the receiver,
and transmits the MMSE error on the next channel use,
amplified to match the permissible input power constraint. This
scheme is strikingly simple and yet achieves capacity; in fact
at any rate below capacity it has an error probability decaying
double-exponentially with the block length, as opposed to the
single exponential attained by non-feedback schemes.
Since the emergence of the Horstein and the SchalkwijkKailath schemes, it was evident that those were similar in
some fundamental sense. Both schemes used the message point
representation, and both always attempted to tell the receiver
what it was still missing in order to “get it right”. However,
neither the precise correspondence between these schemes nor
a generalization to other cases has ever been established. In
this paper, we show that in fact there exists an underlying
principal connecting these two methods. We present a general
feedback transmission scheme for any memoryless channel
and any required input distribution. Our scheme is simple and
elegant, and manifests the idea of always transmitting what the
receiver is missing. In the special cases of a BSC with uniform
input distribution and an AWGN channel with a Gaussian input
distribution, our scheme is reduced to those of Horstein and
Schalkwijk-Kailath respectively.
The paper is organized as follows. In section II we present
the new scheme. In section III we provide a preliminary
analysis of the scheme and its error performance. In section IV
we derive the Horstein and Schalkwijk schemes as a special
case, and in section V we provide an illustrative example by
applying our scheme to a simple setting. A discussion and
some future research items are provided in section VI.
II. T HE N EW S CHEME
Consider a discrete-time memoryless channel with an instantaneous noiseless feedback and a transition probability
law W = W (y|x), and let Q = Q(x) be any desirable
input distribution1,2 with finite expectancy and variance. Let
θ0 ∈ [0, 1) be the message point whose binary expansion
represents an infinite bitstream to be reliably conveyed to the
receiver. The message point is selected according to a uniform
distribution over the unit interval. Denote the transmitted
signal at time n by xn = gn (θ0 , y1n−1 ) where yn is the
corresponding channel output, and
n−1
gn : [0, 1) × R
7→ R
is a sequence of deterministic transmission functions known
at both terminals. At each time point, the receiver calcu1 For instance, Q may be selected to be a capacity achieving distribution
under some desirable input constraint.
2 Our approach is valid for any channel/input distribution pair, but here we
assume Q, W are Probability Density Functions (PDFs).
lates the a-posteriori density function of the message point
fn (θ) = fθ|y1n (θ | y1n ), starting with f0 (θ) uniform over the
unit interval. Thanks to the noiseless feedback, the transmitter
can track fn (θ) as well, and is assumed to do so. A proper
selection of transmission functions will hopefully result in a
fast concentration of the posterior fn (θ) around θ0 , rapidly
reducing the uncertainty at the receiver.
For communications with a fixed rate R, the decoding rule
we consider after n channel uses is looking for an interval of
size 2−nR whose a-posteriori probability is maximal3 . Alternatively, one can set the bit error probability at a threshold,
and decode bits whenever their respective intervals accumulate
enough probability, as in [1]. This variable rate approach
possesses an error exponent inherently superior to that of the
aforementioned fixed rate approach, however below we focus
on the latter as it is easier to analyze.
What is the best selection of the functions gn ? We argue the
following: Since fn (θ) describes the receiver’s knowledge (or
lack of it) regarding θ0 at time n given y1n , it is reasonable to
zoom-in on θ0 by somehow “stretching” the posterior into the
desired input distribution, and hence describe to the receiver
in greater detail what it is still missing. Therefore, we suggest
to use
gn+1 (θ0 , y1n ) = FQ−1 ◦ Fθ|y1n (θ0 |y1n )
(1)
where Fθ|y1n is the Cumulative Distribution Function (CDF)
corresponding to the posterior fn (θ), and FQ is the CDF of the
desired input distribution Q. It is easy to see that Fθ|y1n (θ0 |y1n ),
viewed as a random variable, is uniform over the unit interval
given any value of y1n and is therefore independent of y1n .
Hence, gn+1 is Q-distributed and independent of y1n as well.
Notice that the inputs are essentially produced in two steps.
In the first step, the information regarding θ0 still missing
at the receiver is “extracted”, by deterministically generating
a random variable independent of previous observations, that
together with those observations uniquely determines θ0 . In
the second step, the distribution of that random variable is
“matched” to the channel by transforming it into that of the
desired input distribution Q.
This strategy admits a simpler recursive form. Define the
inverse channel for W with an input distribution Q to be
Q(x)W (y|x)
V (x|y) = P
x Q(x)W (y|x)
and let FX|Y (x|y) be the CDF of V (x|y) for a fixed value y,
namely,
Z x
FX|Y (x|y) =
V (ξ|y)dξ
−∞
Define also
△
S(x, y) = FQ−1 ◦ FX|Y (x|y).
3 As
in arithmetic coding, this interval may be positioned so that less than
nR bits are decoded. Similarly, the number of bits not decoded is expected to
be small and independent of n [4]. Alternatively, note that just a single extra
bit is required to decode the rest, and it can be appended to the next block.
Lemma 1: The transmission functions (1) are also given by
the recursive formula
g1 (θ0 ) = FQ−1 (θ0 )
¡
¢
gn+1 (θ0 , y1n ) = S gn (θ0 , y1n−1 ), yn
(2)
Proof: This Lemma can be proved directly by taking
derivatives, as we shall verify in the sequel. Here we provide
a more illuminating proof by induction, showing that (1) and
(2) are the same as functions for any n. This is immediately
true for n = 1. Assume now it is true for n = k. As
shown above gk is independent of y1k−1 . Since the channel is
memoryless, (gk , yk ) are also independent of y1k−1 . Therefore,
FX|Y (gk |yk ) is uniform given y1k , and applying FQ−1 transforms its distribution into Q. Thus, gk+1 = S(gk , yk ) is Qdistributed given y1k and is obtained from θ0 by a composition
of monotonic functions in θ0 , which itself is monotonic in θ0 .
The same is true for gk+1 obtained from (1). By the uniqueness
of a monotonic transformation between distributions, gk+1
generated by either (1) or (2) is identical for any y1k , and the
proof is concluded.
The recursive form (2) provides a simple way for implementing our transmission scheme: The next channel input is
given by a deterministic function of the previous input and
previous output only, i.e., xn+1 = S(xn , yn ).
III. A NALYSIS
In this section we derive some basic properties of the
suggested scheme. We shall focus henceforth on the case of
an additive noise channel, but essentially the same results are
expected to be valid in a general memoryless setting, under
some regularity conditions. Both the noise and the input are
assumed to have bounded first and second moments, and the
input distribution is assumed to satisfy Q(x) < Qmax . We
denote the noise sequence by zk and its PDF by fZ (·). The
mutual information of the channel W with input distribution
Q is denoted by I = I(Q, W ). The dependence of fn (θ) and
gn+1 (θ) on y1n is usually omitted for notational clarity.
Lemma 2: The posterior evaluated at the message point has
the following asymptotic behavior
1
log fn (θ0 ) = I(Q, W ) with probability 1
(3)
n
Proof: Applying Bayes’ law it is easily verified that the
posterior satisfies the following recursion rule:
lim
n→∞
fn (θ) =
f (yn | θ, y1n−1 )
fn−1 (θ)
f (yn | y1n−1 )
Applying the recursion rule n times and taking a logarithm,
we get
n
1X
1
log W (yk | gk (θ, y1k−1 ))
log fn (θ) =
n
n
−
k=1
n
X
1
n
k=1
log f (yk | y1k−1 )
Evaluating the above at the message point, we use the fact that
gk , yk are independent of y1k−1 and the noise is additive, and
apply the law of large numbers (LLN) to the i.i.d. sequences:
n
n
k=1
k=1
1X
1
1X
log fn (θ0 ) =
log fZ (zk ) −
log fY (yk )
n
n
n
−→
n→∞
− H(Z) + H(Y ) = I(Q, W )
with probability 1, as required.
Lemma 3: The derivative of the transmission function evaluated at the message point, has the following asymptotic
behavior
¯
1
∂gn (θ, y1n−1 ) ¯¯
lim
≥ I(Q, W ) with probability 1
log
¯
n→∞ n
∂θ
θ=θ0
(4)
Proof: From (1) we easily find that
Z θ
fn−1 (θ′ )dθ′ = FQ (gn (θ))
0
which results in
this can also be obtained from (2) by noticing that
△
fX|Y (x|y)
∂S (x, y)
=
∂x
Q (S (x, y))
and then applying the chain rule for derivatives
n−1
Y
1
∂gn (θ, y1n−1 )
=
S1 (gk (θ), yk )
∂θ
Q(g1 (θ))
=
1
Q(gn (θ))
k=1
n−1
Y
k=1
where ∆θ = 2−(nR+1) .
Proof: From (1) we easily find again that the posterior’s
integral is given by
Z θ2
fn (θ)dθ = FQ (gn+1 (θ2 )) − FQ (gn+1 (θ1 ))
(7)
θ1
We therefore have the following expression for the error
probability given y1n :
Z θ1 +2−nR
n
fn (θ)dθ
Pe (y1 ) = 1 − sup
θ
³ 1 θ1
´
= 1 − sup FQ (gn+1 (θ1 + 2∆θ)) − FQ (gn+1 (θ1 ))
θ1
≤ 1 − FQ (gn+1 (θ0 + ∆θ)) + FQ (gn+1 (θ0 − ∆θ))
fn−1 (θ)
∂gn (θ)
=
∂θ
Q (gn (θ))
S1 (x, y) =
The following Lemma provides a useful expression for the
error probability of our scheme, which is applied to the AWGN
channel in the next section.
Lemma 4: For any rate R our scheme attains an error
probability upper bounded by
³
´
Pe ≤ 1− E FQ (gn+1 (θ0 +∆θ))−FQ (gn+1 (θ0 −∆θ)) (6)
(5)
fX|Y (gk (θ)|yk )
fn−1 (θ)
=
Q(gk (θ))
Q (gn (θ))
verifying Lemma 1 again. We now immediately have that
∂gn (θ)
1
1
1
log
= log fn−1 (θ) − log Q(gn (θ))
n
∂θ
n
n
and using Lemma 2 together with the assumption Q < Qmax
we get the desired result.
The properties described above provide a good idea regarding the behavior of the posterior. Loosely speaking, the
posterior has a peak of 2nI at the message point, and since the
derivative of gn (θ) at that point is at least 2nI , the trajectory4
of points that lie 2−n(I+ε) close to θ0 is attracted to that of θ0 ,
hence for such points we expect that fn (θ) ≈ 2nI . In contrast,
the trajectory of points that lie 2−n(I−ε) far from θ0 diverges
from that of θ0 , towards the boundaries of support(Q). We
therefore expect a probability mass approaching one to be
concentrated in a 2−nR vicinity of the message point for any
R < I, which translates to reliable communications at any
rate below the mutual information.
4 The trajectory of a point θ is the sequence of values obtained by applying
gk (θ, y1k−1 ) with increasing k. When calculating the a-posterior density, the
receiver in fact tracks the trajectory of all possible message points.
and the proof is concluded by taking the expectation on both
sides to get the average error probability.
Lemma 4 demonstrates that the error probability is determined
by two factors: The input CDF’s tail behavior, and the sensitivity of the transmission functions to a 2−nR perturbation
in the assumed position of the message point, namely how
fast is the resulting divergence of the trajectory towards the
boundaries of support(Q).
Corollary 1: Assume sup S1 (x, y) < ∞, and so the divergence of the trajectory is exponential at best. If support(Q) =
R and fixed-rate block decoding is used, then a necessary
condition for a doubly-exponential error probability is for Q
to have an exponentially decaying tail.
IV. S CHALKWIJK AND H ORSTEIN R EVISITED
Example 1 (The AWGN channel): We now provide a
sketch of the analysis for the AWGN setting, and show that
our scheme in this particular case is essentially the same
as the Schalkwijk-Kailath scheme [2][3]. Assume the noise
is N (0, σ 2 ), the average power constraint is P , and denote
SN R = σP2 . Set Q ∼ N (0, P ) (capacity achieving) and let
φ0 = FQ−1 (θ0 ), which is the message point converted into a
Gaussian distribution, and also the first channel input g1 (θ0 ).
It is easily verified that (1) in this case is merely an
affine transformation that transform the posterior into N (0, P ),
hence the transmission functions are given by
¢
n ¡
gn (θ0 , y1n−1 ) = (1 + SN R) 2 φ0 − E(φ0 | y1k−1 )
Observe that in this case gn (θ0 ) is just the estimation error of
an M M SE estimator for φ0 (which represents θ0 ) given the
observations, amplified to match the permissible input power.
The recursive representation (2) in this case is simply
¶
µ
√
SN R
S(x, y) = 1 + SN R x −
y
1 + SN R
which is exactly the transmission strategy of the SchalkwijkKailath scheme [3]. We now find an explicit expression upper
bounding the error probability. Taking the derivative of the
transmission function, we get
√
n
(1 + SN R) 2
∂gn (θ)
≥ 2πP (1 + SN R) 2
=
−1
∂θ
Q(FQ (θ))
n
and so
gn (θ0 + 2−nR ) ≥ gn (θ0 ) +
= gn (θ0 ) +
Z
θ0 +2−nR
√
n
2πP (1 + SN R) 2 dθ
√θ0
2πP · 2−n(C−R)
where C = 21 log (1 + SN R) is the Gaussian channel capacity. Similarly,
√
gn (θ0 − 2−nR ) ≤ gn (θ0 ) − 2πP · 2−n(C−R)
Applying Lemma 4, we bound each of the terms in (6)
separately,
√ using the fact that gn (θ0 ) is Gaussian. Denote by
an = 2πP · 2n(C−R) , and we have
³
¯
a
a ´
EFQ (gn + an ) ≥ P(gn > − n )E FQ (gn + an )¯gn > − n
2
2
an
≥ FQ2 ( )
2
³
¯
EFQ (gn − an ) ≤ E FQ (gn − an )¯gn <
an
an ´
)
+ P(gn >
2
2
indicator function of any subset with a-posteriori probability
equal to 12 . One possibility is:
½
0 θ0 < median{fn−1 (θ)}
(9)
gn (θ0 , y1n−1 ) =
1
o.w.
which is precisely the Horstein scheme. Applying (1) results in
(9) as well, since FQ−1 corresponds to a selection of a median
subset. Note that unlike the continuous alphabet case, there is
an inherent loss of information in the “matching” step here,
since the posterior is converted into a discrete distribution.
The posterior in this case is built by multiplying each side of
the median by either 2p or 2(1 − p) according to the received
bit, and since the message point always lies on the correct side
of the median, we get
fn (θ0 ) = 2n pn1 (1 − p)n−n1
where n1 ≈ np is the number of crossovers that occurred
during transmission. This immediately results in
1
log fn (θ0 ) −→ 1 − hb (p) = C with probability 1
n→∞
n
as expected. Notice that the posterior is quasi-constant over
at most n + 1 disjoint intervals, therefore the size of the
interval containing the message point is no larger than 2−nC .
These observations have been utilized before [5] for variable
rate universal communications when the noise is an individual
sequence. Due to the discrete nature of the setting, the error
probability analysis differs from that described herein (for
instance, Lemma 3 naturally does not apply) and is left out.
an
))
2
Putting the terms together we get asymptotically
V. U NIFORM N OISE E XAMPLE
a
a
a
Our
suggested
method generalizes previously proposed
n
n
n
Pe ≤ 1 − FQ2 ( ) + 2(1 − FQ ( )) ≈ 4(1 − FQ ( ))
feedback
schemes,
and to demonstrate its application in cases
2
2
2
¶
µ
³ π
´ not handled before, we provide a simple illustrative example of
1√
2πP · 2n(C−R) ) ≈ 2 exp − 22n(C−R)
= 4 1 − FQ (
a uniform noise channel with a uniform input distribution. We
2
4
shall see that the resulting transmission strategy in this case
where we have used the exponential approximation of the
turns out to be a very intuitive one, which vividly demonstrates
Gaussian CDF. We thus get the same double exponential decay
the zoom-in effect mentioned earlier.
as in the Schalkwijk-Kailath scheme
Example 3 (Uniform noise with uniform input distribution):
1
1
Consider a memoryless additive noise channel with U (0, 1)
lim
log log
≥ 2(C − R)
(8)
n→∞ n
noise, and say we choose an input distribution which is also
Pe
U (0, 1). What is our transmission strategy in this simple
via a slightly different analysis.
case? It is easy to verify that the inverse channel V (x|y) is
The difference between our general scheme and the ‘esti½
mation error” approach of the Schalkwijk-Kailath scheme in
U (0, y)
y≤1
V (x|y) ∼
a non-Gaussian setting should now be evident. For general
U (y − 1, 1) y > 1
additive noise the Schalkwijk-Kailath scheme transmits the
Since the input distribution was set to be U (0, 1), the function
linear MMSE estimation error given past observations, which
S(x, y) is merely the CDF of V (x|y) and is given by
is uncorrelated with those observations but not independent
(
of them as in our scheme, except for the Gaussian case.
y≤1
Λ( xy )
S(x, y) =
Example 2 (The BSC channel): We now consider the BSC
Λ( x−y+1
)
y>1
2−y
setting with crossover probability p, and show that our scheme
in this case is essentially the same as the Horstein scheme [1]. where Λ(x) = min(max(x, 0), 1). This means that our transThe discussion is easily adjusted to any DMC with feedback. mission strategy in this case is very simple. We start by
According to our approach, the channel’s input should be transmitting g1 = θ0 . Then, given y1 we find the range
independent of previous outputs and distributed ∼ Ber( 12 ) of inputs that could have generated it, and apply to g1 a
(capacity achieving). To that end, the function gn can be an transformation that linearly stretches this range to fill the entire
≤ 2(1 − FQ (
unit interval, which provides us with g2 to transmit. We now
find the range of possible inputs given y2 , and apply the
corresponding linear transformation to g2 , and so on. This is
intuitively appealing since what we do in each iteration is just
zoom-in on the remaining uncertainty region for θ0 . Since the
posterior is always uniform, this zooming-in is linear.
This transmission strategy results in a posterior which is uniform in an ever shrinking sequence of intervals. Consequently,
in this case it is easier to look at a variable-rate decoding rule,
by simply decoding the current interval (an , bn ) within which
the posterior is uniform. The size of that interval is
Y
Y
|bn − an | =
yk
(2 − yk )
k∈A
k6∈A
where A = {k : yk < 1}. This is a zero-error decoding rule
which results in a variable rate that converges to
− log |bn − an |
1 X
1 X
1
1
R=
=
+
log
log
n
n
yk
n
2 − yk
k6∈A
k∈A
n
fX|Y (xk |yk )
1X
=
log
n
fX (xk )
k=1
−→
n→∞
I
with probability 1
where I is the corresponding mutual information. Note that in
this example, every channel output actually produces bits in
the amount corresponding to its individual mutual information.
VI. D ISCUSSION
A sequential communications strategy for memoryless channels with feedback was described, providing in particular a
unified view of the known Horstein and Schalkwijk-Kailath
schemes. The core of the strategy lies in the constantly refined
representation of the message point’s position relative to the
uncertainty at the receiver. This is accomplished by evaluating
the receiver’s a-posteriori cumulative distribution function at
the message point, followed by a technical step of matching
this quantity to the channel via an appropriate transformation.
A preliminary analysis for additive noise channels was provided. The proposed scheme is expected to attain the capacity
of general memoryless channels under suitable regularity
conditions, an issue which is currently under investigation.
A known drawback of the Schalkwijk-Kailath scheme is that
its peak power may become arbitrarily large. This problem
was treated in [6] by ceasing transmission and declaring an
error whenever the time averaged power exceeded some given
threshold, at the cost of loosing the doubly-exponential error
probability. However, our scheme allows for a much simpler
solution, since the input distribution can be set (and optimized)
to obey any required single letter peak constraint.
An interesting research direction could be the treatment of
channels with memory within the same framework, possibly
by modifying the channel matching step to depend on previous
outputs. Another direction to be explored is the possible use
of our method for universal communications with feedback. In
a stochastic universal setting, the transmitter can estimate the
channel with increasing accuracy, and match the transmission
strategy accordingly. Although the receiver does not know the
channel, it seems plausible that for a “not too rich” family of
channels, the calculated posterior will have a significant peak
only when “close enough” to the true channel, and will be
flat otherwise. Furthermore, it should be examined whether
the same method can be used in an individual noise setting as
well, employing randomization techniques in the spirit of [5].
VII. ACKNOWLEDGMENT
We gratefully acknowledge the useful comments made by an
anonymous reviewer, which greatly improved the presentation.
R EFERENCES
[1] M. Horstein, “Sequential transmission using noiseless feedback,” IEEE
Trans. Info. Theory, pp. 136–143, July 1963.
[2] J. P. M. Schalkwijk and T. Kailath, “A coding scheme for additive noise
channels with feedback part I: No bandwidth constraint,” IEEE Trans.
Info. Theory, vol. IT-12, pp. 172 – 182, 1966.
[3] J. P. M. Schalkwijk, “A coding scheme for additive noise channels with
feedback part II: Band-limited siganls,” IEEE Trans. Info. Theory, vol.
IT-12, pp. 183 – 189, 1966.
[4] O. Shayevitz, R. Zamir, and M. Feder, “Bounded expected delay
in arithmetic coding,” in Proc. of the International Symposium on
Information Theory, 2006.
[5] O. Shayevitz and M. Feder,
“Achieving the empirical capacity using feedback - part I: Memoryless additive models,” Submitted to the IEEE Trans. Info. Theory,
Available online at:
http://www.eng.tau.ac.il/∼ofersha/empirical capacity part1.pdf.
[6] A.D. Wyner, “On the schalkwijk-kailath coding scheme with a peak
energy constraint,” IEEE Trans. on Info. Theory, vol. IT-14, no. 1, pp.
129–134, Jan. 1968.