Academia.eduAcademia.edu

Communication with Feedback via Posterior Matching

2007, 2007 IEEE International Symposium on Information Theory

In this paper we describe a general algorithmic scheme for communication over any memoryless channel in the presence of noiseless feedback. The scheme is based on the idea of posterior matching, in which the information still missing at the receiver is extracted from the a-posteriori density function, and matched to any desirable input distribution. We analyze the error probability attained by this scheme for additive noise channels, and show that the well-known Schalkwijk-Kailath scheme for the AWGN channel with average power constraint and the Horstein scheme for the BSC, can be derived as special cases.

Communication with Feedback via Posterior Matching Ofer Shayevitz Meir Feder Tel Aviv University, Dept. of EE-Systems Tel Aviv 69978, Israel [email protected] Tel Aviv University, Dept. of EE-Systems Tel Aviv 69978, Israel [email protected] Abstract— In this paper we describe a general algorithmic scheme for communication over any memoryless channel in the presence of noiseless feedback. The scheme is based on the idea of posterior matching, in which the information still missing at the receiver is extracted from the a-posteriori density function, and matched to any desirable input distribution. We analyze the error probability attained by this scheme for additive noise channels, and show that the well-known Schalkwijk-Kailath scheme for the AWGN channel with average power constraint and the Horstein scheme for the BSC, can be derived as special cases. I. I NTRODUCTION Feedback cannot increase the capacity of memoryless channels, but it can significantly improve the error probability performance, and perhaps more importantly - it can drastically simplify the transmission schemes required to achieve capacity. Whereas complex coding techniques strive to approach capacity in the absence of feedback, the same goal can sometimes be obtained using noiseless feedback by simple deterministic schemes that work “on the fly”. Probably the first elegant feedback scheme ever to be presented is due to Horstein [1] for the Binary Symmetric Channel (BSC) with feedback. In that paper, a message point inside the unit interval was used to represent the data bits, and was conveyed to the receiver by always indicating whether it lies to the left or to the right of the receiver posterior’s median, which is also known to the transmitter via feedback. That way, the transmitter always answers the most informative question that can be posed by the receiver based on the information the latter has. Remarkably, this simple technique is enough to attain the capacity of the BSC, and is easily adopted to any Discrete Memoryless Channel (DMC) with feedback. A few years later, two landmark papers by SchalkwijkKailath [2] and Schalkwijk [3] presented an elegant capacity achieving feedback scheme for the Additive White Gaussian Noise (AWGN) channel with an average power constraint. The Schalkwijk-Kailath scheme has a “parameter estimation” spirit. At each time point, it finds the Minimum Mean Square Error (MMSE) estimate of the message point at the receiver, and transmits the MMSE error on the next channel use, amplified to match the permissible input power constraint. This scheme is strikingly simple and yet achieves capacity; in fact at any rate below capacity it has an error probability decaying double-exponentially with the block length, as opposed to the single exponential attained by non-feedback schemes. Since the emergence of the Horstein and the SchalkwijkKailath schemes, it was evident that those were similar in some fundamental sense. Both schemes used the message point representation, and both always attempted to tell the receiver what it was still missing in order to “get it right”. However, neither the precise correspondence between these schemes nor a generalization to other cases has ever been established. In this paper, we show that in fact there exists an underlying principal connecting these two methods. We present a general feedback transmission scheme for any memoryless channel and any required input distribution. Our scheme is simple and elegant, and manifests the idea of always transmitting what the receiver is missing. In the special cases of a BSC with uniform input distribution and an AWGN channel with a Gaussian input distribution, our scheme is reduced to those of Horstein and Schalkwijk-Kailath respectively. The paper is organized as follows. In section II we present the new scheme. In section III we provide a preliminary analysis of the scheme and its error performance. In section IV we derive the Horstein and Schalkwijk schemes as a special case, and in section V we provide an illustrative example by applying our scheme to a simple setting. A discussion and some future research items are provided in section VI. II. T HE N EW S CHEME Consider a discrete-time memoryless channel with an instantaneous noiseless feedback and a transition probability law W = W (y|x), and let Q = Q(x) be any desirable input distribution1,2 with finite expectancy and variance. Let θ0 ∈ [0, 1) be the message point whose binary expansion represents an infinite bitstream to be reliably conveyed to the receiver. The message point is selected according to a uniform distribution over the unit interval. Denote the transmitted signal at time n by xn = gn (θ0 , y1n−1 ) where yn is the corresponding channel output, and n−1 gn : [0, 1) × R 7→ R is a sequence of deterministic transmission functions known at both terminals. At each time point, the receiver calcu1 For instance, Q may be selected to be a capacity achieving distribution under some desirable input constraint. 2 Our approach is valid for any channel/input distribution pair, but here we assume Q, W are Probability Density Functions (PDFs). lates the a-posteriori density function of the message point fn (θ) = fθ|y1n (θ | y1n ), starting with f0 (θ) uniform over the unit interval. Thanks to the noiseless feedback, the transmitter can track fn (θ) as well, and is assumed to do so. A proper selection of transmission functions will hopefully result in a fast concentration of the posterior fn (θ) around θ0 , rapidly reducing the uncertainty at the receiver. For communications with a fixed rate R, the decoding rule we consider after n channel uses is looking for an interval of size 2−nR whose a-posteriori probability is maximal3 . Alternatively, one can set the bit error probability at a threshold, and decode bits whenever their respective intervals accumulate enough probability, as in [1]. This variable rate approach possesses an error exponent inherently superior to that of the aforementioned fixed rate approach, however below we focus on the latter as it is easier to analyze. What is the best selection of the functions gn ? We argue the following: Since fn (θ) describes the receiver’s knowledge (or lack of it) regarding θ0 at time n given y1n , it is reasonable to zoom-in on θ0 by somehow “stretching” the posterior into the desired input distribution, and hence describe to the receiver in greater detail what it is still missing. Therefore, we suggest to use gn+1 (θ0 , y1n ) = FQ−1 ◦ Fθ|y1n (θ0 |y1n ) (1) where Fθ|y1n is the Cumulative Distribution Function (CDF) corresponding to the posterior fn (θ), and FQ is the CDF of the desired input distribution Q. It is easy to see that Fθ|y1n (θ0 |y1n ), viewed as a random variable, is uniform over the unit interval given any value of y1n and is therefore independent of y1n . Hence, gn+1 is Q-distributed and independent of y1n as well. Notice that the inputs are essentially produced in two steps. In the first step, the information regarding θ0 still missing at the receiver is “extracted”, by deterministically generating a random variable independent of previous observations, that together with those observations uniquely determines θ0 . In the second step, the distribution of that random variable is “matched” to the channel by transforming it into that of the desired input distribution Q. This strategy admits a simpler recursive form. Define the inverse channel for W with an input distribution Q to be Q(x)W (y|x) V (x|y) = P x Q(x)W (y|x) and let FX|Y (x|y) be the CDF of V (x|y) for a fixed value y, namely, Z x FX|Y (x|y) = V (ξ|y)dξ −∞ Define also △ S(x, y) = FQ−1 ◦ FX|Y (x|y). 3 As in arithmetic coding, this interval may be positioned so that less than nR bits are decoded. Similarly, the number of bits not decoded is expected to be small and independent of n [4]. Alternatively, note that just a single extra bit is required to decode the rest, and it can be appended to the next block. Lemma 1: The transmission functions (1) are also given by the recursive formula g1 (θ0 ) = FQ−1 (θ0 ) ¡ ¢ gn+1 (θ0 , y1n ) = S gn (θ0 , y1n−1 ), yn (2) Proof: This Lemma can be proved directly by taking derivatives, as we shall verify in the sequel. Here we provide a more illuminating proof by induction, showing that (1) and (2) are the same as functions for any n. This is immediately true for n = 1. Assume now it is true for n = k. As shown above gk is independent of y1k−1 . Since the channel is memoryless, (gk , yk ) are also independent of y1k−1 . Therefore, FX|Y (gk |yk ) is uniform given y1k , and applying FQ−1 transforms its distribution into Q. Thus, gk+1 = S(gk , yk ) is Qdistributed given y1k and is obtained from θ0 by a composition of monotonic functions in θ0 , which itself is monotonic in θ0 . The same is true for gk+1 obtained from (1). By the uniqueness of a monotonic transformation between distributions, gk+1 generated by either (1) or (2) is identical for any y1k , and the proof is concluded. The recursive form (2) provides a simple way for implementing our transmission scheme: The next channel input is given by a deterministic function of the previous input and previous output only, i.e., xn+1 = S(xn , yn ). III. A NALYSIS In this section we derive some basic properties of the suggested scheme. We shall focus henceforth on the case of an additive noise channel, but essentially the same results are expected to be valid in a general memoryless setting, under some regularity conditions. Both the noise and the input are assumed to have bounded first and second moments, and the input distribution is assumed to satisfy Q(x) < Qmax . We denote the noise sequence by zk and its PDF by fZ (·). The mutual information of the channel W with input distribution Q is denoted by I = I(Q, W ). The dependence of fn (θ) and gn+1 (θ) on y1n is usually omitted for notational clarity. Lemma 2: The posterior evaluated at the message point has the following asymptotic behavior 1 log fn (θ0 ) = I(Q, W ) with probability 1 (3) n Proof: Applying Bayes’ law it is easily verified that the posterior satisfies the following recursion rule: lim n→∞ fn (θ) = f (yn | θ, y1n−1 ) fn−1 (θ) f (yn | y1n−1 ) Applying the recursion rule n times and taking a logarithm, we get n 1X 1 log W (yk | gk (θ, y1k−1 )) log fn (θ) = n n − k=1 n X 1 n k=1 log f (yk | y1k−1 ) Evaluating the above at the message point, we use the fact that gk , yk are independent of y1k−1 and the noise is additive, and apply the law of large numbers (LLN) to the i.i.d. sequences: n n k=1 k=1 1X 1 1X log fn (θ0 ) = log fZ (zk ) − log fY (yk ) n n n −→ n→∞ − H(Z) + H(Y ) = I(Q, W ) with probability 1, as required. Lemma 3: The derivative of the transmission function evaluated at the message point, has the following asymptotic behavior ¯ 1 ∂gn (θ, y1n−1 ) ¯¯ lim ≥ I(Q, W ) with probability 1 log ¯ n→∞ n ∂θ θ=θ0 (4) Proof: From (1) we easily find that Z θ fn−1 (θ′ )dθ′ = FQ (gn (θ)) 0 which results in this can also be obtained from (2) by noticing that △ fX|Y (x|y) ∂S (x, y) = ∂x Q (S (x, y)) and then applying the chain rule for derivatives n−1 Y 1 ∂gn (θ, y1n−1 ) = S1 (gk (θ), yk ) ∂θ Q(g1 (θ)) = 1 Q(gn (θ)) k=1 n−1 Y k=1 where ∆θ = 2−(nR+1) . Proof: From (1) we easily find again that the posterior’s integral is given by Z θ2 fn (θ)dθ = FQ (gn+1 (θ2 )) − FQ (gn+1 (θ1 )) (7) θ1 We therefore have the following expression for the error probability given y1n : Z θ1 +2−nR n fn (θ)dθ Pe (y1 ) = 1 − sup θ ³ 1 θ1 ´ = 1 − sup FQ (gn+1 (θ1 + 2∆θ)) − FQ (gn+1 (θ1 )) θ1 ≤ 1 − FQ (gn+1 (θ0 + ∆θ)) + FQ (gn+1 (θ0 − ∆θ)) fn−1 (θ) ∂gn (θ) = ∂θ Q (gn (θ)) S1 (x, y) = The following Lemma provides a useful expression for the error probability of our scheme, which is applied to the AWGN channel in the next section. Lemma 4: For any rate R our scheme attains an error probability upper bounded by ³ ´ Pe ≤ 1− E FQ (gn+1 (θ0 +∆θ))−FQ (gn+1 (θ0 −∆θ)) (6) (5) fX|Y (gk (θ)|yk ) fn−1 (θ) = Q(gk (θ)) Q (gn (θ)) verifying Lemma 1 again. We now immediately have that ∂gn (θ) 1 1 1 log = log fn−1 (θ) − log Q(gn (θ)) n ∂θ n n and using Lemma 2 together with the assumption Q < Qmax we get the desired result. The properties described above provide a good idea regarding the behavior of the posterior. Loosely speaking, the posterior has a peak of 2nI at the message point, and since the derivative of gn (θ) at that point is at least 2nI , the trajectory4 of points that lie 2−n(I+ε) close to θ0 is attracted to that of θ0 , hence for such points we expect that fn (θ) ≈ 2nI . In contrast, the trajectory of points that lie 2−n(I−ε) far from θ0 diverges from that of θ0 , towards the boundaries of support(Q). We therefore expect a probability mass approaching one to be concentrated in a 2−nR vicinity of the message point for any R < I, which translates to reliable communications at any rate below the mutual information. 4 The trajectory of a point θ is the sequence of values obtained by applying gk (θ, y1k−1 ) with increasing k. When calculating the a-posterior density, the receiver in fact tracks the trajectory of all possible message points. and the proof is concluded by taking the expectation on both sides to get the average error probability. Lemma 4 demonstrates that the error probability is determined by two factors: The input CDF’s tail behavior, and the sensitivity of the transmission functions to a 2−nR perturbation in the assumed position of the message point, namely how fast is the resulting divergence of the trajectory towards the boundaries of support(Q). Corollary 1: Assume sup S1 (x, y) < ∞, and so the divergence of the trajectory is exponential at best. If support(Q) = R and fixed-rate block decoding is used, then a necessary condition for a doubly-exponential error probability is for Q to have an exponentially decaying tail. IV. S CHALKWIJK AND H ORSTEIN R EVISITED Example 1 (The AWGN channel): We now provide a sketch of the analysis for the AWGN setting, and show that our scheme in this particular case is essentially the same as the Schalkwijk-Kailath scheme [2][3]. Assume the noise is N (0, σ 2 ), the average power constraint is P , and denote SN R = σP2 . Set Q ∼ N (0, P ) (capacity achieving) and let φ0 = FQ−1 (θ0 ), which is the message point converted into a Gaussian distribution, and also the first channel input g1 (θ0 ). It is easily verified that (1) in this case is merely an affine transformation that transform the posterior into N (0, P ), hence the transmission functions are given by ¢ n ¡ gn (θ0 , y1n−1 ) = (1 + SN R) 2 φ0 − E(φ0 | y1k−1 ) Observe that in this case gn (θ0 ) is just the estimation error of an M M SE estimator for φ0 (which represents θ0 ) given the observations, amplified to match the permissible input power. The recursive representation (2) in this case is simply ¶ µ √ SN R S(x, y) = 1 + SN R x − y 1 + SN R which is exactly the transmission strategy of the SchalkwijkKailath scheme [3]. We now find an explicit expression upper bounding the error probability. Taking the derivative of the transmission function, we get √ n (1 + SN R) 2 ∂gn (θ) ≥ 2πP (1 + SN R) 2 = −1 ∂θ Q(FQ (θ)) n and so gn (θ0 + 2−nR ) ≥ gn (θ0 ) + = gn (θ0 ) + Z θ0 +2−nR √ n 2πP (1 + SN R) 2 dθ √θ0 2πP · 2−n(C−R) where C = 21 log (1 + SN R) is the Gaussian channel capacity. Similarly, √ gn (θ0 − 2−nR ) ≤ gn (θ0 ) − 2πP · 2−n(C−R) Applying Lemma 4, we bound each of the terms in (6) separately, √ using the fact that gn (θ0 ) is Gaussian. Denote by an = 2πP · 2n(C−R) , and we have ³ ¯ a a ´ EFQ (gn + an ) ≥ P(gn > − n )E FQ (gn + an )¯gn > − n 2 2 an ≥ FQ2 ( ) 2 ³ ¯ EFQ (gn − an ) ≤ E FQ (gn − an )¯gn < an an ´ ) + P(gn > 2 2 indicator function of any subset with a-posteriori probability equal to 12 . One possibility is: ½ 0 θ0 < median{fn−1 (θ)} (9) gn (θ0 , y1n−1 ) = 1 o.w. which is precisely the Horstein scheme. Applying (1) results in (9) as well, since FQ−1 corresponds to a selection of a median subset. Note that unlike the continuous alphabet case, there is an inherent loss of information in the “matching” step here, since the posterior is converted into a discrete distribution. The posterior in this case is built by multiplying each side of the median by either 2p or 2(1 − p) according to the received bit, and since the message point always lies on the correct side of the median, we get fn (θ0 ) = 2n pn1 (1 − p)n−n1 where n1 ≈ np is the number of crossovers that occurred during transmission. This immediately results in 1 log fn (θ0 ) −→ 1 − hb (p) = C with probability 1 n→∞ n as expected. Notice that the posterior is quasi-constant over at most n + 1 disjoint intervals, therefore the size of the interval containing the message point is no larger than 2−nC . These observations have been utilized before [5] for variable rate universal communications when the noise is an individual sequence. Due to the discrete nature of the setting, the error probability analysis differs from that described herein (for instance, Lemma 3 naturally does not apply) and is left out. an )) 2 Putting the terms together we get asymptotically V. U NIFORM N OISE E XAMPLE a a a Our suggested method generalizes previously proposed n n n Pe ≤ 1 − FQ2 ( ) + 2(1 − FQ ( )) ≈ 4(1 − FQ ( )) feedback schemes, and to demonstrate its application in cases 2 2 2 ¶ µ ³ π ´ not handled before, we provide a simple illustrative example of 1√ 2πP · 2n(C−R) ) ≈ 2 exp − 22n(C−R) = 4 1 − FQ ( a uniform noise channel with a uniform input distribution. We 2 4 shall see that the resulting transmission strategy in this case where we have used the exponential approximation of the turns out to be a very intuitive one, which vividly demonstrates Gaussian CDF. We thus get the same double exponential decay the zoom-in effect mentioned earlier. as in the Schalkwijk-Kailath scheme Example 3 (Uniform noise with uniform input distribution): 1 1 Consider a memoryless additive noise channel with U (0, 1) lim log log ≥ 2(C − R) (8) n→∞ n noise, and say we choose an input distribution which is also Pe U (0, 1). What is our transmission strategy in this simple via a slightly different analysis. case? It is easy to verify that the inverse channel V (x|y) is The difference between our general scheme and the ‘esti½ mation error” approach of the Schalkwijk-Kailath scheme in U (0, y) y≤1 V (x|y) ∼ a non-Gaussian setting should now be evident. For general U (y − 1, 1) y > 1 additive noise the Schalkwijk-Kailath scheme transmits the Since the input distribution was set to be U (0, 1), the function linear MMSE estimation error given past observations, which S(x, y) is merely the CDF of V (x|y) and is given by is uncorrelated with those observations but not independent ( of them as in our scheme, except for the Gaussian case. y≤1 Λ( xy ) S(x, y) = Example 2 (The BSC channel): We now consider the BSC Λ( x−y+1 ) y>1 2−y setting with crossover probability p, and show that our scheme in this case is essentially the same as the Horstein scheme [1]. where Λ(x) = min(max(x, 0), 1). This means that our transThe discussion is easily adjusted to any DMC with feedback. mission strategy in this case is very simple. We start by According to our approach, the channel’s input should be transmitting g1 = θ0 . Then, given y1 we find the range independent of previous outputs and distributed ∼ Ber( 12 ) of inputs that could have generated it, and apply to g1 a (capacity achieving). To that end, the function gn can be an transformation that linearly stretches this range to fill the entire ≤ 2(1 − FQ ( unit interval, which provides us with g2 to transmit. We now find the range of possible inputs given y2 , and apply the corresponding linear transformation to g2 , and so on. This is intuitively appealing since what we do in each iteration is just zoom-in on the remaining uncertainty region for θ0 . Since the posterior is always uniform, this zooming-in is linear. This transmission strategy results in a posterior which is uniform in an ever shrinking sequence of intervals. Consequently, in this case it is easier to look at a variable-rate decoding rule, by simply decoding the current interval (an , bn ) within which the posterior is uniform. The size of that interval is Y Y |bn − an | = yk (2 − yk ) k∈A k6∈A where A = {k : yk < 1}. This is a zero-error decoding rule which results in a variable rate that converges to − log |bn − an | 1 X 1 X 1 1 R= = + log log n n yk n 2 − yk k6∈A k∈A n fX|Y (xk |yk ) 1X = log n fX (xk ) k=1 −→ n→∞ I with probability 1 where I is the corresponding mutual information. Note that in this example, every channel output actually produces bits in the amount corresponding to its individual mutual information. VI. D ISCUSSION A sequential communications strategy for memoryless channels with feedback was described, providing in particular a unified view of the known Horstein and Schalkwijk-Kailath schemes. The core of the strategy lies in the constantly refined representation of the message point’s position relative to the uncertainty at the receiver. This is accomplished by evaluating the receiver’s a-posteriori cumulative distribution function at the message point, followed by a technical step of matching this quantity to the channel via an appropriate transformation. A preliminary analysis for additive noise channels was provided. The proposed scheme is expected to attain the capacity of general memoryless channels under suitable regularity conditions, an issue which is currently under investigation. A known drawback of the Schalkwijk-Kailath scheme is that its peak power may become arbitrarily large. This problem was treated in [6] by ceasing transmission and declaring an error whenever the time averaged power exceeded some given threshold, at the cost of loosing the doubly-exponential error probability. However, our scheme allows for a much simpler solution, since the input distribution can be set (and optimized) to obey any required single letter peak constraint. An interesting research direction could be the treatment of channels with memory within the same framework, possibly by modifying the channel matching step to depend on previous outputs. Another direction to be explored is the possible use of our method for universal communications with feedback. In a stochastic universal setting, the transmitter can estimate the channel with increasing accuracy, and match the transmission strategy accordingly. Although the receiver does not know the channel, it seems plausible that for a “not too rich” family of channels, the calculated posterior will have a significant peak only when “close enough” to the true channel, and will be flat otherwise. Furthermore, it should be examined whether the same method can be used in an individual noise setting as well, employing randomization techniques in the spirit of [5]. VII. ACKNOWLEDGMENT We gratefully acknowledge the useful comments made by an anonymous reviewer, which greatly improved the presentation. R EFERENCES [1] M. Horstein, “Sequential transmission using noiseless feedback,” IEEE Trans. Info. Theory, pp. 136–143, July 1963. [2] J. P. M. Schalkwijk and T. Kailath, “A coding scheme for additive noise channels with feedback part I: No bandwidth constraint,” IEEE Trans. Info. Theory, vol. IT-12, pp. 172 – 182, 1966. [3] J. P. M. Schalkwijk, “A coding scheme for additive noise channels with feedback part II: Band-limited siganls,” IEEE Trans. Info. Theory, vol. IT-12, pp. 183 – 189, 1966. [4] O. Shayevitz, R. Zamir, and M. Feder, “Bounded expected delay in arithmetic coding,” in Proc. of the International Symposium on Information Theory, 2006. [5] O. Shayevitz and M. Feder, “Achieving the empirical capacity using feedback - part I: Memoryless additive models,” Submitted to the IEEE Trans. Info. Theory, Available online at: http://www.eng.tau.ac.il/∼ofersha/empirical capacity part1.pdf. [6] A.D. Wyner, “On the schalkwijk-kailath coding scheme with a peak energy constraint,” IEEE Trans. on Info. Theory, vol. IT-14, no. 1, pp. 129–134, Jan. 1968.