Academia.eduAcademia.edu

Feedback communication over individual channels

2009, 2009 IEEE International Symposium on Information Theory

We consider the problem of communicating over a channel for which no mathematical model is specified. We present achievable rates as a function of the channel input and output known a-posteriori for discrete and continuous channels, as well as a rate-adaptive scheme employing feedback which achieves these rates asymptotically without prior knowledge of the channel behavior.

1 Communication over Individual Channels arXiv:0901.1473v2 [cs.IT] 20 Aug 2009 Yuval Lomnitz, Meir Feder Tel Aviv University, Dept. of EE-Systems Email: {yuvall,meir}@eng.tau.ac.il Abstract—We consider the problem of communicating over a channel for which no mathematical model is specified. We present achievable rates as a function of the channel input and output known a-posteriori for discrete and continuous channels, as well as a rate-adaptive scheme employing feedback which achieves these rates asymptotically without prior knowledge of the channel behavior. I. I NTRODUCTION The problem of communicating over a channel with an individual, predetermined noise sequence which is not known to the sender and receiver was addressed by Shayevitz and Feder [1] [2] and Eswaran et al [3][4]. The simple example discussed in [1] is of a binary channel yn = xn ⊕ en where the error sequence en can be any unknown sequence. Using perfect feedback and common randomness, communication is shown to be possible in a rate approaching the capacity of the binary symmetric channel (BSC) where the error probability equals the empirical error probability of the sequence (the relative number of ’1’-s in en ). Subsequently both authors extended this model to general discrete channels and moduluadditive channels ([3], [2] resp.) with an individual state sequence, and showed that the empirical mutual information can be attained. Now we take this model one step further. We consider a channel where no specific probabilistic or mathematical relation between the input and the output is assumed. In order to define positive communication rates without assumptions on the channel, we characterize the achievable rate using the specific input and output sequences, and we term this channel an individual channel. This way of treating with unknown channels is different from other concepts of dealing with the problem, such as compound channels and arbitrarily varying channels, in the fact that the later require a specification of the channel model up to some unknown parameters, whereas the current approach makes no a-priori assumptions about the channel behavior. We usually assume the existence of a feedback link in which the channel output or other information from the decoder can be sent back to the encoder. Without this feedback it would not be possible to match the rate of transmission to the quality of the channel so outage would be inevitable. Although one may not be fully convenient with the mathematical formulation of the problem, there is no question about the reality of this model: this is the only channel model that we know for sure exists in nature. This point of view is similar to the approach used in universal source coding of individual sequences where the goal is to asymptotically attain for each sequence the same coding rate achieved by the best encoder from a model class, tuned to the sequence. Just to inspire thought, let’s ask the following question: Pn suppose the sequence {xi }ni=1 with power P = n1 i=1 x2i encodes a message and is transmitted over a continuous realvalued input channel. The output sequence is {yi }ni=1 . One can think of viP = yi − xi as a noise sequence and measure its  n P which power N = n1 i=1 vi2 . Is the rate R = 12 log 1 + N is the Gaussian channel capacity, achievable in this case, under appropriate definitions ? The way it was posed, the answer to this question would be ”no”, since this model predicts a rate of 21 bit/use for the channel whose output is ∀i : yi = 0 which cannot convey any information. However with the slight restatement done in the next section the answer would be ”yes”. We consider two classes of individual channels: discrete input and output channels and continuous real valued input and output channels, and two communication models: with feedback and without feedback. In both cases we assume common randomness exists. The case of feedback is of higher interest, since the encoder can adapt the transmission rate and avoid outage. The case of no-feedback is used as an intermediate step, but the results are interesting since they can be used for analysis of semi probabilistic models. The main result is that with a small amount of feedback, a communication at a rate close to the empirical mutual information (or its Gaussian equivalent for continuous channels) can be achieved, without any prior knowledge, or assumptions, about the channel structure. The paper is organized as follows: in section II we give a high level overview of the results. In section III-B we define the model and notation. Section IV deals with communication without feedback where the results pertaining to discrete and continuous case are formalized and proven, and the choice of the rate function and the Gaussian prior for the continuous case is justified. Section V deals with the case where feedback is present. After reviewing similar results we state the main result and the adaptive rate scheme that achieves it, and delay the proof to section VI. Here, the error probability and the achieved rate are analyzed and bounded. Section VII gives several examples, and section VIII is dedicated to comments and highlights areas for further study. II. OVERVIEW OF MAIN RESULTS We start with a high level overview of the definitions and results. The definitions below are conceptual rather than accurate, and detailed definitions follow in the next sections. A rate function is a function Remp : X n × Y n → R of the input and output sequences. In communication without 2 feedback we say a given rate function is achievable if for large block size n → ∞, it is possible to communicate at rate R and an arbitrarily small error probability is obtained whenever Remp exceeds the rate of transmission, i.e. whenever Remp (x, y) > R. In communication with feedback we say a given rate function is achieved by a communication scheme if for large block size n, data at rate close to or exceeding Remp (x, y) is decoded successfully with arbitrarily large probability for every output sequence and almost every input sequence. Roughly speaking, this means that in any instance of the system operation, where a specific x was the input and a specific y was the output, the communication rate had been at least Remp (x, y). Note that the only statistical assumptions are related to the common randomness, and we consider the rate and error probability conditioned on a specific input and output, where the error probability is averaged over common randomness. We say that a rate function Remp is an optimal ′ (but not the optimal) function if any Remp ≥ Remp which is strictly larger than Remp at at least one point, is not achievable. The definition of achievability is not complete without stating the input distribution, since it affects the empirical rate. For example, by setting x = 0 one can attain every rate function where Remp (0, y) = 0 in a void way, since other x sequences will never appear. Different from classical results in information theory, we do not use the input distribution only as a means to show the existence of good codes: taking advantage of the common randomness we require the encoder to emit input symbols that are random and distributed according to a defined prior (currently we assume i.i.d. distribution). The choice of the rate functions is arbitrary in a way: for any pair of encoder and decoder, we can tailor a function Remp (x, y) as a function equaling the transmitted rate whenever the error probability given the two sequences (averaged over messages and the common randomness) is sufficiently small, and 0 otherwise. However it is clear that there are certain rates which cannot be exceeded uniformly. Our interest will focus on simple functions of the input and output, and specifically in this paper we focus on functions of the instantaneous (zero order) empirical statistics. Extension to higher order models seems technical. For the discrete channel we show that a rate ˆ y) Remp = I(x; (1) ˆ ·) is achievable with any input distribution PX where I(·; denotes the empirical mutual information [5] (see definition in section III-B, and Theorems 1, 3). For the continuous (real valued) channel we show that a rate   1 1 (2) Remp = log 2 1 − ρ̂(x, y)2 is achievable with Gaussian input distribution N (0, P ), where ρ̂ is the empirical correlation factor between the input and output sequences (see Theorems 2, 4). These results pertain both to the case of feedback and of no-feedback according to the definitions above. Throughout the current paper we define correlation factor ) (that is, in a slightly non standard way as ρ = √ E(XY 2 2 E(X )E(Y ) without subtracting the mean). This is done only to simplify definitions and derivations, and similar claims can be made using the correlation factor defined in the standard way. Although the result regarding the continuous case is less tight, we show that this is the best rate function that can be defined by second order moments, and is tight for the Gaussian additive channel  P P ) therefore Remp = 12 log 1 + N (for this channel ρ2 = P +N We may now rephrase our example question from the introduction so that it will have an affirmative answer: given the input and output sequences, describe the output by the virtual additive channel with a gain yi = αxi + vi , so the effective noise P sequence is vi = yi − αxi . Chose α so that v ⊥ x, i.e. 1 i vi xi = 0. An equivalent condition is that α minimizes n kvk2 . The resulting α is the LMMSE coefficient in estimation xT y of y from x (assuming zero mean), i.e. α = kxk 2 . Define the Pn effective noise power as N = n1 i=1 vi2 , and the effective 2 ρ̂2 SNR ≡ αNP . It is easy to check that SNR = 1− ρ̂2 where T x y is the empirical correlation factor between x and ρ̂ = kxk·kyk y. Then according to Eq.(2) the rate R = 21 log (1 + SNR) is achievable, in the sense defined above. Reexamining the counter example we gave above, in this model if we set y = 0 we obtain ρ̂ = 0 and therefore Remp = 0, or equivalently the effective channel has v = 0 and α = 0, therefore SNR = 0 (instead of v = −x, α = 1 and SNR = 1). As will be seen, we achieve these rates by random coding and universal decoders. For the case of feedback we use iterated instances of rateless coding (i.e. we encode a fixed number of bits and the decision time depends on the channel). The scheme is able to operate asymptotically with ”zero rate” feedback (meaning any positive capacity of the feedback channel suffices). A similar although more complicated scheme was used in [3] (see a comparison in the appendix). Before the detailed presentation we would like to examine the differences between the model used here and two proximate models: the arbitrarily varying channel (AVC) and the channel with individual noise sequence. In the AVC (see for example [6][7]), the channel is defined by a probabilistic model which includes an unknown state sequence. Constraints on the sequence (such as power, number of errors) may be defined, and the target is to communicate equally well over all possible occurrences of the state sequence. In AVC, the capacity depends on the existence of common randomness and on whether the average or maximum error probability (over the messages) is required to approach 0, yet when sufficient common randomness is used, the capacities for maximum and average error probability are equal. The notes in [6] regarding common randomness and randomized encoders (see p.2151) are also relevant to our case. A treatment of AVC-s which is similar in spirit to our results exists in watermarking problems. For example a rather general case of AVC is discussed in [8]. They consider communication over a black box (representing the attacker) which is only limited to a given level D of distortion according to a predefined metric, but has otherwise a block-wise undefined behavior. They show that it is possible to achieve a rate equal to the rate-distortion function of the input RX (D), if the black box guarantees a given level of average distortion in 3 high probability. This result is similar to our Theorem 1. The remarkable distinction from other results for AVC is that the rate is determined using a constraint on the channel inputs and outputs, rather than the channel state sequence. We note that for the Gaussian additive channel the above result is suboptimal since the rate is RX (N ) = 12 log(P/N ) and our results improve this result by using the correlation factor yields rather than the mean squared error. See further discussion of these results in the proof Lemma 1 and the discussion following Theorem 3. Channels with individual noise (or state) sequence are treated by Shayevitz and Feder [1][2] and Eswaran et al [3]. The probabilistic setting is the same as in the AVC, and the difference is that instead of achieving a uniform (hence worst-case) rate, the target is to achieve a variable rate which depends on the particular sequence of noise, using a feedback link. In this setup, prior constraints on the state sequence can be relaxed. As opposed to AVC where the capacity is well defined, the target rate for each state sequence is determined in a somewhat arbitrary way (since many different constraints on the sequence can be defined). As an example, in the binary channel of [1], a rate of 0 would be obtained for the sequence e =′ 01010101...′ since the empirical error probability is 1 2 , although obviously a scheme which favors this specific sequence and achieves a rate of 1 can be designed. On the other hand, with the AVC approach communication over this channel would not be possible without prior constraints on the noise sequence. Channels with individual noise sequence can be thought of as compound-AVCs (i.e. an AVC with unknown parameter, in this case, the constraint). As in AVC, existence of common randomness as well as the definition of error probability affect the achievable rates. In the individual channel model we use here, since no equation with state sequence connecting the input and output is given, the achievable rates cannot be defined without relating to the channel input. Therefore the definitions of achieved rates depend in a somewhat circular way on the channel input which is determined by the scheme itself. Currently we circumvent this difficulty by constraining the input distribution, as mentioned above. In many aspects the model used in this paper is more stringent than the AVC and the individual noise sequence models, since it makes less assumptions on the channel, and the error probability is required to be met for (almost) every input and output sequence (rather than on average). In other aspects it is lenient since we may attribute ’bad’ channel behavior to the rate rather than suffer an error, therefore the error exponents are better than in probabilistic models. This is further explained in section IV-A. The model we propose suggests a new approach for the design of communication systems. The classical point of view first assumes a channel model and then devises a communication system optimized for it. Here we take the inverse direction: we devise a communication system without assumptions on the channel which guarantees rates depending on channel behavior. This change of viewpoint does not make probabilistic or semi probabilistic channel models redundant but merely suggests an alternative. By using a channel model we can formalize questions relating to optimality such as capacity (single user, networks) and error exponent as well as guarantee a communication rate a-priori. Another aspect is that we pay a price for universality. Even if one considers an individual channel scheme that guarantees asymptotically optimum rates over a large class of channels, it can never consider all possible channels (block-wise), and for a finite block size it will have a larger overhead (a reduction in the amount of information communicated with same error probability) compared to a scheme optimized for the specific channel. Following our results, the individual channel approach becomes a very natural starting point for determining achievable rates for various probabilistic and arbitrary models (AVC-s, individual noise sequences, probabilistic models, compound channels) under the realm of randomized encoders, since the achievable rates for these models follow easily from the achievable rates for specific sequences, and the law of large numbers. We will give some examples later on. III. D EFINITIONS AND NOTATION A. Notation In general we use uppercase letters to denote random variables, respective lowercase letters to denote their sample values and boldface letters to denote vectors, which are by default of length n. However we deviate from this practice when the change of case leads to confusion, and vectors are always denoted by lowercase letters even when they are random variables. √ kxk ≡ xT x denotes L2 norm. We denote by P ◦ Q the product of conditional probability functions e.g. (P ◦ ˆ denotes an estimated Q)(x, y) = P (x) · Q(y|x). A hat () value. We denote the Pnempirical distribution as P̂ (e.g. P̂(x,y) (x, y) ≡ n1 i=1 δ(xi −x),(yi −y) ). The source vectors x, y and/or the variables x, y are sometimes omitted when ˆ ·), they are clear from the context. We denote by Ĥ(·), I(·; ρ̂(·; ·) the empirical entropy, the empirical mutual information and the empirical correlation factor, which are the respective values calculated for the empirical distribution. All expressions ˆ y), I(x; ˆ y|z), I(x; ˆ y|z = z0 ) such as Ĥ(x), Ĥ(x|y), I(x; are interpreted as their respective probabilistic counterparts H(X), H(X|Y ), I(X; Y ), I(X; Y |Z), I(X; Y |Z = z0 ) where (X, Y, Z) are random variables distributed according to the empirical distribution of the vectors P̂(x,y,z) , or equivalently are defined as a random selection of an element of the vectors i.e. (X, Y, Z) = (xi , yi , zi ), i ∼ U{1, . . . , n}. It is clear from this equivalence that relations on entropy and mutual information (e.g. positivity, chain rules) are directly translated to relations on their empirical counterparts. We apply superscript and subscript indices to vectors to define subsequences in the standard way, i.e. xji ≡ (xi , xi+1 , ..., xj ), xi ≡ xi1 We denote I(P, W ) the mutual information I(X; Y ) when (X, Y ) ∼ P (x)·W (y|x). U(A) denotes a uniform distribution over the set A. Ber(p) denotes the Bernoulli distribution, and hb (p) ≡ H(Ber(p)) = −p log p − (1 − p) log(1 − p) denotes 4 the binary entropy function. The indicator function Ind(E) where E is a set or a probabilistic event is defined as 1 over the set (or when the event occurs) and 0 otherwise. The functions log(·) and exp(·) as well as information theoretic quantities H(·), I(·; ·), D(·||·) refer to the same, unspecified base. We use the term ”information unit” as the 1 bits). unit of these quantities (equals log(2) The notation fn = O(gn ) and fn < O(gn ) (or equivalently O(fn ) = O(gn ) and O(fn ) < O(gn )) means gfnn n→∞ −→ const > 0 and gfnn n→∞ 0 respectively. −→ Throughout this paper we use the term ”continuous” to refer to the continuous real valued channel R → R, although this definition does not cover all continuous input - continuous output channels. By the term ”discrete” in this paper we always refer to finite alphabets (as opposed to countable ones). B. Definitions Definition 1 (Channel). A channel is defined by a pair of input and output alphabets X , Y, and denoted X → Y Definition 2 (Fixed rate encoder, decoder, error probability). A randomized block encoder and decoder pair for the channel X → Y with block length n and rate R without feedback is defined by a random variable S distributed over the set S, a mapping φ : {1, 2, . . . exp(nR)} × S → X n and a mapping φ̄ : Y n × S → {1, 2, . . . exp(nR)}. The error probability for message w ∈ {1, 2, . . . exp(nR)} is defined as  Pe(w) (x, y) = Pr φ̄(y, S) 6= w φ(w, S) = x (3) where for x such that the condition cannot hold, we define (w) Pe (x, y) = 0. Note that the encoder rate must pertain to a discrete number of messages exp(nR) ∈ Z+ , but the empirical rates defined in the following theorems may be any positive real numbers. Definition 3 (Adaptive rate encoder, decoder, error probability). A randomized block encoder and decoder pair for the channel X → Y with block length n, adaptive rate and feedback is defined as follows: • The message w is expressed by the infinite sequence w1∞ ∈ {0, 1}∞ • The common randomness is defined as a random variable S distributed over the set S • The feedback alphabet is denoted F • The encoder is defined by a series of mappings xk = φk (w, s, f k−1 ) where φk : {0, 1}∞ × S × F k−1 → X . • The decoder is defined by the feedback function ϕk : Y k−1 × S → F, the decoding function φ̄ : Y n × S → {0, 1}∞ and the rate function r : Y n × S → R+ (where the rate is measured in bits), applied as follows: fk = ϕk (yk , S) (4) ŵ R = φ̄(y, S) = r(y, S) (5) (6) The error probability for message w is defined as   ⌈nR⌉ ⌈nR⌉ 6= w1 x, y Pe(w) (x, y) = Pr ŵ1 (7) In other words, a recovery of the first ⌈nR⌉ bits by the decoder is considered a successful reception. For x such that (w) the condition cannot hold, we define Pe (x, y) = 0. The conditioning on y is mainly for clarification, since it can be treated as a fixed vector. This system is illustrated in figure 2. Note that if we are not interested in limiting the feedback rate, and perfect feedback can be assumed, the definition of feedback alphabet and feedback function is redundant (in this case F = Y and fk = yk ). The model in which the decoder determines the transmission rate is lenient in the sense that it gives the flexibility to exchange rate for error probability: the decoder may estimate the error probability and decrease it by reducing the decoding rate. In the scheme we discuss here the rate is determined during reception, but it’s worth noting in this context the posterior matching scheme [9] for the known memoryless channel. In this scheme the message is represented as a real number θ ∈ [0, 1) and the rate for a given error probability Pe can be determined after the decoding by calculating Pr(θ|y) and finding the smallest interval with probability at least 1 − Pe . IV. C OMMUNICATION WITHOUT FEEDBACK In this section we show that the empirical mutual information (in the discrete case) and its Gaussian counterpart (in the continuous case) are achievable in the sense defined in the overview. For the continuous case we justify the choice of the Gaussian distribution as the one yielding the maximum rate function that can be defined by second order moments. A. The discrete channel without feedback The following theorem formalizes the achievability of rate ˆ y) without feedback: I(x; Theorem 1 (Non-adaptive, discrete channel). Given discrete input and output alphabets X , Y, for every Pe > 0, δ > 0, prior Q(x) over X and rate R > 0 there exists n large enough and a random encoder-decoder pair of rate R over block size n, such that the distribution of the input sequence is x ∼ Qn and the probability of error for any message given an input sequence x ∈ X n and output sequence y ∈ Y n is not greater ˆ y) > R + δ. than Pe if I(x, Theorem 1 follows almost immediately from the following lemma, which is proven in the appendix using simple a calculation based on the method of types [10]: Lemma 1. For any sequence y ∈ Y n the probability of a sequence x ∈ X n drawn independently according to Qn to ˆ y) ≥ t is upper bounded by: have I(x;   ˆ y) ≥ t ≤ exp (−n (t − δn )) Qn I(x; (8) → 0. where δn = |X ||Y| log(n+1) n Following notations in [10], Qn (A) denotes the probability of the event A or equivalently the set of sequences A under the 5 xi ∈ X ✲ Channel w (message) ✲ Encoder Fig. 1. ✻ ✻ S (common randomness) S ŵ (message) ✲ Non rate adaptive encoder-decoder pair without feedback w ✲ Encoder (message) xi ∈ X ✲ Channel ✛ ✻ yi ∈ Y R (rate)✲ ✲ Decoder fi ∈ F (feedback) S (common randomness) Fig. 2. yi ∈ Y ✲ Decoder ŵ (message) ✲ ✻ S Rate adaptive encoder-decoder pair with feedback i.i.d. distribution Qn . Remarkably this bound does not depend on Q. exp(nR) To prove Theorem 1, the codebook {xm }m=1 is randomly generated by i.i.d. selection of its L = exp(nR) · n letters, so that the common randomness S ∈ X L may be defined as the codebook itself and is distributed QL . The encoder sends the w-th codeword, and the decoder uses maximum mutual information decoding (MMI) i.e. chooses: i h ˆ m ; y) (9) ŵ = φ̄(y, {xm }) = argmax I(x m where ties are broken arbitrarily. By Lemma 1, the probability of error is bounded by:    [   ˆ m ; y) ≥ I(x ˆ w ; y) Pe(w) (xw , y) ≤ Pr ≤ I(x   m6=w    ˆ w ; y) − δn ≤ exp(nR) exp −n I(x =    ˆ w ; y) − R − δn = exp −n I(x (10) e) +δn < δ. For any δ there is n large enough such that − log(P n ˆ For this n, whenever I(x; y) > R + δ we have Pe(w) (x, y) ≤ exp (−n (δ − δn )) < Pe (11) which proves the theorem.  Note that the MMI decoder used here is a popular universal decoder (see [5][10][11]), and was shown to achieve the same error exponent as the maximum likelihood decoder for fixed composition codes. The error exponent obtained here is better than the classical error exponent (slope of -1), and the reason is that the behavior of the channel is known, and therefore no errors occur as result of non-typical channel behavior. Comparing for example with the derivation of the random coding error exponent for the probabilistic DMC based on the method of types (see [10]), in the later the error probability is summed across all potential ”behaviors” (conditional types) of the channel accounting for their respective probabilities (resulting in one behavior, usually different from the typical behavior, dominating the bound), while here the behavior of the channel (the conditional distribution) is fixed, and therefore the error exponent is better. This is not necessarily the best error exponent that can be achieved (see [11][12] which discuss error exponent with random decision time and feedback for probabilistic and compound models). Note that the empirical mutual information is always well defined, even when some of the input and output symbols do not appear in the sequence, since at least one input symbol and one output symbol always appear. For the particular case of empirical mutual information measured over a single symbol, the empirical distributions become unit vectors (representing constants) and their mutual information is 0. In this discussion we have not dealt with the issue of choosing the prior Q(x). Since the channel behavior is unknown it makes sense to choose the maximum entropy, i.e. the uniform, prior which was shown to obtain a bounded loss from capacity [13]. B. The continuous channel without feedback When turning to define empirical rates for the real valued alphabet case, the first obstacle we tackle is the definition of the empirical mutual information. A potential approach is to use discrete approximations. We only briefly describe this approach since it is somewhat arbitrary and less elegant than in the discrete case. The main focus is on empirical rates defined by the correlation factor. Although the later approach is pessimistic and falls short of the mutual information for most channels, it is much simpler and elegant than discrete approximations. We believe this approach can be further extended to obtain results closer to the (probabilistic) mutual information. 1) Discrete approximations: Define the continuous input and output alphabets X , Y. Suppose Q is an arbitrary (continuous) prior. Define input and output quantizers to discrete alphabets An : X → X̃n and Bn : Y → Ỹn where X̃n , 6 Ỹn are discrete alphabets of growing size, chosen to grow slowly enough so that δn = |X̃n ||Ỹn | log(n+1) −→ 0. Define n n→∞ the empirical mutual information between continuous vectors as the empirical mutual information between their quantized versions (quantized letter by letter): ˆ n (x), Bn (y)) IˆA,B (x, y) ≡ I(A (12) Then based on Lemma 1, by using a random codebook drawn according to Q and applying a maximum mutual information decoder using the above definition, we could asymptotically achieve the rate function Remp = IˆA,B (x, y) based on the definitions of Theorem 1. The main issue with this approach is that determining An , Bn is arbitrary, and especially Bn is difficult to define when the output range is unknown. Therefore in the following we focus on the suboptimal approach using the correlation factor. 2) Choosing the input distribution and rate function: First we justify our choice of the Gaussian input distribution and the aforementioned rate function. We take the point of view of a compound (probabilistic, unknown) channel. If a rate function cannot be attained for compound channel model, it cannot be attained also in the more stringent individual model. It is well known that for a memoryless additive noise channel with constraints on the transmit power and noise variance, the Gaussian noise is the worst noise when the prior is Gaussian, and the Gaussian prior is the best prior when the noise is Gaussian. Thus by choosing a Gaussian prior we choose the best prior for the worst noise, and can we guarantee the mutual information will equal, at least, the Gaussian channel capacity. See the ”mutual information game” (problem 9.21) in [14]. For the additive noise channel [15] shows the loss from capacity when using Gaussian distribution is limited to 12 a bit. However the above is true only for additive noise channels. For the more general where no additivity is assumed case we show below (Lemma 3) that the rate function R = − 12 log(1−ρ2 ) is the best rate function that can be defined by second order moments, and attained universally. Of course, this proof merely supplies the motivation to use a Gaussian distribution and does not rid us from the need to prove this rate is achievable for specific, individual sequences. Lemma 2. Let X,Y be two continuous random variables with ) , where X is Gaussian correlation factor ρ ≡ √ E(XY 2 2 E(X )E(Y ) X ∼ N (0, P ). Then I(X; Y ) ≥ − 12 log(1 − ρ2 ) R(Λ) ≡ max Q min W :Λ(Q,W )=Λ Corollary 2.2. The lemma does not hold for general X (not Gaussian) 2 log(1−ρ ) The proof is given in the appendix. Note that is the mutual information of two Gaussian r.v-s ([14], example 8.5.1). Also note the relation to Theorem 1 in [16] dealing with an additive channel with uncorrelated, but not necessarily independent noise. The following lemma justifies our selection of R(ρ) = − 21 log(1 − ρ2 ): Lemma 3. Let Q(x) be an input prior, W (y|x) be an unknown channel, Λ(Q, W ) be the correlation matrix Λ ≡ 1 I(Q, W ) = − log(1 − ρ2 ) (13) 2 Proof of lemma 3: R(Λ) = − 12 log(1 − ρ2 ) is attainable by selecting an input prior Q = N (0, σx2 ) and by lemma 2 the mutual information is at least R(Λ) for all channels. R(Λ) is the maximum attainable function since by writing the condition of the lemma for the additive white gaussian noise (AWGN) channel W ∗ (a specific choice of W ) and any Q, we have R(Λ) ≤ I(Q, W ∗ ) ≤ I(N (0, EP (X 2 ), W ∗ )) = − 21 log(1 − ρ2 ), where the inequalities follow from the conditions of the lemma on R and from the fact the Gaussian prior achieves the AWGN capacity. 3) Communication scheme for the empirical channel (without feedback): The following theorem is the analogue of Theorem 1 where the expression − 21 log(1 − ρ2 ) (interpreted as the Gaussian effective mutual information) plays the role of mutual information. Theorem 2 (Non-adaptive, continuous channel). Given the channel R → R for every Pe > 0, δ > 0, power P > 0 and rate R > 0 there exists n large enough and a random encoder-decoder pair of rate R over block size n, such that the distribution of the input sequence is x ∼ N n (0, P ) and the probability of error for any message given an input sequence x and output sequence y with empirical  correlation ρ̂ is not  1 1 greater than Pe if Remp = 2 log 1−ρ̂2 > R + δ As before, the theorem will follow easily from the following lemma, proven in the appendix. T x y Lemma 4. Let x, y ∈ Rn be two sequences, and ρ̂ ≡ kxkkyk be the empirical correlation factor. For any y, the probability of x drawn according to N n (0, σx2 ) to have |ρ̂| ≥ t is bounded by: Pr(|ρ̂| ≥ t) ≤ 2 exp (−(n − 1)R2 (t)) (14) where Corollary 2.1. Equality holds iff X,Y are jointly Gaussian − 12    T X X between X, Y induced by the joint probability E Y Y Q ◦ W and ρ(Q, W ) be the correlation factor induced by Q, W (ρ = √ΛΛ12Λ ). We say a function R(Λ) is an attainable 11 22 second order rate function if there exists a Q(x) such that for every channel W (y|x) inducing correlation Λ the mutual information is at least R(Λ) (in other words can carry the rate R(Λ)). Then R(Λ) = − 12 log(1 − ρ2 ) is the largest attainable second order rate function. Alternatively this can be stated as: R2 (t) ≡ 1 log 2  1 1 − t2  (15) exp(nR) To prove Theorem 2, the codebook {xm }m=1 is randomly generated by Gaussian i.i.d. selection of its L = exp(nR) · n letters, and the common randomness S ∈ X L is defined as the codebook itself and is distributed N L (0, P ). The encoder sends the w-th codeword, and the decoder uses maximum empirical correlation decoder i.e. chooses:  T  |xm y| ŵ = φ̄(y, {xm }) = argmax|ρ̂(xm ; y)|= argmax m m kxm k (16) 7 where ties are broken arbitrarily. By Lemma 4, the probability of error is bounded by:    [  Pe(w) (xw , y) ≤ Pr (|ρ̂(xm ; y)| ≥ |ρ̂(xw ; y)|) ≤   m6=w ≤ exp(nR) · 2 exp (−(n − 1)R2 (ρ̂(xw ; y)))) = = 2 exp(R) · exp (−(n − 1) (R2 (ρ̂) − R)) (17)    1 ≤δ R + log P2e Choosing n large enough so that n−1 (where Pe is from Theorem 2) we have that when R2 (ρ̂) > R + δ: Pe(w) (x, y) ≤ 2 exp(R) · exp (−(n − 1)δ) ≤ Pe (18) which proves the theorem.  A note is due regarding the definition of ρ̂ in singular cases where x or y are 0. The limit of ρ̂ as y → 0 is undefined (the directional derivative may take any value in [0,1]), however for consistency we define ρ̂ = 0 when y = 0. Since x is generated from a Gaussian distribution we do not worry about the event x = 0 since the probability of this event is 0. It’s worth spending a few words on the connections between the receivers used for the discrete and the continuous cases. Since the mutual information between two Gaussian r.v-s is − 21 log(1 − ρ2 ), one can think of this value as a measure of mutual information under Gaussian assumptions. Thus, using this metric as an effective mutual information, since the mutual information is an increasing function of |ρ| the MMI decoder becomes a maximum empirical correlation decoder. On the other hand, the receiver we used can be identified as the GLRT (generalized maximum likelihood ratio test) for the AWGN channel Y = αX + N (0, σ 2 ) with α an unknown parameter, resulting from maximizing the likelihood of the codeword and the channel simultaneously: ŵ = argmax max log Pr(y|x; α) = xm α (xT y)2 = argmin minky − αxm k2 = argmax m 2 = m m α kxm k   = argmax ρ̂2 (xm , y) (19) m The choice of the GLRT is motivated by considering the individual channel as an effective additive channel with unknown gain (as presented in section II), combined with the fact Gaussian noise is the worse. For discrete memoryless channels it is easy to show that the GLRT (where the group of channels consists of all DMC-s) is synonymous with the MMI decoder (see [6]). Thus, we can identify the two decoders as GLRT decoders, or equivalently as variants of MMI decoders. In the sequel we sometimes use the term ”empirical mutual information” in a broad sense that includes also the metric − 12 log(1 − ρ̂2 ). Regarding the receiver required to obtain the rates of Theorem 2, it is interesting to consider the simpler maximum projection receiver argmax|xTm y|. This receiver seems to xm differ from the maximum correlation receiver only in the term kxm k which is nearly constant for large n due to the law of large numbers. However surprisingly, the maximum rate achievable with the projection receiver is only 21 ρ̂2 as can be shown by a simple calculation equivalent to Lemma 4 (simpler, since z = xT y is Gaussian). The reason is that when x is chosen independently of y, a large value of the projection (non typical event) is usually created by a sequence with power significantly exceeding the average (another non typical event). When one non-typical event occurs there is no reason to believe the sequence √ is typical in other senses thus the approximation kxm k ≈ nP is invalid. The correlation receiver normalizes by the power of x and compensates this effect. An alternative receiver which yields the rates of Theorem 2 and is similar to the AEP receiver looks for the codeword with the maximum absolute projection subject to power limited to n1 kxm k2 < P + ǫ. This can be shown by Sanov theorem [10] or by using the Chernoff bound. The maximum correlation receiver was chosen because of its elegance and the simplicity of the proof of Lemma 4. Combining this lemma with the law of large numbers provides a simple proof for the achievability of the AWGN capacity ( 12 log(1+SNR)), which uses much simpler mechanics than the popular proof based on AEP or error exponents. This receiver has the technical advantage, compared to the AEP receiver, that it does not declare an error for codewords which have power deviating from the nominal power. This technical advantage is important in the context of rateless decoding since the power condition needs to be re-validated each symbol, thus increasing its contribution to the overall error probability. Lapidoth [17] showed that the nearest neighbor receiver achieves a rate equal to the Gaussian capacity 21 log(1+P/N ) over the additive channel Y = X + V with arbitrary noise distribution (with fixed noise power). This result parallels the result that the random code capacity of the AVC Y = X + V with a power constraint on V equals the Gaussian capacity [18] (this stems directly from the characterization of the random code capacity of the AVC as maxPX (x) minPS (s) I(X; Y ), cf.[10] Eq.(V.4)). Our result is stronger since it does not assume the channel is additive (nor any fixed behavior), but considering the former results it is not surprising, if one assumes (1) that any channel can be modeled as Y = αX + V with V ⊥ X, (2) that the dependence of V on X does not increase the error probability due to orthogonality (see [16]) and (3) that the loss from the single unknown parameter α is asymptotically small. Another related result is Agarwal et al’s [8] result that it is possible to communicate with a rate approaching the ratedistortion function RX (D) over an arbitrarily varying channel with unknown block-wise behavior satisfying a distortion constraint Êd(x, y) ≤ D in high probability. This relation is further discussed in the proof of Lemma 1. Their result is similar to ours in the fact they define the rate in terms of the input and output alone. The result is similar to obtaining the rate function Remp ≈ RX (Êd(x, y)) in the sense of Theorems 1,2. However their result is not tight even for the Gaussian channels: for the gaussian channel Y = X + V with noise V limited to power N and the Gaussian prior  X ∼ N (0, P ) P which is smaller this rate function equals RX (N ) = 12 log N than this channel capacity, whereas with Theorem 2 we would  P . Agarwal’s result is tight in the sense that obtain 21 log 1 + N 8 this is the maximum rate that can be guaranteed given this distortion. There exists a channel with the same distortion N P : the channel Y = αX + βV whose capacity is only 21 log N N 2 with α = β = 1− P . The reason for the sub-optimality of the result is that the squared distance between the input and output, in contrast with the correlation factor, does not yield a tight representation of all memoryless linear Gaussian channels (in the sense of Lemma 3). V. C OMMUNICATION WITH FEEDBACK A. Overview and background In this section we present the rate-adaptive counterparts of Theorems 1, 2, and the scheme achieving them. The proof is delayed to the next section. The scheme we use in order to adaptively attain these rates is by iterating a rateless coding scheme. In other words, in each iteration we send a fixed number of bits K, by transmitting symbols from an n length codebook, until the receiver has enough information to decode. Then, the receiver sends an indication that the block is over and a new block starts. Before developing the details we give some background regarding the evolution of rateless codes, and the differences between the proposed techniques. The earliest work is of Burnashev [12] who showed that for known channels, using feedback and a random decision time (i.e. decision time which depends on the channel output) yields an improved error exponent, which is attained by a 3 step protocol (best described in [11]) and shown to be optimal. Shulman [19] proposed to use random decision time as a means to deal with sending common information over broadcast channels (static broadcasting), and for unknown compound channels (which are treated as broadcast). In this scheme later described as ”rateless coding” (or Incremental Redundancy Hybrid ARQ) a codebook of exp(K) infinite sequences is generated, and the sequence representing the message is sent to the receiver symbol by symbol, until the receiver decides to decode (and turn off, in case of a broadcast channel). Tchamkerten and Telatar [11] connect the two results by showing that for some, but not all compound channels Burnashev error exponent can be attained universally using rateless coding and the 3 step protocol. Eswaran, Sarwate, Sahai and Gastpar [3] used iterated rateless coding to achieve the mutual information related to the empirical noise statistics on channels with individual noise sequences. The scheme we use here is most similar to the one used in [3] but less complicated. We do not use training symbols to learn the channel in order to decide on the decoding time but rely on the mutual information itself as the criterion (based on Lemmas 1,4) and the partitioning into blocks and the decision rules are simpler. The result in [3] is an extension of a result in [1] regarding the binary channel to general discrete channels with individual noise sequence. The original result in [1] was obtained not by rateless codes but by a successive estimation scheme [20] which is a generalization of the Horstein [21] and Schalkwijk-Kailath [22] schemes. The same authors extend their results to discrete channels [2] using successive schemes (where the target rate is the capacity of the respective modulu-additive channel). The two concepts in achieving the empirical rates differ in various factors such as complexity and the amount of feedback and randomization required. The successive schemes require less common randomness but assume perfect feedback, while the schemes based on rateless coding require less (asymptotically 0 rate) feedback but potentially more randomness. As noted the technique we use here is similar to that of [3] in its high level structure, while the structure of the rateless decoder is similar to [19]’s (chapter 3). The application of this scheme to individual inputs and outputs and the extension to real-valued models requires proof and especially issues such as abnormal behavior of specific (e.g. last) symbols have to be treated carefully. The result of [3] cannot be applied directly to individual channels since the channel model cannot be extracted based on the input and output sequences alone, and in the later both the model and the sequence are assumed to be fixed (over common randomness). B. Statement of the main result In this section we prove the following theorems, relating to the definitions given in section III-B: Theorem 3 (Rate adaptive, discrete channels). Given discrete input and output alphabets X , Y, for every Pe > 0, PA > 0, δ > 0 and prior Q(x) over X there is n large enough and random encoder and decoder with feedback and variable rate over block size n with a subset J ⊂ X n , such that: n • The distribution of the input sequence is x ∼ Q independently of the feedback and message • The probability of error is smaller than Pe for any x, y • For any input sequence x 6∈ J and output sequence y ∈ ˆ y) − δ Y n the rate is R ≥ I(x, • The probability of J is bounded by Pr(x ∈ J) ≤ PA Theorem 4 (Rate adaptive, continuous channels). Given the channel R → R for every Pe > 0, PA > 0, δ > 0, R̄ > 0, and power P > 0 there is n large enough and random encoder and decoder with feedback and variable rate over block size n with a subset J ⊂ Rn , such that n • The distribution of the input sequence is x ∼ N (0, P ) independently of the feedback and message • The probability of error is smaller than Pe for any x, y • For any input sequence x and output sequencei y ∈ h 6∈ J  1 − δ, R̄ Rn the rate is R ≥ min 21 log 1−ρ̂(x,y) 2 • The probability of J is bounded by Pr(x ∈ J) ≤ PA Note that in the last theorem we do not have uniform convergence of the rate function in x, y. Unfortunately our scheme is limited by having a maximum rate for each n, and although the maximum rate tends to infinity as n → ∞, we cannot guarantee uniform convergence for each n in the continuous case, where the target rate may be unbounded. The rates in the theorems are the minimal rates, and in certain conditions (e.g. a channel varying in time) higher rates may be achieved by the scheme proposed below. Regarding the set J as we shall see in the sequel there are some sequences for which poor rate is obtained, and since we committed to an input distribution we cannot avoid them (one 9 becomes conditioned on the set J. The question whether the set J itself is truly necessary (i.e. is it possible to attain the above Theorems with J = ∅) is still open. Figure (3) illustrates the lower bound for Remp presented by Theorem 4 (RLB2 ) as well as a (higher) lower bound RLB1 for the rate achieved by the proposed scheme (see section VI-C2, Eq.(65)). The parameters generating these curves appear in table III in the appendix. We prove the two theorems together. First we define the scheme, and in the next section we analyze its error performance and rate and show it achieves the promise of the theorems. Throughout this section and the following one we use n to denote the length of a complete transmission, and m to denote the length of a single block. C. A proposed rate adaptive scheme Fig. 3. Illustration of Remp lower bound of theorem 4 (RLB2 ) and the lower bound RLB1 shown in the proof in section VI-C2, as a function of ρ ρ2 (top) and the effective SNR = 1−ρ 2 (bottom). Parameters appear in table III in the appendix example is the sequence of 21 n zeros followed by 21 n ones, in which at most one block will be sent). However there is an important distinction between claiming for example that ”for each y the probability of R < Remp is at most PA ” and the claim made in the theorems that ”R < Remp only when x belongs to a subset J with probability at most PA ”. The first claim is weaker since smartly chosen y may increase the probability (see figure 4). This is avoided in the second claim. A consequence of this definition is that the probability of R < Remp is bounded by PA for any conditional probability Pr(y|x) on the sequences. This issue is further discussed in section VI-A. Note that the probability PA could be absorbed into Pe by a simple trick, but this seems to make the Theorem less insightful. After reception the receiver knows the input sequence in probability of at least 1 − Pe and may calculate ˆ y). If the rate achieved the empirical mutual information I(x, ˆ y) it by the scheme we will describe later falls short of I(x, ˆ y) (which will most likely may declare a rate of R = I(x, result in a decoding error). This way the receiver will never ˆ y) unless there is an declare a rate which is lower than I(x, error, and we could avoid the restriction x 6∈ J required for achieving Remp , but on the other hand, the error probability The following communication scheme sends B indices from {1, . . . , M } over n channel uses (or equivalently sends the number θ ∈ [0, 1) in resolution M −B ), where M is fixed, and B varies according to empirical channel behavior. The building block is a rateless transmission of one of M codewords (K ≡ log(M ) information units), which is iterated until the n-th symbol is reached. The transmit distribution Q is an arbitrary distribution for the discrete case and Q = N (0, P ) for the continuous case. We define the decoding metric as the empirical rate: ( ˆ y) I(x,   discrete Remp (x, y) ≡ (20) 1 1 continuous 2 log 1−ρ̂2 (x,y) The codebook CM ×n consists of M codewords of length n, where all M × n symbols are drawn i.i.d. ∼ Q and known to the sender and receiver. For brevity of notation we denote m m Remp (x, y) instead of Remp (xm 1 , y1 ). k denotes the absolute time index 1 ≤ k ≤ n. Block b starts from index kb , where k1 = 1. m = k − kb + 1 denotes the time index inside the current block. In each rateless block b = 1, 2, . . ., a new index i = ib ∈ {1, . . . , M } is sent to the receiver using the following procedure: 1) The encoder sends index i by sending the symbols of codeword i: xk = Ci,k (21) Note that different blocks use different symbols from the codebook. 2) The encoder keeps sending symbols and incrementing k until the decoder announces the end of the block through the feedback link. 3) The decoder announces the end of the block after symbol m in the block if for any codeword xi :  m Remp (xi , y) ≡ Remp (xi )kkb , ykkb ≥ µ∗m (22) where µ∗m is a fixed threshold per symbol defined in Eq.(23) below. 4) When the end of block is announced one of the i fulfilling Eq.(22) is determined as the index of the decoded codeword îb (breaking ties arbitrarily). 10 5) Otherwise the transmission continues, until the n-th symbol is reached. If symbol n is reached without fulfilling Eq.(22), then the last block is terminated without decoding. After a block ends, b is incremented and if k < n a new block starts at symbol kb = k + 1. After symbol n is reached the transmission stops and the number of blocks sent is B = b − 1. The threshold µ∗m is defined as:   n 1 K + log + δm = µ∗m = m−s m−s Pe   K+log( Pne )+|X ||Y| log(m+1) discrete m (23) = 2n  K+log( Pe ) continuous m−1 where s = 0 for the discrete case and 1 for the continuous case and δm is defined in Lemma 1 for the discrete case and equals log(2) ∗ m−1 for the continuous case. The threshold µm is tailored to achieve the designated error probability and is composed of 3 parts. The first part requires that the empirical rate Remp would approximately equal the transmission rate of the block K m , which guarantees there is approximately enough mutual information to send K information units. The second part is an offset responsible for guaranteeing error probability bounded by Pe over all the blocks in the transmission. The third part δm compensates the overhead terms in Lemmas 1,4. The scheme achieves the claims of Theorems 3,4 with a proper choice of the parameters (discussed in section VI-C). Note that the scheme uses feedback rate of 1 bit/use however it is easy to show any positive feedback rate is sufficient (see section VI-C), therefore we can claim the theorems hold with ”zero rate” feedback. We devote the next section to the analysis of the error probability and rate of the scheme, showing it attains Theorems 3,4. Unfortunately although the scheme is simple, the current analysis we have is somewhat cumbersome. VI. P ROOF OF THE MAIN RESULT In this section we analyze the adaptive rate scheme presented and show it achieves Theorems 3,4. Before analyzing the scheme we develop some general results pertaining to the convexity of the mutual information and correlation factors over sub-vectors. The proof of the error probability is simple (based on the construction of µ∗m ) and common to the two cases. The proof of the achieved rate is more complex and performed separately for each case. which would guarantee that if we achieve a rate equal to the empirical mutual information over the two sections 0 ≤ k ≤ m and m < k ≤ n, then we would achieve the empirical mutual information over the entire vector 0 ≤ k ≤ n. However this property does not hold in general since the mutual information is not convex with respect to the joint distribution. The mutual information I(P, W ) is known to be convex ∪ with respect to W and concave ∩ with respect to P , so if, for example, the conditional distributions over the sections [1, m] and [m+1, n] are equal and only the distribution of x differs, the condition would in general not hold. On the other hand should the n empirical distributions of xm 1 and xm+1 be equal, then the empirical mutual information expressions appearing in Eq.(24) would differ only in the conditional distributions of y w.r.t x and the assertion would hold. Since we generate x by i.i.d. drawing of its elements the empirical distributions converge to the prior Q, and we would expect that if the size of both regions m and m − n is large enough the convexity would hold up to a fraction ǫ in high probability. We show below that such convexity holds under even milder conditions. The cases in which this approximate convexity is used later on can serve as examples of the difference between the individual model used here and probabilistic models (including the individual noise sequence). We use the lemma to: 1) Bound the loss due to insufficient utilization of the last symbol in each rateless encoding block. 2) Bound the loss due to not completing the last rateless encoding block. 3) Show that the average rate (empirical mutual information) over multiple blocks equals at least the mutual information measured over the blocks together Had the rate been averaged over multiple sequences x rather than obtained for a specific sequence, the regular convexity of the mutual information with respect to the channel distribution would have been sufficient. The property is formalized in the following lemma: Lemma 5 (Likely convexity of mutual information). Let {Ai }pi=1 defineSa disjoint partitioning of the index set {1, . . . , n}, i.e. i Ai = {1, . . . , n} and Ai ∩ Aj = ∅ for i 6= j. x , y are n-length sequences, and xA , yA define the subsequences of x, y (resp.) over the index set A. Let the elements of x be chosen i.i.d. with distribution Q. Then for any ∆ > 0 there is a subset J∆ ⊂ X n such that: ∀x 6∈ J∆ , y ∈ Y n : p X |Ai | i=1 And A. Preliminaries 1) Likely convexity of the mutual information: A property which would be useful for the analysis is ∪-convexity of the empirical mutual information with respect to joint empirical distributions P̂(x,y) (x, y) measured over different sub-vectors, so for example we would like to have for 0 ≤ m ≤ n:   ˆ n ; yn ) ˆ m ; ym ) + 1 − m · I(x ˆ n1 ; y1n ) ≤ m · I(x I(x m+1 m+1 1 1 n n (24) n  ˆ y) − ∆ ˆ A ; yA ) ≥ I(x; I(x i i (25)  Qn {J∆ } ≤ exp −n ∆ − δ̃n With δ̃n = p|X | · log(n+1) n → 0.  (26) The lemma does not claim that convexity holds with high probability, but rather that any positive deviation from convexity may happen only on a subset of x with vanishing probability. It is surprising that the bound does not depend on y, Q and the size of the subsets, and only weakly depends on the number of subsets. 11 Using the chain rule for mutual information (see [14] section 2.5):   ˆ yu) − I(x; ˆ u) = ˆ y) − I(x; ˆ y|u) = I(x; ˆ y) − I(x; I(x; ˆ u) − I(x; ˆ u|y) ≤ I(x; ˆ u) (28) = I(x; ˆ u) > ∆}, then Define the set J∆ = {x : I(x; ˆ y) − I(x; ˆ y|u) ≤ I(x; ˆ u) ≤ ∆ ∀x 6∈ J∆ , y : I(x; (29) And since x is chosen iid and u is a fixed vector, we have from Lemma 1:    Pr (x ∈ J∆ ) ≤ exp −n ∆ − δ̃n (30) Fig. 4. Illustration of bad sequences and lemma 5 Before proving the lemma we emphasize a delicate point: the lemma does not only claim that for each y the probability of deviation from convexity is small, but makes a stronger claim that apart from a subset of the x sequences with vanishing probability, convexity always holds independently of y. This distinction is important since this lemma defines a set of ”bad” input sequences that fail our scheme. In these sequences there exists a partitioning that yields an excessive deviation from the distribution Q between rateless blocks. As an example of such a sequence consider the binary channel and the input sequence 0n/2 1n/2 (n/2 zeros followed by n/2 ones). This sequence is bad since it guarantees that on one hand at most one block will be received (since at most one block includes both 0-s and 1-s at the input), but on the other hand the zero order empirical input distribution is good (Ber( 21 )), so potentially we have the combination of high empirical mutual information with low communication rate. The sequences that deviate from convexity are a function of the output y. Had we only bounded the probability of deviation from convexity to occur for each y individually, then a potential adversary could have increased this probability by determining y (given x) such that x will be a bad sequence with respect to this y. To avoid this, we claim that there is a fixed group of x such that if the sequence is not in the group, approximate convexity holds regardless of y. This is illustrated in fig.(4) where the dark spots mark the pairs (x, y) for which convexity does not hold. Proof of lemma 5: Define the vector u denoting the subset number of each element uk = i ∀k ∈ Ai . Then ˆ A ; yA ) = I(x, ˆ y|u = i), and P̂u (i) = |Ai | , therefore I(x i i n we can write the weighted sum of empirical mutual information over the partitions, as a conditional empirical mutual information: p  X |Ai | i=1 n  X p ˆ ˆ y|u = i) = I(xAi ; yAi ) = P̂u (i)I(x; i=1 ˆ y|u) (27) = I(x; .  with δ̃n = |X ||{1, . . . , p}| log(n+1) n Note that if the distribution of x is the same over all ˆ u) = 0 and partitions then Ĥ(x|u) = Ĥ(x) therefore I(x; the empirical mutual information will be truly convex. 2) Likely convexity of the correlation factor: For the continuous case we use the following property which somewhat parallels Lemma 5. The reasons for not following the same path as the discrete case will be explained in the sequel (subsection VI-C). Unfortunately the proof is very technical and less elegant and will therefore be expelled to the appendix (appendix-E). Note that again the bound does not depend on the size of the subsets. Lemma 6 (Likely convexity of ρ̂2 ). Define {Ai }pi=1 as in Lemma 5. Let x , y be n-length sequences and define the correlation factors of the sub-sequences, and the overall correlation factor as ρ̂i = |xTAi yAi | kxAi k · kyAi k and |xT y| kxk · kyk (31) ρ̂2i ≥ ρ̂2 − ∆ (32) ρ̂ = respectively. Let x be drawn i.i.d from a Gaussian distribution x ∼ N (0, P ). Then for any 0 < ∆ ≤ 17 there is a subset J∆ ⊂ Rn such that: ∀x 6∈ J∆ , y ∈ Rn : And p X |Ai | i=1 n Pr {x ∈ J∆ } ≤ 2p e−n∆ 2 /8 (33) I.e. there is a subset with high probability on which the mean of the correlation factors does not fall considerably below the overall correlation factor. 3) Likely convexity with dependencies: The properties of likely convexity defined in the previous sections pertain to a case where the partition of the n block is fixed and x is drawn i.i.d. However in the transmission scheme we described, the partition varies in a way that depends on the value of x (through the decoding decisions and the empirical mutual information), which may, in general, change the probability of the convexity property with a given ∆ to occur. Although it stands to reason that the variability of the block sizes in the decoding process reduces the probability to deviate from convexity since it tends to equalize the amount of mutual information in each rateless block, for the analysis we assume 12 an arbitrary dependence, and assume that the size of the set J increases by factor of the number of possible partitions, as explained below. Denote a partition by π = {Ai }pi=1 (as defined in Lemmas 5,6) and the group of all possible partitions (for a given encoder-decoder) by Π. For each partition π from Lemmas 5,6 there is a subset J(π) with probability bounded by pJ outside which approximate convexity (as defined in the lemmas) holds. Then [ approximate convexity is guaranteed to hold for J(π), where the probability of the set J is x 6∈ J ≡ π∈Π bounded by the union bound: Pr(x ∈ J) = Pr [ π∈Π ! (x ∈ J(π)) ≤ |Π| · pJ (34) Now we bound the number of partitions. In the two cases we will deal with in section VI-C the number of subsets can be bounded by pmax , and all subsets but one contain continuous indices. Therefore the partition is completely defined by the start and end indices of pmax − 1 subsets (allowed to overlap if there are less than pmax subsets), thus |Π| ≤ n2pmax −2 < n2pmax and we have Pr(J) ≤ n2pmax · pJ = exp(2pmax log(n)) · pJ (35) where pJ is defined in the previous lemmas. So for our purposes we may say that these lemmas hold even if the partition depends on x with an appropriate change in the probability of J. B. Error probability analysis In this subsection we show the probability to decode incorrectly any of the B indices is smaller than Pe . With Remp defined in Eq.(20), we have from Lemma 4 that under the conditions of the lemma Pr(Remp ≥ t) = Pr(|ρ̂| ≥ R2−1 (t)) ≤ 2 exp(−(n − 1)t). Then combining Lemmas 1 and 4, we may say that for any y1m the probability of xm 1 generated i.i.d. from the relevant prior to have Remp ≥ t is bounded by: m Qm (Remp (xm 1 , y1 ) ≥ t) ≤ exp (−(m − s)(t − δm )) (36) where δm = ( |X ||Y| log(m+1) m log 2 m−1 And s=  discrete continuous 0 discrete 1 continuous (37) (38) An error might occur if at any symbol 1 ≤ k ≤ n an incorrect codeword meets the termination condition Eq.(23). The probability that codeword j 6= i meets Eq.(23) at a specific symbol k which is the m-th symbol of a rateless block is bounded by: m Pr(Remp (xj , y) ≥ µ∗m ) ≤ exp (−(m − s)(µ∗m − δm )) =     n Pe Pe = exp − K + log = = (39) Pe n exp(K) Mn The probability of any erroneous codeword to meet the threshold at any symbol is bounded by the union bound:   n [  [ (µm (xj , y) ≥ µ∗m ) ≤ Pr(error) ≤ Pr   k=1 j6=i Pe < Pe (40) Mn The first inequality is since the correct codeword might be decoded even if an erroneous codeword met the threshold. Although the index m in the expression above depends on k and the specific sequences x, y in an unspecified way, the assertion is true since the probability of the event in the union has an upper bound independent of m. ≤ n(M − 1) C. Rate analysis Roughly speaking, since µ∗m ≈ K m , if no error occurs, the m correct codeword crossed the threshold when Remp (xi , y) ≈ K therefore the rate achieved over a rateless block is Rb = m K m ≈ R (x , y), and due to the approximate convexity by emp i m achieving the above rate on each block separately we meet or exceed the rate Remp (x, y) over the entire transmission. However in a detailed analysis we have the following sources of rate loss: 1) The offsets inserted in µ∗m to meet the desired error probability 2) The offset from convexity (Lemma 5) introduced by the slight differences in empirical distribution of x between the blocks 3) Unused symbols: a) The last symbol of each block is not fully utilized, as explained below b) The last (unfinished) block is not utilized Regarding the last symbol of each block, note that after receiving the previous symbol the empirical mutual information is below the threshold, and at the last symbol it meets or exceeds the threshold. However the proposed scheme does not gain additional rate from the difference between the mutual information and the threshold, and thus it loses with respect to its target (the mutual information over the block) when this difference is large. Here a ”good” channel works adversely to our worse. Since we operate under an individual channel regime, the increase of the mutual information at the last symbol is not bounded to the average information contents of a single symbol. This is especially evident in the continuous case where the empirical mutual information is unbounded. A high value of y together with high value of x at the last symbol causes an unbounded increase in Remp : if we choose xm , ym → ∞ then ρ → 1 regardless of the history x1m−1 , y1m−1 . Therefore over a single block we might have an arbitrarily low rate (|ρ̂| is small over the m − 1 first symbols) and arbitrarily large Remp . In the discrete case this phenomenon exists but is less accented (consider for example the sequences x = y = 0n−1 1 = (0, . . . , 0, 1)) Similarly regarding the last block, the fact that the length of the block may be bounded does not mean the increase in the empirical mutual information can be bounded as well. We use the 13 TABLE I S UMMARY OF DEFINITIONS AND REFERENCES FOR THE DISCRETE AND CONTINUOUS CASES Item Input distribution Discrete case Any Q Continuous case Q = N (0, P ) Decoding metric Decoder ˆ y) Remp (x, y) ≡ I(x, maximize Remp (x, y) ⇔ maximize ˆ y) I(x, ≤ exp(−n(t − δn )) (Lemma 1) Pp ˆ ˆ i=1 λi I(xi ; yi ) ≥ I(x; y)−∆ (Lemma 5) ”” “ “ ” “ Remp (x, y) ≡ 12 log 1−ρ̂21(x,y) maximize Remp (x, y) ⇔ maximize |ρ̂(x, y)| ≤ 2 exp(−(n − 1)t) (Lemma 4) Pp 2 2 i=1 λi ρ̂i ≥ ρ̂ − ∆ (Lemma 6) Pairwise error probability Pr(Remp ≥ t) Likely convexity condition (∀x 6∈ J∆ , y ∈ 1 Y n with λi ≡ n |Ai |) Likely convexity probability (Pr(x 6∈ J∆ ), fixed partitioning) ≥ 1 − exp −n ∆ − δ̃n approximate convexity (Lemma 5) to show the last two losses are bounded for most x sequences. Note that by the same argument that shows the loss from not utilizing the last symbol vanishes asymptotically, it is easy to show that feeding back the block success information only once every 1/ǫ symbols thereby decreasing the feedback rate to ǫ does not decrease the asymptotical rate, since this is equivalent to having 1/ǫ unused symbols instead of one. Hence the scheme can be modified to operate with ”zero rate” feedback. Similarly the scheme can operate with a noisy feedback channel by introducing in the feedback link a delay suitable to convey the decoder decisions with sufficiently low error rate over the noisy channel. In addition to having rate losses the scheme also has a minimal rate and a maximal rate for each block length. The minimal rate is K n resulting from sending a single block. If channel conditions are worse (Remp < K n ), no information will be sent. A maximal rate exists since at best K information units could be sent every 2 symbols (since for the continuous 1 case µ∗1 = ∞ and for the discrete case Remp (x, y) = 0 thus the decoding never terminates at the first symbol of the block), hence the maximum rate is K 2 . As n → ∞ we increase K so that the minimum rate (and the rate offsets) tend to 0 and the maximum rate tends to ∞. The maximum rate is the reason that the scheme cannot approach the target rate Remp (xi , y) uniformly in x, y in the continuous case, since for some pairs of sequences the target rate (which is unbounded) may be much higher than the maximum rate. The rate R̄ that we achieve in the proof of Theorem 4 is much smaller than the absolute maximum K 2 . Note that successive schemes (such as Schalkwijk’s [22]) do not suffer from the problem of maximum rate. For the discrete case the target rate is bounded by max(|X |, |Y|) therefore for sufficiently large n the maximal rate K 2 exceeds max(|X |, |Y|) and we are able to show uniform convergence. Although our target is the empirical mutual information over the n-block, an artifact of the partitioning to smaller blocks is that higher rates can be attained when the empirical conditional channel distribution varies over time, since by the convexity of mutual information with respect to the channel law the convex sum of mutual information over blocks exceeds the overall mutual information if these are not constant. We now turn to prove the achieved rate. The total amount of information sent (with or without error) is B · K therefore ≥ 1 − 2p e−n∆ the actual rate is Ract = 2 /8 BK n (41) We now endeavor to show this rate is close to or better than the empirical mutual information in probability of at least PA over the sequences x, regardless of y and of whether a decoding error occurred. The following definition of index sets in {1, . . . , n} is used for both the discrete and the continuous cases: Ub = kb+1 −2 {k}k=k denotes the channel uses of block b except the b last one, L0 collects the last channel uses of all the blocks L0 = {kb − 1 : b > 1}, and UB+1 denotes the indices of the un-decoded (last) block UB+1 = {k}nk=kB+1 (including its last symbol), and is an empty set if the last block is decoded. The sets {Ub }B+1 b=1 , L0 are disjoint and their union is {1, . . . , n}. We denote the length of each block not including the last symbol by mb ≡ |Ub |. From this point on we split the discussion and we start with the discrete case which is simpler. 1) Rate analysis for the discrete case: We write µ∗m as K+∆ ∗ m ≤ m µ with µm = K+∆ m ∆m    n n +mδm = log +|X ||Y| log(m+1) ≤ = log Pe Pe   n ≤ log + |X ||Y| log(n + 1) ≡ ∆µ (42) Pe  From Lemma 5 and Eq.(35) we have that the following equation: B+1 X  |L | mb ˆ 0 ˆ I(xBb ; yBb ) + I(xL0 ; yL0 ) n n b=1 (43) is satisfied when  x is outside  a set J∆ with probability of at most exp −n ∆ − δ̃n where δ̃n = (B + 2)|X | · ˆ y) − ∆ ≤ I(x; log(n+1) n + 2Bmax log(n) n . We shall find Bmax later on. To make sure the probability of J is less than PA we require  exp −n ∆ − δ̃n ≤ PA therefore 1 log (PA ) = n log(n + 1) log(n) 1 = (B + 2)|X | · + 2Bmax − log (PA ) n n n (44) ∆ ≥ δ̃n − 14 and we choose 1 log(n + 1) − log (PA ) (45) n n We now bound each element of Eq.(43). Consider block b with mb + 1 symbols. At the last symbol before decoding (symbol mb ≡ |Ub |) none of the codewords, including the correct one crosses the threshold µ∗m , therefore: ∆ = (3Bmax + 2)|X | · K + ∆mb ˆ U ; yU ) > I(x (46) b b mb Specifically for the unfinished block we have at symbol n: µ∗mb = µ∗mB+1 = K + ∆mB+1 ˆ U ; yU ) > I(x B+1 B+1 mB+1 (47) The way to understand these bounds is as guarantee on the shortness of the blocks given sufficient mutual information. On the other hand, at the end of each block including the last symbol (symbols (kb , kb + mb )), since one of the sequences was decoded we have: K + ∆mb +1 ≤ µ∗mb +1 =  mb + 1  k +m ≤ max Iˆ (xi )kbb b ; ykkbb +mb ≤ log min(|X |, |Y|) ≡ h0 i (48) Which we can use to bound the number of blocks, since mb + 1 ≥ hK0 therefore B≤ B  X h0  h0 · n (mb + 1) ≤ ≡ Bmax K K b=1 (49) As for the unused last-symbols we bound: ˆ L ; yL ) ≤ h0 I(x 0 0 (50) Combining Eq.(49) and Eq.(45) we have:   3h0 1 2 ∆≤ |X | · log(n + 1) − log (PA ) + K n n (51) Combining Eq.(46),(47),(50) with Eq.(43) and substituting ∆m ≤ ∆µ yields: ˆ y) < ∆ + I(x; B+1 X mb n b=1 B+1 X ≤∆+ b=1  K + ∆mb mb  + B h0 ≤ n B 1 (K + ∆µ ) + h0 = n n B+1 B (K + ∆µ ) + h0 (52) n n From Eq.(52) B and consequently Ract can be lower bounded: =∆+ Ract = ˆ y) − ∆ − 1 (K + ∆µ ) I(x; B n ·K > ·K = n K + ∆ µ + h0   ˆ y) − ∆ − K 1 + ∆µ I(x; n K = (53) ∆µ +h0 1+ K Now if we increase K with n such that O(log(n)) < O(K) < O(n) (for example by choosing K = nα , 0 < α < 1), then K n → 0 as n → ∞, since ∆µ = O(log(n)) we have ∆µ → 0 and from Eq.(51) we have ∆ → 0 thus for any ǫ we K have n large enough so that: Ract >  ˆ y) − ǫ  I(x; ˆ y) − ǫ (1 − ǫ) > > I(x; 1+ǫ ˆ y) − (1 + h0 )ǫ ≡ Remp > I(x; (54) Outside the set J, where the last inequality is due to the fact Iˆ is bounded. Hence we proved our claim that the rate exceeds a rate function which converges uniformly to the empirical mutual information and the proof of Theorem 3 is complete.  2) Rate analysis for the continuous case: The continuous case is more difficult from several reasons. One is that the error probability exponent has a missing degree of freedom (≈ exp((n − 1)t)). This results in a rate loss (through s in the definition of µ∗m ), which is larger for small blocks, and can be bounded only when assuming the number of blocks does not grow linearly with n. Since the effective mutual information Remp (x, y) is unbounded we cannot simply bound the loss of mutual information over the unused symbols. Specifically for a single symbol, ρ̂ = 1 and Remp = ∞. Therefore we use the convexity of the correlation factor and the fact it is bounded by 1. As a result, the loss introduced in order to attain convexity (over the rateless blocks) is in the correlation factor rather than the empirical mutual information. A loss in the correlation factor induces unbounded loss in the rate function for ρ ≈ 1, leading to a maximum rate. In order to cope with these difficulties we use a threshold T on the number of symbols in a block (T is chosen to grow slower than n), and treat large and small blocks differently: the large blocks are analyzed through their correlation factor and for the small blocks the correlation factor is upper bounded by 1 and only the number of blocks is accounted for. We denote ρ̂b ≡ ρ̂(xUb , yUb ) and ρ̂ ≡ ρ̂(x, y) the correlation factor measured on a rateless block and on the entire transmission block, respectively. We denote by BS = {b : mb ≤ T } and BL = {b : mb > T } the indices of the small and the large blocks respectively (the last unfinished block included). The total P number of symbols in the large blocks is denoted mL ≡ b∈BL mb . The number of large blocks is bounded by |BL | < Tn . The decoding threshold is written as   K n 1 K + ∆µ log(2) ∗ µm = + log = (55) + m−1 m−1 Pe m−1 m−1   2n where we denoted ∆µ ≡ log P . We consider the partie tioning of the index set {1, . . . , n} into at most p = Tn sets: the first Tn −S1 (or less) sets are the large blocks except their last symbol b∈BL Ub (each with at least T + 1 symbols by definition), and the last set denoted L1 includes the rest of the symbols (last symbols of these blocks and all symbols of small blocks), and has |L1 | = n − mL . Since this partitioning has a bounded number of sets, by applying Lemma 6 and Eq.(35) with p = Tn we have that Eq.57 below is satisfied when x is 15 outside a set J with probability at most: √  2 2 2n Pr(J) ≤ n2p · 2p e−n∆ /8 = e−n∆ /8 =   √  2 (56) 2n = exp −n log(e)∆2 /8 − log T n 2T For any 0 < ∆ ≤ 17 . This bound √ tends  to 0 if T > O(log(n)) (since log(e)∆2 /8 − T2 log 2n → log(e)∆2 /8 > 0) therefore for any such ∆ there is n large enough such that this probability falls below the required PA . The convexity condition is: ρ̂2 − ∆ ≤ X mb |L1 | ρ̂2b + ρ̂(xL1 ; yL1 )2 ≤ n n b∈BL X mb n − mL ρ̂2 + ≤ n b n (57) The last equation is a lower bound on a linear combination of |BL | and |BS |. Since the total information sent depends on |BL | + |BS | we equalize the coefficients multiplying |BL | and |BS | by determining η1 so that:   1 K + ∆µ 1 (62) − log (1 − η1 ) = 1 + 2 T T This is always possible since the RHS is positive and the LHS maps η1 ∈ (0, 1) to (0, ∞). Then    |BL | + (T + 1)|BS | 1 K + ∆µ r0 ≤ |BL | + 1+ = T T n K + ∆µ K + ∆µ = (|BL | + |BS |) = (B + 1) (63) n n Extracting a lower bound on B from Eq.(63) yields a bound on the empirical rate: b∈BL where ∆ can be made arbitrarily close to 0. We define a factor η1 < 1 and apply the function (− 21 ) log(1 − η1 t) to both sides of the above equation. Since the function is monotonically increasing and convex ∪ over t ∈ [0, 1) (stemming from concavity ∩ of log(t)), we have: r0 ≡ (− 12 ) log(1 − η1 · (ρ̂2 − ∆)) ≤ " !# X mb (57) n − mL 2 1 ≤ (− 2 ) log 1 − η1 ρ̂ + ·1 ≤ n b n b∈BL X mb  (− 21 ) log 1 − η1 ρ̂2b + ≤ n b∈BL n − mL (− 21 ) log (1 − η1 · 1) (58) n We start by bounding the terms related to the large blocks. At the last symbol before decoding in each block (or symbol n for the unfinished block) none of the codewords, including the correct one crosses the threshold µ∗m , therefore we have for b = 1, . . . , B + 1: + µ∗mb = 1 K + ∆µ > Remp (xUb , yUb ) = − log(1 − ρ̂2b ) (59) mb − 1 2 and since mb ≥ T + 1:  mb  mb (− 21 ) log 1 − η1 ρ̂2b ≤ (− 21 ) log 1 − ρ̂2b < n  n  (59) mb K + ∆µ 1 K + ∆µ · = 1+ ≤ < n mb − 1 mb − 1 n   1 K + ∆µ ≤ 1+ (60) T n P For the small blocks we use n ≤ b∈BL (mb + 1) + P (m + 1) ≤ m + |B | + (T + 1)|B b L L S | (where the b∈BS inequality is since the unterminated block has length mb ) to bound n − mL ≤ |BL | + (T + 1)|BS |. Combining Eq.(58) with these bounds we have:   1 K + ∆µ r0 ≤ |BL | 1 + + T n   1 |BL | + (T + 1)|BS | − log (1 − η1 ) + (61) n 2 K ·B ≥ n   r0 · n r0 K K −1 = − ≥ · = n K + ∆µ 1 + K −1 ∆µ n (− 21 ) log(1 − η1 (ρ̂2 − ∆)) K = − ≡ RLB1 (64) (1 + K −1 ∆µ ) n Ract = Equation (64) may be optimized with respect to T to obtain a tighter bound, but this is not necessary to prove the theorem. Recall that ∆µ = O(log(n)). By choosing O(log(n)) < K < O(n) the factor (1 + K −1 ∆µ ) in Eq.(64) can be made arbitrarily close to 1 and K n can be made arbitrarily close to 0. As we saw above choosing O(log(n)) < T < O(n) enables us to have PA → 0 with ∆ arbitrarily close to 0, and finally if K > O(T ) then the RHS of Eq.(62) tends to ∞ and therefore we can choose η1 arbitrarily close to 1. Summarizing the above, by selecting O(log(n)) < O(T ) < O(K) < O(n) we can write the rate as Ract ≥ RLB1 = (− 12 ) log(1 − η1 · (ρ̂2 − ∆)) · η2 − ǫ1 (65) With η1 , η2 n→∞ 0+ . RLB1 tends to the target −→ 1− and n→∞  ǫ1 ,∆−→ rate R2 (ρ̂) ≡ 12 log 1−1ρ̂2 for each point ρ̂ ∈ [0, 1) (but not uniformly), and it remains to show that for any R̄, ǫ there is n large enough such that RLB1 ≥ RLB2 ≡ min(R2 (ρ̂) − ǫ, R̄). The functions R2 (ρ) and RLB1 (ρ) are monotonically increasing (for fixed η1 , η2 and ǫ1 ) and it is easy to verify by differentiation that the difference R2 (ρ) − RLB1 (ρ) is also monotonically increasing. Given R̄, ǫ, choose ρ0 such that R2 (ρ0 ) = R̄ + ǫ. Since RLB1 (ρ0 )−→ R2 (ρ0 ), for n n→∞ large enough we have R2 (ρ0 ) − RLB1 (ρ0 ) ≤ ǫ, and therefore RLB1 (ρ0 ) ≥ R2 (ρ0 ) − ǫ = R̄. For this n, for any ρ ≤ ρ0 from the monotonicity of the difference we have that R2 (ρ) − RLB1 (ρ) ≤ ǫ, and for any ρ ≥ ρ0 we have from the monotonicity of RLB1 (ρ) that RLB1 (ρ) ≥ R̄, therefore RLB1 ≥ RLB2 , which completes the proof of Theorem 4.  VII. E XAMPLES In this section we give some examples to illustrate the model developed in this paper. In this section we use a slightly less formal notation. 16 A. Constant outputs and other illustrative cases The statement that a rate which is determined by the input and output sequences can be attained without assuming any dependence between them may seem paradoxical at first. Some insight can be gained by looking at the specific case where the output sequence is fixed and does not depend on the input. In this case, obviously, no information can be transferred. Since the encoder uses random sequences, the result of fixing the output is that the probability to have an empirical mutual information larger than ǫ > 0 tends to 0, therefore most of the time the rate will be 0. Infrequently, however, the input sequence accidentally has empirical mutual information larger than ǫ > 0 with the output sequence. In this case the decoder will set a positive rate, but very likely fail to decode. These cases occur in vanishing probability and constitute part of the error probability. So in this case we will transmit rate R = 0 with probability of at least 1 − Pe and R > 0 with probability at most Pe . Conversely, if the channel appears to be good according to the input and output sequences (suppose for example yk = xk ), the decoder does not know if it is facing a good channel or just a coincidence, however it takes a small risk by assuming it is indeed a good channel and attempting to decode, since the chances of high mutual information appearing accidentally are small (and uniformly bounded for all output sequences). Another point that appears paradoxical at first sight is that the decoder is able to determine a rate R ≥ Remp without knowing x for any x 6∈ J. First observe that although it is an output of the decoder, the rate R is not controlled by the encoder and therefore cannot convey information. Since the decoder knows the codebook, and given the codebook the sequence x is limited to a number of possibilities (determined by the possible messages and block locations), it is easy to find an R(y) ≥ Remp (x, y) by maximizing Remp over all possible sequences x. Vaguely speaking, the decoding process is indeed a maximization of Remp over multiple x sequences and by Lemmas 1, 4 such a decoding process guarantees small probability of error. B. Applying the continuous alphabet scheme to other input alphabets The scheme used for the continuous case can be adapted to peak limited or even discrete input, by using an adaptation function, i.e. the channel input will be x′k = f (xk ). In this case the modified codebook C ′ = f (C) will be generated by passing the Gaussian codebook through the adaptation function, but for analysis purposes the adaptation function f (·) may be considered part of the channel and the correlation factor is calculated with respect to x which is used to generate the codebook. In order to write the rate guaranteed by this approach as a function of x′ rather than x, the law of large numbers has to be utilized (in general) with respect to the distribution Pr(xk |x′k ). C. Non linear channels In analyzing probabilistic the correlation model   channels, 1 1 determines the rate 2 log 1−ρ2 is always achievable using Gaussian code (no randomization is needed if the channel is probabilistic as can be shown by the standard argument about the existence of a good code). This is actually a result of Lemma 2. This expression is useful for analyzing channels in which the noise is not additive or non linearities exist. As an example, transmitter noise is usually modeled as an additive noise. However large part of this noise is due to distortion (e.g. in the power amplifier), and therefore depends on the transmitted signal and is inversely correlated to it. Consider the non linear channel Y = f (X) + V with V ∼ N (0, N ). In this case ρ2 if we define the effective SNR as SNR = 1−ρ 2 then rate R = 12 log (1 + SNR) is achievable. The correlation factor is: ρ2 = E(Xf (X))2 E(XY )2 = E(X 2 )E(Y 2 ) E(X 2 )(E(f (X)2 ) + N ) (66) Therefore the effective SNR is: ρ2 = 1 − ρ2 Peff E(Xf (X))2 = = 2 E(X )(E(f (X)2 ) + N ) − E(Xf (X))2 N + Neff (67) SNR = where we defined the effective gain γ, the effective power Peff and the effective noise Neff as: γ Peff Neff E(Xf (X)) E(X 2 )   (E[(Xf (X)])2 ≡ = E (γX)2 E(X 2 ) (E[Xf (X)])2 ≡ E(f (X)2 ) − E(X 2 )   = E (f (X) − γX)2 ≡ (68) (69) (70) This yields a simple characterization of the degradation caused by the non linearity, which is independent of the noise power and is tight if the non linearity is small. This model enables to characterize the transmitter distortions by the two parameters Peff , Neff , a characterization which is more convenient and practical to calculate than the channel capacity, and on the other hand guarantees that transmitter noise evaluated this way never degrades the channel capacity in more than determined by Eq.(67). Another interesting application of this bound is in treating receiver estimation errors, since it is simpler to calculate the loss in the correlation factor induced due to the imperfect knowledge of the channel parameters than the loss in capacity. For example, the bound in [16] for the loss due to channel estimation from training, when specialized to single input single output (SISO) channels, may be computed using the correlation factor bound. D. Employing continuous channel scheme over a BSC When operated over a channel different than the Gaussian additive noise channel, the rates achieved with the scheme we described in the continuous case are suboptimal compared to the channel capacity. The loss depends on the channel in 17 from the simplicity of the models used, and can be solved by schemes employing higher order empirical distributions (over blocks, or by using Markov models), and by employing tighter approximations of the empirical statistics (e.g. by higher order statistics) in the continuous case. F. Using individual channel model to analyze adversarial individual sequence Fig. 5. Comparison of C,R for the BSC question. As an example, suppose the communication system is used over a BSC with error probability ǫ, i.e. the continuous input value X is translated to a binary value by sign(X), and the output is Y = sign(X) · (−1)Ber(ǫ) . The capacity of this channel is C = 1bit − hb (ǫ) and we are interested to calculate the rate which would be achieved by our scheme (which does not know the channel) for this channel behavior. For this channel with Gaussian N (0, P ) input we have (through a simple calculation): r 2P E(XY ) = (1 − 2ǫ) (71) π Hence ρ2 = And 2 E(XY )2 = (1 − 2ǫ)2 2 P · E(1 ) π 1 R = log 2  1 2 1 − π (1 − 2ǫ)2  (72) (73) The comparison between C and R is presented in fig.(5). It can be shown that R ≥ π2 C, thus the maximum loss is 36%. E. Channels that fail the zero order and the correlation model Although we did not assume anything about the channel, and specifically we did not assume the channel is memoryless, the fact we used the zero-order empirical distribution means the results are less tight for channels with memory. Specifically if delay is introduced then the scheme would fail completely. For example, for the channel yk = xk + 12 xk−1 + vk we would obtain positive rates and the intersymbol interference (ISI) 21 xk−1 would be treated (suboptimally) as noise, but for the error free channel yk = xk−1 the achieved rate would be 0 (with high probability). Similarly we can find a memoryless channel with infinite capacity but for which the correlation model we used for the continuous alphabet scheme fails: if yk = x2k then ρ = 0. Another example of practical importance is the fading channel (with memory) yn = hn xn + vn , where hn is slowly fading with mean 0. All these examples result As we noted in the overview, the results obtained for the individual channel model constitute a convenient starting point for analyzing channel models which have a full or partial probabilistic behavior. It is clear that results regarding achievable rates in fully probabilistic, compound, arbitrarily varying and individual noise sequence models can be obtained by applying the weak law of large numbers to the theorems discussed here (limited, in general, to the randomized encoders regime). E.g. for a compound channel model Wθ (y|x) with an unknown parameter θ since P̂ (x; y)−→ Pθ (x, y) = n→∞ Wθ (y|x)Q(x) in probability for every θ and since I(·; ·) is ˆ y)−→ Iθ (X; Y ). Hence from Theorem 1 rate continuous I(x; n→∞ minθ Iθ (X; Y ) can be obtained without feedback, and from Theorem 3 rate Iθ (X; Y ) can be obtained with feedback. These results are not new (see [23][24] for the first and the second is obtained as a special case of the results in [3] and [2] since the individual noise sequence model can be degenerated into a compound model) and are given only to show the ease of using the individual model once established. To show the strength of the model we analyze a problem considered also in [2] of an individual sequence which is determined by an adversary and allowed to depend in a fixed or randomized way on the past channel inputs and outputs. For simplicity we start with the binary channel yk = xk ⊕ ek where ek is allowed to depend on x1k−1 and y1k−1 (possibly in a random fashion), and the target is to show the empirical capacity is still achievable in this scenario. Note that here Ek is a random variable but not assumed toP be i.i.d. We n denote the relative number of errors by ǫ̂ ≡ n1 k=1 ek . We would like to show the communication scheme achieves a rate close to 1bit − hb (ǫ̂) in high probability, regardless of the adversary’s policy. Note that both the achieved rate and the target 1bit − hb (ǫ̂) are random variables and the claim is that they are close in high probability (i.e. that the difference converges in probability to 0 when n → ∞) Applying the scheme achieving Theorem 3 with Q = Ber( 12 ) we can asymptotically approach (or exceed) the rate: ˆ y) = Ĥ(y) − Ĥ(y|x) = Ĥ(y) − Ĥ(e|x) ≥ I(x; ≥ Ĥ(y) − Ĥ(e) = Ĥ(y) − hb (ǫ̂) (74) Note that unlike in the probabilistic BSC where we have I(X; Y ) = H(Y )−H(E), here the empirical distribution of e is not necessarily independent of x, therefore the entropies are only related by the inequality Ĥ(e|x) ≤ Ĥ(e) (conditioning reduces entropy). In order to show a rate of 1bit − hb (ǫ̂) is −→ 1bit . Since Xk achieved, we only need to show Ĥ(y)n→∞,prob. k−1 k−1 is independent of X1 , Y1 and therefore also of Ek we 18 have: Pr(Yk = 0|Y1k−1 ) = X Pr(Yk = 0|Y1k−1 , ek )Pr(ek ) = ek = X ek = X ek Y1n Pr(Xk = ek |Y1k−1 , ek )Pr(ek ) = Pr(Xk = ek )Pr(ek ) = X1 ek 2 Pr(ek ) = 1 2 (75) Ber( 21 ) Therefore is distributed i.i.d. and from the law of large numbers and the continuity of H(·) we have the desired result. This result is a special case of the results in [2]. We can extend the example above to general discrete channels and perform a consolidation of the adversarial sequence model considered in [2] (for modulu additive channels) with the general discrete channel with fixed sequence considered in [3]. We address the channel Ws (y|x) with state sequence sk potentially determined by an adversary knowing all past inputsPand outputs. We would like to show that the rate I(Q, s Ws (y|x)P̂s (s)) (the mutual information of the stateaveraged channel) can be asymptotically attained in the sense defined above. This result is a superset of the results of [3] and [2]. It overlaps with [3] in the case s is a fixed sequence and with [2] for the case of modulu-additive channel (or when the target rate is based on the modulu additive model). ˆ y) ≡ I(P̂ (x), P̂ (y|x)) Since Theorem 3 shows the rate I(x; can be approached or exceeded asymptotically, it remains to show that the empirical distribution P̂ (x, y) is asymptotically close to the state-averaged P P distribution Pavg (x, y) ≡ 1 W (y|x) P̂ (s)Q(x) = s s s k WSk (y|x)Q(x), and the ren sult will follow from continuity of the mutual information. Note that the later value is a random variable (function) depending on the behavior of the adversary. Here we do not use the law of large numbers because of the interdependencies between the signals x, y and s. Our purpose is to prove that the difference ∆(t, r) defined below converges in probability to 0 for every t, r: ∆(t, r) ≡ P̂(x,y) (t, r) − Pavg (t, r) = 1X 1X Ind(Xk = t, Yk = r) − WSk (r|t)Q(t) ≡ = n n k k 1X ϕk (t, r) (76) ≡ n k where ϕk (t, r) ≡ Ind(Xk = t, Yk = r) − WSk (r|t)Q(t). For brevity of notation we omit the argument (t, r) from ϕk (t, r) since from this point on it takes a fixed value. Then E(Ind(Xk = t, Yk = r)|X k−1 , Y k−1 , S k ) = = Pr(Xk = t, Yk = r|X k−1 , Y k−1 , S k ) = = Pr(Xk = t|X k−1 , Y k−1 , S k )· (a) · Pr(Yk = r|Xk = t, X k−1 , Y k−1 , S k ) = (b) = Pr(Xk = t) · Pr(Yk = r|Xk = t, Sk ) = Q(t)WSk (r|t) (77) where (a) is due to the independent drawing of Xk (when not conditioned on the codebook), the fact S k is independent of Xk , and the memoryless channel (defining the Markov chain (X k−1 , Y k−1 , S k−1 ) ↔ (Xk , Sk ) ↔ Yk ), and (b) is due to the i.i.d drawing of Xk from Q and the definition of W . From Eq.(77) we have that: E(ϕk |X k−1 , Y k−1 , S k ) = 0 (78) By the smoothing theorem we also have that ϕk has zero mean E(ϕk ) = 0. We now show that ϕk are uncorrelated. Consider two different indices j < k (without loss of generality) then   E(ϕk · ϕj ) = E E(ϕk · ϕj |X k−1 , Y k−1 , S k ) =   = E ϕj · E(ϕk |X k−1 , Y k−1 , S k ) = 0 (79) where we used the smoothing theorem and the fact ϕj is completely determined by Xj , Yj , Sj which are given. In addition since by definition −1 ≤ ϕk ≤ 1, E(ϕ2k ) ≤ 1. Therefore n n 1 X 1 1 X (80) E(ϕk · ϕj ) ≤ 2 δjk = E(∆2 ) = 2 n n n j,k=1 j,k=1 and by Chebychev inequality for any ǫ > 0: Pr(|∆(t, r)| > ǫ) ≤ 1 E(∆2 ) ≤ 2 n→∞ −→ 0 ǫ2 nǫ (81) which proves the claim.  This result is new, to our knowledge, however the main point here is the relative simplicity in which it is attained when relying on the empirical channel model (note that most of the proof did not require any information-theoretic argument). VIII. C OMMENTS AND FURTHER STUDY A. Limitations of the model The scheme presented here is suboptimal when operated over channels with memory or, in the continuous case over non AWGN channels, and in section VII-E we discussed several cases where the communication fails completely. Obviously the solution is to extend the time order of the model. A simple extension is by using the super-alphabets X p and Y p and treating a block of channel uses as one symbol. A more delicate extension is by considering a Markov model (the p-th order k−1 empirical conditional probability P̂ (xk , yk |xk−1 k−p , yk−p )). For the continuous channel we focused on a specific class of continuous channels where the alphabet is the real numbers (we have not considered vectors as in MIMO channels), and we did not achieve the full mutual information. A possible extension is to find measures of empirical mutual information for the continuous channels which are also attainable and approach the probabilistic mutual information for probabilistic channels. The current paper exhibits a considerable similarity between the continuous case and the discrete case which is not fully explored here, and a unifying theory which will include the two as particular cases is wanting. We conjecture that the following definition of empirical mutual information may achieve these goals: given a family 19 of joint distributions (not necessarily i.i.d) {Pθ (x, y), θ ∈ Θ} define the entropy with respect to the family Θ as the entropy of the closest member of the family (in maximum likelihood sense): ĤΘ (x) = minθ∈Θ − n1 log Pθ (x) and likewise ĤΘ (x|y) = minθ∈Θ − n1 log Pθ (x|y), and define the relative mutual information as IˆΘ (x; y) = ĤΘ (x) − ĤΘ (x|y). This definition corresponds to our target rates for the discrete case (with Θ as the family of all DMC-s) and continuous case (with Θ the family of all joint Gaussian zero-mean distributions N (0, ΛXY )). B. Overhead and error exponent Another aspect is the overhead associated with extending the empirical distribution (”channel”) family which is considered (both in considering time dependence and in increasing the accuracy with which the distribution is estimated or described). This overhead is related to the redundancy or regret associated with universal distributions (see [25]). Although we haven’t performed a detailed analysis of the overheads and considered only the asymptotically achievable rates, it is obvious from comparing Lemmas 1 and 4 that the tighter rates we obtained for the discrete channel come at the cost of additional overhead (O(log(n)) compared to O(1) in the continuous case) which is associated with the richness of the channel family (describing a conditional probability as opposed to a single correlation factor). Thus for example for a discrete channel with a large alphabet and a small block size n we would sometimes be better off using the ”continuous channel model” version of our scheme (gaining only from the correlation) rather than the scheme of the discrete case (gaining the empirical mutual information). The issue of overheads requires additional analysis in order to determine the bounds on the overheads and the tradeoff between richness of the channel family and the rate, for a finite n. As we noted in section VI-C2 the bounds we currently have for the rate-adaptive continuous case are especially loose and call for improvement. Since rate can be traded off for error probability, a related question is the error exponent. Here, a good definition is still lacking for variable rate schemes, and the error exponents are not known for individual channels. The scheme we described does not endeavor to attain a good error exponent. Specifically, since the block of n channel uses is broken into multiple smaller blocks, it is probably not an efficient scheme in terms of error rate. We note, however, that for rate adaptive schemes with feedback a good error exponent does not necessarily relate to the capability of sending a message with small probability of error, but rather to the capability to detect the errors. A similar situation occurs in the setting of random decision time considered by Burnashev [12]. In the later, an uncertainty of the decoder with respect to the message is mitigated by sending an acknowledge / unacknowledge (ACK/NACK) message and possibly repeating the transmission with small penalty in the average rate (see a good description in [11] sec IV.B). A similar approach can be used in our setting (fixed decoding time, variable rate), by sending an ACK/NACK over a fixed portion of the block and setting R = 0 when the decoder is not certain of the received message. However we did not perform a detailed analysis. Note also that the analysis of the probability PA to transmit at a rate lower than the target rate function is entangled with the error analysis, since by such schemes it is possible to trade off rate for error, and reduce the error probability at the expense of increasing the probability to fall short of the target rate. C. Determining the behavior of the transmitted signal (prior) In this work we assumed a fixed prior (input probability distribution) and haven’t dealt with the question of determining the prior, or more generally, how the encoder should adapt its behavior based on the feedback. Had the channel been a compound one, it stands to reason that a scheme using feedback may estimate the channel and adjust the input prior, and may asymptotically attain the channel capacity. However in the scope of individual channels (as well as individual sequence channels and AVC-s) it is not clear whether the approach of adjusting to the input distribution to the measured conditional distribution is of merit, if the empirical channel capacity can be attained for every sequence, and even the definition of achievability is unclear if the input distribution is allowed to vary. Another related aspect is what we require from a communications system when considered under the individual channel framework. This question is relevant to all the requirements defined in the theorems (for example is the existence of the failure set J necessary ?), however the most outstanding requirement is related to the prior. Currently we constrained the input sequence to be a random i.i.d. sequence chosen from a fixed prior, which seems to be an overly narrow definition. The rationale behind this choice is that without any constraint on the input, the theorems we presented can be attained in a void way by transmitting only bad (e.g. fixed) sequences that guarantee zero empirical rate. Furthermore, without this constraint, attainability results for probabilistic models, and in general any attainable rates which are not conditioned on the input sequence could not be derived from our individual sequence theorems. A weaker requirement from the encoder is to be able to emit any possible sequence, however this requirement is not sufficient, since from the existence of such encoders we could not infer the existence of encoders achieving any positive rate over a specific channel. Consider for example the encoder satisfying the requirement by transmitting bad sequences in probability 1 − ǫ and good sequences in probability ǫ → 0. Theorems 1,2,3 and 4 are existence theorems, i.e. they guarantee the existence of at least one system satisfying the conditions. Had we removed the requirement for fixed input prior we saw these theorems would be attained by encoders that are unsatisfactory in other aspects. Once the theorem is satisfied by one encoder it cannot guarantee the existence of other (satisfactory) encoders, thus making it un-useful. Therefore the requirement for fixed prior is necessary in the current framework. Although in the scope of the theorems presented here, this requirement only strengthens the theorems (since it reveals additional properties of the encoder attaining the other conditions of the theorem), we are still bothered by 20 the question what should be the minimal requirements from a communication system, and these hopefully will not include a constraint on the input distribution. This issue relates to a fundamental difficulty which aries in communication over individual channels: unlike universal source coding in which the sequence is given a-priori, here the sequences are given a-posteriori, and the actions of the encoder affect the outcome in an unspecified way. Currently we broke the tie by placing a constraint on the encoder, but we seek a more general definition of the problem. coding (as in [12] [19]) in which the block size is not fixed but determined by the decoder. We did not include this scenario since the achievability result is less elegant in a way: the decoder indirectly affects the target rate (mutual information) through the block size. On the other hand this case may be of practical interest. Clearly the mutual information can be asymptotically attained for this communication scenario as well and its analysis is merely a simpler version of the rate analysis performed in section VI-C, since convexity is not required. D. Amount of randomization G. Bounds We have assumed so far there is no restriction on the amount of common randomness available and have not attempted to minimize the amount of randomization required (while maintaining the same rates). It is shown in [2] that less than O(n) of randomization information is required in some cases and O(n) is enough for others (see section V.5 therein), whereas we have used at least O(M · n) > O(n2 ) random drawings to produce the codebook. In this paper we focused on achievable rates and did not show a converse. An almost obvious statement is that any continuous rate function which depends only on the zero-order empirical statistics / correlation (respectively) cannot exceed asymptotically the rate functions of Theorems 3, 4 respectively with vanishing error probability. To show the statement for the discrete case determine y using a memoryless channel W (y|x). Then by the law of large numbers the empirical distribution converges to the channel distribution and from the continuity of the rate function the empirical rate converges to the rate function taken at the channel distribution. Since by Theorem 3 the actual rate asymptotically meets or exceeds the rate function, and by the converse of the channel capacity theorem the actual rate cannot exceed (asymptotically) the mutual information, we have that the rate function cannot exceed the mutual information (Remp ≤ Ract ≤ I(P, W )), up to asymptotically vanishing factors. For the continuous case the analogue claim is shown by taking a Gaussian additive channel and replacing ”distribution” by ”correlation” and ”empirical mutual information” by − 21 log(1 − ρ̂2 ). The same applies also to rate functions obeying the conditions of Theorems 1, 2. More general bounds are yet to be studied. E. Practical aspects The scheme described in this work is a theoretical one, but the concept appears to be extendable to practical coding systems. Below we focus on the continuous case and merely give the motivation (without proof). One may replace the correlation receiver (GLRT) by a receiver utilizing training symbols to learn the channel effective gain, and then apply maximum likelihood (or approximate, e.g. iterative) decoding. The randomization of the codebook may be replaced by using a fixed code with random interleaving, since with random interleaving only the empirical distribution of the (effective) noise sequence affects the error probability, and we may conjecture that the property that Gaussian noise distribution is the worst is approximately true for practical codes (such as turbo codes and LDPC). When using a random interleaver the training symbols as well as the part of the coded symbols can be interleaved together, and the decoding attempts (which occur every symbol in the theoretical scheme) occur only at the end of each interleaving block. The rateless code is replaced by an incremental redundancy scheme, i.e. by sending each time part of the symbols of the codeword, and repeating the codeword if all symbols were transmitted without successful decoding. The decision when to decode can be simply replaced by decoding and using a CRC check. Finally the common randomness (required only for the generation of the interleaver permutation) can be replaced by pseudo-randomness. Such a scheme may not be able to attain the promise of Theorem 4 for every individual sequence but may be able to adapt to every natural and man-made channel. H. Comparison of the rate adaptive scheme with the similar scheme in [3] As noted the rate adaptive scheme we use is similar to the scheme of [3] in its high level structure. Table II compares some attributes of the schemes. Another important factor is the overhead (i.e. the loss in number of bits communicated with a given error exponent, compared to the target rate), which we were unable to compare. We conjecture that the current scheme may have a lower overhead due to its simplicity which results in a smaller number of parameters and constraints on their order of magnitude (compared to the scheme of [3] where relations between factors such as number of pilots and the minimum size of a chunk may require a large value of n). IX. C ONCLUSION F. Random decision time In our discussion we have described two communication scenarios: fixed rate without feedback and variable rate with feedback, and in both we assumed a fixed block size n. Another scenario is that of random decision time or rateless We examined achievable transmission rates for channels with unspecified models, and focused on rates determined by a channel’s a-posteriori empirical behavior, and specifically on rate functions which are determined by the zeroorder empirical distribution. This communication approach 21 TABLE II C OMPARISON RATE ADAPTIVE SCHEMES IN CURRENT PAPER AND [3] Item Channel model Mechanism for adaptivity Transmit format Feedback Alphabet Training Randomness Codebook construction Stopping condition Decoding Stopping location Eswaran et al [3] Individual sequence Repeated instanced of rateless coding Total time divided to rounds (=rateless blocks) which are divided to chunks Ternary (Bad Noise/Decoded/Keep Going), once per chunk Discrete Known symbols in random locations in each chunk Full (O(exp(nR))) Current Paper Individual channel Repeated instanced of rateless coding Total time divided to rateless blocks Constant composition + expurgation + training insertion Threshold over mutual information of channel estimated from training Maximum (empirical) mutual information End of Chunk Random i.i.d. Comments Chunks in [3] used as feedback instances and expurgated code has constant type over chunks Easy to generalize to once every 1/ǫ symbols (see VI-C) Binary (Decoded/Not Decoded) per symbol Discrete or Real valued None Full (O(exp(nR))) Might be reduced by selection from a smaller collection of codebooks (in both cases) Threshold over empirical mutual information of best codeword Maximum (empirical) mutual information Any symbol does not require a-priori specification of the channel model. The main result is that for discrete channels the empirical mutual information between the input and output sequences is attainable for any output sequence using feedback and common randomness, and for continuous real valued channels an effective ”Gaussian capacity” − 12 (1− ρ̂2 ) can be attained. This generalizes results obtained for individual noise sequences and is a useful model for analyzing compound, arbitrarily varying, and individual noise sequence channels. conditional type have the same (marginal) type, we can write:  X   ˆ y) ≥ t = Qn TX|Y (y) = Qn I(x; Tt (a) = X Tt (b) ≤ X Tt |TX|Y (y)| exp {−n [H(TX ) + D(TX ||Q)]} ≤ n o n h io exp nH(X̃|Ỹ ) exp −n H(X̃) + D(TX ||Q) = = X Tt ACKNOWLEDGMENT n h io exp −n I(X̃; Ỹ ) + D(TX ||Q) ≤      ≤ |Pn (X Y)|·exp −n min I(TY , TX|Y ) + D(TX ||Q) Tt The authors would like to thank the reviewers of the ISIT 2009 conference paper on the subject for their helpful comments and references. A PPENDIX A. Proof of Lemma 1 The proof is a rather standard calculation using the method of types. We use the notations of [10]. We divide the sequences according to their joint type TXY . The type TXY is defined by the probability distribution TXY ∈ Pn (X Y). For notational purposes we define the dummy random variables (X̃, Ỹ ) ∼ TXY and TX , TY , TY |X as the marginal and conditional distributions resulting from TXY . Following [10], the conditional type is defined as TX|Y (y) ≡ {y : (x, y) ∈ TXY }. The empirical mutual information of sequences in the type TXY is simply I(X̃; Ỹ ) = I(TY , TY |X ). Define Tt ≡ {TXY ∈ Pn (X Y) : I(TY , TY |X ) ≥ t}. Since all sequences in the (c) ≤ (n + 1)|X ||Y| · exp (−nt) =    log(n + 1) = exp −n t − |X ||Y| n (82) where (a) is due to [10] Eq.(II.1), (b) results from eq.(83) below which is an extension of (II.4) there to conditional types (and is a stronger version of Lemma II.3), based on the fact that in the conditional type TX|Y (y) the values of x over the na = nTY (a) indices for which yi = a have empirical distribution TX|Y and therefore thenumber of such sequences is limited to exp na H(X̃|Ỹ = a) , hence: |TX|Y (y)| ≤ Y a   exp nTY (a)H(X̃|Ỹ = a) =   = exp nH(X̃|Ỹ ) (83) (c) is based on bounding the number of types (see [14], Theorem 11.1.1), and the fact that in the minimization region I(TY , TX|Y ) ≥ t and D(TX ||Q) ≥ 0 therefore the result of the minimum is at least t. 22 B. Discussion of Lemma 1 1) An alternative proof for the exponential rate: For the proof of Theorem 1 we do not need the strict inequalities and equality in the error exponent would be sufficient, however these will be useful later for the rateless coding. An explanation for the fact that the result does not depend on Q can be obtained by showing that the above probability can be bounded for each type of x separately. I.e. if x is drawn uniformly over the type TX the probability of the above condition is: X X exp(nH(X̃|Ỹ )) |TX|Y (y)| TXY ∈Tt . TXY ∈Tt = = |TX | exp(nH(X̃)) X . exp(−nI(X̃; Ỹ )) = exp(−nt) (84) = TXY ∈Tt  where Tt ≡ TXY ∈ Pn (X Y) : (TXY )X = TX , (TXY )Y = TY , I(TY , TY |X ) ≥ t and since drawing x ∼ Qn is equivalent to first drawing the type of x and then drawing x uniformly over the type, the bound holds when x ∼ Qn . 2) Extension to alpha receivers: Following we discuss an extension of the bound and relate it to Agarwal’s [8] coding theorem using the rate distortion function. Consider a communication system similar to that of Theorem 1, where the codebook is a constant composition code, consisting of randomly selected sequences of type Q, and the receiver is an α receiver (see [26]), i.e. selects the received codeword by maximizing a function α̂(x, y) depending only on the joint empirical distribution of the sequences x, y. The function α(X̃, Ỹ ) = α(TXY ) is defined as the respective function of the distribution of X̃, Ỹ . Then, the pairwise error probability may be bounded similarly to eq. (84) by replacing the condition the condition I(TY , TY |X ) ≥ t in the definition of Tt by α(TXY ) ≥ t, and obtaining: . Pr(α̂(x, y) ≥ t) ≤ Pα =   . = exp − n min PX̃ Ỹ :X̃∼Q  I(X̃; Ỹ ) ≤ Ỹ ∼P̂ (y) α(X̃,Ỹ )≥t   ≤ exp − n min PX̃ Ỹ :X̃∼Q  I(X̃; Ỹ ) (85) α(X̃,Ỹ )≥t Following the proof of Theorem 1, the RHS of eq.(85) determines the following achievable rate: Remp (x, y) ≈ min X̃∼Q, I(X̃; Ỹ ) ≈ ˆ y) (86) ≤ I(x, α(X̃,Ỹ )≥α̂(x,y) Where the approximate inequality stems from substituting the empirical distribution of x, y as a particular distribution of X̃, Ỹ meeting the minimization constraints. The above expression is similar to the one obtained in mismatch decoding with random codes. Eq.(85) allows a larger (but still limited) scope of empirical rate functions, but also shows that within this scope the best function is still the empirical mutual information. On the other hand, an advantage of this expression is that under some continuity conditions it can be extended from discrete to continuous vectors (as performed in [8]). When substituting α with the distortion function α(X̃, Ỹ ) = −Ed(X̃, Ỹ ), we would obtain: Remp (x, y) ≈ min I(X̃; Ỹ ) = X̃∼Q, Ed(X̃,Ỹ )≥Êd(x,y) = RX (Êd(x, y)) = RX (D̂) (87) where RX (D) is the rate distortion function of an i.i.d. source X ∼ Q with the distortion metric d. The later relation can be used to show the result that communication at the rate RX (D) is possible where D is the empirical or the maximum guaranteed distortion of the channel as shown in [8]. On the other hand, when using the correlation function α(X̃, Ỹ ) = E(X̃ Ỹ ) = ρ, we would obtain from eq.(86) and Lemma E(X̃ 2 )E(Ỹ 2 ) 2: Remp (x, y) ≈ − 21 log(1 − ρ̂2 ). Note that although the later expression is the same as the one obtained in Theorem 2, the above derivation only proves it for discrete vectors. C. Proof of Lemma 2 For random variables X and Y where X is continuous (not necessarily Gaussian) we have the following bound on the conditional differential entropy (Ỹ denotes a dummy variable with the same distribution as Y and used for notational purposes): h  i ≤ h(X|Y ) = EỸ h X Y = Ỹ   (a) 1 log (2πeV AR(X|Y )) ≤ ≤ E 2 (b) 1 ≤ log (2πeE [V AR(X|Y )]) = 2 1 = log (2πeE [V AR(X − α · Y |Y )]) ≤ 2 (c) 1  ≤ log 2πeE(X − α · Y )2 =α:= E(XY ) 2 E(Y 2 )    2 1 E(XY ) = log 2πe E(X 2 ) − = 2 E(Y 2 )  1 = log 2πeE(X 2 )(1 − ρ2 ) = 2  1  1 (88) = log 2πeE(X 2 ) + log 1 − ρ2 2 2 where the (a) is based on Gaussian bound for entropy and (b) on concavity of the log function (see also [14] Eq.(17.24)) (c) is based on V AR(X) = E(X 2 ) − (EX)2 ≤ E(X 2 ) and is similar to the assertion that E[V AR(X|Y )] which is the MMSE estimation error is not worse than the LMMSE estimation error (except our disregard for the mean). Therefore for a Gaussian X: I(X; Y ) = h(X) − h(X|Y ) = = (88) 1 1 log(2πeE(X 2 )) − h(X|Y ) ≥ − log(1 − ρ2 ) (89) 2 2 23  Proof of corollary 2.1: Equality (a) holds only if X|Y is Gaussian for every value of Y , (b) holds if X has fixed variance conditioned on every Y , and (c) if E(X −α·Y |Y ) = 0 =⇒ E(X|Y ) = α · Y , therefore it results in X|Y ∼ N (αY, const) which implies X, Y are jointly Gaussian (easy to check by calculating the pdf). Note that if X, Y are jointly Gaussian then Y can be represented as a result of an additive white Gaussian noise channel (AWGN) with gain operating on X: Y ∼ E(Y |X)+N (0, V AR(Y |X)) = α̃·X +N (0, σ 2 )+const (90) To show corollary 2.2 consider X = Y = Ber( 12 ), in which case I(X; Y ) = 1 and ρ = 1, therefore the assertion doesn’t hold. Fig. 6. A geometric interpretation of Lemma 4 we have: D. Proof of Lemma 4 Write the empirical correlation as ρ̂ ≡ xT y = kxkkyk  x kxk T  y kyk  (91) From the expression above we can infer that ρ̂ does not depend on the amplitude of x and y but only on their direction. Since x is isotropically distributed, the result does not depend on the direction of y (unless y = 0 in which case it is trivially correct), therefore it is independent of y and we can conveniently choose y = (1, 0, 0, . . . , 0). To put the claim above more formally, for any unitary n × n matrix U we can write: T T ρ̂ = p x y (xT x)(yT y)  x1 ≥t = Pr(|ρ̂| ≥ t) = Pr kxk  = Pr x21 ≥ t2 (kxn2 k2 + x21 ) =   t2 n 2 = Pr x21 ≥ kx k = 1 − t2 2     t2 n 2 n 2 kx k x2 = = E Pr x1 ≥ 1 − t2 2 " !# r   n 2 t2 t2 − 12 1−t n k2 2 kx2 k = E 2Q = ≤ E 2e kx 1 − t2 2    Z n 2 t2 1 − 12 1−t k2 − 21 kxn 2 kx2 k 2 dxn2 = e = 2e (2π)(n−1)/2 Rn−1 Z 1 − 1 1 kxn k2 =2 e 2 1−t2 2 · dxn2 = (n−1)/2 Rn−1 (2π) Z n−1 = 2(1 − t2 ) 2 fN n−1 (0,1−t2 ) (xn2 ) · dxn2 =  =p T x U Uy (xT UT Ux)(yT UT Uy) =  Ux kUxk T  Uy kUyk =  (92) Since x is Gaussian, Ux has the same distribution of x, thus the probability remains unchanged if weremove U from  T Uy x the left side and remain with ρ̂′ = kxk kUyk . For y 6= 0, we may choose the unitary matrix U whose first y and the other rows complete it to an orthonormal row is kyk n basis of the  linear space R . Then Uy = (kyk, 0, 0, ..0) and Uy therefore kUyk = (1, 0, 0, ..0). Thus the distribution of   x1 x equals the distribution ρ̂′ = (1, 0, 0, . . . , 0) · kxk = kxk of ρ̂. Assuming without loss of generality that x ∼ N n (0, 1) Rn−1 2 n−1 2 = 2(1 − t ) = 2 exp (−(n − 1)R2 (t)) (93) where we used the rough upper bound of the Gaussian error 2 function Q(x) ≡ Pr(N (0, 1) ≥ x) ≤ e−x /2 , and fN n (µ,σ2 ) denotes the pdf of a Gaussian i.i.d. vector.  Discussion: A geometrical interpretation of Lemma 4 relates this probability to the solid angle of the cone {x : |ρ̂| > t}. Since x is isotropically distributed, the probability to have |ρ̂| > t equals the relative surface determined by vectors having |ρ̂| > t on the unit n-ball (termed the solid angle). Since ρ̂ is the cosine of the angle between x and y the points where |ρ̂| > t generate a cone with inner angle 2α where cos(α) = t and their intersection with the unit n-ball is a spherical cap (dome), shown in figure 6. We can obtain a similar bound as above using geometrical considerations. Write the volume of an n dimensional ball as Vn rn where π n/2 [27], and accordingly the Vn is a fixed factor Vn = Γ(1+n/2) surface of an n dimensional ball is (the derivative) nVn rn−1 , then the relative surface of the spherical cap can be computed by integrating the surfaces of the n − 1 dimensional balls with 24 radius sin(θ) that have a fixed angle θ with respect to y, and can be bounded as follows: Pr(|ρ̂| ≥ t) = Surface of cap = Surface of ball Z α 1 · (n − 1)Vn−1 sinn−2 (θ)dθ ≤ nVn θ=0 Z α Vn−1 n−3 · sin (α) ≤ sin(θ)dθ = Vn θ=0 Vn−1 · sinn−3 (α)(1 − cos(α)) ≤ = Vn α≤ π 2 Vn−1 ≤ · sinn−3 (α)(1 − cos2 (α)) = Vn p √ √ n−1 = = O( n) · sinn−1 (α) = O( n) · 1 − cos2 (α) √ 2 (n−1)/2 (94) = O( n) · (1 − t ) = Vn−1 → 1 is based on [28] where the asymptotic ratio √ nVn Eq.(99). An interesting observation is that the assumption of Gaussian distribution is not necessary and this bound is true for all isotopical distributions. E. Proof of Lemma 6 We denote xi , yi as the sub-vectors over Ai (i.e. xi ≡ xAi , yi ≡ yAi ), their length by ni ≡ |Ai | and their relative length by λi = ni /n. We are interested to find a subset J probability such that outside the set P of x2 with bounded 2 λ ρ̂ ≥ ρ̂ − ∆ for any y. Consider the following i i i inequality: 2 2 2 T kxk · kyk · ρ̂ = x y = X i !2 ρ̂i kxi k·kyi k X = λi ρ̂2i + i (b) ≤ X i λi ρ̂2i + (a) ≤ ρ̂2i  max  X i X i 2 X = xTi yi i X i ρ̂2i kxi k2 kxi k2 − λi kxk2 2 ! ! kxi k − λi , 0 kxk2 !2 · = X i 2 The set is minimal in the sense that none of its elements can be removed while meeting the conditions of the lemma. We would like to bound the probability of J∆ . The result of P i max(zi , 0) is a partial sum of zi , and since negative zi are not summed, it is easy to see this is the maximal partial sum, i.e. we can write this sum alternatively as X X zi (98) max(zi , 0) = max i∈I where P ≡ 2{1,...,p} \ ∅ denotes all non empty sub-sets of {1, . . . , p}, and its size is 2p − 1. Therefore from the union bound we have: ) (  X  kxi k2 − λi > ∆ ≤ Pr{J∆ } = Pr max I∈P kxk2 i∈I ) (  X  kxi k2 X − λi > ∆ (99) Pr ≤ kxk2 i∈I I∈P To bound the above we  first develop bound on P probability 2 the probability Pr a kx k ≤ 0 for some coefficients ai : i i i Lemma 7. Let x ∼ N (0, P )n . For coefficients {ai }pi=1 with P 1 i λi ai = ā > 0 and |ai | ≤ A where |ā| ≤ 8 A, we have ! X Pr ai kxi k2 ≤ 0 ≤ e−nE (100) i where ā2 6A2 Now we apply the bound to the events in Eq.(99):  X  kxi k2 >∆ − λ i kxk2 E= (101) i∈I kyi k2 ! X = i∈I 2 kxi k − X λi i∈I m p X i=1 p X i=1 · kxk2 · kyk2 (95) where (a) is from Cauchy-Swartz inequality (b) is since ρ̂i zi ≤ zi for zi ≥ 0 and ρ̂i zi ≤ 0 for zi ≤ 0 therefore always ρ̂i zi ≤ max(zi , 0) (attained for ρ̂i = Ind(zi > 0)). Both inequalities are tight in the sense that for each x there is a sequence y (equivalent to choosing {kyi k2 } and {ρ̂i }) that meets them in equality. Dividing by kxk2 · kyk2 we have that   X X kxi k2 − λ , 0 (96) max λi ρ̂2i ≤ ρ̂2 − i kxk2 i i where the RHS depends only on x and should be bounded by ∆. Thus the minimal set J∆ is: ) (   X kxi k2 − λi , 0 > ∆ (97) max J∆ ≡ x : kxk2 i ∆+ We have: ā = p X i=1 X i∈I | λi ai = ∆ · 2 kxi k > ∆ m 2 · kxk ·kyk ≤ ! I∈P i p X i=1 kxi k2 ! λi − Ind(i ∈ I) kxi k2 < 0 {z } ≡ai p X λi + i=1 − X i∈I p X i=1 λi · p X i=1 λi − Ind(i ∈ I)λi = ∆ (102) And |ai | ≤ 1 + ∆ ≡ A, therefore for ∆ ≤ 1/7 we have ā ≤ 18 A and by Lemma 7: ) (  X  kxi k2 − λi > ∆ ≤ e−nE ≤ e−nE0 (103) Pr kxk2 i∈I where E= ā2 ∆2 ∆2 ∆2 = ≥ ≥ ≡ E0 (104) 6A2 6(1 + ∆)2 6(1 + 1/7)2 8 25 TABLE III PARAMETERS OF ADAPTIVE RATE SCHEME USED FOR FIGURE 3 Item Transmission scheme Referrence section V-C RLB1 parameters section VI-C2, Eq.(65) RLB2 parameters section VI-C2, Theorem 4 and from Eq.(99) we have: ) (  X  kxi k2 X − λi > ∆ ≤ Pr Pr{J∆ } ≤ kxk2 I∈P Parameter set 1 of figure 3 n = 1e + 008, K = 1e 006, PA = 0.001, Pe = 0.001 T = 2.5e + 005, ∆µ 37.5412, ∆ = 0.0345958, η1 0.996007, η2 = 0.999962, ǫ1 0.01 ρ0 = 0.9, ǫ = 0.139438, R̄ 1.05173 āw∗ − (105) which proves the lemma. Note that different bounds can be obtained by applying the bound on m smaller sets in {1, . . . , p} and requiring that the sum over each set will be bounded by ∆/m (as an example we could bound each max(zi , 0) separately by ∆/p), however this bound is most suitable for our purpose since when p << n the element 2p becomes negligible.  Proof of Lemma 7: We assume without loss of generality that x ∼ N (0, 1)n . For Gaussian r.v. X ∼ N (0, 1) and a < 12 we have: Z ∞ 2 2 1 1 √ e(a− 2 )x dx = E(eax ) = 2π −∞ Z ∞ x2 1 1 − p =√ e 2(1−2a)−1 dx = 1 − 2a −∞ 2π(1 − 2a)−1 1 (106) =√ 1 − 2a P For coefficients {ai }pi=1 with i λi ai = ā > 0 and |ai | ≤ A, w > 0 a positive constant of our choice, and x ∼ N (0, 1)n we have: ! X P 2 1 2 ln Pr ai kxi k ≤ 0 ≤ ln Ee− 2 w· i ai kxi k = i P P − 12 w· i ai j∈A x2j = ln Ee i = ln Y Y 1 2 Ee− 2 w·ai ·xj = i j∈Ai =   1 X λi ln(1+w·ai ) = =− n 2 i i j∈Ai   1 X 1 1 (a) 2 = − n λi (w · ai ) − (w · ai ) ≤ 2 i 2 (1 + w · ti )2   (b) 1 1 1 X 2 λi (w · ai ) − (w · A) = ≤− n 2 i 2 (1 − w · A)2   1 A2 w 2 = − n āw − (107) 2 2(1 − w · A)2 XX − 12 ln (1 + w · ai ) = = = = ρ0 = 0.99998, ǫ 0.0068209, R̄ = 7.29818 = the bound with respect to w ignoring the denominator) and obtain: i∈I ≤ |P| · e−nE0 ≤ 2p e−nE0 Parameter set 2 n = 1e + 020, K = 1e + 017, PA = 0.001, Pe = 0.001 T = 7.5e + 015, ∆µ = 77.4043, ∆ = 3.14616e − 007, η1 = 1, η2 = 1, ǫ1 = 0.001 + where (a) is based on the second order Tailor series of ln(1 + wt) around t = 0 with some ti ∈ [0, ai ]∪[ai , 0] and (b) is since |ti | ≤ |ai | ≤ A. For simplicity we choose a sub-optimal w∗ = ā A2 (which is obtained by assuming small a, w and optimizing ā2 ā2 /A2 A2 w ∗ 2 = − = 2(1 − w∗ · A)2 A2 2(1 − ā/A)2   ā2 A2 = 2 1− A 2(A − ā)2 (108) To simplify the bound, we make a further assumption that |ā| ≤ 81 A therefore:     A2 ā2 A2 ā2 1 − ≥ 1 − = A2 2(A − ā)2 A2 2 · (7/8)2 · A2 ā2 17 ā2 = 2· ≥ (109) A 49 3A2 Therefore we can write the following bound: for |ā| ≤ 81 A we have ! X 2 ai kxi k ≤ 0 ≤ e−nE (110) Pr i where E = N (0, P )n . ā2 6A2 . Note that the bound is true for any x ∼  F. Parameters of adaptive rate scheme used for figure 3 Table III lists two sets of parameters for the continuous alphabet adaptive rate scheme. The first set was used for the curves in figure 3, and the second set shows the convergence of ǫ, R̄, for higher values of n, K. Note that the values of n, K are extremely high, and this is due to the looseness of the bounds used in the continuous case: specifically the exponent of Lemma 6 which yields a relatively slow convergence of the ill-convexity probability in equation 56. R EFERENCES [1] O. Shayevitz and M. Feder, ”Communicating using Feedback over a Binary Channel with Arbitrary Noise Sequence”, International Symposium on Information Theory (ISIT), Adelaide, Australia, September 2005. [2] Ofer Shayevitz and Meir Feder, ”Achieving the Empirical Capacity Using Feedback Part I: Memoryless Additive Models”, Dept. of Electrical Engineering Systems, Tel Aviv University, Tel Aviv 69978, Israel [3] Krishnan Eswaran, Anand D. Sarwate, Anant Sahai, and Michael Gastpar, ”Limited feedback achieves the empirical capacity,” Department of Electrical Engineering and Computer Sciences, University of California, arXiv:0711.0237v1 [cs.IT] 2 Nov 2007. [4] K. Eswaran and A.D. Sarwate and A. Sahai and M. Gastpar, ”Using zerorate feedback on binary additive channels with individual noise sequences,” Proceedings of the 2007 Infernational Symposium on Information Theory (ISIT 2007), Nice, France, June 2007 [5] V. D. Goppa, Nonprobabilistic mutual information without memory, Probl. Contr. Inform. Theory, vol. 4, pp. 97-102, 1975 26 [6] Lapidoth, A.; Narayan, P., ”Reliable communication under channel uncertainty,” Information Theory, IEEE Transactions on , vol.44, no.6, pp.21482177, Oct 1998 [7] I. Csiszár and P. Narayan, ”The Capacity of the Arbitrarily Varying Channel Revisited : Positivity, Constraints”, IEEE Transactions On Information Theory, Vol. 34, No. 2, March 1988 [8] Mukul Agarwal, Anant Sahai, Sanjoy Mitter, ”Coding into a source: a direct inverse Rate-Distortion theorem,” arXiv:cs/0610142v1 [cs.IT]. Orginally presented at Allerton 06 [9] Shayevitz, O.; Feder, M., ”The posterior matching feedback scheme: Capacity achieving and error analysis,” Information Theory, 2008. ISIT 2008. IEEE International Symposium on , vol., no., pp.900-904, 6-11 July 2008 [10] Csiszár, I., ”The method of types [information theory],” Information Theory, IEEE Transactions on , vol.44, no.6, pp.2505-2523, Oct 1998 [11] Aslan Tchamkerten and I. Emre Telatar, ”Variable Length Coding Over an Unknown Channel”, IEEE Transactions On Information Theory, Vol. 52, No. 5, May 2006 [12] M. V. Burnashev, ”Data transmission over a discrete channel with feedback: Random transmission time,” Probl. Inf. Transm., vol. 12, no. 4, pp. 250-265, 1976 [13] Shulman, N.; Feder, M., ”The uniform distribution as a universal prior,” Information Theory, IEEE Transactions on , vol.50, no.6, pp. 1356-1362, June 2004 [14] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley-Interscience, Second edition, 2006. [15] Zamir, R.; Erez, U., ”A Gaussian input is not too bad,” Information Theory, IEEE Transactions on , vol.50, no.6, pp. 1362-1367, June 2004 [16] Hassibi, B.; Hochwald, B.M., ”How much training is needed in multipleantenna wireless links?,” Information Theory, IEEE Transactions on , vol.49, no.4, pp. 951-963, April 2003 [17] Lapidoth, A., ”Nearest neighbor decoding for additive non-Gaussian noise channels ,” Information Theory, IEEE Transactions on , vol.42, no.5, pp.1520-1529, Sep 1996 [18] Hughes, B.; Narayan, P., ”Gaussian arbitrarily varying channels,” Information Theory, IEEE Transactions on , vol.33, no.2, pp. 267-284, Mar 1987 [19] Nadav Shulman, ”Communication over an Unknown Channel via Common Broadcasting,” Ph.D. dissertation, Tel Aviv University, 2003 [20] Shayevitz, Ofer; Feder, Meir, ”Communication with Feedback via Posterior Matching,” Information Theory, 2007. ISIT 2007. IEEE International Symposium on , vol., no., pp.391-395, 24-29 June 2007 [21] M. Horstein, Sequential transmission using noiseless feedback, IEEE Trans. Info. Theory, pp. 136143, July 1963. [22] J. P. M. Schalkwijk, A coding scheme for additive noise channels with feedback part II: Band-limited siganls, IEEE Trans. Info. Theory, vol. IT12, pp. 183 189, 1966. [23] D. Blackwell, L. Breiman, and A. J. Thomasian, ”The capacities of certain channel classes under random coding,” Ann. Math. Statist., vol. 31, pp. 558-567, 1960 [24] Lapidoth, A.; Telatar, I.E., ”The compound channel capacity of a class of finite-state channels ,” Information Theory, IEEE Transactions on , vol.44, no.3, pp.973-983, May 1998 [25] Barron, A.; Rissanen, J.; Bin Yu, ”The minimum description length principle in coding and modeling,” Information Theory, IEEE Transactions on , vol.44, no.6, pp.2743-2760, Oct 1998 [26] Csiszar, I.; Narayan, P., ”Channel capacity for a given decoding metric,” Information Theory, IEEE Transactions on , vol.41, no.1, pp.35-43, Jan 1995 [27] Weisstein, Eric W. ”Ball.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/Ball.html [28] Weisstein, Eric W. ”Gamma Function.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/GammaFunction.html