1
Communication over Individual Channels
arXiv:0901.1473v2 [cs.IT] 20 Aug 2009
Yuval Lomnitz, Meir Feder
Tel Aviv University, Dept. of EE-Systems
Email: {yuvall,meir}@eng.tau.ac.il
Abstract—We consider the problem of communicating over a
channel for which no mathematical model is specified. We present
achievable rates as a function of the channel input and output
known a-posteriori for discrete and continuous channels, as well
as a rate-adaptive scheme employing feedback which achieves
these rates asymptotically without prior knowledge of the channel
behavior.
I. I NTRODUCTION
The problem of communicating over a channel with an
individual, predetermined noise sequence which is not known
to the sender and receiver was addressed by Shayevitz and
Feder [1] [2] and Eswaran et al [3][4]. The simple example
discussed in [1] is of a binary channel yn = xn ⊕ en where
the error sequence en can be any unknown sequence. Using
perfect feedback and common randomness, communication is
shown to be possible in a rate approaching the capacity of the
binary symmetric channel (BSC) where the error probability
equals the empirical error probability of the sequence (the
relative number of ’1’-s in en ). Subsequently both authors
extended this model to general discrete channels and moduluadditive channels ([3], [2] resp.) with an individual state
sequence, and showed that the empirical mutual information
can be attained.
Now we take this model one step further. We consider
a channel where no specific probabilistic or mathematical
relation between the input and the output is assumed. In order
to define positive communication rates without assumptions
on the channel, we characterize the achievable rate using the
specific input and output sequences, and we term this channel
an individual channel. This way of treating with unknown
channels is different from other concepts of dealing with the
problem, such as compound channels and arbitrarily varying
channels, in the fact that the later require a specification of
the channel model up to some unknown parameters, whereas
the current approach makes no a-priori assumptions about
the channel behavior. We usually assume the existence of a
feedback link in which the channel output or other information
from the decoder can be sent back to the encoder. Without
this feedback it would not be possible to match the rate of
transmission to the quality of the channel so outage would be
inevitable.
Although one may not be fully convenient with the mathematical formulation of the problem, there is no question about
the reality of this model: this is the only channel model that
we know for sure exists in nature. This point of view is similar
to the approach used in universal source coding of individual
sequences where the goal is to asymptotically attain for each
sequence the same coding rate achieved by the best encoder
from a model class, tuned to the sequence.
Just to inspire thought, let’s ask the following question:
Pn
suppose the sequence {xi }ni=1 with power P = n1 i=1 x2i
encodes a message and is transmitted over a continuous realvalued input channel. The output sequence is {yi }ni=1 . One
can think of viP
= yi − xi as a noise sequence and measure
its
n
P
which
power N = n1 i=1 vi2 . Is the rate R = 12 log 1 + N
is the Gaussian channel capacity, achievable in this case, under
appropriate definitions ?
The way it was posed, the answer to this question would
be ”no”, since this model predicts a rate of 21 bit/use for the
channel whose output is ∀i : yi = 0 which cannot convey any
information. However with the slight restatement done in the
next section the answer would be ”yes”.
We consider two classes of individual channels: discrete
input and output channels and continuous real valued input
and output channels, and two communication models: with
feedback and without feedback. In both cases we assume
common randomness exists. The case of feedback is of higher
interest, since the encoder can adapt the transmission rate
and avoid outage. The case of no-feedback is used as an
intermediate step, but the results are interesting since they
can be used for analysis of semi probabilistic models. The
main result is that with a small amount of feedback, a communication at a rate close to the empirical mutual information
(or its Gaussian equivalent for continuous channels) can be
achieved, without any prior knowledge, or assumptions, about
the channel structure.
The paper is organized as follows: in section II we give a
high level overview of the results. In section III-B we define
the model and notation. Section IV deals with communication
without feedback where the results pertaining to discrete and
continuous case are formalized and proven, and the choice
of the rate function and the Gaussian prior for the continuous
case is justified. Section V deals with the case where feedback
is present. After reviewing similar results we state the main
result and the adaptive rate scheme that achieves it, and delay
the proof to section VI. Here, the error probability and the
achieved rate are analyzed and bounded. Section VII gives
several examples, and section VIII is dedicated to comments
and highlights areas for further study.
II. OVERVIEW OF MAIN RESULTS
We start with a high level overview of the definitions
and results. The definitions below are conceptual rather than
accurate, and detailed definitions follow in the next sections.
A rate function is a function Remp : X n × Y n → R of
the input and output sequences. In communication without
2
feedback we say a given rate function is achievable if for
large block size n → ∞, it is possible to communicate at
rate R and an arbitrarily small error probability is obtained
whenever Remp exceeds the rate of transmission, i.e. whenever
Remp (x, y) > R. In communication with feedback we say a
given rate function is achieved by a communication scheme
if for large block size n, data at rate close to or exceeding
Remp (x, y) is decoded successfully with arbitrarily large
probability for every output sequence and almost every input
sequence. Roughly speaking, this means that in any instance
of the system operation, where a specific x was the input and
a specific y was the output, the communication rate had been
at least Remp (x, y). Note that the only statistical assumptions
are related to the common randomness, and we consider the
rate and error probability conditioned on a specific input and
output, where the error probability is averaged over common
randomness. We say that a rate function Remp is an optimal
′
(but not the optimal) function if any Remp
≥ Remp which is
strictly larger than Remp at at least one point, is not achievable.
The definition of achievability is not complete without
stating the input distribution, since it affects the empirical
rate. For example, by setting x = 0 one can attain every rate
function where Remp (0, y) = 0 in a void way, since other x
sequences will never appear. Different from classical results in
information theory, we do not use the input distribution only as
a means to show the existence of good codes: taking advantage
of the common randomness we require the encoder to emit
input symbols that are random and distributed according to a
defined prior (currently we assume i.i.d. distribution).
The choice of the rate functions is arbitrary in a way: for
any pair of encoder and decoder, we can tailor a function
Remp (x, y) as a function equaling the transmitted rate whenever the error probability given the two sequences (averaged
over messages and the common randomness) is sufficiently
small, and 0 otherwise. However it is clear that there are
certain rates which cannot be exceeded uniformly. Our interest
will focus on simple functions of the input and output,
and specifically in this paper we focus on functions of the
instantaneous (zero order) empirical statistics. Extension to
higher order models seems technical.
For the discrete channel we show that a rate
ˆ y)
Remp = I(x;
(1)
ˆ ·)
is achievable with any input distribution PX where I(·;
denotes the empirical mutual information [5] (see definition
in section III-B, and Theorems 1, 3). For the continuous (real
valued) channel we show that a rate
1
1
(2)
Remp = log
2
1 − ρ̂(x, y)2
is achievable with Gaussian input distribution N (0, P ), where
ρ̂ is the empirical correlation factor between the input and
output sequences (see Theorems 2, 4). These results pertain
both to the case of feedback and of no-feedback according to
the definitions above.
Throughout the current paper we define correlation factor
)
(that is,
in a slightly non standard way as ρ = √ E(XY
2
2
E(X )E(Y )
without subtracting the mean). This is done only to simplify
definitions and derivations, and similar claims can be made using the correlation factor defined in the standard way. Although
the result regarding the continuous case is less tight, we show
that this is the best rate function that can be defined by second
order moments, and is tight for the Gaussian additive channel
P
P
)
therefore Remp = 12 log 1 + N
(for this channel ρ2 = P +N
We may now rephrase our example question from the introduction so that it will have an affirmative answer: given the
input and output sequences, describe the output by the virtual
additive channel with a gain yi = αxi + vi , so the effective
noise
P sequence is vi = yi − αxi . Chose α so that v ⊥ x, i.e.
1
i vi xi = 0. An equivalent condition is that α minimizes
n
kvk2 . The resulting α is the LMMSE coefficient in estimation
xT y
of y from x (assuming zero mean), i.e. α = kxk
2 . Define the
Pn
effective noise power as N = n1 i=1 vi2 , and the effective
2
ρ̂2
SNR ≡ αNP . It is easy to check that SNR = 1−
ρ̂2 where
T
x y
is the empirical correlation factor between x and
ρ̂ = kxk·kyk
y. Then according to Eq.(2) the rate R = 21 log (1 + SNR)
is achievable, in the sense defined above. Reexamining the
counter example we gave above, in this model if we set y = 0
we obtain ρ̂ = 0 and therefore Remp = 0, or equivalently the
effective channel has v = 0 and α = 0, therefore SNR = 0
(instead of v = −x, α = 1 and SNR = 1).
As will be seen, we achieve these rates by random coding
and universal decoders. For the case of feedback we use
iterated instances of rateless coding (i.e. we encode a fixed
number of bits and the decision time depends on the channel).
The scheme is able to operate asymptotically with ”zero rate”
feedback (meaning any positive capacity of the feedback channel suffices). A similar although more complicated scheme was
used in [3] (see a comparison in the appendix).
Before the detailed presentation we would like to examine
the differences between the model used here and two proximate models: the arbitrarily varying channel (AVC) and the
channel with individual noise sequence.
In the AVC (see for example [6][7]), the channel is defined
by a probabilistic model which includes an unknown state
sequence. Constraints on the sequence (such as power, number
of errors) may be defined, and the target is to communicate equally well over all possible occurrences of the state
sequence. In AVC, the capacity depends on the existence of
common randomness and on whether the average or maximum
error probability (over the messages) is required to approach 0,
yet when sufficient common randomness is used, the capacities
for maximum and average error probability are equal. The
notes in [6] regarding common randomness and randomized
encoders (see p.2151) are also relevant to our case.
A treatment of AVC-s which is similar in spirit to our results
exists in watermarking problems. For example a rather general
case of AVC is discussed in [8]. They consider communication
over a black box (representing the attacker) which is only
limited to a given level D of distortion according to a
predefined metric, but has otherwise a block-wise undefined
behavior. They show that it is possible to achieve a rate equal
to the rate-distortion function of the input RX (D), if the
black box guarantees a given level of average distortion in
3
high probability. This result is similar to our Theorem 1. The
remarkable distinction from other results for AVC is that the
rate is determined using a constraint on the channel inputs
and outputs, rather than the channel state sequence. We note
that for the Gaussian additive channel the above result is
suboptimal since the rate is RX (N ) = 12 log(P/N ) and our
results improve this result by using the correlation factor yields
rather than the mean squared error. See further discussion
of these results in the proof Lemma 1 and the discussion
following Theorem 3.
Channels with individual noise (or state) sequence are
treated by Shayevitz and Feder [1][2] and Eswaran et al [3].
The probabilistic setting is the same as in the AVC, and
the difference is that instead of achieving a uniform (hence
worst-case) rate, the target is to achieve a variable rate which
depends on the particular sequence of noise, using a feedback
link. In this setup, prior constraints on the state sequence can
be relaxed. As opposed to AVC where the capacity is well
defined, the target rate for each state sequence is determined
in a somewhat arbitrary way (since many different constraints
on the sequence can be defined). As an example, in the binary
channel of [1], a rate of 0 would be obtained for the sequence
e =′ 01010101...′ since the empirical error probability is
1
2 , although obviously a scheme which favors this specific
sequence and achieves a rate of 1 can be designed. On the
other hand, with the AVC approach communication over this
channel would not be possible without prior constraints on the
noise sequence. Channels with individual noise sequence can
be thought of as compound-AVCs (i.e. an AVC with unknown
parameter, in this case, the constraint). As in AVC, existence
of common randomness as well as the definition of error
probability affect the achievable rates.
In the individual channel model we use here, since no
equation with state sequence connecting the input and output is
given, the achievable rates cannot be defined without relating
to the channel input. Therefore the definitions of achieved rates
depend in a somewhat circular way on the channel input which
is determined by the scheme itself. Currently we circumvent
this difficulty by constraining the input distribution, as mentioned above.
In many aspects the model used in this paper is more
stringent than the AVC and the individual noise sequence
models, since it makes less assumptions on the channel, and
the error probability is required to be met for (almost) every
input and output sequence (rather than on average). In other
aspects it is lenient since we may attribute ’bad’ channel
behavior to the rate rather than suffer an error, therefore the
error exponents are better than in probabilistic models. This
is further explained in section IV-A.
The model we propose suggests a new approach for the
design of communication systems. The classical point of
view first assumes a channel model and then devises a
communication system optimized for it. Here we take the
inverse direction: we devise a communication system without
assumptions on the channel which guarantees rates depending
on channel behavior. This change of viewpoint does not make
probabilistic or semi probabilistic channel models redundant
but merely suggests an alternative. By using a channel model
we can formalize questions relating to optimality such as
capacity (single user, networks) and error exponent as well
as guarantee a communication rate a-priori. Another aspect
is that we pay a price for universality. Even if one considers
an individual channel scheme that guarantees asymptotically
optimum rates over a large class of channels, it can never
consider all possible channels (block-wise), and for a finite
block size it will have a larger overhead (a reduction in
the amount of information communicated with same error
probability) compared to a scheme optimized for the specific
channel.
Following our results, the individual channel approach becomes a very natural starting point for determining achievable
rates for various probabilistic and arbitrary models (AVC-s,
individual noise sequences, probabilistic models, compound
channels) under the realm of randomized encoders, since
the achievable rates for these models follow easily from the
achievable rates for specific sequences, and the law of large
numbers. We will give some examples later on.
III. D EFINITIONS AND NOTATION
A. Notation
In general we use uppercase letters to denote random
variables, respective lowercase letters to denote their sample
values and boldface letters to denote vectors, which are by
default of length n. However we deviate from this practice
when the change of case leads to confusion, and vectors
are always denoted by lowercase letters even when they are
random variables.
√
kxk ≡ xT x denotes L2 norm. We denote by P ◦ Q
the product of conditional probability functions e.g. (P ◦
ˆ denotes an estimated
Q)(x, y) = P (x) · Q(y|x). A hat ()
value.
We denote the
Pnempirical distribution as P̂ (e.g.
P̂(x,y) (x, y) ≡ n1 i=1 δ(xi −x),(yi −y) ). The source vectors
x, y and/or the variables x, y are sometimes omitted when
ˆ ·),
they are clear from the context. We denote by Ĥ(·), I(·;
ρ̂(·; ·) the empirical entropy, the empirical mutual information
and the empirical correlation factor, which are the respective
values calculated for the empirical distribution. All expressions
ˆ y), I(x;
ˆ y|z), I(x;
ˆ y|z = z0 )
such as Ĥ(x), Ĥ(x|y), I(x;
are interpreted as their respective probabilistic counterparts
H(X), H(X|Y ), I(X; Y ), I(X; Y |Z), I(X; Y |Z = z0 )
where (X, Y, Z) are random variables distributed according
to the empirical distribution of the vectors P̂(x,y,z) , or
equivalently are defined as a random selection of an element
of the vectors i.e. (X, Y, Z) = (xi , yi , zi ), i ∼ U{1, . . . , n}.
It is clear from this equivalence that relations on entropy and
mutual information (e.g. positivity, chain rules) are directly
translated to relations on their empirical counterparts.
We apply superscript and subscript indices to vectors
to define subsequences in the standard way, i.e. xji ≡
(xi , xi+1 , ..., xj ), xi ≡ xi1
We denote I(P, W ) the mutual information I(X; Y ) when
(X, Y ) ∼ P (x)·W (y|x). U(A) denotes a uniform distribution
over the set A. Ber(p) denotes the Bernoulli distribution, and
hb (p) ≡ H(Ber(p)) = −p log p − (1 − p) log(1 − p) denotes
4
the binary entropy function. The indicator function Ind(E)
where E is a set or a probabilistic event is defined as 1 over
the set (or when the event occurs) and 0 otherwise.
The functions log(·) and exp(·) as well as information
theoretic quantities H(·), I(·; ·), D(·||·) refer to the same,
unspecified base. We use the term ”information unit” as the
1
bits).
unit of these quantities (equals log(2)
The notation fn = O(gn ) and fn < O(gn ) (or equivalently
O(fn ) = O(gn ) and O(fn ) < O(gn )) means gfnn n→∞
−→ const >
0 and gfnn n→∞
0
respectively.
−→
Throughout this paper we use the term ”continuous” to refer
to the continuous real valued channel R → R, although this
definition does not cover all continuous input - continuous
output channels. By the term ”discrete” in this paper we always
refer to finite alphabets (as opposed to countable ones).
B. Definitions
Definition 1 (Channel). A channel is defined by a pair of
input and output alphabets X , Y, and denoted X → Y
Definition 2 (Fixed rate encoder, decoder, error probability).
A randomized block encoder and decoder pair for the channel
X → Y with block length n and rate R without feedback is
defined by a random variable S distributed over the set S, a
mapping φ : {1, 2, . . . exp(nR)} × S → X n and a mapping
φ̄ : Y n × S → {1, 2, . . . exp(nR)}. The error probability for
message w ∈ {1, 2, . . . exp(nR)} is defined as
Pe(w) (x, y) = Pr φ̄(y, S) 6= w φ(w, S) = x
(3)
where for x such that the condition cannot hold, we define
(w)
Pe (x, y) = 0.
Note that the encoder rate must pertain to a discrete number
of messages exp(nR) ∈ Z+ , but the empirical rates defined
in the following theorems may be any positive real numbers.
Definition 3 (Adaptive rate encoder, decoder, error probability). A randomized block encoder and decoder pair for
the channel X → Y with block length n, adaptive rate and
feedback is defined as follows:
• The message w is expressed by the infinite sequence
w1∞ ∈ {0, 1}∞
• The common randomness is defined as a random variable
S distributed over the set S
• The feedback alphabet is denoted F
• The encoder is defined by a series of mappings xk =
φk (w, s, f k−1 ) where φk : {0, 1}∞ × S × F k−1 → X .
• The decoder is defined by the feedback function ϕk :
Y k−1 × S → F, the decoding function φ̄ : Y n × S →
{0, 1}∞ and the rate function r : Y n × S → R+ (where
the rate is measured in bits), applied as follows:
fk
= ϕk (yk , S)
(4)
ŵ
R
= φ̄(y, S)
= r(y, S)
(5)
(6)
The error probability for message w is defined as
⌈nR⌉
⌈nR⌉
6= w1
x, y
Pe(w) (x, y) = Pr ŵ1
(7)
In other words, a recovery of the first ⌈nR⌉ bits by the
decoder is considered a successful reception. For x such that
(w)
the condition cannot hold, we define Pe (x, y) = 0. The
conditioning on y is mainly for clarification, since it can be
treated as a fixed vector. This system is illustrated in figure 2.
Note that if we are not interested in limiting the feedback
rate, and perfect feedback can be assumed, the definition of
feedback alphabet and feedback function is redundant (in this
case F = Y and fk = yk ). The model in which the decoder
determines the transmission rate is lenient in the sense that
it gives the flexibility to exchange rate for error probability:
the decoder may estimate the error probability and decrease
it by reducing the decoding rate. In the scheme we discuss
here the rate is determined during reception, but it’s worth
noting in this context the posterior matching scheme [9] for
the known memoryless channel. In this scheme the message is
represented as a real number θ ∈ [0, 1) and the rate for a given
error probability Pe can be determined after the decoding
by calculating Pr(θ|y) and finding the smallest interval with
probability at least 1 − Pe .
IV. C OMMUNICATION WITHOUT FEEDBACK
In this section we show that the empirical mutual information (in the discrete case) and its Gaussian counterpart (in the
continuous case) are achievable in the sense defined in the
overview. For the continuous case we justify the choice of the
Gaussian distribution as the one yielding the maximum rate
function that can be defined by second order moments.
A. The discrete channel without feedback
The following theorem formalizes the achievability of rate
ˆ y) without feedback:
I(x;
Theorem 1 (Non-adaptive, discrete channel). Given discrete
input and output alphabets X , Y, for every Pe > 0, δ > 0,
prior Q(x) over X and rate R > 0 there exists n large enough
and a random encoder-decoder pair of rate R over block size
n, such that the distribution of the input sequence is x ∼ Qn
and the probability of error for any message given an input
sequence x ∈ X n and output sequence y ∈ Y n is not greater
ˆ y) > R + δ.
than Pe if I(x,
Theorem 1 follows almost immediately from the following
lemma, which is proven in the appendix using simple a
calculation based on the method of types [10]:
Lemma 1. For any sequence y ∈ Y n the probability of a
sequence x ∈ X n drawn independently according to Qn to
ˆ y) ≥ t is upper bounded by:
have I(x;
ˆ y) ≥ t ≤ exp (−n (t − δn ))
Qn I(x;
(8)
→ 0.
where δn = |X ||Y| log(n+1)
n
Following notations in [10], Qn (A) denotes the probability
of the event A or equivalently the set of sequences A under the
5
xi ∈ X ✲
Channel
w (message)
✲ Encoder
Fig. 1.
✻
✻
S (common randomness)
S
ŵ (message)
✲
Non rate adaptive encoder-decoder pair without feedback
w
✲ Encoder
(message)
xi ∈ X
✲ Channel
✛
✻
yi ∈ Y
R (rate)✲
✲
Decoder
fi ∈ F (feedback)
S (common randomness)
Fig. 2.
yi ∈ Y ✲
Decoder
ŵ (message)
✲
✻
S
Rate adaptive encoder-decoder pair with feedback
i.i.d. distribution Qn . Remarkably this bound does not depend
on Q.
exp(nR)
To prove Theorem 1, the codebook {xm }m=1
is randomly generated by i.i.d. selection of its L = exp(nR) · n
letters, so that the common randomness S ∈ X L may be
defined as the codebook itself and is distributed QL . The
encoder sends the w-th codeword, and the decoder uses
maximum mutual information decoding (MMI) i.e. chooses:
i
h
ˆ m ; y)
(9)
ŵ = φ̄(y, {xm }) = argmax I(x
m
where ties are broken arbitrarily. By Lemma 1, the probability
of error is bounded by:
[
ˆ m ; y) ≥ I(x
ˆ w ; y)
Pe(w) (xw , y) ≤ Pr
≤
I(x
m6=w
ˆ w ; y) − δn
≤ exp(nR) exp −n I(x
=
ˆ w ; y) − R − δn
= exp −n I(x
(10)
e)
+δn < δ.
For any δ there is n large enough such that − log(P
n
ˆ
For this n, whenever I(x; y) > R + δ we have
Pe(w) (x, y)
≤ exp (−n (δ − δn )) < Pe
(11)
which proves the theorem.
Note that the MMI decoder used here is a popular universal
decoder (see [5][10][11]), and was shown to achieve the same
error exponent as the maximum likelihood decoder for fixed
composition codes. The error exponent obtained here is better
than the classical error exponent (slope of -1), and the reason
is that the behavior of the channel is known, and therefore
no errors occur as result of non-typical channel behavior.
Comparing for example with the derivation of the random
coding error exponent for the probabilistic DMC based on the
method of types (see [10]), in the later the error probability is
summed across all potential ”behaviors” (conditional types)
of the channel accounting for their respective probabilities
(resulting in one behavior, usually different from the typical
behavior, dominating the bound), while here the behavior
of the channel (the conditional distribution) is fixed, and
therefore the error exponent is better. This is not necessarily
the best error exponent that can be achieved (see [11][12]
which discuss error exponent with random decision time and
feedback for probabilistic and compound models).
Note that the empirical mutual information is always well
defined, even when some of the input and output symbols do
not appear in the sequence, since at least one input symbol and
one output symbol always appear. For the particular case of
empirical mutual information measured over a single symbol,
the empirical distributions become unit vectors (representing
constants) and their mutual information is 0.
In this discussion we have not dealt with the issue of choosing the prior Q(x). Since the channel behavior is unknown it
makes sense to choose the maximum entropy, i.e. the uniform,
prior which was shown to obtain a bounded loss from capacity
[13].
B. The continuous channel without feedback
When turning to define empirical rates for the real valued
alphabet case, the first obstacle we tackle is the definition
of the empirical mutual information. A potential approach
is to use discrete approximations. We only briefly describe
this approach since it is somewhat arbitrary and less elegant
than in the discrete case. The main focus is on empirical rates
defined by the correlation factor. Although the later approach
is pessimistic and falls short of the mutual information for
most channels, it is much simpler and elegant than discrete
approximations. We believe this approach can be further
extended to obtain results closer to the (probabilistic) mutual
information.
1) Discrete approximations: Define the continuous input
and output alphabets X , Y. Suppose Q is an arbitrary (continuous) prior. Define input and output quantizers to discrete
alphabets An : X → X̃n and Bn : Y → Ỹn where X̃n ,
6
Ỹn are discrete alphabets of growing size, chosen to grow
slowly enough so that δn = |X̃n ||Ỹn | log(n+1)
−→ 0. Define
n
n→∞
the empirical mutual information between continuous vectors
as the empirical mutual information between their quantized
versions (quantized letter by letter):
ˆ n (x), Bn (y))
IˆA,B (x, y) ≡ I(A
(12)
Then based on Lemma 1, by using a random codebook drawn
according to Q and applying a maximum mutual information
decoder using the above definition, we could asymptotically
achieve the rate function Remp = IˆA,B (x, y) based on the
definitions of Theorem 1. The main issue with this approach
is that determining An , Bn is arbitrary, and especially Bn is
difficult to define when the output range is unknown. Therefore
in the following we focus on the suboptimal approach using
the correlation factor.
2) Choosing the input distribution and rate function: First
we justify our choice of the Gaussian input distribution and
the aforementioned rate function. We take the point of view
of a compound (probabilistic, unknown) channel. If a rate
function cannot be attained for compound channel model,
it cannot be attained also in the more stringent individual
model. It is well known that for a memoryless additive noise
channel with constraints on the transmit power and noise
variance, the Gaussian noise is the worst noise when the
prior is Gaussian, and the Gaussian prior is the best prior
when the noise is Gaussian. Thus by choosing a Gaussian
prior we choose the best prior for the worst noise, and can
we guarantee the mutual information will equal, at least, the
Gaussian channel capacity. See the ”mutual information game”
(problem 9.21) in [14]. For the additive noise channel [15]
shows the loss from capacity when using Gaussian distribution
is limited to 12 a bit. However the above is true only for additive
noise channels. For the more general where no additivity is
assumed case we show below (Lemma 3) that the rate function
R = − 12 log(1−ρ2 ) is the best rate function that can be defined
by second order moments, and attained universally. Of course,
this proof merely supplies the motivation to use a Gaussian
distribution and does not rid us from the need to prove this
rate is achievable for specific, individual sequences.
Lemma 2. Let X,Y be two continuous random variables with
)
, where X is Gaussian
correlation factor ρ ≡ √ E(XY
2
2
E(X )E(Y )
X ∼ N (0, P ). Then I(X; Y ) ≥ − 12 log(1 − ρ2 )
R(Λ) ≡ max
Q
min
W :Λ(Q,W )=Λ
Corollary 2.2. The lemma does not hold for general X (not
Gaussian)
2
log(1−ρ )
The proof is given in the appendix. Note that
is the mutual information of two Gaussian r.v-s ([14], example
8.5.1). Also note the relation to Theorem 1 in [16] dealing
with an additive channel with uncorrelated, but not necessarily
independent noise. The following lemma justifies our selection
of R(ρ) = − 21 log(1 − ρ2 ):
Lemma 3. Let Q(x) be an input prior, W (y|x) be an
unknown channel, Λ(Q, W ) be the correlation matrix Λ ≡
1
I(Q, W ) = − log(1 − ρ2 ) (13)
2
Proof of lemma 3: R(Λ) = − 12 log(1 − ρ2 ) is attainable
by selecting an input prior Q = N (0, σx2 ) and by lemma
2 the mutual information is at least R(Λ) for all channels.
R(Λ) is the maximum attainable function since by writing
the condition of the lemma for the additive white gaussian
noise (AWGN) channel W ∗ (a specific choice of W ) and any
Q, we have R(Λ) ≤ I(Q, W ∗ ) ≤ I(N (0, EP (X 2 ), W ∗ )) =
− 21 log(1 − ρ2 ), where the inequalities follow from the conditions of the lemma on R and from the fact the Gaussian prior
achieves the AWGN capacity.
3) Communication scheme for the empirical channel (without feedback): The following theorem is the analogue of
Theorem 1 where the expression − 21 log(1 − ρ2 ) (interpreted
as the Gaussian effective mutual information) plays the role
of mutual information.
Theorem 2 (Non-adaptive, continuous channel). Given the
channel R → R for every Pe > 0, δ > 0, power P > 0
and rate R > 0 there exists n large enough and a random
encoder-decoder pair of rate R over block size n, such that
the distribution of the input sequence is x ∼ N n (0, P ) and the
probability of error for any message given an input sequence
x and output sequence y with empirical
correlation ρ̂ is not
1
1
greater than Pe if Remp = 2 log 1−ρ̂2 > R + δ
As before, the theorem will follow easily from the following
lemma, proven in the appendix.
T
x y
Lemma 4. Let x, y ∈ Rn be two sequences, and ρ̂ ≡ kxkkyk
be the empirical correlation factor. For any y, the probability
of x drawn according to N n (0, σx2 ) to have |ρ̂| ≥ t is bounded
by:
Pr(|ρ̂| ≥ t) ≤ 2 exp (−(n − 1)R2 (t))
(14)
where
Corollary 2.1. Equality holds iff X,Y are jointly Gaussian
− 12
T
X
X
between X, Y induced by the joint probability
E
Y
Y
Q ◦ W and ρ(Q, W ) be the correlation factor induced by
Q, W (ρ = √ΛΛ12Λ ). We say a function R(Λ) is an attainable
11 22
second order rate function if there exists a Q(x) such that
for every channel W (y|x) inducing correlation Λ the mutual
information is at least R(Λ) (in other words can carry the rate
R(Λ)). Then R(Λ) = − 12 log(1 − ρ2 ) is the largest attainable
second order rate function.
Alternatively this can be stated as:
R2 (t) ≡
1
log
2
1
1 − t2
(15)
exp(nR)
To prove Theorem 2, the codebook {xm }m=1
is randomly generated by Gaussian i.i.d. selection of its L =
exp(nR) · n letters, and the common randomness S ∈ X L
is defined as the codebook itself and is distributed N L (0, P ).
The encoder sends the w-th codeword, and the decoder uses
maximum empirical correlation decoder i.e. chooses:
T
|xm y|
ŵ = φ̄(y, {xm }) = argmax|ρ̂(xm ; y)|= argmax
m
m
kxm k
(16)
7
where ties are broken arbitrarily. By Lemma 4, the probability
of error is bounded by:
[
Pe(w) (xw , y) ≤ Pr
(|ρ̂(xm ; y)| ≥ |ρ̂(xw ; y)|) ≤
m6=w
≤ exp(nR) · 2 exp (−(n − 1)R2 (ρ̂(xw ; y)))) =
= 2 exp(R) · exp (−(n − 1) (R2 (ρ̂) − R)) (17)
1
≤δ
R + log P2e
Choosing n large enough so that n−1
(where Pe is from Theorem 2) we have that when R2 (ρ̂) >
R + δ:
Pe(w) (x, y) ≤ 2 exp(R) · exp (−(n − 1)δ) ≤ Pe
(18)
which proves the theorem.
A note is due regarding the definition of ρ̂ in singular cases
where x or y are 0. The limit of ρ̂ as y → 0 is undefined (the
directional derivative may take any value in [0,1]), however
for consistency we define ρ̂ = 0 when y = 0. Since x is
generated from a Gaussian distribution we do not worry about
the event x = 0 since the probability of this event is 0.
It’s worth spending a few words on the connections between
the receivers used for the discrete and the continuous cases.
Since the mutual information between two Gaussian r.v-s is
− 21 log(1 − ρ2 ), one can think of this value as a measure of
mutual information under Gaussian assumptions. Thus, using
this metric as an effective mutual information, since the mutual
information is an increasing function of |ρ| the MMI decoder
becomes a maximum empirical correlation decoder. On the
other hand, the receiver we used can be identified as the GLRT
(generalized maximum likelihood ratio test) for the AWGN
channel Y = αX + N (0, σ 2 ) with α an unknown parameter,
resulting from maximizing the likelihood of the codeword and
the channel simultaneously:
ŵ = argmax max log Pr(y|x; α) =
xm
α
(xT y)2
= argmin minky − αxm k2 = argmax m 2 =
m
m
α
kxm k
= argmax ρ̂2 (xm , y) (19)
m
The choice of the GLRT is motivated by considering the
individual channel as an effective additive channel with unknown gain (as presented in section II), combined with the
fact Gaussian noise is the worse. For discrete memoryless
channels it is easy to show that the GLRT (where the group of
channels consists of all DMC-s) is synonymous with the MMI
decoder (see [6]). Thus, we can identify the two decoders as
GLRT decoders, or equivalently as variants of MMI decoders.
In the sequel we sometimes use the term ”empirical mutual
information” in a broad sense that includes also the metric
− 12 log(1 − ρ̂2 ).
Regarding the receiver required to obtain the rates of
Theorem 2, it is interesting to consider the simpler maximum
projection receiver argmax|xTm y|. This receiver seems to
xm
differ from the maximum correlation receiver only in the term
kxm k which is nearly constant for large n due to the law
of large numbers. However surprisingly, the maximum rate
achievable with the projection receiver is only 21 ρ̂2 as can
be shown by a simple calculation equivalent to Lemma 4
(simpler, since z = xT y is Gaussian). The reason is that
when x is chosen independently of y, a large value of the
projection (non typical event) is usually created by a sequence
with power significantly exceeding the average (another non
typical event). When one non-typical event occurs there is no
reason to believe the sequence
√ is typical in other senses thus
the approximation kxm k ≈ nP is invalid. The correlation
receiver normalizes by the power of x and compensates
this effect. An alternative receiver which yields the rates of
Theorem 2 and is similar to the AEP receiver looks for the
codeword with the maximum absolute projection subject to
power limited to n1 kxm k2 < P + ǫ. This can be shown
by Sanov theorem [10] or by using the Chernoff bound.
The maximum correlation receiver was chosen because of its
elegance and the simplicity of the proof of Lemma 4.
Combining this lemma with the law of large numbers
provides a simple proof for the achievability of the AWGN
capacity ( 12 log(1+SNR)), which uses much simpler mechanics
than the popular proof based on AEP or error exponents. This
receiver has the technical advantage, compared to the AEP
receiver, that it does not declare an error for codewords which
have power deviating from the nominal power. This technical
advantage is important in the context of rateless decoding since
the power condition needs to be re-validated each symbol, thus
increasing its contribution to the overall error probability.
Lapidoth [17] showed that the nearest neighbor receiver
achieves a rate equal to the Gaussian capacity 21 log(1+P/N )
over the additive channel Y = X + V with arbitrary noise
distribution (with fixed noise power). This result parallels the
result that the random code capacity of the AVC Y = X + V
with a power constraint on V equals the Gaussian capacity [18]
(this stems directly from the characterization of the random
code capacity of the AVC as maxPX (x) minPS (s) I(X; Y ),
cf.[10] Eq.(V.4)). Our result is stronger since it does not
assume the channel is additive (nor any fixed behavior), but
considering the former results it is not surprising, if one
assumes (1) that any channel can be modeled as Y = αX + V
with V ⊥ X, (2) that the dependence of V on X does not
increase the error probability due to orthogonality (see [16])
and (3) that the loss from the single unknown parameter α is
asymptotically small.
Another related result is Agarwal et al’s [8] result that it
is possible to communicate with a rate approaching the ratedistortion function RX (D) over an arbitrarily varying channel
with unknown block-wise behavior satisfying a distortion
constraint Êd(x, y) ≤ D in high probability. This relation
is further discussed in the proof of Lemma 1. Their result is
similar to ours in the fact they define the rate in terms of the
input and output alone. The result is similar to obtaining the
rate function Remp ≈ RX (Êd(x, y)) in the sense of Theorems
1,2. However their result is not tight even for the Gaussian
channels: for the gaussian channel Y = X + V with noise
V limited to power N and the Gaussian prior
X ∼ N (0, P )
P
which is smaller
this rate function equals RX (N ) = 12 log N
than this channel capacity,
whereas
with
Theorem
2 we would
P
. Agarwal’s result is tight in the sense that
obtain 21 log 1 + N
8
this is the maximum rate that can be guaranteed given this
distortion. There exists a channel with the same distortion N
P
: the channel Y = αX + βV
whose capacity is only 21 log N
N
2
with α = β = 1− P . The reason for the sub-optimality of the
result is that the squared distance between the input and output,
in contrast with the correlation factor, does not yield a tight
representation of all memoryless linear Gaussian channels (in
the sense of Lemma 3).
V. C OMMUNICATION WITH FEEDBACK
A. Overview and background
In this section we present the rate-adaptive counterparts of
Theorems 1, 2, and the scheme achieving them. The proof is
delayed to the next section. The scheme we use in order to
adaptively attain these rates is by iterating a rateless coding
scheme. In other words, in each iteration we send a fixed
number of bits K, by transmitting symbols from an n length
codebook, until the receiver has enough information to decode.
Then, the receiver sends an indication that the block is over
and a new block starts.
Before developing the details we give some background
regarding the evolution of rateless codes, and the differences
between the proposed techniques. The earliest work is of
Burnashev [12] who showed that for known channels, using
feedback and a random decision time (i.e. decision time
which depends on the channel output) yields an improved
error exponent, which is attained by a 3 step protocol (best
described in [11]) and shown to be optimal. Shulman [19]
proposed to use random decision time as a means to deal with
sending common information over broadcast channels (static
broadcasting), and for unknown compound channels (which
are treated as broadcast). In this scheme later described as
”rateless coding” (or Incremental Redundancy Hybrid ARQ)
a codebook of exp(K) infinite sequences is generated, and
the sequence representing the message is sent to the receiver
symbol by symbol, until the receiver decides to decode (and
turn off, in case of a broadcast channel). Tchamkerten and
Telatar [11] connect the two results by showing that for some,
but not all compound channels Burnashev error exponent
can be attained universally using rateless coding and the 3
step protocol. Eswaran, Sarwate, Sahai and Gastpar [3] used
iterated rateless coding to achieve the mutual information
related to the empirical noise statistics on channels with
individual noise sequences. The scheme we use here is most
similar to the one used in [3] but less complicated. We do not
use training symbols to learn the channel in order to decide on
the decoding time but rely on the mutual information itself as
the criterion (based on Lemmas 1,4) and the partitioning into
blocks and the decision rules are simpler. The result in [3] is
an extension of a result in [1] regarding the binary channel to
general discrete channels with individual noise sequence. The
original result in [1] was obtained not by rateless codes but by
a successive estimation scheme [20] which is a generalization
of the Horstein [21] and Schalkwijk-Kailath [22] schemes.
The same authors extend their results to discrete channels
[2] using successive schemes (where the target rate is the
capacity of the respective modulu-additive channel). The two
concepts in achieving the empirical rates differ in various
factors such as complexity and the amount of feedback and
randomization required. The successive schemes require less
common randomness but assume perfect feedback, while the
schemes based on rateless coding require less (asymptotically
0 rate) feedback but potentially more randomness.
As noted the technique we use here is similar to that of [3]
in its high level structure, while the structure of the rateless
decoder is similar to [19]’s (chapter 3). The application of this
scheme to individual inputs and outputs and the extension to
real-valued models requires proof and especially issues such
as abnormal behavior of specific (e.g. last) symbols have to be
treated carefully. The result of [3] cannot be applied directly
to individual channels since the channel model cannot be
extracted based on the input and output sequences alone, and
in the later both the model and the sequence are assumed to
be fixed (over common randomness).
B. Statement of the main result
In this section we prove the following theorems, relating to
the definitions given in section III-B:
Theorem 3 (Rate adaptive, discrete channels). Given discrete
input and output alphabets X , Y, for every Pe > 0, PA > 0,
δ > 0 and prior Q(x) over X there is n large enough and
random encoder and decoder with feedback and variable rate
over block size n with a subset J ⊂ X n , such that:
n
• The distribution of the input sequence is x ∼ Q
independently of the feedback and message
• The probability of error is smaller than Pe for any x, y
• For any input sequence x 6∈ J and output sequence y ∈
ˆ y) − δ
Y n the rate is R ≥ I(x,
• The probability of J is bounded by Pr(x ∈ J) ≤ PA
Theorem 4 (Rate adaptive, continuous channels). Given the
channel R → R for every Pe > 0, PA > 0, δ > 0, R̄ > 0, and
power P > 0 there is n large enough and random encoder
and decoder with feedback and variable rate over block size
n with a subset J ⊂ Rn , such that
n
• The distribution of the input sequence is x ∼ N (0, P )
independently of the feedback and message
• The probability of error is smaller than Pe for any x, y
• For any input sequence x
and output sequencei y ∈
h 6∈ J
1
− δ, R̄
Rn the rate is R ≥ min 21 log 1−ρ̂(x,y)
2
• The probability of J is bounded by Pr(x ∈ J) ≤ PA
Note that in the last theorem we do not have uniform
convergence of the rate function in x, y. Unfortunately our
scheme is limited by having a maximum rate for each n,
and although the maximum rate tends to infinity as n → ∞,
we cannot guarantee uniform convergence for each n in the
continuous case, where the target rate may be unbounded. The
rates in the theorems are the minimal rates, and in certain
conditions (e.g. a channel varying in time) higher rates may
be achieved by the scheme proposed below.
Regarding the set J as we shall see in the sequel there are
some sequences for which poor rate is obtained, and since we
committed to an input distribution we cannot avoid them (one
9
becomes conditioned on the set J. The question whether the
set J itself is truly necessary (i.e. is it possible to attain the
above Theorems with J = ∅) is still open.
Figure (3) illustrates the lower bound for Remp presented by
Theorem 4 (RLB2 ) as well as a (higher) lower bound RLB1 for
the rate achieved by the proposed scheme (see section VI-C2,
Eq.(65)). The parameters generating these curves appear in
table III in the appendix.
We prove the two theorems together. First we define the
scheme, and in the next section we analyze its error performance and rate and show it achieves the promise of the
theorems. Throughout this section and the following one we
use n to denote the length of a complete transmission, and m
to denote the length of a single block.
C. A proposed rate adaptive scheme
Fig. 3. Illustration of Remp lower bound of theorem 4 (RLB2 ) and the
lower bound RLB1 shown in the proof in section VI-C2, as a function of ρ
ρ2
(top) and the effective SNR = 1−ρ
2 (bottom). Parameters appear in table III
in the appendix
example is the sequence of 21 n zeros followed by 21 n ones,
in which at most one block will be sent). However there is
an important distinction between claiming for example that
”for each y the probability of R < Remp is at most PA ”
and the claim made in the theorems that ”R < Remp only
when x belongs to a subset J with probability at most PA ”.
The first claim is weaker since smartly chosen y may increase
the probability (see figure 4). This is avoided in the second
claim. A consequence of this definition is that the probability
of R < Remp is bounded by PA for any conditional probability
Pr(y|x) on the sequences. This issue is further discussed in
section VI-A.
Note that the probability PA could be absorbed into Pe
by a simple trick, but this seems to make the Theorem
less insightful. After reception the receiver knows the input
sequence in probability of at least 1 − Pe and may calculate
ˆ y). If the rate achieved
the empirical mutual information I(x,
ˆ y) it
by the scheme we will describe later falls short of I(x,
ˆ y) (which will most likely
may declare a rate of R = I(x,
result in a decoding error). This way the receiver will never
ˆ y) unless there is an
declare a rate which is lower than I(x,
error, and we could avoid the restriction x 6∈ J required for
achieving Remp , but on the other hand, the error probability
The following communication scheme sends B indices from
{1, . . . , M } over n channel uses (or equivalently sends the
number θ ∈ [0, 1) in resolution M −B ), where M is fixed, and
B varies according to empirical channel behavior. The building
block is a rateless transmission of one of M codewords (K ≡
log(M ) information units), which is iterated until the n-th
symbol is reached.
The transmit distribution Q is an arbitrary distribution for
the discrete case and Q = N (0, P ) for the continuous case.
We define the decoding metric as the empirical rate:
(
ˆ y)
I(x,
discrete
Remp (x, y) ≡
(20)
1
1
continuous
2 log 1−ρ̂2 (x,y)
The codebook CM ×n consists of M codewords of length n,
where all M × n symbols are drawn i.i.d. ∼ Q and known
to the sender and receiver. For brevity of notation we denote
m
m
Remp
(x, y) instead of Remp (xm
1 , y1 ). k denotes the absolute
time index 1 ≤ k ≤ n. Block b starts from index kb , where
k1 = 1. m = k − kb + 1 denotes the time index inside the
current block.
In each rateless block b = 1, 2, . . ., a new index i =
ib ∈ {1, . . . , M } is sent to the receiver using the following
procedure:
1) The encoder sends index i by sending the symbols of
codeword i:
xk = Ci,k
(21)
Note that different blocks use different symbols from
the codebook.
2) The encoder keeps sending symbols and incrementing k
until the decoder announces the end of the block through
the feedback link.
3) The decoder announces the end of the block after symbol
m in the block if for any codeword xi :
m
Remp
(xi , y) ≡ Remp (xi )kkb , ykkb ≥ µ∗m
(22)
where µ∗m is a fixed threshold per symbol defined in
Eq.(23) below.
4) When the end of block is announced one of the i
fulfilling Eq.(22) is determined as the index of the
decoded codeword îb (breaking ties arbitrarily).
10
5) Otherwise the transmission continues, until the n-th
symbol is reached. If symbol n is reached without fulfilling Eq.(22), then the last block is terminated without
decoding.
After a block ends, b is incremented and if k < n a new
block starts at symbol kb = k + 1. After symbol n is reached
the transmission stops and the number of blocks sent is B =
b − 1.
The threshold µ∗m is defined as:
n
1
K
+
log
+ δm =
µ∗m =
m−s m−s
Pe
K+log( Pne )+|X ||Y| log(m+1)
discrete
m
(23)
=
2n
K+log( Pe )
continuous
m−1
where s = 0 for the discrete case and 1 for the continuous case
and δm is defined in Lemma 1 for the discrete case and equals
log(2)
∗
m−1 for the continuous case. The threshold µm is tailored
to achieve the designated error probability and is composed
of 3 parts. The first part requires that the empirical rate Remp
would approximately equal the transmission rate of the block
K
m , which guarantees there is approximately enough mutual
information to send K information units. The second part is an
offset responsible for guaranteeing error probability bounded
by Pe over all the blocks in the transmission. The third part
δm compensates the overhead terms in Lemmas 1,4.
The scheme achieves the claims of Theorems 3,4 with a
proper choice of the parameters (discussed in section VI-C).
Note that the scheme uses feedback rate of 1 bit/use however
it is easy to show any positive feedback rate is sufficient (see
section VI-C), therefore we can claim the theorems hold with
”zero rate” feedback.
We devote the next section to the analysis of the error probability and rate of the scheme, showing it attains Theorems
3,4. Unfortunately although the scheme is simple, the current
analysis we have is somewhat cumbersome.
VI. P ROOF OF THE MAIN RESULT
In this section we analyze the adaptive rate scheme presented and show it achieves Theorems 3,4. Before analyzing
the scheme we develop some general results pertaining to the
convexity of the mutual information and correlation factors
over sub-vectors. The proof of the error probability is simple
(based on the construction of µ∗m ) and common to the two
cases. The proof of the achieved rate is more complex and
performed separately for each case.
which would guarantee that if we achieve a rate equal to the
empirical mutual information over the two sections 0 ≤ k ≤ m
and m < k ≤ n, then we would achieve the empirical mutual
information over the entire vector 0 ≤ k ≤ n. However this
property does not hold in general since the mutual information
is not convex with respect to the joint distribution. The mutual
information I(P, W ) is known to be convex ∪ with respect to
W and concave ∩ with respect to P , so if, for example, the
conditional distributions over the sections [1, m] and [m+1, n]
are equal and only the distribution of x differs, the condition
would in general not hold. On the other hand should the
n
empirical distributions of xm
1 and xm+1 be equal, then the
empirical mutual information expressions appearing in Eq.(24)
would differ only in the conditional distributions of y w.r.t x
and the assertion would hold. Since we generate x by i.i.d.
drawing of its elements the empirical distributions converge
to the prior Q, and we would expect that if the size of both
regions m and m − n is large enough the convexity would
hold up to a fraction ǫ in high probability. We show below that
such convexity holds under even milder conditions. The cases
in which this approximate convexity is used later on can serve
as examples of the difference between the individual model
used here and probabilistic models (including the individual
noise sequence). We use the lemma to:
1) Bound the loss due to insufficient utilization of the last
symbol in each rateless encoding block.
2) Bound the loss due to not completing the last rateless
encoding block.
3) Show that the average rate (empirical mutual information) over multiple blocks equals at least the mutual
information measured over the blocks together
Had the rate been averaged over multiple sequences x rather
than obtained for a specific sequence, the regular convexity of
the mutual information with respect to the channel distribution
would have been sufficient. The property is formalized in the
following lemma:
Lemma 5 (Likely convexity of mutual information). Let
{Ai }pi=1 defineSa disjoint partitioning of the index set
{1, . . . , n}, i.e. i Ai = {1, . . . , n} and Ai ∩ Aj = ∅ for i 6=
j. x , y are n-length sequences, and xA , yA define the subsequences of x, y (resp.) over the index set A. Let the elements
of x be chosen i.i.d. with distribution Q. Then for any ∆ > 0
there is a subset J∆ ⊂ X n such that:
∀x 6∈ J∆ , y ∈ Y n :
p
X
|Ai |
i=1
And
A. Preliminaries
1) Likely convexity of the mutual information: A property
which would be useful for the analysis is ∪-convexity of the
empirical mutual information with respect to joint empirical
distributions P̂(x,y) (x, y) measured over different sub-vectors,
so for example we would like to have for 0 ≤ m ≤ n:
ˆ n ; yn )
ˆ m ; ym ) + 1 − m · I(x
ˆ n1 ; y1n ) ≤ m · I(x
I(x
m+1
m+1
1
1
n
n
(24)
n
ˆ y) − ∆
ˆ A ; yA ) ≥ I(x;
I(x
i
i
(25)
Qn {J∆ } ≤ exp −n ∆ − δ̃n
With δ̃n = p|X | ·
log(n+1)
n
→ 0.
(26)
The lemma does not claim that convexity holds with high
probability, but rather that any positive deviation from convexity may happen only on a subset of x with vanishing
probability. It is surprising that the bound does not depend
on y, Q and the size of the subsets, and only weakly depends
on the number of subsets.
11
Using the chain rule for mutual information (see [14] section
2.5):
ˆ yu) − I(x;
ˆ u) =
ˆ y) − I(x;
ˆ y|u) = I(x;
ˆ y) − I(x;
I(x;
ˆ u) − I(x;
ˆ u|y) ≤ I(x;
ˆ u) (28)
= I(x;
ˆ u) > ∆}, then
Define the set J∆ = {x : I(x;
ˆ y) − I(x;
ˆ y|u) ≤ I(x;
ˆ u) ≤ ∆
∀x 6∈ J∆ , y : I(x;
(29)
And since x is chosen iid and u is a fixed vector, we have
from Lemma 1:
Pr (x ∈ J∆ ) ≤ exp −n ∆ − δ̃n
(30)
Fig. 4.
Illustration of bad sequences and lemma 5
Before proving the lemma we emphasize a delicate point:
the lemma does not only claim that for each y the probability
of deviation from convexity is small, but makes a stronger
claim that apart from a subset of the x sequences with
vanishing probability, convexity always holds independently
of y. This distinction is important since this lemma defines
a set of ”bad” input sequences that fail our scheme. In these
sequences there exists a partitioning that yields an excessive
deviation from the distribution Q between rateless blocks. As
an example of such a sequence consider the binary channel
and the input sequence 0n/2 1n/2 (n/2 zeros followed by n/2
ones). This sequence is bad since it guarantees that on one
hand at most one block will be received (since at most one
block includes both 0-s and 1-s at the input), but on the
other hand the zero order empirical input distribution is good
(Ber( 21 )), so potentially we have the combination of high
empirical mutual information with low communication rate.
The sequences that deviate from convexity are a function
of the output y. Had we only bounded the probability of
deviation from convexity to occur for each y individually, then
a potential adversary could have increased this probability by
determining y (given x) such that x will be a bad sequence
with respect to this y. To avoid this, we claim that there is a
fixed group of x such that if the sequence is not in the group,
approximate convexity holds regardless of y. This is illustrated
in fig.(4) where the dark spots mark the pairs (x, y) for which
convexity does not hold.
Proof of lemma 5: Define the vector u denoting the subset number of each element uk = i ∀k ∈ Ai . Then
ˆ A ; yA ) = I(x,
ˆ y|u = i), and P̂u (i) = |Ai | , therefore
I(x
i
i
n
we can write the weighted sum of empirical mutual information over the partitions, as a conditional empirical mutual
information:
p
X
|Ai |
i=1
n
X
p
ˆ
ˆ y|u = i) =
I(xAi ; yAi ) =
P̂u (i)I(x;
i=1
ˆ y|u) (27)
= I(x;
.
with δ̃n = |X ||{1, . . . , p}| log(n+1)
n
Note that if the distribution of x is the same over all
ˆ u) = 0 and
partitions then Ĥ(x|u) = Ĥ(x) therefore I(x;
the empirical mutual information will be truly convex.
2) Likely convexity of the correlation factor: For the continuous case we use the following property which somewhat
parallels Lemma 5. The reasons for not following the same
path as the discrete case will be explained in the sequel
(subsection VI-C). Unfortunately the proof is very technical
and less elegant and will therefore be expelled to the appendix
(appendix-E). Note that again the bound does not depend on
the size of the subsets.
Lemma 6 (Likely convexity of ρ̂2 ). Define {Ai }pi=1 as in
Lemma 5. Let x , y be n-length sequences and define the
correlation factors of the sub-sequences, and the overall
correlation factor as
ρ̂i =
|xTAi yAi |
kxAi k · kyAi k
and
|xT y|
kxk · kyk
(31)
ρ̂2i ≥ ρ̂2 − ∆
(32)
ρ̂ =
respectively. Let x be drawn i.i.d from a Gaussian distribution
x ∼ N (0, P ). Then for any 0 < ∆ ≤ 17 there is a subset
J∆ ⊂ Rn such that:
∀x 6∈ J∆ , y ∈ Rn :
And
p
X
|Ai |
i=1
n
Pr {x ∈ J∆ } ≤ 2p e−n∆
2
/8
(33)
I.e. there is a subset with high probability on which the mean
of the correlation factors does not fall considerably below the
overall correlation factor.
3) Likely convexity with dependencies: The properties of
likely convexity defined in the previous sections pertain to
a case where the partition of the n block is fixed and x is
drawn i.i.d. However in the transmission scheme we described,
the partition varies in a way that depends on the value of
x (through the decoding decisions and the empirical mutual
information), which may, in general, change the probability
of the convexity property with a given ∆ to occur. Although
it stands to reason that the variability of the block sizes in
the decoding process reduces the probability to deviate from
convexity since it tends to equalize the amount of mutual
information in each rateless block, for the analysis we assume
12
an arbitrary dependence, and assume that the size of the set
J increases by factor of the number of possible partitions, as
explained below.
Denote a partition by π = {Ai }pi=1 (as defined in Lemmas
5,6) and the group of all possible partitions (for a given
encoder-decoder) by Π. For each partition π from Lemmas 5,6
there is a subset J(π) with probability bounded by pJ outside
which approximate convexity (as defined in the lemmas)
holds. Then [
approximate convexity is guaranteed to hold for
J(π), where the probability of the set J is
x 6∈ J ≡
π∈Π
bounded by the union bound:
Pr(x ∈ J) = Pr
[
π∈Π
!
(x ∈ J(π))
≤ |Π| · pJ
(34)
Now we bound the number of partitions. In the two cases
we will deal with in section VI-C the number of subsets can be
bounded by pmax , and all subsets but one contain continuous
indices. Therefore the partition is completely defined by the
start and end indices of pmax − 1 subsets (allowed to overlap
if there are less than pmax subsets), thus |Π| ≤ n2pmax −2 <
n2pmax and we have
Pr(J) ≤ n2pmax · pJ = exp(2pmax log(n)) · pJ
(35)
where pJ is defined in the previous lemmas. So for our
purposes we may say that these lemmas hold even if the
partition depends on x with an appropriate change in the
probability of J.
B. Error probability analysis
In this subsection we show the probability to decode incorrectly any of the B indices is smaller than Pe .
With Remp defined in Eq.(20), we have from Lemma 4 that
under the conditions of the lemma Pr(Remp ≥ t) = Pr(|ρ̂| ≥
R2−1 (t)) ≤ 2 exp(−(n − 1)t). Then combining Lemmas 1 and
4, we may say that for any y1m the probability of xm
1 generated
i.i.d. from the relevant prior to have Remp ≥ t is bounded by:
m
Qm (Remp (xm
1 , y1 ) ≥ t) ≤ exp (−(m − s)(t − δm )) (36)
where
δm =
(
|X ||Y| log(m+1)
m
log 2
m−1
And
s=
discrete
continuous
0 discrete
1 continuous
(37)
(38)
An error might occur if at any symbol 1 ≤ k ≤ n an
incorrect codeword meets the termination condition Eq.(23).
The probability that codeword j 6= i meets Eq.(23) at a specific
symbol k which is the m-th symbol of a rateless block is
bounded by:
m
Pr(Remp
(xj , y) ≥ µ∗m ) ≤ exp (−(m − s)(µ∗m − δm )) =
n
Pe
Pe
= exp − K + log
=
=
(39)
Pe
n exp(K)
Mn
The probability of any erroneous codeword to meet the
threshold at any symbol is bounded by the union bound:
n [
[
(µm (xj , y) ≥ µ∗m ) ≤
Pr(error) ≤ Pr
k=1 j6=i
Pe
< Pe (40)
Mn
The first inequality is since the correct codeword might be
decoded even if an erroneous codeword met the threshold.
Although the index m in the expression above depends on
k and the specific sequences x, y in an unspecified way, the
assertion is true since the probability of the event in the union
has an upper bound independent of m.
≤ n(M − 1)
C. Rate analysis
Roughly speaking, since µ∗m ≈ K
m , if no error occurs, the
m
correct codeword crossed the threshold when Remp
(xi , y) ≈
K
therefore
the
rate
achieved
over
a
rateless
block
is Rb =
m
K
m
≈
R
(x
,
y),
and
due
to
the
approximate
convexity
by
emp i
m
achieving the above rate on each block separately we meet
or exceed the rate Remp (x, y) over the entire transmission.
However in a detailed analysis we have the following sources
of rate loss:
1) The offsets inserted in µ∗m to meet the desired error
probability
2) The offset from convexity (Lemma 5) introduced by the
slight differences in empirical distribution of x between
the blocks
3) Unused symbols:
a) The last symbol of each block is not fully utilized,
as explained below
b) The last (unfinished) block is not utilized
Regarding the last symbol of each block, note that after
receiving the previous symbol the empirical mutual information is below the threshold, and at the last symbol it meets
or exceeds the threshold. However the proposed scheme does
not gain additional rate from the difference between the mutual
information and the threshold, and thus it loses with respect
to its target (the mutual information over the block) when this
difference is large. Here a ”good” channel works adversely
to our worse. Since we operate under an individual channel
regime, the increase of the mutual information at the last
symbol is not bounded to the average information contents of
a single symbol. This is especially evident in the continuous
case where the empirical mutual information is unbounded.
A high value of y together with high value of x at the
last symbol causes an unbounded increase in Remp : if we
choose xm , ym → ∞ then ρ → 1 regardless of the history
x1m−1 , y1m−1 . Therefore over a single block we might have
an arbitrarily low rate (|ρ̂| is small over the m − 1 first
symbols) and arbitrarily large Remp . In the discrete case this
phenomenon exists but is less accented (consider for example
the sequences x = y = 0n−1 1 = (0, . . . , 0, 1)) Similarly
regarding the last block, the fact that the length of the block
may be bounded does not mean the increase in the empirical
mutual information can be bounded as well. We use the
13
TABLE I
S UMMARY OF DEFINITIONS AND REFERENCES FOR THE DISCRETE AND CONTINUOUS CASES
Item
Input distribution
Discrete case
Any Q
Continuous case
Q = N (0, P )
Decoding metric
Decoder
ˆ y)
Remp (x, y) ≡ I(x,
maximize Remp (x, y) ⇔ maximize
ˆ y)
I(x,
≤ exp(−n(t − δn )) (Lemma 1)
Pp
ˆ
ˆ
i=1 λi I(xi ; yi ) ≥ I(x; y)−∆ (Lemma
5)
””
“
“
”
“
Remp (x, y) ≡ 12 log 1−ρ̂21(x,y)
maximize Remp (x, y) ⇔ maximize
|ρ̂(x, y)|
≤ 2 exp(−(n − 1)t) (Lemma 4)
Pp
2
2
i=1 λi ρ̂i ≥ ρ̂ − ∆ (Lemma 6)
Pairwise error probability Pr(Remp ≥ t)
Likely convexity condition (∀x 6∈ J∆ , y ∈
1
Y n with λi ≡ n
|Ai |)
Likely convexity probability (Pr(x 6∈ J∆ ),
fixed partitioning)
≥ 1 − exp −n ∆ − δ̃n
approximate convexity (Lemma 5) to show the last two losses
are bounded for most x sequences.
Note that by the same argument that shows the loss from
not utilizing the last symbol vanishes asymptotically, it is
easy to show that feeding back the block success information
only once every 1/ǫ symbols thereby decreasing the feedback
rate to ǫ does not decrease the asymptotical rate, since this
is equivalent to having 1/ǫ unused symbols instead of one.
Hence the scheme can be modified to operate with ”zero
rate” feedback. Similarly the scheme can operate with a noisy
feedback channel by introducing in the feedback link a delay
suitable to convey the decoder decisions with sufficiently low
error rate over the noisy channel.
In addition to having rate losses the scheme also has a
minimal rate and a maximal rate for each block length. The
minimal rate is K
n resulting from sending a single block. If
channel conditions are worse (Remp < K
n ), no information
will be sent. A maximal rate exists since at best K information
units could be sent every 2 symbols (since for the continuous
1
case µ∗1 = ∞ and for the discrete case Remp
(x, y) = 0
thus the decoding never terminates at the first symbol of
the block), hence the maximum rate is K
2 . As n → ∞ we
increase K so that the minimum rate (and the rate offsets)
tend to 0 and the maximum rate tends to ∞. The maximum
rate is the reason that the scheme cannot approach the target
rate Remp (xi , y) uniformly in x, y in the continuous case,
since for some pairs of sequences the target rate (which is
unbounded) may be much higher than the maximum rate. The
rate R̄ that we achieve in the proof of Theorem 4 is much
smaller than the absolute maximum K
2 . Note that successive
schemes (such as Schalkwijk’s [22]) do not suffer from the
problem of maximum rate. For the discrete case the target rate
is bounded by max(|X |, |Y|) therefore for sufficiently large n
the maximal rate K
2 exceeds max(|X |, |Y|) and we are able
to show uniform convergence.
Although our target is the empirical mutual information over
the n-block, an artifact of the partitioning to smaller blocks is
that higher rates can be attained when the empirical conditional
channel distribution varies over time, since by the convexity
of mutual information with respect to the channel law the
convex sum of mutual information over blocks exceeds the
overall mutual information if these are not constant.
We now turn to prove the achieved rate. The total amount
of information sent (with or without error) is B · K therefore
≥ 1 − 2p e−n∆
the actual rate is
Ract =
2
/8
BK
n
(41)
We now endeavor to show this rate is close to or better than the
empirical mutual information in probability of at least PA over
the sequences x, regardless of y and of whether a decoding
error occurred.
The following definition of index sets in {1, . . . , n} is
used for both the discrete and the continuous cases: Ub =
kb+1 −2
{k}k=k
denotes the channel uses of block b except the
b
last one, L0 collects the last channel uses of all the blocks
L0 = {kb − 1 : b > 1}, and UB+1 denotes the indices of
the un-decoded (last) block UB+1 = {k}nk=kB+1 (including
its last symbol), and is an empty set if the last block is
decoded. The sets {Ub }B+1
b=1 , L0 are disjoint and their union is
{1, . . . , n}. We denote the length of each block not including
the last symbol by mb ≡ |Ub |. From this point on we split
the discussion and we start with the discrete case which is
simpler.
1) Rate analysis for the discrete case: We write µ∗m as
K+∆
∗
m
≤ m µ with
µm = K+∆
m
∆m
n
n
+mδm = log
+|X ||Y| log(m+1) ≤
= log
Pe
Pe
n
≤ log
+ |X ||Y| log(n + 1) ≡ ∆µ (42)
Pe
From Lemma 5 and Eq.(35) we have that the following
equation:
B+1
X
|L |
mb ˆ
0 ˆ
I(xBb ; yBb ) +
I(xL0 ; yL0 )
n
n
b=1
(43)
is satisfied when
x is outside
a set J∆ with probability
of at most exp −n ∆ − δ̃n
where δ̃n = (B + 2)|X | ·
ˆ y) − ∆ ≤
I(x;
log(n+1)
n
+ 2Bmax log(n)
n . We shall find Bmax later on. To
make sure the probability
of J is less than PA we require
exp −n ∆ − δ̃n
≤ PA therefore
1
log (PA ) =
n
log(n + 1)
log(n)
1
= (B + 2)|X | ·
+ 2Bmax
− log (PA )
n
n
n
(44)
∆ ≥ δ̃n −
14
and we choose
1
log(n + 1)
− log (PA )
(45)
n
n
We now bound each element of Eq.(43). Consider block
b with mb + 1 symbols. At the last symbol before decoding
(symbol mb ≡ |Ub |) none of the codewords, including the
correct one crosses the threshold µ∗m , therefore:
∆ = (3Bmax + 2)|X | ·
K + ∆mb
ˆ U ; yU )
> I(x
(46)
b
b
mb
Specifically for the unfinished block we have at symbol n:
µ∗mb =
µ∗mB+1 =
K + ∆mB+1
ˆ U ; yU )
> I(x
B+1
B+1
mB+1
(47)
The way to understand these bounds is as guarantee on the
shortness of the blocks given sufficient mutual information.
On the other hand, at the end of each block including the last
symbol (symbols (kb , kb + mb )), since one of the sequences
was decoded we have:
K + ∆mb +1
≤
µ∗mb +1 =
mb + 1
k +m
≤ max Iˆ (xi )kbb b ; ykkbb +mb ≤ log min(|X |, |Y|) ≡ h0
i
(48)
Which we can use to bound the number of blocks, since mb +
1 ≥ hK0 therefore
B≤
B
X
h0
h0 · n
(mb + 1) ≤
≡ Bmax
K
K
b=1
(49)
As for the unused last-symbols we bound:
ˆ L ; yL ) ≤ h0
I(x
0
0
(50)
Combining Eq.(49) and Eq.(45) we have:
3h0
1
2
∆≤
|X | · log(n + 1) − log (PA )
+
K
n
n
(51)
Combining Eq.(46),(47),(50) with Eq.(43) and substituting
∆m ≤ ∆µ yields:
ˆ y) < ∆ +
I(x;
B+1
X
mb
n
b=1
B+1
X
≤∆+
b=1
K + ∆mb
mb
+
B
h0 ≤
n
B
1
(K + ∆µ ) + h0 =
n
n
B+1
B
(K + ∆µ ) + h0 (52)
n
n
From Eq.(52) B and consequently Ract can be lower
bounded:
=∆+
Ract =
ˆ y) − ∆ − 1 (K + ∆µ )
I(x;
B
n
·K >
·K =
n
K + ∆ µ + h0
ˆ y) − ∆ − K 1 + ∆µ
I(x;
n
K
=
(53)
∆µ +h0
1+ K
Now if we increase K with n such that O(log(n)) <
O(K) < O(n) (for example by choosing K = nα , 0 < α <
1), then K
n → 0 as n → ∞, since ∆µ = O(log(n)) we have
∆µ
→
0
and
from Eq.(51) we have ∆ → 0 thus for any ǫ we
K
have n large enough so that:
Ract >
ˆ y) − ǫ
I(x;
ˆ y) − ǫ (1 − ǫ) >
> I(x;
1+ǫ
ˆ y) − (1 + h0 )ǫ ≡ Remp
> I(x;
(54)
Outside the set J, where the last inequality is due to the fact Iˆ
is bounded. Hence we proved our claim that the rate exceeds
a rate function which converges uniformly to the empirical
mutual information and the proof of Theorem 3 is complete.
2) Rate analysis for the continuous case: The continuous
case is more difficult from several reasons. One is that the
error probability exponent has a missing degree of freedom
(≈ exp((n − 1)t)). This results in a rate loss (through s in the
definition of µ∗m ), which is larger for small blocks, and can be
bounded only when assuming the number of blocks does not
grow linearly with n. Since the effective mutual information
Remp (x, y) is unbounded we cannot simply bound the loss
of mutual information over the unused symbols. Specifically
for a single symbol, ρ̂ = 1 and Remp = ∞. Therefore we
use the convexity of the correlation factor and the fact it is
bounded by 1. As a result, the loss introduced in order to
attain convexity (over the rateless blocks) is in the correlation
factor rather than the empirical mutual information. A loss
in the correlation factor induces unbounded loss in the rate
function for ρ ≈ 1, leading to a maximum rate. In order to
cope with these difficulties we use a threshold T on the number
of symbols in a block (T is chosen to grow slower than n),
and treat large and small blocks differently: the large blocks
are analyzed through their correlation factor and for the small
blocks the correlation factor is upper bounded by 1 and only
the number of blocks is accounted for.
We denote ρ̂b ≡ ρ̂(xUb , yUb ) and ρ̂ ≡ ρ̂(x, y) the correlation factor measured on a rateless block and on the entire transmission block, respectively. We denote by BS = {b : mb ≤ T }
and BL = {b : mb > T } the indices of the small and the
large blocks respectively (the last unfinished block included).
The total
P number of symbols in the large blocks is denoted
mL ≡ b∈BL mb . The number of large blocks is bounded by
|BL | < Tn .
The decoding threshold is written as
K
n
1
K + ∆µ
log(2)
∗
µm =
+
log
=
(55)
+
m−1 m−1
Pe
m−1
m−1
2n
where we denoted ∆µ ≡ log P
. We consider the partie
tioning of the index set {1, . . . , n} into at most p = Tn sets:
the first Tn −S1 (or less) sets are the large blocks except their
last symbol b∈BL Ub (each with at least T + 1 symbols by
definition), and the last set denoted L1 includes the rest of the
symbols (last symbols of these blocks and all symbols of small
blocks), and has |L1 | = n − mL . Since this partitioning has a
bounded number of sets, by applying Lemma 6 and Eq.(35)
with p = Tn we have that Eq.57 below is satisfied when x is
15
outside a set J with probability at most:
√
2
2
2n
Pr(J) ≤ n2p · 2p e−n∆ /8 =
e−n∆ /8 =
√
2
(56)
2n
= exp −n log(e)∆2 /8 − log
T
n
2T
For any 0 < ∆ ≤ 17 . This bound
√ tends
to 0 if T > O(log(n))
(since log(e)∆2 /8 − T2 log 2n → log(e)∆2 /8 > 0)
therefore for any such ∆ there is n large enough such that
this probability falls below the required PA . The convexity
condition is:
ρ̂2 − ∆ ≤
X mb
|L1 |
ρ̂2b +
ρ̂(xL1 ; yL1 )2 ≤
n
n
b∈BL
X mb
n − mL
ρ̂2 +
≤
n b
n
(57)
The last equation is a lower bound on a linear combination
of |BL | and |BS |. Since the total information sent depends on
|BL | + |BS | we equalize the coefficients multiplying |BL | and
|BS | by determining η1 so that:
1 K + ∆µ
1
(62)
− log (1 − η1 ) = 1 +
2
T
T
This is always possible since the RHS is positive and the LHS
maps η1 ∈ (0, 1) to (0, ∞). Then
|BL | + (T + 1)|BS |
1 K + ∆µ
r0 ≤ |BL | +
1+
=
T
T
n
K + ∆µ
K + ∆µ
= (|BL | + |BS |)
= (B + 1)
(63)
n
n
Extracting a lower bound on B from Eq.(63) yields a bound
on the empirical rate:
b∈BL
where ∆ can be made arbitrarily close to 0. We define a factor
η1 < 1 and apply the function (− 21 ) log(1 − η1 t) to both sides
of the above equation. Since the function is monotonically
increasing and convex ∪ over t ∈ [0, 1) (stemming from
concavity ∩ of log(t)), we have:
r0 ≡ (− 12 ) log(1 − η1 · (ρ̂2 − ∆)) ≤
"
!#
X mb
(57)
n − mL
2
1
≤ (− 2 ) log 1 − η1
ρ̂ +
·1
≤
n b
n
b∈BL
X mb
(− 21 ) log 1 − η1 ρ̂2b +
≤
n
b∈BL
n − mL
(− 21 ) log (1 − η1 · 1) (58)
n
We start by bounding the terms related to the large blocks.
At the last symbol before decoding in each block (or symbol
n for the unfinished block) none of the codewords, including
the correct one crosses the threshold µ∗m , therefore we have
for b = 1, . . . , B + 1:
+
µ∗mb =
1
K + ∆µ
> Remp (xUb , yUb ) = − log(1 − ρ̂2b ) (59)
mb − 1
2
and since mb ≥ T + 1:
mb
mb
(− 21 ) log 1 − η1 ρ̂2b ≤
(− 21 ) log 1 − ρ̂2b <
n
n
(59) mb K + ∆µ
1
K + ∆µ
·
= 1+
≤
<
n
mb − 1
mb − 1
n
1 K + ∆µ
≤ 1+
(60)
T
n
P
For the small blocks we use n ≤
b∈BL (mb + 1) +
P
(m
+
1)
≤
m
+
|B
|
+
(T
+
1)|B
b
L
L
S | (where the
b∈BS
inequality is since the unterminated block has length mb ) to
bound n − mL ≤ |BL | + (T + 1)|BS |.
Combining Eq.(58) with these bounds we have:
1 K + ∆µ
r0 ≤ |BL | 1 +
+
T
n
1
|BL | + (T + 1)|BS |
− log (1 − η1 )
+
(61)
n
2
K
·B ≥
n
r0 · n
r0
K
K
−1 =
−
≥
·
=
n
K + ∆µ
1 + K −1 ∆µ
n
(− 21 ) log(1 − η1 (ρ̂2 − ∆)) K
=
−
≡ RLB1 (64)
(1 + K −1 ∆µ )
n
Ract =
Equation (64) may be optimized with respect to T to obtain a
tighter bound, but this is not necessary to prove the theorem.
Recall that ∆µ = O(log(n)). By choosing O(log(n)) <
K < O(n) the factor (1 + K −1 ∆µ ) in Eq.(64) can be made
arbitrarily close to 1 and K
n can be made arbitrarily close
to 0. As we saw above choosing O(log(n)) < T < O(n)
enables us to have PA → 0 with ∆ arbitrarily close to 0, and
finally if K > O(T ) then the RHS of Eq.(62) tends to ∞ and
therefore we can choose η1 arbitrarily close to 1. Summarizing
the above, by selecting O(log(n)) < O(T ) < O(K) < O(n)
we can write the rate as
Ract ≥ RLB1 = (− 12 ) log(1 − η1 · (ρ̂2 − ∆)) · η2 − ǫ1 (65)
With η1 , η2 n→∞
0+ . RLB1 tends to the target
−→ 1− and
n→∞
ǫ1 ,∆−→
rate R2 (ρ̂) ≡ 12 log 1−1ρ̂2 for each point ρ̂ ∈ [0, 1) (but not
uniformly), and it remains to show that for any R̄, ǫ there is n
large enough such that RLB1 ≥ RLB2 ≡ min(R2 (ρ̂) − ǫ, R̄).
The functions R2 (ρ) and RLB1 (ρ) are monotonically increasing (for fixed η1 , η2 and ǫ1 ) and it is easy to verify
by differentiation that the difference R2 (ρ) − RLB1 (ρ) is
also monotonically increasing. Given R̄, ǫ, choose ρ0 such
that R2 (ρ0 ) = R̄ + ǫ. Since RLB1 (ρ0 )−→
R2 (ρ0 ), for n
n→∞
large enough we have R2 (ρ0 ) − RLB1 (ρ0 ) ≤ ǫ, and therefore RLB1 (ρ0 ) ≥ R2 (ρ0 ) − ǫ = R̄. For this n, for any
ρ ≤ ρ0 from the monotonicity of the difference we have that
R2 (ρ) − RLB1 (ρ) ≤ ǫ, and for any ρ ≥ ρ0 we have from
the monotonicity of RLB1 (ρ) that RLB1 (ρ) ≥ R̄, therefore
RLB1 ≥ RLB2 , which completes the proof of Theorem 4.
VII. E XAMPLES
In this section we give some examples to illustrate the model
developed in this paper. In this section we use a slightly less
formal notation.
16
A. Constant outputs and other illustrative cases
The statement that a rate which is determined by the input
and output sequences can be attained without assuming any
dependence between them may seem paradoxical at first. Some
insight can be gained by looking at the specific case where the
output sequence is fixed and does not depend on the input. In
this case, obviously, no information can be transferred. Since
the encoder uses random sequences, the result of fixing the
output is that the probability to have an empirical mutual
information larger than ǫ > 0 tends to 0, therefore most of
the time the rate will be 0. Infrequently, however, the input
sequence accidentally has empirical mutual information larger
than ǫ > 0 with the output sequence. In this case the decoder
will set a positive rate, but very likely fail to decode. These
cases occur in vanishing probability and constitute part of
the error probability. So in this case we will transmit rate
R = 0 with probability of at least 1 − Pe and R > 0 with
probability at most Pe . Conversely, if the channel appears to
be good according to the input and output sequences (suppose
for example yk = xk ), the decoder does not know if it
is facing a good channel or just a coincidence, however it
takes a small risk by assuming it is indeed a good channel
and attempting to decode, since the chances of high mutual
information appearing accidentally are small (and uniformly
bounded for all output sequences).
Another point that appears paradoxical at first sight is that
the decoder is able to determine a rate R ≥ Remp without
knowing x for any x 6∈ J. First observe that although it
is an output of the decoder, the rate R is not controlled by
the encoder and therefore cannot convey information. Since
the decoder knows the codebook, and given the codebook the
sequence x is limited to a number of possibilities (determined
by the possible messages and block locations), it is easy to
find an R(y) ≥ Remp (x, y) by maximizing Remp over all
possible sequences x. Vaguely speaking, the decoding process
is indeed a maximization of Remp over multiple x sequences
and by Lemmas 1, 4 such a decoding process guarantees small
probability of error.
B. Applying the continuous alphabet scheme to other input
alphabets
The scheme used for the continuous case can be adapted
to peak limited or even discrete input, by using an adaptation
function, i.e. the channel input will be x′k = f (xk ). In this
case the modified codebook C ′ = f (C) will be generated
by passing the Gaussian codebook through the adaptation
function, but for analysis purposes the adaptation function
f (·) may be considered part of the channel and the correlation
factor is calculated with respect to x which is used to generate
the codebook. In order to write the rate guaranteed by this
approach as a function of x′ rather than x, the law of large
numbers has to be utilized (in general) with respect to the
distribution Pr(xk |x′k ).
C. Non linear channels
In analyzing probabilistic
the correlation model
channels,
1
1
determines the rate 2 log 1−ρ2 is always achievable using
Gaussian code (no randomization is needed if the channel is
probabilistic as can be shown by the standard argument about
the existence of a good code). This is actually a result of
Lemma 2.
This expression is useful for analyzing channels in which
the noise is not additive or non linearities exist. As an example,
transmitter noise is usually modeled as an additive noise.
However large part of this noise is due to distortion (e.g. in
the power amplifier), and therefore depends on the transmitted
signal and is inversely correlated to it. Consider the non linear
channel Y = f (X) + V with V ∼ N (0, N ). In this case
ρ2
if we define the effective SNR as SNR = 1−ρ
2 then rate
R = 12 log (1 + SNR) is achievable. The correlation factor is:
ρ2 =
E(Xf (X))2
E(XY )2
=
E(X 2 )E(Y 2 )
E(X 2 )(E(f (X)2 ) + N )
(66)
Therefore the effective SNR is:
ρ2
=
1 − ρ2
Peff
E(Xf (X))2
=
=
2
E(X )(E(f (X)2 ) + N ) − E(Xf (X))2
N + Neff
(67)
SNR =
where we defined the effective gain γ, the effective power Peff
and the effective noise Neff as:
γ
Peff
Neff
E(Xf (X))
E(X 2 )
(E[(Xf (X)])2
≡
= E (γX)2
E(X 2 )
(E[Xf (X)])2
≡ E(f (X)2 ) −
E(X 2 )
= E (f (X) − γX)2
≡
(68)
(69)
(70)
This yields a simple characterization of the degradation caused
by the non linearity, which is independent of the noise power
and is tight if the non linearity is small. This model enables to
characterize the transmitter distortions by the two parameters
Peff , Neff , a characterization which is more convenient and
practical to calculate than the channel capacity, and on the
other hand guarantees that transmitter noise evaluated this way
never degrades the channel capacity in more than determined
by Eq.(67).
Another interesting application of this bound is in treating
receiver estimation errors, since it is simpler to calculate the
loss in the correlation factor induced due to the imperfect
knowledge of the channel parameters than the loss in capacity.
For example, the bound in [16] for the loss due to channel
estimation from training, when specialized to single input
single output (SISO) channels, may be computed using the
correlation factor bound.
D. Employing continuous channel scheme over a BSC
When operated over a channel different than the Gaussian
additive noise channel, the rates achieved with the scheme
we described in the continuous case are suboptimal compared
to the channel capacity. The loss depends on the channel in
17
from the simplicity of the models used, and can be solved by
schemes employing higher order empirical distributions (over
blocks, or by using Markov models), and by employing tighter
approximations of the empirical statistics (e.g. by higher order
statistics) in the continuous case.
F. Using individual channel model to analyze adversarial
individual sequence
Fig. 5.
Comparison of C,R for the BSC
question. As an example, suppose the communication system
is used over a BSC with error probability ǫ, i.e. the continuous
input value X is translated to a binary value by sign(X), and
the output is Y = sign(X) · (−1)Ber(ǫ) . The capacity of this
channel is C = 1bit − hb (ǫ) and we are interested to calculate
the rate which would be achieved by our scheme (which
does not know the channel) for this channel behavior. For
this channel with Gaussian N (0, P ) input we have (through a
simple calculation):
r
2P
E(XY ) = (1 − 2ǫ)
(71)
π
Hence
ρ2 =
And
2
E(XY )2
= (1 − 2ǫ)2
2
P · E(1 )
π
1
R = log
2
1
2
1 − π (1 − 2ǫ)2
(72)
(73)
The comparison between C and R is presented in fig.(5). It
can be shown that R ≥ π2 C, thus the maximum loss is 36%.
E. Channels that fail the zero order and the correlation model
Although we did not assume anything about the channel,
and specifically we did not assume the channel is memoryless,
the fact we used the zero-order empirical distribution means
the results are less tight for channels with memory. Specifically
if delay is introduced then the scheme would fail completely.
For example, for the channel yk = xk + 12 xk−1 + vk we
would obtain positive rates and the intersymbol interference
(ISI) 21 xk−1 would be treated (suboptimally) as noise, but for
the error free channel yk = xk−1 the achieved rate would be
0 (with high probability). Similarly we can find a memoryless
channel with infinite capacity but for which the correlation
model we used for the continuous alphabet scheme fails: if
yk = x2k then ρ = 0. Another example of practical importance
is the fading channel (with memory) yn = hn xn + vn , where
hn is slowly fading with mean 0. All these examples result
As we noted in the overview, the results obtained for
the individual channel model constitute a convenient starting
point for analyzing channel models which have a full or
partial probabilistic behavior. It is clear that results regarding
achievable rates in fully probabilistic, compound, arbitrarily
varying and individual noise sequence models can be obtained
by applying the weak law of large numbers to the theorems
discussed here (limited, in general, to the randomized encoders
regime).
E.g. for a compound channel model Wθ (y|x) with
an unknown parameter θ since P̂ (x; y)−→
Pθ (x, y) =
n→∞
Wθ (y|x)Q(x) in probability for every θ and since I(·; ·) is
ˆ y)−→ Iθ (X; Y ). Hence from Theorem 1 rate
continuous I(x;
n→∞
minθ Iθ (X; Y ) can be obtained without feedback, and from
Theorem 3 rate Iθ (X; Y ) can be obtained with feedback.
These results are not new (see [23][24] for the first and the
second is obtained as a special case of the results in [3] and [2]
since the individual noise sequence model can be degenerated
into a compound model) and are given only to show the ease
of using the individual model once established.
To show the strength of the model we analyze a problem
considered also in [2] of an individual sequence which is
determined by an adversary and allowed to depend in a fixed
or randomized way on the past channel inputs and outputs.
For simplicity we start with the binary channel yk = xk ⊕ ek
where ek is allowed to depend on x1k−1 and y1k−1 (possibly
in a random fashion), and the target is to show the empirical
capacity is still achievable in this scenario. Note that here
Ek is a random variable but not assumed toP
be i.i.d. We
n
denote the relative number of errors by ǫ̂ ≡ n1 k=1 ek . We
would like to show the communication scheme achieves a
rate close to 1bit − hb (ǫ̂) in high probability, regardless of
the adversary’s policy. Note that both the achieved rate and
the target 1bit − hb (ǫ̂) are random variables and the claim is
that they are close in high probability (i.e. that the difference
converges in probability to 0 when n → ∞)
Applying the scheme achieving Theorem 3 with Q =
Ber( 12 ) we can asymptotically approach (or exceed) the rate:
ˆ y) = Ĥ(y) − Ĥ(y|x) = Ĥ(y) − Ĥ(e|x) ≥
I(x;
≥ Ĥ(y) − Ĥ(e) = Ĥ(y) − hb (ǫ̂) (74)
Note that unlike in the probabilistic BSC where we have
I(X; Y ) = H(Y )−H(E), here the empirical distribution of e
is not necessarily independent of x, therefore the entropies are
only related by the inequality Ĥ(e|x) ≤ Ĥ(e) (conditioning
reduces entropy). In order to show a rate of 1bit − hb (ǫ̂) is
−→ 1bit . Since Xk
achieved, we only need to show Ĥ(y)n→∞,prob.
k−1
k−1
is independent of X1 , Y1
and therefore also of Ek we
18
have:
Pr(Yk = 0|Y1k−1 ) =
X
Pr(Yk = 0|Y1k−1 , ek )Pr(ek ) =
ek
=
X
ek
=
X
ek
Y1n
Pr(Xk = ek |Y1k−1 , ek )Pr(ek ) =
Pr(Xk = ek )Pr(ek ) =
X1
ek
2
Pr(ek ) =
1
2
(75)
Ber( 21 )
Therefore
is distributed i.i.d.
and from the law of
large numbers and the continuity of H(·) we have the desired
result. This result is a special case of the results in [2].
We can extend the example above to general discrete channels and perform a consolidation of the adversarial sequence
model considered in [2] (for modulu additive channels) with
the general discrete channel with fixed sequence considered
in [3]. We address the channel Ws (y|x) with state sequence
sk potentially determined by an adversary knowing all past
inputsPand outputs. We would like to show that the rate
I(Q, s Ws (y|x)P̂s (s)) (the mutual information of the stateaveraged channel) can be asymptotically attained in the sense
defined above.
This result is a superset of the results of [3] and [2]. It
overlaps with [3] in the case s is a fixed sequence and with
[2] for the case of modulu-additive channel (or when the target
rate is based on the modulu additive model).
ˆ y) ≡ I(P̂ (x), P̂ (y|x))
Since Theorem 3 shows the rate I(x;
can be approached or exceeded asymptotically, it remains
to show that the empirical distribution P̂ (x, y) is asymptotically
close to the state-averaged
P
P distribution Pavg (x, y) ≡
1
W
(y|x)
P̂
(s)Q(x)
=
s
s
s
k WSk (y|x)Q(x), and the ren
sult will follow from continuity of the mutual information.
Note that the later value is a random variable (function)
depending on the behavior of the adversary. Here we do not
use the law of large numbers because of the interdependencies
between the signals x, y and s.
Our purpose is to prove that the difference ∆(t, r) defined
below converges in probability to 0 for every t, r:
∆(t, r) ≡ P̂(x,y) (t, r) − Pavg (t, r) =
1X
1X
Ind(Xk = t, Yk = r) −
WSk (r|t)Q(t) ≡
=
n
n
k
k
1X
ϕk (t, r) (76)
≡
n
k
where ϕk (t, r) ≡ Ind(Xk = t, Yk = r) − WSk (r|t)Q(t). For
brevity of notation we omit the argument (t, r) from ϕk (t, r)
since from this point on it takes a fixed value. Then
E(Ind(Xk = t, Yk = r)|X k−1 , Y k−1 , S k ) =
= Pr(Xk = t, Yk = r|X k−1 , Y k−1 , S k ) =
= Pr(Xk = t|X k−1 , Y k−1 , S k )·
(a)
· Pr(Yk = r|Xk = t, X k−1 , Y k−1 , S k ) =
(b)
= Pr(Xk = t) · Pr(Yk = r|Xk = t, Sk ) = Q(t)WSk (r|t)
(77)
where (a) is due to the independent drawing of Xk (when not
conditioned on the codebook), the fact S k is independent of
Xk , and the memoryless channel (defining the Markov chain
(X k−1 , Y k−1 , S k−1 ) ↔ (Xk , Sk ) ↔ Yk ), and (b) is due to the
i.i.d drawing of Xk from Q and the definition of W . From
Eq.(77) we have that:
E(ϕk |X k−1 , Y k−1 , S k ) = 0
(78)
By the smoothing theorem we also have that ϕk has zero
mean E(ϕk ) = 0. We now show that ϕk are uncorrelated.
Consider two different indices j < k (without loss of generality) then
E(ϕk · ϕj ) = E E(ϕk · ϕj |X k−1 , Y k−1 , S k ) =
= E ϕj · E(ϕk |X k−1 , Y k−1 , S k ) = 0 (79)
where we used the smoothing theorem and the fact ϕj is
completely determined by Xj , Yj , Sj which are given. In
addition since by definition −1 ≤ ϕk ≤ 1, E(ϕ2k ) ≤ 1.
Therefore
n
n
1 X
1
1 X
(80)
E(ϕk · ϕj ) ≤ 2
δjk =
E(∆2 ) = 2
n
n
n
j,k=1
j,k=1
and by Chebychev inequality for any ǫ > 0:
Pr(|∆(t, r)| > ǫ) ≤
1
E(∆2 )
≤ 2 n→∞
−→ 0
ǫ2
nǫ
(81)
which proves the claim.
This result is new, to our knowledge, however the main
point here is the relative simplicity in which it is attained when
relying on the empirical channel model (note that most of the
proof did not require any information-theoretic argument).
VIII. C OMMENTS AND FURTHER STUDY
A. Limitations of the model
The scheme presented here is suboptimal when operated
over channels with memory or, in the continuous case over non
AWGN channels, and in section VII-E we discussed several
cases where the communication fails completely. Obviously
the solution is to extend the time order of the model. A simple
extension is by using the super-alphabets X p and Y p and treating a block of channel uses as one symbol. A more delicate
extension is by considering a Markov model (the p-th order
k−1
empirical conditional probability P̂ (xk , yk |xk−1
k−p , yk−p )).
For the continuous channel we focused on a specific class
of continuous channels where the alphabet is the real numbers
(we have not considered vectors as in MIMO channels), and
we did not achieve the full mutual information. A possible
extension is to find measures of empirical mutual information
for the continuous channels which are also attainable and
approach the probabilistic mutual information for probabilistic
channels. The current paper exhibits a considerable similarity
between the continuous case and the discrete case which is not
fully explored here, and a unifying theory which will include
the two as particular cases is wanting.
We conjecture that the following definition of empirical
mutual information may achieve these goals: given a family
19
of joint distributions (not necessarily i.i.d) {Pθ (x, y), θ ∈ Θ}
define the entropy with respect to the family Θ as the entropy
of the closest member of the family (in maximum likelihood sense): ĤΘ (x) = minθ∈Θ − n1 log Pθ (x) and likewise
ĤΘ (x|y) = minθ∈Θ − n1 log Pθ (x|y), and define the relative
mutual information as IˆΘ (x; y) = ĤΘ (x) − ĤΘ (x|y). This
definition corresponds to our target rates for the discrete case
(with Θ as the family of all DMC-s) and continuous case (with
Θ the family of all joint Gaussian zero-mean distributions
N (0, ΛXY )).
B. Overhead and error exponent
Another aspect is the overhead associated with extending the
empirical distribution (”channel”) family which is considered
(both in considering time dependence and in increasing the
accuracy with which the distribution is estimated or described).
This overhead is related to the redundancy or regret associated
with universal distributions (see [25]). Although we haven’t
performed a detailed analysis of the overheads and considered
only the asymptotically achievable rates, it is obvious from
comparing Lemmas 1 and 4 that the tighter rates we obtained
for the discrete channel come at the cost of additional overhead
(O(log(n)) compared to O(1) in the continuous case) which is
associated with the richness of the channel family (describing
a conditional probability as opposed to a single correlation
factor). Thus for example for a discrete channel with a large
alphabet and a small block size n we would sometimes be
better off using the ”continuous channel model” version of
our scheme (gaining only from the correlation) rather than
the scheme of the discrete case (gaining the empirical mutual
information). The issue of overheads requires additional analysis in order to determine the bounds on the overheads and
the tradeoff between richness of the channel family and the
rate, for a finite n. As we noted in section VI-C2 the bounds
we currently have for the rate-adaptive continuous case are
especially loose and call for improvement.
Since rate can be traded off for error probability, a related
question is the error exponent. Here, a good definition is still
lacking for variable rate schemes, and the error exponents
are not known for individual channels. The scheme we described does not endeavor to attain a good error exponent.
Specifically, since the block of n channel uses is broken
into multiple smaller blocks, it is probably not an efficient
scheme in terms of error rate. We note, however, that for
rate adaptive schemes with feedback a good error exponent
does not necessarily relate to the capability of sending a
message with small probability of error, but rather to the
capability to detect the errors. A similar situation occurs in
the setting of random decision time considered by Burnashev
[12]. In the later, an uncertainty of the decoder with respect
to the message is mitigated by sending an acknowledge /
unacknowledge (ACK/NACK) message and possibly repeating
the transmission with small penalty in the average rate (see a
good description in [11] sec IV.B). A similar approach can
be used in our setting (fixed decoding time, variable rate), by
sending an ACK/NACK over a fixed portion of the block and
setting R = 0 when the decoder is not certain of the received
message. However we did not perform a detailed analysis.
Note also that the analysis of the probability PA to transmit at
a rate lower than the target rate function is entangled with the
error analysis, since by such schemes it is possible to trade off
rate for error, and reduce the error probability at the expense
of increasing the probability to fall short of the target rate.
C. Determining the behavior of the transmitted signal (prior)
In this work we assumed a fixed prior (input probability
distribution) and haven’t dealt with the question of determining
the prior, or more generally, how the encoder should adapt
its behavior based on the feedback. Had the channel been
a compound one, it stands to reason that a scheme using
feedback may estimate the channel and adjust the input prior,
and may asymptotically attain the channel capacity. However
in the scope of individual channels (as well as individual
sequence channels and AVC-s) it is not clear whether the
approach of adjusting to the input distribution to the measured
conditional distribution is of merit, if the empirical channel
capacity can be attained for every sequence, and even the
definition of achievability is unclear if the input distribution
is allowed to vary.
Another related aspect is what we require from a communications system when considered under the individual channel
framework. This question is relevant to all the requirements
defined in the theorems (for example is the existence of
the failure set J necessary ?), however the most outstanding
requirement is related to the prior.
Currently we constrained the input sequence to be a random
i.i.d. sequence chosen from a fixed prior, which seems to be
an overly narrow definition. The rationale behind this choice
is that without any constraint on the input, the theorems
we presented can be attained in a void way by transmitting
only bad (e.g. fixed) sequences that guarantee zero empirical
rate. Furthermore, without this constraint, attainability results
for probabilistic models, and in general any attainable rates
which are not conditioned on the input sequence could not
be derived from our individual sequence theorems. A weaker
requirement from the encoder is to be able to emit any
possible sequence, however this requirement is not sufficient,
since from the existence of such encoders we could not infer
the existence of encoders achieving any positive rate over a
specific channel. Consider for example the encoder satisfying
the requirement by transmitting bad sequences in probability
1 − ǫ and good sequences in probability ǫ → 0. Theorems
1,2,3 and 4 are existence theorems, i.e. they guarantee the
existence of at least one system satisfying the conditions.
Had we removed the requirement for fixed input prior we
saw these theorems would be attained by encoders that are
unsatisfactory in other aspects. Once the theorem is satisfied
by one encoder it cannot guarantee the existence of other
(satisfactory) encoders, thus making it un-useful. Therefore
the requirement for fixed prior is necessary in the current
framework. Although in the scope of the theorems presented
here, this requirement only strengthens the theorems (since
it reveals additional properties of the encoder attaining the
other conditions of the theorem), we are still bothered by
20
the question what should be the minimal requirements from a
communication system, and these hopefully will not include
a constraint on the input distribution.
This issue relates to a fundamental difficulty which aries
in communication over individual channels: unlike universal
source coding in which the sequence is given a-priori, here
the sequences are given a-posteriori, and the actions of the
encoder affect the outcome in an unspecified way. Currently
we broke the tie by placing a constraint on the encoder, but
we seek a more general definition of the problem.
coding (as in [12] [19]) in which the block size is not fixed but
determined by the decoder. We did not include this scenario
since the achievability result is less elegant in a way: the
decoder indirectly affects the target rate (mutual information)
through the block size. On the other hand this case may be
of practical interest. Clearly the mutual information can be
asymptotically attained for this communication scenario as
well and its analysis is merely a simpler version of the rate
analysis performed in section VI-C, since convexity is not
required.
D. Amount of randomization
G. Bounds
We have assumed so far there is no restriction on the amount
of common randomness available and have not attempted
to minimize the amount of randomization required (while
maintaining the same rates). It is shown in [2] that less than
O(n) of randomization information is required in some cases
and O(n) is enough for others (see section V.5 therein),
whereas we have used at least O(M · n) > O(n2 ) random
drawings to produce the codebook.
In this paper we focused on achievable rates and did not
show a converse. An almost obvious statement is that any
continuous rate function which depends only on the zero-order
empirical statistics / correlation (respectively) cannot exceed
asymptotically the rate functions of Theorems 3, 4 respectively
with vanishing error probability. To show the statement for
the discrete case determine y using a memoryless channel
W (y|x). Then by the law of large numbers the empirical
distribution converges to the channel distribution and from the
continuity of the rate function the empirical rate converges to
the rate function taken at the channel distribution. Since by
Theorem 3 the actual rate asymptotically meets or exceeds
the rate function, and by the converse of the channel capacity
theorem the actual rate cannot exceed (asymptotically) the
mutual information, we have that the rate function cannot
exceed the mutual information (Remp ≤ Ract ≤ I(P, W )),
up to asymptotically vanishing factors. For the continuous
case the analogue claim is shown by taking a Gaussian
additive channel and replacing ”distribution” by ”correlation”
and ”empirical mutual information” by − 21 log(1 − ρ̂2 ). The
same applies also to rate functions obeying the conditions of
Theorems 1, 2. More general bounds are yet to be studied.
E. Practical aspects
The scheme described in this work is a theoretical one,
but the concept appears to be extendable to practical coding
systems. Below we focus on the continuous case and merely
give the motivation (without proof). One may replace the
correlation receiver (GLRT) by a receiver utilizing training
symbols to learn the channel effective gain, and then apply
maximum likelihood (or approximate, e.g. iterative) decoding.
The randomization of the codebook may be replaced by using
a fixed code with random interleaving, since with random
interleaving only the empirical distribution of the (effective)
noise sequence affects the error probability, and we may
conjecture that the property that Gaussian noise distribution
is the worst is approximately true for practical codes (such
as turbo codes and LDPC). When using a random interleaver
the training symbols as well as the part of the coded symbols
can be interleaved together, and the decoding attempts (which
occur every symbol in the theoretical scheme) occur only at the
end of each interleaving block. The rateless code is replaced
by an incremental redundancy scheme, i.e. by sending each
time part of the symbols of the codeword, and repeating the
codeword if all symbols were transmitted without successful
decoding. The decision when to decode can be simply replaced
by decoding and using a CRC check. Finally the common
randomness (required only for the generation of the interleaver
permutation) can be replaced by pseudo-randomness. Such a
scheme may not be able to attain the promise of Theorem
4 for every individual sequence but may be able to adapt to
every natural and man-made channel.
H. Comparison of the rate adaptive scheme with the similar
scheme in [3]
As noted the rate adaptive scheme we use is similar to the
scheme of [3] in its high level structure. Table II compares
some attributes of the schemes.
Another important factor is the overhead (i.e. the loss in
number of bits communicated with a given error exponent,
compared to the target rate), which we were unable to
compare. We conjecture that the current scheme may have
a lower overhead due to its simplicity which results in a
smaller number of parameters and constraints on their order
of magnitude (compared to the scheme of [3] where relations
between factors such as number of pilots and the minimum
size of a chunk may require a large value of n).
IX. C ONCLUSION
F. Random decision time
In our discussion we have described two communication
scenarios: fixed rate without feedback and variable rate with
feedback, and in both we assumed a fixed block size n.
Another scenario is that of random decision time or rateless
We examined achievable transmission rates for channels
with unspecified models, and focused on rates determined
by a channel’s a-posteriori empirical behavior, and specifically on rate functions which are determined by the zeroorder empirical distribution. This communication approach
21
TABLE II
C OMPARISON RATE ADAPTIVE SCHEMES IN CURRENT PAPER AND [3]
Item
Channel model
Mechanism for adaptivity
Transmit format
Feedback
Alphabet
Training
Randomness
Codebook construction
Stopping condition
Decoding
Stopping location
Eswaran et al [3]
Individual sequence
Repeated instanced of rateless coding
Total time divided to rounds (=rateless blocks) which are divided to
chunks
Ternary (Bad Noise/Decoded/Keep
Going), once per chunk
Discrete
Known symbols in random locations in each chunk
Full (O(exp(nR)))
Current Paper
Individual channel
Repeated instanced of rateless coding
Total time divided to rateless
blocks
Constant composition + expurgation + training insertion
Threshold over mutual information
of channel estimated from training
Maximum (empirical) mutual information
End of Chunk
Random i.i.d.
Comments
Chunks in [3] used as feedback
instances and expurgated code has
constant type over chunks
Easy to generalize to once every
1/ǫ symbols (see VI-C)
Binary (Decoded/Not Decoded)
per symbol
Discrete or Real valued
None
Full (O(exp(nR)))
Might be reduced by selection from
a smaller collection of codebooks
(in both cases)
Threshold over empirical mutual
information of best codeword
Maximum (empirical) mutual information
Any symbol
does not require a-priori specification of the channel model.
The main result is that for discrete channels the empirical
mutual information between the input and output sequences is
attainable for any output sequence using feedback and common randomness, and for continuous real valued channels an
effective ”Gaussian capacity” − 12 (1− ρ̂2 ) can be attained. This
generalizes results obtained for individual noise sequences and
is a useful model for analyzing compound, arbitrarily varying,
and individual noise sequence channels.
conditional type have the same (marginal) type, we can write:
X
ˆ y) ≥ t =
Qn TX|Y (y) =
Qn I(x;
Tt
(a)
=
X
Tt
(b)
≤
X
Tt
|TX|Y (y)| exp {−n [H(TX ) + D(TX ||Q)]} ≤
n
o
n
h
io
exp nH(X̃|Ỹ ) exp −n H(X̃) + D(TX ||Q) =
=
X
Tt
ACKNOWLEDGMENT
n
h
io
exp −n I(X̃; Ỹ ) + D(TX ||Q) ≤
≤ |Pn (X Y)|·exp −n min I(TY , TX|Y ) + D(TX ||Q)
Tt
The authors would like to thank the reviewers of the
ISIT 2009 conference paper on the subject for their helpful
comments and references.
A PPENDIX
A. Proof of Lemma 1
The proof is a rather standard calculation using the method
of types. We use the notations of [10]. We divide the sequences according to their joint type TXY . The type TXY is
defined by the probability distribution TXY ∈ Pn (X Y). For
notational purposes we define the dummy random variables
(X̃, Ỹ ) ∼ TXY and TX , TY , TY |X as the marginal and conditional distributions resulting from TXY . Following [10], the
conditional type is defined as TX|Y (y) ≡ {y : (x, y) ∈ TXY }.
The empirical mutual information of sequences in the type
TXY is simply I(X̃; Ỹ ) = I(TY , TY |X ). Define Tt ≡ {TXY ∈
Pn (X Y) : I(TY , TY |X ) ≥ t}. Since all sequences in the
(c)
≤ (n + 1)|X ||Y| · exp (−nt) =
log(n + 1)
= exp −n t − |X ||Y|
n
(82)
where (a) is due to [10] Eq.(II.1), (b) results from eq.(83)
below which is an extension of (II.4) there to conditional types
(and is a stronger version of Lemma II.3), based on the fact
that in the conditional type TX|Y (y) the values of x over
the na = nTY (a) indices for which yi = a have empirical
distribution TX|Y and therefore thenumber of such sequences
is limited to exp na H(X̃|Ỹ = a) , hence:
|TX|Y (y)| ≤
Y
a
exp nTY (a)H(X̃|Ỹ = a) =
= exp nH(X̃|Ỹ )
(83)
(c) is based on bounding the number of types (see [14],
Theorem 11.1.1), and the fact that in the minimization region
I(TY , TX|Y ) ≥ t and D(TX ||Q) ≥ 0 therefore the result of
the minimum is at least t.
22
B. Discussion of Lemma 1
1) An alternative proof for the exponential rate: For the
proof of Theorem 1 we do not need the strict inequalities and
equality in the error exponent would be sufficient, however
these will be useful later for the rateless coding. An explanation for the fact that the result does not depend on Q can be
obtained by showing that the above probability can be bounded
for each type of x separately. I.e. if x is drawn uniformly over
the type TX the probability of the above condition is:
X
X
exp(nH(X̃|Ỹ ))
|TX|Y (y)|
TXY ∈Tt
. TXY ∈Tt
=
=
|TX |
exp(nH(X̃))
X
.
exp(−nI(X̃; Ỹ )) = exp(−nt) (84)
=
TXY ∈Tt
where Tt ≡ TXY ∈ Pn (X Y) : (TXY )X = TX , (TXY )Y =
TY , I(TY , TY |X ) ≥ t and since drawing x ∼ Qn is
equivalent to first drawing the type of x and then drawing
x uniformly over the type, the bound holds when x ∼ Qn .
2) Extension to alpha receivers: Following we discuss
an extension of the bound and relate it to Agarwal’s [8]
coding theorem using the rate distortion function. Consider
a communication system similar to that of Theorem 1, where
the codebook is a constant composition code, consisting of
randomly selected sequences of type Q, and the receiver is
an α receiver (see [26]), i.e. selects the received codeword
by maximizing a function α̂(x, y) depending only on the
joint empirical distribution of the sequences x, y. The function
α(X̃, Ỹ ) = α(TXY ) is defined as the respective function of
the distribution of X̃, Ỹ . Then, the pairwise error probability
may be bounded similarly to eq. (84) by replacing the condition the condition I(TY , TY |X ) ≥ t in the definition of Tt by
α(TXY ) ≥ t, and obtaining:
.
Pr(α̂(x, y) ≥ t) ≤ Pα =
.
= exp − n
min
PX̃ Ỹ :X̃∼Q
I(X̃; Ỹ )
≤
Ỹ ∼P̂ (y)
α(X̃,Ỹ )≥t
≤ exp − n
min
PX̃ Ỹ :X̃∼Q
I(X̃; Ỹ )
(85)
α(X̃,Ỹ )≥t
Following the proof of Theorem 1, the RHS of eq.(85)
determines the following achievable rate:
Remp (x, y) ≈
min
X̃∼Q,
I(X̃; Ỹ )
≈
ˆ y) (86)
≤ I(x,
α(X̃,Ỹ )≥α̂(x,y)
Where the approximate inequality stems from substituting
the empirical distribution of x, y as a particular distribution
of X̃, Ỹ meeting the minimization constraints. The above
expression is similar to the one obtained in mismatch decoding
with random codes. Eq.(85) allows a larger (but still limited)
scope of empirical rate functions, but also shows that within
this scope the best function is still the empirical mutual information. On the other hand, an advantage of this expression is
that under some continuity conditions it can be extended from
discrete to continuous vectors (as performed in [8]).
When substituting α with the distortion function α(X̃, Ỹ ) =
−Ed(X̃, Ỹ ), we would obtain:
Remp (x, y) ≈
min
I(X̃; Ỹ )
=
X̃∼Q,
Ed(X̃,Ỹ )≥Êd(x,y)
= RX (Êd(x, y)) = RX (D̂) (87)
where RX (D) is the rate distortion function of an i.i.d. source
X ∼ Q with the distortion metric d. The later relation can
be used to show the result that communication at the rate
RX (D) is possible where D is the empirical or the maximum
guaranteed distortion of the channel as shown in [8]. On the
other hand, when using the correlation function α(X̃, Ỹ ) =
E(X̃ Ỹ )
= ρ, we would obtain from eq.(86) and Lemma
E(X̃ 2 )E(Ỹ 2 )
2: Remp (x, y) ≈ − 21 log(1 − ρ̂2 ). Note that although the later
expression is the same as the one obtained in Theorem 2, the
above derivation only proves it for discrete vectors.
C. Proof of Lemma 2
For random variables X and Y where X is continuous (not
necessarily Gaussian) we have the following bound on the
conditional differential entropy (Ỹ denotes a dummy variable
with the same distribution as Y and used for notational
purposes):
h
i
≤
h(X|Y ) = EỸ h X Y = Ỹ
(a)
1
log (2πeV AR(X|Y )) ≤
≤ E
2
(b) 1
≤ log (2πeE [V AR(X|Y )]) =
2
1
= log (2πeE [V AR(X − α · Y |Y )]) ≤
2
(c) 1
≤ log 2πeE(X − α · Y )2 =α:= E(XY )
2
E(Y 2 )
2
1
E(XY )
= log 2πe E(X 2 ) −
=
2
E(Y 2 )
1
= log 2πeE(X 2 )(1 − ρ2 ) =
2
1
1
(88)
= log 2πeE(X 2 ) + log 1 − ρ2
2
2
where the (a) is based on Gaussian bound for entropy and (b)
on concavity of the log function (see also [14] Eq.(17.24))
(c) is based on V AR(X) = E(X 2 ) − (EX)2 ≤ E(X 2 )
and is similar to the assertion that E[V AR(X|Y )] which is
the MMSE estimation error is not worse than the LMMSE
estimation error (except our disregard for the mean).
Therefore for a Gaussian X:
I(X; Y ) = h(X) − h(X|Y ) =
=
(88)
1
1
log(2πeE(X 2 )) − h(X|Y ) ≥ − log(1 − ρ2 ) (89)
2
2
23
Proof of corollary 2.1: Equality (a) holds only if X|Y is
Gaussian for every value of Y , (b) holds if X has fixed
variance conditioned on every Y , and (c) if E(X −α·Y |Y ) =
0 =⇒ E(X|Y ) = α · Y , therefore it results in X|Y ∼
N (αY, const) which implies X, Y are jointly Gaussian (easy
to check by calculating the pdf).
Note that if X, Y are jointly Gaussian then Y can be
represented as a result of an additive white Gaussian noise
channel (AWGN) with gain operating on X:
Y ∼ E(Y |X)+N (0, V AR(Y |X)) = α̃·X +N (0, σ 2 )+const
(90)
To show corollary 2.2 consider X = Y = Ber( 12 ), in which
case I(X; Y ) = 1 and ρ = 1, therefore the assertion doesn’t
hold.
Fig. 6.
A geometric interpretation of Lemma 4
we have:
D. Proof of Lemma 4
Write the empirical correlation as
ρ̂ ≡
xT y
=
kxkkyk
x
kxk
T
y
kyk
(91)
From the expression above we can infer that ρ̂ does not depend
on the amplitude of x and y but only on their direction.
Since x is isotropically distributed, the result does not depend
on the direction of y (unless y = 0 in which case it is
trivially correct), therefore it is independent of y and we can
conveniently choose y = (1, 0, 0, . . . , 0). To put the claim
above more formally, for any unitary n × n matrix U we can
write:
T
T
ρ̂ = p
x y
(xT x)(yT y)
x1
≥t =
Pr(|ρ̂| ≥ t) = Pr
kxk
= Pr x21 ≥ t2 (kxn2 k2 + x21 ) =
t2
n 2
= Pr x21 ≥
kx
k
=
1 − t2 2
t2
n 2
n
2
kx k
x2 =
= E Pr x1 ≥
1 − t2 2
"
!#
r
n 2
t2
t2
− 12 1−t
n k2
2 kx2 k
= E 2Q
=
≤
E
2e
kx
1 − t2 2
Z
n 2
t2
1
− 12 1−t
k2
− 21 kxn
2 kx2 k
2
dxn2 =
e
=
2e
(2π)(n−1)/2
Rn−1
Z
1
− 1 1 kxn k2
=2
e 2 1−t2 2 · dxn2 =
(n−1)/2
Rn−1 (2π)
Z
n−1
= 2(1 − t2 ) 2
fN n−1 (0,1−t2 ) (xn2 ) · dxn2 =
=p
T
x U Uy
(xT UT Ux)(yT UT Uy)
=
Ux
kUxk
T
Uy
kUyk
=
(92)
Since x is Gaussian, Ux has the same distribution of x,
thus the probability remains unchanged if weremove
U from
T
Uy
x
the left side and remain with ρ̂′ = kxk
kUyk . For
y 6= 0, we may choose the unitary matrix U whose first
y
and the other rows complete it to an orthonormal
row is kyk
n
basis of the
linear space R . Then Uy = (kyk, 0, 0, ..0) and
Uy
therefore kUyk = (1, 0, 0, ..0). Thus the distribution of
x1
x
equals the distribution
ρ̂′ = (1, 0, 0, . . . , 0) · kxk
= kxk
of ρ̂. Assuming without loss of generality that x ∼ N n (0, 1)
Rn−1
2 n−1
2
= 2(1 − t )
= 2 exp (−(n − 1)R2 (t)) (93)
where we used the rough upper bound of the Gaussian error
2
function Q(x) ≡ Pr(N (0, 1) ≥ x) ≤ e−x /2 , and fN n (µ,σ2 )
denotes the pdf of a Gaussian i.i.d. vector.
Discussion: A geometrical interpretation of Lemma 4 relates
this probability to the solid angle of the cone {x : |ρ̂| > t}.
Since x is isotropically distributed, the probability to have
|ρ̂| > t equals the relative surface determined by vectors
having |ρ̂| > t on the unit n-ball (termed the solid angle).
Since ρ̂ is the cosine of the angle between x and y the
points where |ρ̂| > t generate a cone with inner angle 2α
where cos(α) = t and their intersection with the unit n-ball
is a spherical cap (dome), shown in figure 6. We can obtain
a similar bound as above using geometrical considerations.
Write the volume of an n dimensional ball as Vn rn where
π n/2
[27], and accordingly the
Vn is a fixed factor Vn = Γ(1+n/2)
surface of an n dimensional ball is (the derivative) nVn rn−1 ,
then the relative surface of the spherical cap can be computed
by integrating the surfaces of the n − 1 dimensional balls with
24
radius sin(θ) that have a fixed angle θ with respect to y, and
can be bounded as follows:
Pr(|ρ̂| ≥ t) =
Surface of cap
=
Surface
of ball
Z
α
1
·
(n − 1)Vn−1 sinn−2 (θ)dθ ≤
nVn θ=0
Z α
Vn−1
n−3
· sin
(α)
≤
sin(θ)dθ =
Vn
θ=0
Vn−1
· sinn−3 (α)(1 − cos(α)) ≤
=
Vn
α≤ π
2 Vn−1
≤
· sinn−3 (α)(1 − cos2 (α)) =
Vn
p
√
√
n−1
=
= O( n) · sinn−1 (α) = O( n) · 1 − cos2 (α)
√
2 (n−1)/2
(94)
= O( n) · (1 − t )
=
Vn−1
→ 1 is based on [28]
where the asymptotic ratio √
nVn
Eq.(99). An interesting observation is that the assumption of
Gaussian distribution is not necessary and this bound is true
for all isotopical distributions.
E. Proof of Lemma 6
We denote xi , yi as the sub-vectors over Ai (i.e. xi ≡
xAi , yi ≡ yAi ), their length by ni ≡ |Ai | and their relative
length by λi = ni /n. We are interested to find a subset
J
probability such that outside the set
P of x2 with bounded
2
λ
ρ̂
≥
ρ̂
−
∆
for any y. Consider the following
i
i
i
inequality:
2
2
2
T
kxk · kyk · ρ̂ = x y
=
X
i
!2
ρ̂i kxi k·kyi k
X
=
λi ρ̂2i
+
i
(b)
≤
X
i
λi ρ̂2i +
(a)
≤
ρ̂2i
max
X
i
X
i
2
X
=
xTi yi
i
X
i
ρ̂2i kxi k2
kxi k2
− λi
kxk2
2
!
!
kxi k
− λi , 0
kxk2
!2
·
=
X
i
2
The set is minimal in the sense that none of its elements
can be removed while meeting the conditions of the lemma.
We would like to bound the probability of J∆ . The result of
P
i max(zi , 0) is a partial sum of zi , and since negative zi are
not summed, it is easy to see this is the maximal partial sum,
i.e. we can write this sum alternatively as
X
X
zi
(98)
max(zi , 0) = max
i∈I
where P ≡ 2{1,...,p} \ ∅ denotes all non empty sub-sets of
{1, . . . , p}, and its size is 2p − 1. Therefore from the union
bound we have:
)
(
X kxi k2
− λi > ∆ ≤
Pr{J∆ } = Pr max
I∈P
kxk2
i∈I
)
(
X kxi k2
X
− λi > ∆
(99)
Pr
≤
kxk2
i∈I
I∈P
To bound the above
we
first develop bound on
P probability
2
the probability Pr
a
kx
k
≤
0
for some coefficients ai :
i
i i
Lemma
7. Let x ∼ N (0, P )n . For coefficients {ai }pi=1 with
P
1
i λi ai = ā > 0 and |ai | ≤ A where |ā| ≤ 8 A, we have
!
X
Pr
ai kxi k2 ≤ 0 ≤ e−nE
(100)
i
where
ā2
6A2
Now we apply the bound to the events in Eq.(99):
X kxi k2
>∆
−
λ
i
kxk2
E=
(101)
i∈I
kyi k2
!
X
=
i∈I
2
kxi k −
X
λi
i∈I
m
p
X
i=1
p
X
i=1
· kxk2 · kyk2
(95)
where (a) is from Cauchy-Swartz inequality (b) is since ρ̂i zi ≤
zi for zi ≥ 0 and ρ̂i zi ≤ 0 for zi ≤ 0 therefore always
ρ̂i zi ≤ max(zi , 0) (attained for ρ̂i = Ind(zi > 0)). Both
inequalities are tight in the sense that for each x there is a
sequence y (equivalent to choosing {kyi k2 } and {ρ̂i }) that
meets them in equality. Dividing by kxk2 · kyk2 we have that
X
X
kxi k2
−
λ
,
0
(96)
max
λi ρ̂2i ≤
ρ̂2 −
i
kxk2
i
i
where the RHS depends only on x and should be bounded by
∆. Thus the minimal set J∆ is:
)
(
X
kxi k2
− λi , 0 > ∆
(97)
max
J∆ ≡ x :
kxk2
i
∆+
We have:
ā =
p
X
i=1
X
i∈I
|
λi ai = ∆ ·
2
kxi k > ∆
m
2
· kxk ·kyk ≤
!
I∈P
i
p
X
i=1
kxi k2
!
λi − Ind(i ∈ I) kxi k2 < 0
{z
}
≡ai
p
X
λi +
i=1
−
X
i∈I
p
X
i=1
λi ·
p
X
i=1
λi −
Ind(i ∈ I)λi = ∆ (102)
And |ai | ≤ 1 + ∆ ≡ A, therefore for ∆ ≤ 1/7 we have
ā ≤ 18 A and by Lemma 7:
)
(
X kxi k2
− λi > ∆ ≤ e−nE ≤ e−nE0 (103)
Pr
kxk2
i∈I
where
E=
ā2
∆2
∆2
∆2
=
≥
≥
≡ E0 (104)
6A2
6(1 + ∆)2
6(1 + 1/7)2
8
25
TABLE III
PARAMETERS OF ADAPTIVE RATE SCHEME USED FOR FIGURE 3
Item
Transmission scheme
Referrence
section V-C
RLB1 parameters
section VI-C2, Eq.(65)
RLB2 parameters
section VI-C2, Theorem 4
and from Eq.(99) we have:
)
(
X kxi k2
X
− λi > ∆ ≤
Pr
Pr{J∆ } ≤
kxk2
I∈P
Parameter set 1 of figure 3
n = 1e + 008, K = 1e
006, PA = 0.001, Pe = 0.001
T
= 2.5e + 005, ∆µ
37.5412, ∆ = 0.0345958, η1
0.996007, η2 = 0.999962, ǫ1
0.01
ρ0 = 0.9, ǫ = 0.139438, R̄
1.05173
āw∗ −
(105)
which proves the lemma. Note that different bounds can
be obtained by applying the bound on m smaller sets in
{1, . . . , p} and requiring that the sum over each set will
be bounded by ∆/m (as an example we could bound each
max(zi , 0) separately by ∆/p), however this bound is most
suitable for our purpose since when p << n the element 2p
becomes negligible.
Proof of Lemma 7: We assume without loss of generality
that x ∼ N (0, 1)n . For Gaussian r.v. X ∼ N (0, 1) and a < 12
we have:
Z ∞
2
2
1
1
√ e(a− 2 )x dx =
E(eax ) =
2π
−∞
Z ∞
x2
1
1
−
p
=√
e 2(1−2a)−1 dx =
1 − 2a −∞ 2π(1 − 2a)−1
1
(106)
=√
1 − 2a
P
For coefficients {ai }pi=1 with i λi ai = ā > 0 and |ai | ≤
A, w > 0 a positive constant of our choice, and x ∼ N (0, 1)n
we have:
!
X
P
2
1
2
ln Pr
ai kxi k ≤ 0 ≤ ln Ee− 2 w· i ai kxi k =
i
P
P
− 12 w· i ai j∈A x2j
= ln Ee
i
= ln
Y Y
1
2
Ee− 2 w·ai ·xj =
i j∈Ai
=
1 X
λi ln(1+w·ai ) =
=− n
2 i
i j∈Ai
1 X
1
1
(a)
2
= − n
λi (w · ai ) −
(w · ai ) ≤
2 i
2 (1 + w · ti )2
(b)
1
1
1 X
2
λi (w · ai ) −
(w · A) =
≤− n
2 i
2 (1 − w · A)2
1
A2 w 2
= − n āw −
(107)
2
2(1 − w · A)2
XX
− 12
ln (1 + w · ai )
=
=
=
=
ρ0
=
0.99998, ǫ
0.0068209, R̄ = 7.29818
=
the bound with respect to w ignoring the denominator) and
obtain:
i∈I
≤ |P| · e−nE0 ≤ 2p e−nE0
Parameter set 2
n = 1e + 020, K = 1e +
017, PA = 0.001, Pe = 0.001
T
= 7.5e + 015, ∆µ =
77.4043, ∆
=
3.14616e −
007, η1 = 1, η2 = 1, ǫ1 = 0.001
+
where (a) is based on the second order Tailor series of ln(1 +
wt) around t = 0 with some ti ∈ [0, ai ]∪[ai , 0] and (b) is since
|ti | ≤ |ai | ≤ A. For simplicity we choose a sub-optimal w∗ =
ā
A2 (which is obtained by assuming small a, w and optimizing
ā2
ā2 /A2
A2 w ∗ 2
=
−
=
2(1 − w∗ · A)2
A2
2(1 − ā/A)2
ā2
A2
= 2 1−
A
2(A − ā)2
(108)
To simplify the bound, we make a further assumption that
|ā| ≤ 81 A therefore:
A2
ā2
A2
ā2
1
−
≥
1
−
=
A2
2(A − ā)2
A2
2 · (7/8)2 · A2
ā2 17
ā2
= 2·
≥
(109)
A 49
3A2
Therefore we can write the following bound: for |ā| ≤ 81 A we
have
!
X
2
ai kxi k ≤ 0 ≤ e−nE
(110)
Pr
i
where E =
N (0, P )n .
ā2
6A2 .
Note that the bound is true for any x ∼
F. Parameters of adaptive rate scheme used for figure 3
Table III lists two sets of parameters for the continuous
alphabet adaptive rate scheme. The first set was used for the
curves in figure 3, and the second set shows the convergence of
ǫ, R̄, for higher values of n, K. Note that the values of n, K
are extremely high, and this is due to the looseness of the
bounds used in the continuous case: specifically the exponent
of Lemma 6 which yields a relatively slow convergence of the
ill-convexity probability in equation 56.
R EFERENCES
[1] O. Shayevitz and M. Feder, ”Communicating using Feedback over a
Binary Channel with Arbitrary Noise Sequence”, International Symposium
on Information Theory (ISIT), Adelaide, Australia, September 2005.
[2] Ofer Shayevitz and Meir Feder, ”Achieving the Empirical Capacity
Using Feedback Part I: Memoryless Additive Models”, Dept. of Electrical
Engineering Systems, Tel Aviv University, Tel Aviv 69978, Israel
[3] Krishnan Eswaran, Anand D. Sarwate, Anant Sahai, and Michael Gastpar, ”Limited feedback achieves the empirical capacity,” Department of
Electrical Engineering and Computer Sciences, University of California,
arXiv:0711.0237v1 [cs.IT] 2 Nov 2007.
[4] K. Eswaran and A.D. Sarwate and A. Sahai and M. Gastpar, ”Using zerorate feedback on binary additive channels with individual noise sequences,”
Proceedings of the 2007 Infernational Symposium on Information Theory
(ISIT 2007), Nice, France, June 2007
[5] V. D. Goppa, Nonprobabilistic mutual information without memory,
Probl. Contr. Inform. Theory, vol. 4, pp. 97-102, 1975
26
[6] Lapidoth, A.; Narayan, P., ”Reliable communication under channel uncertainty,” Information Theory, IEEE Transactions on , vol.44, no.6, pp.21482177, Oct 1998
[7] I. Csiszár and P. Narayan, ”The Capacity of the Arbitrarily Varying Channel Revisited : Positivity, Constraints”, IEEE Transactions On Information
Theory, Vol. 34, No. 2, March 1988
[8] Mukul Agarwal, Anant Sahai, Sanjoy Mitter, ”Coding into a source:
a direct inverse Rate-Distortion theorem,” arXiv:cs/0610142v1 [cs.IT].
Orginally presented at Allerton 06
[9] Shayevitz, O.; Feder, M., ”The posterior matching feedback scheme:
Capacity achieving and error analysis,” Information Theory, 2008. ISIT
2008. IEEE International Symposium on , vol., no., pp.900-904, 6-11 July
2008
[10] Csiszár, I., ”The method of types [information theory],” Information
Theory, IEEE Transactions on , vol.44, no.6, pp.2505-2523, Oct 1998
[11] Aslan Tchamkerten and I. Emre Telatar, ”Variable Length Coding Over
an Unknown Channel”, IEEE Transactions On Information Theory, Vol.
52, No. 5, May 2006
[12] M. V. Burnashev, ”Data transmission over a discrete channel with
feedback: Random transmission time,” Probl. Inf. Transm., vol. 12, no.
4, pp. 250-265, 1976
[13] Shulman, N.; Feder, M., ”The uniform distribution as a universal prior,”
Information Theory, IEEE Transactions on , vol.50, no.6, pp. 1356-1362,
June 2004
[14] T. M. Cover and J. A. Thomas, Elements of Information Theory. New
York: Wiley-Interscience, Second edition, 2006.
[15] Zamir, R.; Erez, U., ”A Gaussian input is not too bad,” Information
Theory, IEEE Transactions on , vol.50, no.6, pp. 1362-1367, June 2004
[16] Hassibi, B.; Hochwald, B.M., ”How much training is needed in multipleantenna wireless links?,” Information Theory, IEEE Transactions on ,
vol.49, no.4, pp. 951-963, April 2003
[17] Lapidoth, A., ”Nearest neighbor decoding for additive non-Gaussian
noise channels ,” Information Theory, IEEE Transactions on , vol.42, no.5,
pp.1520-1529, Sep 1996
[18] Hughes, B.; Narayan, P., ”Gaussian arbitrarily varying channels,” Information Theory, IEEE Transactions on , vol.33, no.2, pp. 267-284, Mar
1987
[19] Nadav Shulman, ”Communication over an Unknown Channel via Common Broadcasting,” Ph.D. dissertation, Tel Aviv University, 2003
[20] Shayevitz, Ofer; Feder, Meir, ”Communication with Feedback via Posterior Matching,” Information Theory, 2007. ISIT 2007. IEEE International
Symposium on , vol., no., pp.391-395, 24-29 June 2007
[21] M. Horstein, Sequential transmission using noiseless feedback, IEEE
Trans. Info. Theory, pp. 136143, July 1963.
[22] J. P. M. Schalkwijk, A coding scheme for additive noise channels with
feedback part II: Band-limited siganls, IEEE Trans. Info. Theory, vol. IT12, pp. 183 189, 1966.
[23] D. Blackwell, L. Breiman, and A. J. Thomasian, ”The capacities of
certain channel classes under random coding,” Ann. Math. Statist., vol.
31, pp. 558-567, 1960
[24] Lapidoth, A.; Telatar, I.E., ”The compound channel capacity of a class of
finite-state channels ,” Information Theory, IEEE Transactions on , vol.44,
no.3, pp.973-983, May 1998
[25] Barron, A.; Rissanen, J.; Bin Yu, ”The minimum description length
principle in coding and modeling,” Information Theory, IEEE Transactions
on , vol.44, no.6, pp.2743-2760, Oct 1998
[26] Csiszar, I.; Narayan, P., ”Channel capacity for a given decoding metric,”
Information Theory, IEEE Transactions on , vol.41, no.1, pp.35-43, Jan
1995
[27] Weisstein, Eric W. ”Ball.” From MathWorld–A Wolfram Web Resource.
http://mathworld.wolfram.com/Ball.html
[28] Weisstein, Eric W. ”Gamma Function.” From MathWorld–A Wolfram
Web Resource. http://mathworld.wolfram.com/GammaFunction.html