1 Matrix Scaling: Lecture 11: November 1, 2017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Information and Coding Theory Autumn 2017

Lecture 11: November 1, 2017


Lecturer: Madhur Tulsiani

1 Matrix Scaling

We will consider an application of I-projections to a problem known as matrix scaling. Say


nn
we are given two nonnegative matrices M, N R+ such that for all i, j, Mij = 0 Nij =
0. The goal is to multiply (scale) each row i of M by a number ri and each column j by c j ,
such that the resulting matrix M0 has the same row and column sums as the target matrix
N. Another way of stating this is that we want to find diagonal matrices R and C such that
for M0 = RMC, we have

Mij0 = Nij i [ n ] and Mij0 = Nij j [n] .


j j i i

We will show a special case when the goal is to scale M so that the resulting matrix M0 is
doubly stochastic i.e.,
Mij0 = Mij0 = 1 i, j [n] .
j i

First, note that by a global scaling of 1/ i,j Mij , we can assume that i,j Mij = 1 and the goal
is instead to scale it to have row and column sums equal to 1/n. We assume that Mij > 0
for all i, j (we can think of the target matrix N with Nij = 1/n2 for all i, j). Since the entries
of M are positive and sum to 1, we can think of it as a probability distribution Q with
Supp( Q) = [n] [n] (where q(i, j) = Mij ). We consider the linear family of distributions
on [n] [n] (written as matrices) with the required row and column sums.
( )
1
L := P | p(i, j) = p(i, j) = i, j [n]
j i
n

Note that the abobe is a linear family as defined in the previous lecture, since we can
consider functions f 1 , . . . , f n and g1 , . . . , gn defined as
( (
1 if i = i0 1 if j = j0
f i0 (i, j) = and g j0 (i, j) = .
0 otherwise 0 otherwise

1
Then, the above family can be written in terms of the expectations of the functions f i and
g j for all i, j [n]. Moreover, we know from the previous lecture that the I-projection P of
Q onto L is of the form
!
p (i, j) = c0 q(i, j) exp i0 f i0 (i, j) + j0 g j0 (i, j)


= c0 q(i, j) exp i + j .
i0 j0

Thus, scaling each row i of M by c0 exp (i ) and each column j by c0 exp j ensures
that all row and column sums are 1/n. Combining this with the global scaling, we can also
get all the row and column sums to be 1 (i.e., make the matrix doubly stochastic).

Exercise 1.1 Where did we use the fact that Mij > 0 for all i, j [n]?

Exercise 1.2 Use this the above techniques to solve the matrix scaling problem for an arbitrary
target matrix N (assuming Mij = 0 Nij = 0).

Matrix scaling and its generalization, known as operator scaling have found a variety of
applications in combinatorial optimization, complexity theory and analysis. Please take
a look at the recent tutorial by Wigderson [Wig17] for an introduction to many of these
connections.

2 Error-Correcting Codes

Over the next few lectures, we will switch to the coding theory part of the course and
see how to construct (and work with) error correcting codes. We first review some basic
notions about finite fields that we will need.

2.1 Some relevant facts about finite fields

We will use the fact that the set F p = {0, . . . , p 1} is a field for prime p, with all arithmetic
operations defined modulo p. There also exist finite fields of size pr for any prime p and
positive integer r, but we will mostly restrict our discussion to prime fields. We recall some
basic facts:

- Fnp is a vector space over the field F p . Note this is not an inner product space.

- Fermats little theorem: a p a mod p, for all a {0, . . . , p 1}.

- The set of all polynomials in a single variable x, over F p , is defined as


n o
F p [ x ] : = c 0 + c 1 x + + c p 1 x p 1 | c 0 , . . . , c p 1 F p .

2
Note that the degree (highest power of x with a non-zero coefficient) is never more
than p 1 (why?)

- A polynomial of degree d, which is not identically zero, has at most d roots.

- Lagrange interpolation: Given distinct a1 , . . . , ad+1 F p and any b1 , . . . , bd+1 F p ,


the unique polynomial Q with degree at most d, satisfying Q( ai ) = bi for all i
[d + 1], is given by
d +1 
x aj

Q ( x ) = bi .
i =1 j 6 =i
ai a j

2.2 The Shannon and Hamming models

Suppose Alice wants to send Bob a message over a noisy channel, where some of the
symbols in Alices message may become corrupted. Lets assume that Alices message, is
a sequence of elements of F p for some prime p. If the original message Alice wants to send
is m = m1 , . . . , mk , she will encode with some encoding function Enc : Fkp Fnp , where the
goal of the encoding is to introduce some redundancy (to handle errors in transmission).
Say Enc(m) = x. Alice will transmit x and Bob will receive corrupted version y. He will
then apply some decoding function Dec : Fnp Fkp , to try and figure our the message. The
goal is to define encoding and decoding functions such that Dec(y) = m even when y is a
corrupted version of x. Of course, it is impossible to handle arbitrary corruptions. We will
consider two models of how the corruption might occur.

Shannon model. Let x = x1 , x2 , . . . , xn be a string where each xi F p . In the Shannon


Model, Alice sends x over the communication channel, and each xi is corrupted indepen-
dently (though not necessarily identical distributions for each bit) at random.

Example 2.1 (Binary Symmetric Channel) Suppose the message is a tuple from F2 . Then, for
some [0, 12 ], each bit is flipped independently at random with probability , and is transmitted
without error with probability 1 .

We will work with a different model, which is more relevant for some of the applications
in theoretical computer science.

Hamming model Suppose Alice sends n symbols x1 , x2 , . . . , xn from F p . In the Hamming


Model, up to t of these symbols are corrupted (but not lost) in some arbitrary manner.
There is no way to know which of the symbols have been corrupted, and which symbols
were transmitted correctly. We wish to design encoding and decoding schemes to recover
up to t errors for some fixed t.

3
2.3 Codes and distances

Definition 2.2 A code is a subset C Fnp . If |C | = pk , we think of the code as encoding messages
in Fkp . The quantity k is called the message length and n is called the block length of C. We call
R(C ) = nk the rate of C.

Remark 2.3 Often the code C is implicitly associated with the encoding function Enc : Fkp C.
For x Fkp , we may refer to Enc( x ) simply as C ( x ). The use of C as a set or a function will be
clear from the context.

Our goal is to introduce some redundancies to our message such that we can uniquely
recover the correct k symbols given that up to t bits from the n bit message may be cor-
rupted.Consider the following trivial code.

Example 2.4 Define C : F42 F12


2 as

C ( x1 x2 x3 x4 ) = ( x1 x1 x1 x2 x2 x2 x3 x3 x3 x4 x4 x4 ) .

It is easy to see that the above code can handle one error. When Bob receives the string, he simply
takes the majority of each block of three symbols and thus recovers the original message. As we shall
see, no more than one error can be recovered using this code. Also, the rate of this code is 1/3.

In general wed like to keep this rate close to 1, as this would imply we introduce fewer
redundancies to the encoded message. To better understand how many errors can be cor-
rected using a particular code, we define the distance of a code.

Definition 2.5 Let C Fnp be a code. Let ( x, y) denote the Hamming distance of two strings
x, y. We define the distance of a code (C ) as

(C ) := min x 6=y ( x, y) .
x,yC

Remark 2.6 The hamming distance is a metric on strings of a fixed length, and in particular
satisfies the triangle inequality.

The distance can be used to understand the number of errors one can correct, as follows.

Proposition 2.7 A code C : Fkp Fnp can correct t errors if and only if (C ) 2t + 1.

Note that for the code in Example 2.4, (C ) = 3. This implies that the code from Example
2.4 can handle no more than one error. We now prove the bound:

4
(C )
Proof: First, we prove the reverse direction. If (C ) 2t + 1, then 2 > t. For each
codeword x C let H ( x, r ) = {y Fnp : ( x, y) < r }. In particular consider the case
(C )
when we let r = 2 . Notice that for two distinct x, x 0 C, H ( x, r ) H ( x 0 , r ) = , since
otherwise, ( x, x 0 ) < (C ). Now, suppose a codeword is corrupted in t positions to the
(C )
string y. Since t < 2 , we have that the corrupted message y is contained in precisely
one set H ( x, r ). Thus, we can simply find the unique x such that y H ( x, r ).
Now, we prove the forward direction. Suppose the codeword x C is corrupted in t bits
to y i.e., ( x, y) = t. Since we can correct up to t errors, we know that we can find w C
such that w = argminzC ((y, z)). Moreover, we want w to be equal to z. Thus, we have
that
(z, y) ( x, y) = t z C .
However, if we have x, z as the closest codewords in C, we can always find y such that
( x, y) = t and (z, y) = (C ) t. Combining it with the above, we get (C ) t > t,
which implies (C ) 2t + 1.

References

[Wig17] Avi Wigderson. Operator scaling: theory, applications and connections, 2017.
Tutorial given at CCC 2017. URL: http://computationalcomplexity.org/
Archive/2017/tutorial.php. 2

You might also like