Mathematics of Sparsity and Entropy: Axioms, Core Functions and Sparse Recovery

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/271218681

Mathematics of Sparsity and Entropy: Axioms, Core Functions and Sparse


Recovery

Article · January 2015


Source: arXiv

CITATIONS READS
9 505

4 authors, including:

Inma Mora Jiménez Riku Jantti


King Juan Carlos University Aalto University
89 PUBLICATIONS   728 CITATIONS    267 PUBLICATIONS   3,232 CITATIONS   

SEE PROFILE SEE PROFILE

Antonio Caamaño
King Juan Carlos University
78 PUBLICATIONS   897 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

2013-PIT-014 EWBS-WSN View project

QUASAR View project

All content following this page was uploaded by Antonio Caamaño on 27 January 2015.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 1

Mathematics of Sparsity and Entropy:


Axioms, Core Functions and Sparse Recovery
Giancarlo Pastor, Inmaculada Mora-Jiménez, Riku Jäntti, and Antonio J. Caamaño
arXiv:1501.05126v2 [cs.IT] 22 Jan 2015

Abstract

Sparsity and entropy are pillar notions of modern theories in signal processing and information theory. However,
there is no clear consensus among scientists on the characterization of these notions. Previous efforts have contributed
to understand individually sparsity or entropy from specific research interests. This paper proposes a mathematical
formalism, a joint axiomatic characterization, which contributes to comprehend (the beauty of) sparsity and entropy.
The paper gathers and introduces inherent and first principles criteria as axioms and attributes that jointly characterize
sparsity and entropy. The proposed set of axioms is constructive and allows to derive simple or core functions
and further generalizations. Core sparsity generalizes the Hoyer measure, Gini index and pq-means. Core entropy
generalizes the Rényi entropy and Tsallis entropy, both of which generalize Shannon entropy. Finally, core functions
are successfully applied to compressed sensing and to minimum entropy given sample moments. More importantly,
the (simplest) core sparsity adds theoretical support to the `1 -minimization approach in compressed sensing.

Index Terms

Sparsity, entropy, axioms, compressive sampling, compressed sensing, Rényi entropy, Tsallis entropy.

This paper was presented in part in the Proceedings of the Tenth International Symposium in Wireless Communication Systems (ISWCS
2013).
G.P. (corresponding author, [email protected]) and R.J. are with the Department of Communications and Networking, Aalto University,
Finland.
G.P., I.M.J. and A.J.C. are with the Department of Information and Communication Technologies, King Juan Carlos University, Spain.
This work was supported by the Spanish Ministry of Economy and Competitiveness (TEC2013-4839-C4-1-R) and the Finnish Cultural
Foundation.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 2

I. I NTRODUCTION

For a given signal x, the uncertainty (randomness) of its elements defines the compressibility (compactness)
of its coefficients w in a given domain. This dependence, between the (elements’) uncertainty and (coefficients’)
compressibility of a signal, suggests a connection between the two families of functions that measure these properties.
But, what axioms do these families have in common? This paper adopts the following definitions with the aim to
respond to this question. Let p be the probability mass function (pmf) of signal x.
Definition 1.1 (Compressibility or sparsity): The property of to concentrate most of the energy in few coefficients
of w.
Definition 1.2 (Uncertainty or entropy): The property of not to concentrate most of the probability mass in few
atoms of p.
Sparsity and entropy functions quantify properties 1.1 and 1.2, respectively. One approach to evaluate the goodness
of these functions is through validation of criteria (see [30] [33] for sparsity, [20] [28] for entropy, and [41] for
both). Following this approach, the paper gathers and introduces inherent and first principles criteria as axioms
and attributes that jointly characterize sparsity and entropy functions. As expected, it turns out that criteria of both
families of functions are strictly complementary.
In [16] a set of conditions are identified under which variance and entropy order distributions in a similar way.
This suggests that entropy could be extracted from the “shape” of distributions. Intuitively, for a given variance,
random processes with similar sorted pmf’s should present similar levels of entropy. Hence, for a given signal with
elements following a univariate distribution, the uncertainty of the undergoing process equals the non-compressibility
of the pmf. The following toy examples illustrate this relation. Let r.v. X ∈ {x1 , x2 } with pmf p = (p1 , p2 ).
1) Assume p1 > p2 : if p1 increases (thus p2 decreases since p1 , p2 ≥ 0 and kpk1 = 1), then x1 is even more
certain to appear, i.e. if the compressibility of p increases then the uncertainty of X decreases,

sparsity(p) ↑ ≡ entropy(x) ↓ . (1)

2) Assume p = (p1 , p2 ) = (1, 0) (a constant r.v.): if p2 increases (thus p1 decreases), then x1 is not the unique
possible outcome, i.e. if the compressibility of p decreases then the uncertainty of X increases,

sparsity(p) ↓ ≡ entropy(x) ↑ . (2)

Relation to Previous Work

Entropy functions (related to uncertainty, information, fuzziness or complexity measures) and sparsity functions
(related to compressibility, fairness or dispersion measures) are not regarded as easily characterizable. Hence these
functions have been thoroughly studied in numerous articles from different perspectives and research areas. The
following present relevant references which contribute explicitly to the set of axioms and attributes of sparsity and
entropy functions of Section II.
The main method for axiomatic characterization of entropy functions consist on to treat inherent or satisfactory
properties as axioms for (reference) Shannon entropy function. This started in 1948 with [5] which established

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 3

continuity and monotonicity of entropy functions; [13] added concavity for fuzziness functions and [18] relaxed it to
quasi-concavity for dispersion functions; [15] added concentration for information functions; [20] added maximality
for entropy functions; and [28] added symmetry for information functions.
On the other hand, although sparse data and sparsity has attracted a lot of attention in recent years, concentration,
scaling, homogeneous growth and replication were originally applied in 1920 by [2] in a financial setting to measure
the inequity of wealth distribution; [18] added bounds for dispersion functions; [30] [33] added symmetry and
continuity for sparsity and fairness functions, respectively; and [38] added quasi-convexity for reward-risk ratio
functions.

Contribution of the Paper

For each property, previous works have separately established some axioms and relationships, and derived
generalizations. The contribution of the paper is threefold.
1) The main contribution of this paper is a refined and constructive set of axioms (and attributes) of sparsity
and entropy functions (Section II), which allows to derive function generalizations.
2) The derived sparsity functions explain the benefits of proposed sparsity functions in Compressed Sensing and,
in particular, the effectiveness of the `1 -minimization approach (Section III).
3) The derived entropy functions generalize Rényi and Tsallis entropy, and offer simple formulations to the
sparse recovery of probability measures (Section IV).
The rest of the paper is organized as follows: For sparsity and entropy functions, Section II presents the axioms;
Sections III-IV describe existing functions and derive generalizations; Section V concludes; and the Appendix
contains the proofs.

II. A XIOMS

This section gathers and introduces inherent and first principles criteria as axioms and attributes that jointly
characterize sparsity and entropy functions.
Let X be an independent and identically distributed (i.i.d.) discrete random variable (r.v.), defined on a countable
sample space Ω = {x1 , . . . , xb } ⊆ R (b could be infinity), and let x = (xi ) ∈ Ωn be a vector containing n
realizations of r.v. X. The definition of vector x allows two possible interpretations. It is an arbitrary signal as a
whole and a random process formed by its elements. Throughout this paper both interpretations are adopted, and
different operations are applied on vector x according to the analysis performed. The analysis of sparsity will be from
the signal’s compressibility perspective. Analogously, the analysis of entropy will be from the process’s uncertainty
perspective. The analogies between these perspectives are summarized in Table I. The following operations will
provide adequate representations for the analyzes.
For the compressibility assessment of signal x, a domain transformation is applied using basis Φ ∈ Rn×n ,

Φ : Rn → Rn+ : x 7→ w = Φx, (3)

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 4

Table I
A NALOGIES BETWEEN SPARSITY FUNCTIONS AND ENTROPY FUNCTIONS .

Description Sparsity Entropy


Property compressibility uncertainty
Arguments coefficients probability mass function
Arguments’ unit energy (probability) mass
High value if few dominant coefficients unlikely events
Low value if few negligible coefficients likely events

where w = (wi ) is the (vector of) coefficients of x. For ease of notation, since compressibility is measured from
magnitudes, the energy values of coefficients w are assumed non-negative, i.e. w ∈ Rn+ . Essentially, redefine
coefficients (in terms of true coefficients) as wi = |wi |. Under this model, a sparsity function, which measures the
compactness of signal x under basis Φ, is

s : Rn+ → R : w = Φx 7→ s(w). (4)

For the uncertainty calculation of process x, a b-bins-based `1 -normalized histogram method is applied using
operator Ψ,

Ψ : Rn → [0, 1]b : x 7→ p = Ψx, (5)

where p = (pj ) is the sample probability mass function (pmf) of x, with kpk1 = 1 and
cardinality({i : xi ∈ xj })
pj = ≈ P(X = xj ). (6)
n
Similarly, define Sb+ = {p ∈ [0, 1]b : kpk1 = 1} (probability simplex). Then, for a given variance σ 2 , an entropy
function, which measures the randomness of process x from its outcomes’ occurrences using method Ψ, is

hb : Sb+ → R : p = Ψx 7→ hb (p). (7)

Under this notation (see Table II for a summary), the following enumerates a collection of axioms and attributes
of sparsity and entropy functions. Axioms and attributes describe the effect of different actions on the argument
of functions, e.g. changes on its pattern, ratios or relative differences (see Table III for mathematical description).
In the case of entropy analysis, and in order to stay inside the probability simplex, the actions related to regularity
II-B5 and homogeneous growth II-B6 require a subsequent normalization kp̃k1 = 1, where p̃ denotes p after the
action. The histogram model described above makes this normalization step transparent.

A. Axioms

1) Continuity (II-A1):
S. (Sparsity) Slight energy perturbation retains the ratio of dominant and negligible coefficients. [33]
H. (Entropy) Slight probability mass perturbation retains the ratio of unlikely and likely events. [5]

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 5

Table II
S UMMARY OF NOTATION AND FUNCTIONS .

Symbol Description
s(.) sparsity function
hb (.) entropy function given b events (states)
Ω countable sample space, Ω = {x1 , . . . , xb } ⊆ R
P probability measure defined on Ω
X random variable, X : ω 7→ X(ω) = ω
x = (xi ) vector with elements xi ∈ Ω (realizations of X)
n length of vector x
Φ domain transformation, Φ ∈ Rn×n
w = (wi ) vector with coefficients wi of x, w = Φx
Ψ b-bins histogram operator, Ψ
p = (pj ) sample probability mass function of x, p = Ψx
b number of events (states) or bins, length of vector p
xj j-th event in the histogram method
v = (vk ) arbitrary non-null vector (v 6= 0) with elements vk
k.k0 `0 -“norm”, kvk0 = cardinality(supp(v))
1
`p -norm, kvkp = ( k |vk |p ) p , p ∈ (0, ∞)
P
k.kp
k.k∞ `∞ -norm, kvk∞ = maxk |vk |
P
k.k1∗ `1∗ weighted norm, kvk1∗ = k k|vk |, vi ≥ vj , i > j
.k. concatenation operator of vectors
h., .i inner product of vectors
π(.) permutation in the order of elements
Sn
+ n-simplex, Sn b
+ = {v ∈ [0, 1] : kvk1 = n}
0k , 1k vector of all zeros and ones of length k, resp.

2) Symmetry (II-A2):
S. Energy permutation retains the ratio of dominant and negligible coefficients. [18]
H. Probability mass permutation retains the ratio of unlikely and likely events. [28]
3) Concentration (II-A3) (Dalton’s 1st Law):
S. Moving energy from negligible to dominant coefficients shorten the set of dominant coefficients. [2]
H. Moving probability mass from unlikely to likely events shorten the set of likely events. [15]
4) Scaling (II-A4) (Dalton’s modified 2nd Law):

S. Scaled coefficients contain proportional amounts of energy in proportional number of coefficients. [2]
H. Scaled occurrences contain proportional amounts of probability mass in proportional number of events (or
by the kpk1 = 1 normalization).
5) Replication (II-A5) (Dalton’s 4th Law):

S. Concatenating replicas of (all) coefficients retains the ratio of dominant and negligible coefficients. [2]
H. Adding realizations following the same law retains the ratio of unlikely and likely events. This axiom

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 6

affirms that entropy is intrinsic to a given process (see Subsection IV-B for details).

B. Attributes

1) Bounds (II-B1):
S. The least (most) compressible coefficients allocate (all) its energy on all coefficients (on one single
coefficient), e.g. a white noise (a single “tone”). [18]
H. The most (least) uncertain distribution allocates (all) its probability mass on all events (on one single
event), e.g. a uniform r.v. (a constant r.v.). [5]
Lemma 2.1: II-A3 and II-A4 imply II-B1.
2) Quasi-convexity (II-B2):
S. Quasi-convexity prefers extremes than averages. [38]
H. Quasi-concavity encourages diversification. [13]
Lemma 2.2: II-A1, II-A2, II-A3 and II-A4 imply II-B2.
3) Monotonicity (II-B3): A simple version of II-A3.
S. For two signals with non-zero pair of coefficients, the signal with the largest ratio highest-energy by
lowest-energy is more compressible. [33]
H. For two processes with non-impossible pair of events, the process with the largest ratio highest-mass by
lowest-mass is less uncertain.
Lemma 2.3: II-A3 implies II-B3.
4) Completeness (II-B4):
S. Adding zero-energy coefficients concentrates the energy in relatively less coefficients. [30] (Babies)
H. (Conditioning on) having impossible (zero-mass) events concentrates the mass in relatively less events.
Lemma 2.4: [30] II-A3, II-A4 and II-A5 imply II-B4.
5) Regularity (II-B5):
S. Significant concentration of energy in one single coefficient makes the rest negligible. [30] (Bill Gates)
H. Significant concentration of probability mass in one single event makes the rest unlikely.
Lemma 2.5: [30] II-A3 and II-A4 imply II-B5.
6) Homogeneous growth (II-B6) (Dalton’s 3rd Law):
S. Increasing relatively more the energy of negligible coefficients makes them relatively more dominant. [2]
H. Increasing relatively more the probability mass of unlikely events makes them relatively more likely.
Lemma 2.6: II-A4, II-B1 and II-B2 imply II-B6.
7) Schur-convexity (II-B7):
S. Schur-convexity holds.
H. Schur-concavity holds.
Lemma 2.7: II-A2 and II-B2 imply II-B7.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 7

Table III
A XIOMS II-A1-II-A5 AND ATTRIBUTES II-B1-II-B8 OF SPARSITY FUNCTIONS AND ENTROPY FUNCTIONS , WITH α, β > 0.

Name Sparsity or Compressibility Entropy or Uncertainty


II-A1 Continuity w̃ → w, s(w̃) → s(w) p̃ → p, hb (p̃) → hb (p)
II-A2 Symmetry w̃ = π(w), s(w̃) = s(w) p̃ = π(p), hb (p̃) = hb (p)
 
w̃ = (w̃k ), w̃i = wi + α, w̃j = wj − α, wi ≥ wj
 p̃ = (p̃k ), p̃i = pi + α, p̃j = pj − α, pi ≥ pj

II-A3 Concentration
w̃k = wk , k 6= {i, j}, s(w̃) > s(w)
 p̃k = pk , k 6= {i, j}, hb (p̃) < hb (p)

II-A4 Scaling w̃ = αw, s(w̃) = s(w) p̃ = αp, hb (p̃) = hb (p)


II-A5 Replication w̃ = wk . . . kw, s(w̃) = s(w) p̃ = pk . . . kp, hb (p̃) = hb (p)
II-B1 Bounds w̃ = α1n , ŵ = 0n−1 kα, s(w̃) ≤ s(ŵ) p̃ = α1b , p̂ = 0b−1 kα, hb (p̃) ≥ hb (p̂)
II-B2 Quasi-convexity α < 1, s(αw + (1 − α)w̃) < max{s(w), s(w̃)} α < 1, hb (αp + (1 − α)p̃) > min{hb (p), hb (p̃)}
α α
II-B3 Monotonicity 2
≤ w < w̃ ≤ α, s(w̃, α − w̃) > s(w, α − w) 2
≤ p < p̃ ≤ α, hb (p̃, α − p̃) > hb (p, α − p)
II-B4 Completeness w̃ = wk0, s(w̃) > s(w) p̃ = pk0, hb (p̃) < hb (p)
 
∃β, w̃ = (w̃k ), ŵ = (ŵk ), w̃k = ŵk = wk , k 6= i
 
 ∃β, p̃ = (p̃k ), p̂ = (p̂k ), p̃k = p̂k = pk , k 6= i
II-B5 Regularity
w̃i = wi + β, ŵi = wi + β + α, s(w̃) < s(ŵ)
 p̃i = pi + β, p̂i = pi + β + α, hb (p̃) > hb (p̂)

II-B6 Hom. growth w̃ = w + α1n , s(w̃) < s(w) p̃ = p + α1b , hb (p̃) > hb (p)
II-B7 Schur-convexity w  w̃, s(w) ≥ s(w̃) p  p̃, hb (p) ≤ hb (p̃)
II-B8 Triangle Ineq. s(.) ≥ 0, s(w + w̃) ≤ s(w) + s(w̃) hb (.) ≤ 0, hb (p + p̃) ≥ hb (p) + hb (p̃)

8) Triangle inequality (II-B8):


S. If s > 0, the triangle inequality holds.
H. If h < 0, the reverse triangle inequality holds.
Lemma 2.8: If s(.) > 0, II-A4 and II-B2 imply II-B8.

III. S PARSITY F UNCTIONS

In signal processing, sparsity functions describe the efficiency of basis representation.


The `0 -pseudo-norm is known as the canonical sparsity count. It is also known as strict or hard sparsity since
it counts the cardinality of the support of a function, e.g. the number of non-zero elements of a vector. In the
following, this sparsity count will be denoted “`0 ”.1 Its lack of useful derivative leads to combinatorial optimization
when used as the objective functional in maximizing-sparsity problems. For instance, in Compressed Sensing [25]
a signal x ∈ Rn , of assumed sparse coefficients w ∈ Rn , is recovered from samples y ∈ Rm , with m  n. The
recovery is based on searching the vector x# of sparsest representation or minimum complexity [25],

maxw kwks

Φx# = arg (8)
s.t. Πw = y,

1 The quotation marks (“ ”) help to preserve and tries to alert about the misleading notation which treats `0 -pseudo- (`1 -) norm as a sparsity
function. In fact, `0 -pseudo- (`1 -) norm measures non-sparsity. This misleading notation led [30] to treat `0 (no quotation marks) as a sparsity
function.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 8

Table IV
SPARSITY FUNCTIONS .

Function Description spq family


1
“`0 ” or sparsity count −n kwk0 not a member
1
“`1 ” (proxy) −n kwk1 not a sparsity func.
kwk4 4
 
Kurtosis n kwk not a member
2
2 kwk1∗
Gini index 1+n kwk1
−1 p = 1∗ , q = 1
1 √ kwk1

Hoyer measure √
n−1
n − kwk2
p = 1, q = 2
1 − 1 kwk
pq-means −n q p kwkp p ≤ 1, q > 1
q
1 kwk1
s1∞ or max-sparsity −n kwk∞
p = 1, q = ∞

Table V
T EST RESULTS OF SPARSITY FUNCTIONS . (L EMMAS 2.#, P ROOFS A.##.)

“`0 ” “`1 ” Hoyer Gini max-s spq


II-A1 3 A.11 3 A.11 3 A.11 3 A.11 3 A.11 3 A.11
II-A2 3 A.11 3 A.11 3 A.11 3 A.11 3 A.11 3 A.11
II-A3 3 A.12 7 A.21 3 A.31 3 A.40 3 A.49 3 A.57
II-A4 3 A.13 7 A.22 3 A.32 3 A.41 3 A.50 3 A.58
II-A5 3 A.14 3 A.23 7 A.33 7 A.42 3 A.51 3 A.59
II-B1 3 A.15 7 A.24 3 A.34 3 A.43 3 A.52 3 A.60
II-B2 3 A.16 3 A.25 3 A.35 3 A.44 3 A.35 3 A.61
II-B3 3 A.17 7 A.26 3 A.36 3 A.45 3 A.53 3 A.62
II-B4 3 A.18 3 A.27 3 A.37 3 A.46 3 A.54 3 A.63
II-B5 7 A.19 7 A.28 3 A.38 3 A.47 3 A.55 3 A.64
II-B6 3 A.20 3 A.29 3 A.39 3 A.48 3 A.56 3 A.65
II-B7 3 2.7 3 2.7 3 2.7 3 2.7 3 2.7 3 2.7
II-B8 3 2.8 7 A.30 3 2.8 3 2.8 3 2.8 3 2.8

where k.ks measures sparsity, and Φ and Π are the sparsifying and measurement matrices, respectively.
For (8) and related problems, the adoption of the `1 -norm as a proxy function allows to derive efficient algorithms
and performance guarantees [21]. “`1 is a convex surrogate for `0 count. It is the best surrogate in the sense that
the `1 ball is the smallest convex body containing all 1-sparse objects of the form ±ei ” [42]. This function will be
denoted “`1 ”. “`1 ” is not though a sparsity function [30] and it should not be treated as such. However, Subsection
III-B will validate its adoption in (8). Other sparsity functions proposed for (8) will be discussed later. Among
these functions, the kurtosis [1], Gini index [4], Hoyer measure [19] and pq-means {p ≤ 1, q > 1} [30] (denoted
pq-means in the following) satisfy most of the six-rules criteria of [30]. [41] extends these criteria and derives
max-sparsity. Table IV defines all these functions, and Table V shows the results of the test using axioms and
attributes of Section II.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 9

A. The spq Core Sparsity

Further inspection of sparsity functions (see Table IV) reveals a parallel in their construction [41]. Metrics of
inequality of wealth or fairness, i.e. non-sparsity, exhibit the same parallel, e.g. the Theil [8] and Atkinson [9]
indexes. This simple observation allows to derive a generalization.
Theorem 3.1: [Core sparsity] The function
1 1 kwkp
spq (w) = −n q − p , {p ≤ 1, q ≥ 1, p 6= q}, (9)
kwkq
satisfies axioms II-A1-II-A5.
Core sparsity is continuous (except at the origin), symmetric, scale-invariant and semi-strictly quasi-convex.
The kurtosis (p = 4, q = 2) is out of the range for p and q in (9). This function failed the test using criteria of
[30] hence it will not be further analyzed. Nevertheless, the Gini index (weighted mean p = 1∗ , q = 1) and Hoyer
measure (p = 1, q = 2) are positive affine transformations of core sparsity. These transformations are increasing
hence they preserve the quasi-convexity attribute II-B2 by Theorem A.2 of [31].
More importantly, core sparsity extends pq-means. [22] presents pq-means with {p ≤ 1, q = 2}. “This, (`p )
normalization by the `2 -norm may turn out to be the best sparseness criterion. This however, has yet to be further
investigated. [22]”. And although [30] grants the origin (citation principle) of pq-means to [22], (anonymous)
reviewers suggested these functions to [30] as stated in its acknowledgments. Thus, pq-means have no clear origin
neither a formal derivation in the literature. Thus, the Appendix derives core sparsity (which extends pq-means)
from axioms II-A1-II-A5.
A key factor of core sparsity is its resemblance to “`0 ”. Fig. 1 shows this resemblance. Figs. 1(d)-1(f) show
the smoothness, symmetry and quasi-convexity of sparsity functions; and Figs. 1(g)-1(l) show that the unit balls
of their respective “core” functions and of “`0 ” are the same. A difference between functions in Fig. 1 is their
behavior at the origin. “`0 ” attains its maximum at the origin, which makes sense from a practical perspective since
no information is required to reconstruct null vectors. However, core sparsity is not defined at the origin due to
continuity, e.g. in Fig. 1,

0 = lim spq ((, )) 6= lim spq ((, 0)) = 1. (10)


→0 →0

Notably, p and q parameters localize the domain of core sparsity. Fig. 2(b) presents this localization process
1
for w following a power law decay, i.e. w = (wi ), with wi = ri− ρ . Specifically, Fig. 2(a) shows the versatility
of max-sparsity (and of “`1 ”) to assess the sparsity of arbitrarily-sparse coefficients, and the localized domain of
the kurtosis, Gini index and Hoyer measure, which are better distinguishing among highly-sparse coefficients [41].
This would suggest the (slightly) superior performance, in terms of number of measurements m (the dimension of
y), of recovery strategies based on these last functions compared with the common `1 -minimization, as reported
in [26] for `p , with p ∈ (0, 1); in [36] for Gini index, which stochastic algorithm tends to be unstable; in [27] for
kurtosis, although the function selects inappropriate basis; and in [35] for Hoyer measure in the framework of image
regularization. Thus, appropriate choice of parameters p and q would offer a localized-sparsity formulation of (8),

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 10

(a) “`0 ” (b) “`1 ” (proxy) (c) Kurtosis (d) Gini index (e) Hoyer measure (f) max-sparsity

1 1 1 1 1 1

0.5 0.5 0.5 0.5 0.5 0.5

0 0 0 0 0 0

-0.5 -0.5 -0.5 -0.5 -0.5 -0.5

-1 -1 -1 -1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

kwk4 kwk1∗ kwk1 kwk1


(g) kwk0 = 1 (h) kwk1 = 1 (i) kwk2
=1 (j) kwk1
=1 (k) kwk2
=1 (l) kwk∞
=1

Figure 1. Sparsity functions (normalized to [0, 2]) (a)-(f) for w ∈ [0, 1]2 , and the unit ball of their “core” functions (g)-(l) for w ∈ [−1, 1]2 .

1 1
"L1" (Ref.) Ref.
max sparsity q = 32
0.8 Hoyer 0.8 q = 16
Gini q=8
Kurtosis q=4
q=2
0.6 0.6
q = 1.5
q = 1.25
q = 1.125
0.4 0.4

0.2 0.2

0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
ρ ρ

(a) Sparsity functions (b) s1q core sparsity

1
−ρ
Figure 2. Sparsity (normalized to [0, 1]) of coefficients w = (wi ) of length n = 1000, with wi = ri (power law decay), r = 1.2 and
ρ ∈ [0.03, 8].

since core sparsity strongly encourages coefficients of a given (or estimated) level of sparsity. Further, core sparsity
is differentiable (by construction) which is appropriate for gradient-based optimization, and by the quasi-convexity,
its solution to (8) is attained at an extreme point of the polytope defined by linear (equality) constraints [32], e.g.
Πw = y. Then, a simple exact algorithm can solve (8) by walking along the extreme points of the feasible region.
Still, the potential of core sparsity in sparse recovery has not been studied.

B. Essentials of `1 -minimization in Compressed Sensing

“`1 ” fails the test using axioms and attributes as shown in Table V hence it is not a sparsity function. But “`1 ”
is able to solve (8) in which it commonly plays the role of objective functional. Further, it offers a computational
efficient convex, actually linear, formulation. Certainly core sparsity can not improve this formulation due to its
quasi-convex nature, which leads to quasi-convex maximization in (8). Nevertheless, maximization of sparsity via

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 11

s1∞ core sparsity or max-sparsity2 validates the adoption of “`1 ” in (8).


Theorem 3.2: In linear constrained (8), s1∞ -maximization simplifies to “`1 ”-maximization, i.e. `1 -minimization,
 
maxw s1∞ (w)
 minw kwk1

≡ (11)
s.t. Πw = y, s.t. Πw = y.
 

1
Proof 3.3: Let t = |wmax | , ui = twi , and vi = 1t ui . Then,
 
maxw − 1 kwk1
 minw |w1 |+···+|wn |

n kwk∞ |wmax |
≡ (12)
s.t. Πw = y, s.t. Πw = y,
 



 min|u|≤1 |u1 | + · · · + |un |


≡ s.t. Πu = yt (13)




 t ≥ 0,

min|v|≤ 1t |v1 | + · · · + |vn |


≡ t≥0 (14)

s.t. Πv = y.

Theorem 3.2 does not (only) state that s1∞ -maximization is equivalent to `1 -minimization, but that `1 -minimization
can be formulated in (8) since it is a (simple) reformulation of s1∞ -maximization: essentially, the linear constraint
in (8) ensures the appropriate normalization (by the maximum) that allows “`1 ” to perform as an actual spar-
sity function, e.g. Proof A.30. This result clearly extends to `p -minimization, with p ∈ (0, 1) [26], which is a
reformulation of sp∞ -maximization, with p ∈ (0, 1).
More importantly, the formalism of the present paper and the generalized convexity framework developed by
[31] offer an opportunity to face the non-linear-constrained version of (8), i.e. the non-linear Compressed Sensing
problem [39].

IV. E NTROPY F UNCTIONS

Entropy is a concept of physics and information theory that characterizes the unpredictability of systems’ states
and processes’ events.
In physics, the Tsallis entropy [11] generalizes Boltzmann-Gibbs entropy. In information theory, Rényi entropy
[6] generalizes entropy functions via its parameter p ≥ 0: Hartley entropy or log max entropy [3] (p = 0) counts the
cardinality of non-zero probability events; Shannon entropy (p = 1) follows a similar expression than Boltzmann-
Gibbs entropy to measure the average unpredictability of events; collision entropy (p = 2); log min entropy [23]
(p = ∞) is the most conservative entropy function since it measures the unpredictability of the most likely event.
Hence log min entropy is never greater than Shannon entropy (see Figs. 3-4). Table VI defines all these functions,
and Table VII shows the results of the test using axioms and attributes of Section II.

2 max-sparsity measures the non-similarity between w and 1n , the least compressible vector. It was originally denoted s1∞ in [41].

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 12

Table VI
ENTROPY FUNCTIONS .

Function Description hbpq family


log max entropy log2 kpk0 p = 0, q = 1
Shannon entropy −hp, log2 pi p = 1, q ↓ 1
Collision entropy −2 log2 kpk2 p = 1, q = 2
log min entropy − log2 kpk∞ p = 1, q = ∞
p
Rényi entropy 1−p
log2 kpkp p 6= 1, q = 1
Tsallis entropy 1
p−1
(1 − kpkpp ) p 6= 1, q = 1
1
hb1∞ or min-entropy kpk∞
p = 1, q = ∞

Table VII
T EST RESULTS OF ENTROPY FUNCTIONS . (L EMMAS 2.#, P ROOFS A.##.)

Shannon Rényi Tsallis min-h hbpq


II-A1 3 A.11 3 A.11 3 A.11 3 A.11 3 A.11
II-A2 3 A.11 3 A.11 3 A.11 3 A.11 3 A.11
II-A3 3 A.67 3 A.76 3 A.83 3 A.49 3 A.57
II-A4 3 A.68 3 A.68 3 A.68 3 A.50 3 A.58
II-A5 3 A.69 3 A.69 3 A.69 3 A.51 3 A.59
II-B1 3 A.70 3 A.77 3 A.84 3 A.52 3 A.60
II-B2 3 A.71 3 A.78 3 A.85 3 A.35 3 A.61
II-B3 3 A.72 3 A.79 3 A.86 3 A.53 3 A.62
II-B4 3 A.73 3 A.80 3 A.87 3 A.54 3 A.63
II-B5 3 A.74 3 A.81 3 A.88 3 A.55 3 A.64
II-B6 3 A.75 3 A.82 3 A.89 3 A.56 3 A.65
II-B7 3 2.7 3 2.7 3 2.7 3 2.7 3 2.7
II-B8 3 2.8 3 2.8 3 2.8 3 2.8 3 2.8

In computer science, entropy characterizes the complexity of sequences. In this context, entropy functions are
called complexity algorithms. Among these algorithms, Lempel-Ziv complexity [10] characterizes the randomness of
a sequence of symbols by measuring its rate of (new) pattern generation. Following a distinct approach, Approximate
entropy [12] examines finite-length time series for similar epochs, and Sample entropy [17] performs similarly but
without counting self-matches. Both algorithms emerged from the formulation of Kolmogorov complexity for finite-
sample approximations. Axioms and attributes of Section II are not applicable to these algorithms but they will be
used for comparison (see Figs. 3-4).

A. The hbpq Core Entropy

Although Rényi entropy and Tsallis entropy are concepts from different fields and are applied to different objects,
both functions posses the same parallel that was identified in sparsity functions (see Tables IV and VI). The key

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 13

1 1
Shannon (Ref.) Ref.
Lempel-Ziv q = 1.125
0.8 Apen 0.8 q = 1.25
SampEn q = 1.5
min entropy q=2
q=4
0.6 0.6
q=8
q = 16
q = 32
0.4 0.4

0.2 0.2

0 0
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
π π

(a) Entropy functions (b) hb1q core entropy

Figure 3. Entropy (normalized to [0, 1]) of p = (π, 1 − π) of n = 10000 samples xi ∼ Bern(π), with π ∈ [0, 1].

difference between Rényi entropy and Tsallis entropy (and sparsity functions of Section III) is the logarithm
transformation which provides the information units to Rényi entropy and which, at the same time, preserves the
quasi-concavity attribute by Theorem A.2 of [31]. The same applies to the power function in Tsallis entropy. Thus,
following Section II (which states the complementary behavior of sparsity functions and entropy functions) and
Section III (which introduces core sparsity), and removing all unit transformations, allows to derive a generalization.
Theorem 4.1: [Core entropy] The function
kpkp
hbpq (p) = , {p ≤ 1, q ≥ 1, p 6= q}, (15)
kpkq
satisfies axioms II-A1-II-A5.
Core entropy is continuous (except at the origin which corresponds to the empty set), symmetric and semi-strictly
quasi-concave. It follows the quotient-of-weighted-functions form of general entropy functions as in [14], e.g.
Aczel-Daróczy entropy [7].
Interestingly, the application of function a(.) = − 1. to core entropy increases the domain of its parameters p
and q to {p ≥ 1, q ≤ 1, p 6= q} and makes core entropy negative. Then, the Tsallis entropy (p 6= 1, q = 1), Rényi
entropy (p 6= 1, q = 1) and their special cases (Shannon, etc.) are transformations (by the logarithm and power
functions) of core entropy. All these previous transformations (including a(.)) are increasing hence they preserve
the (now) quasi-concavity attribute II-B2 by Theorem A.2 of [31]. Note that the domain extension of parameters p
and q to {p ≥ 1, q ≤ 1, p 6= q} also applies to core sparsity, which will be now positive.
Some features of core entropy are shown in Fig. 3 which presents the entropy of a r.v. following the Bernoulli
distribution with parameter π, the canonical example of information theory textbooks [24]. Fig. 3(a) shows the
smoothness, monotonicity, symmetry and quasi-concavity of entropy functions and complexity algorithms. Fig.
3(b) shows the same features in core entropy. Setting parameter p = 1 allows to appreciate the most interesting
case: core entropy tends to Shannon entropy as q ↓ 1. Unsurprisingly, the same tendency is present for q = 1 fixed
and p ↑ 1.
For general univariate distributions, [16] identifies conditions under which entropy defines an order (on distri-
butions). Fig. 4 shows this ordering phenomenon for several distributions. For a better visual comparison, these

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 14

1 1

0.8 0.8

0.6 0.6
Ref.
q = 1.125
q = 1.25
0.4 0.4
q = 1.5
Shannon (Ref.) q=2
Lempel-Ziv q=4
0.2 Apen 0.2 q=8
SampEn q = 16
min-entropy q = 32
0 0
2 4 6 8 2 4 6 8
Distributions Distributions

(a) Entropy functions (b) hb1q core entropy

Figure 4. Entropy and ordering of univariate distributions. Unit-variance test distributions: (1) Constant (all probability mass in one single
event, σ = 0), (2) Pareto, (3) Frechet, (4) Log-normal, (5) Exponential, (6) Gamma, (7) Weibull, (8) Gumbel, and (9) Gaussian. Parameters:
deviation σ = 1, alphabet size b = 10, m = 2 and r = 0.2σ.

test distributions do not include the uniform distribution which reaches the maximum in entropy functions. Fig.
4(a) shows the ordering obtained using entropy functions and complexity algorithms. Although its simplicity, hb1∞
core entropy or min-entropy respects the same trend ordering reported by other functions. As in Fig. 3(b), Fig.
4(b) shows that the ordering trend drawn by core entropy, with p = 1 fixed and q ↓ 1, tends to resulting ordering
obtained using Shannon entropy.

B. On Relative and Absolute Functions

Core sparsity is a relative measure and core entropy is an absolute measure. The following discusses this subtle
difference and its consequences on the way both functions follow the axioms (and attributes) of Section II.
For entropy functions, replication II-A5 does not refer to a larger number of events or states. As the number
of “replicas” of p goes to infinity, replication II-A5 affirms that entropy is intrinsic to a process (system). Hence
entropy functions measure an absolute property as is the uncertainty of a process (system) of fixed possible events
(states). Analogously, sparsity functions measure a relative property as is the efficiency of a basis representation or
compressibility. For example, coefficients w = (1, 0) and w̃ = (1, 1, 0, 0) both require 50% of their content for an
exact representation.
1 1
This subtle difference is characterized by the dimension normalization n q − p , which is present only in core
sparsity (cf. Proof A.10). Furthermore, (discrete) entropy is defined for discrete r.v.s on countable sample spaces
1 1
which cardinality could be infinity, i.e. b → ∞. Thus, normalization of core entropy by b q − p would not be possible
in general.
More importantly, replication II-A5 and completeness II-B4 should not be confused with the following two
common axioms of entropy functions [28].
Lemma 4.2: Core entropy satisfies

hbpq (p) = hb+1


pq (pk0). (16)

Proof 4.3: Trivial by the definition of core entropy.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 15

Lemma 4.4: Core entropy satisfies, for m < n,

hm n
pq (1m ) < hpq (1n ). (17)

Proof 4.5:
16
hm n
pq (1m ) = hpq (1m k0n−m ) (18)
II-B4 n
< hpq (1n ), (19)

which is also obvious given 1m k0n−m is more “compressible”, hence less uncertain, than 1n . Or by simple
1 1 1 1
calculation: hm p − q , hn (1 ) = n p − q .
pq (1m ) = m pq n

C. Sparse Recovery of Probability Measures

Consider the minimum (Shannon) entropy problem on the probability simplex with sample moment constraints
[34],

min exp(hbShannon (p))

 p≥0
kpk1 =1 (20)

s.t. Ep [X k ] = µk , 1 ≤ k ≤ m − 1,

where Ep denotes expectation with respect to desired p.


Write A = (aij ), with aij = xi−1
j , 1 ≤ j ≤ b and 1 ≤ i ≤ m; and sample moments µ = (1, µ1 , . . . , µm−1 ).

Since log max entropy upper bounds Shannon entropy, the solution of (20) is upper-bounded by the solution of
 
minp≥0 hbmax (p) 
 minp≥0 kpk0
≡ (21)
s.t. Ap = µ s.t. Ap = µ,
 

where hbmax denotes hb01 core entropy or max-entropy. Deterministic measurement matrix A is Vandermonde [29],
and linear constraint Ap = µ includes the kpk1 = 1 normalization. Then, under the common Compressed Sensing
hypotheses, (21) is equivalent to
 
minp≥0 kpk1 minp≥0 1
 
≡ (22)
s.t. Ap = µ s.t. Ap = µ.
 

However, formulation (22) is useless in this case since it produces a feasibility problem which solution is not
unique if the problem is under-determined. Nevertheless, by the linearity of the constraints and Theorem 3.2, (22)
is equivalent to s1∞ maximization, which is equivalent to minimization of hb1∞ or min-entropy,3 denoted in the
following hbmin .
Theorem 4.6: Under an appropriate Vandermonde matrix A, min-entropy minimization solves (20) uniquely.

3 min-entropy measures the similarity between p and 1b , the most uncertain distribution, i.e. uniform distribution [41].

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 16

Proof 4.7: Given the linear constraints of (20),

min hbmin (p) ≤ min hbShannon (p) ≤ min hbmax (p) . (23)
| {z } | {z }
=min hb1∞ (p) ≡min kpk0
≡max s1∞ (p) ≡min kpk1
≡min kpk1

Hence, (21) not only upper bounds (20) (cf. [37]).

V. C ONCLUSIONS

A complete characterization of sparsity and entropy offered new functions and tools to solve efficiently problems
of sparse recovery. This paper proposed a formalism, a joint axiomatic characterization, for sparsity and entropy.
The proposed set of axioms is constructive and allows to derive the core functions, core sparsity and core entropy.
Both core functions are simple functional forms found in well-known sparsity and entropy functions. Finally, core
functions were applied to the sparse recovery problem where they offered efficient formulations. More importantly,
the (simplest) core sparsity validates the use of the `1 -norm as a (non-)sparsity measure in compressed sensing
problem which further provides insights to this problem.
Future work will follow two directions. Concerning entropy, to study the compatibility of proposed axioms (and
attributes) and specific axioms of entropy, especially those related to relative and conditional entropy. Some of these
specific axioms were already verified here. Concerning sparsity, to study the potential of core sparsity in a general
optimization framework, where core sparsity can be “tuned” to match the optimal recovery strategy according to a
sparsity level.

A PPENDIX

In this appendix Lemmas 2.1-2.8 are proved, which derive attributes II-B1-II-B8 from axioms II-A1-II-A5. The
appendix also contains the proofs listed in Tables V and VII where functions, applied to non-zero vectors, are
evaluated using axioms and attributes of Section II. These proofs give a refinement of the formalism compared to
proofs of [30] of which some proofs are corrected.
Tables VIII-IX describe the notation adopted in the proofs related to sparsity and entropy functions, respectively.
The proofs are based on the following key results of [31].
Theorem A.1: [31] Let f , g be defined on S convex, and
f (w)
s(w) = . (24)
g(w)
If any of the following properties hold

1) f non-negative convex, g positive concave; or


2) f non-positive convex, g positive convex; or
3) f convex, g positive affine.

Then, s is semi-strictly quasi-convex on S.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 17

Theorem A.2: [31] (Also [18].) Let s be semi-strictly quasi-convex (-concave) defined on S ⊆ Rn convex. Let
a : A → R be increasing, with s(S) ⊆ A. Then, a ◦ s is semi-strictly quasi-convex (-concave) on S.
Examples of function a(.) are the log(.) and − 1. (which changes the sign and allows to prove Lemma 2.8).

A. Proofs of Lemmas
1
Proof A.3: [Lem. 2.1, Bounds] Let w arbitrary with kwk = 1. Construct ŵ = 0n−1 k1 and w̃ = n 1n from w
via concentration actions II-A3. Then,
II-A3 II-A3
s(w̃) ≤ s(w) ≤ s(ŵ). (25)

and by scaling II-A4,


II-A4
s(α1n ) = s(w̃), (26)
II-A4
s(ŵ) = s(0n−1 kα). (27)

Proof A.4: [Lem. 2.2, Quasi-convexity] Let s continuous II-A1, symmetric II-A2, homogeneous of degree 0 II-A4
f
and non-negative. Then, s can be written as g, with f and g continuous, symmetric, homogeneous of degree 1 and
non-negative. Construct w̃ from w via a concentration action II-A3. Then, it holds that

f (w̃) > f (w) and g(w̃) < g(w). (28)

For function f (differentiable)

f (w) − f (w̃) ≈ −α∇ei −ej f (w) (29)


 
∂f ∂f
=α (w) − (w) . (30)
∂wj ∂wi
Then, by the symmetry of f and (28), each component function fk of f satisfies
fk (wj ) − fk (wj + α) fk (wi ) − fk (wi + α)
lim < lim (31)
α→0 α α→0 α
f
with wi > wj , which implies that f is convex. Similarly, g is concave. Now, by Theorem A.1, s = g is semi-strictly
quasi-convex. A similar proof holds if s non-positive.
Proof A.5: [Lem. 2.3, Monotonicity] Construct w̃ = (w̃, α − w̃) from w = (w, α − w) via concentration actions
α
II-A3, with 2 ≤ w ≤ w̃ ≤ α.
Proof A.6: [Lem. 2.4, Completeness][30] Let w arbitrary. Then,
II-A5
s(w) = s(wk . . . kw) (32)
| {z }
n+1 times
  
II-A3 1
< s 1+ wk . . . kwk0n (33)
n
II-A4
= s(wk . . . kwk0n ) (34)
II-A5
= s(wk0). (35)

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 18

Proof A.7: [Lem. 2.5, Regularity][30] Let w̃ = [w1 , . . . , wi + β, . . . , wn ] and 0 <  = 1 + δ. Construct ŵ from
w = w̃ via concentration actions II-A3,

ŵj = wj − δwj = wj , j 6= i, (36)


X
ŵi = (wi + β) + δwj (37)
j6=i

= wi + β + α, (38)
P
where α = δ(β + j wj ). Then,
II-A4
s(w0 ) = s(w1 ) (39)
II-A3
< s(w2 ). (40)

Proof A.8: [Lem. 2.6, Hom. growth] Let w arbitrary, w̃ = 1n and α < 1. Then,
 
1−α II-A4
s w+ w̃ = s(αw + (1 − α)w̃) (41)
α
II-B2
< max{s(w), s(w̃)} (42)
II-B1
= s(w). (43)

Proof A.9: [Lem. 2.8, Triangle ineq.] Let w and w̃ arbitrary and α < 1,
II-B2
s(αw + (1 − α)w̃) ≤ max{s(w), s(w̃)} (44)
s>0
≤ s(w) + s(w̃) (45)
II-A4
= s(αw) + s((1 − α)w̃). (46)

B. Derivation of spq Core Functions

Proof A.10: [Thm. 3.1, Thm. 4.1.] Since concentration II-A3 implies quasi-convexity II-B2, Theorems A.1 and
A.2, with a(.) = − 1. increasing, state that sparsity s can be written as
g(w)
s(w) = −ξ(n) (47)
f (w)
with ξ(n) > 0, ∀n, responsible of the dimension-invariance of s or replication II-A5; and g > 0 concave and
f > 0 convex, and both functions continuous II-A1, symmetric II-A2 and homogeneous of degree 1, such that s
is homogeneous of degree 0 or scale-invariant II-A4. By the homogeneity, f (0n ) = 0 and g(0n ) = 0 (hence s
cannot be defined at the origin). The previous description of f and q resembles vector norms and leads to the pair
of candidate functions

f (.) = k.kq , q ≥ 1, (48)

g(.) = k.kp , p ≤ 1, (49)


1 1
with p 6= q. Finally, normalization by ξ(n) = n q − p allows s to satisfy replication II-A5.
A similar proof holds for core entropy which does not need this last normalization.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 19

Table VIII
S UMMARY OF NOTATION USED IN PROOFS RELATED TO SPARSITY FUNCTIONS .

w̃ ñ ŵ n̂ Assumptions Desired result


Continuity n w̃ → w s(w̃) → s(w)
Symmetry π(w) n s(w̃) = s(w)
Concentration w̃i = wi + α, w̃j = wj − α n wi ≥ wj s(w̃) > s(w)
Scaling αw n s(w̃) = s(w)
Replication wk . . . kw mn n→∞ s(w̃) → s(w)
Bounds α1n n 0n−1 kα n s(w̃) < s(ŵ)
Quasi-convex n αw + (1 − α)w̃ n α < 1, s(w) ≤ s(w̃) s(w̃) > s(ŵ)
α
Monotonicity [w̃, α − w̃] 2 [ŵ, α − ŵ] 2 2
≤ w̃ < ŵ ≤ α s(w̃) < s(ŵ)
Completeness wk0 n+1 s(w̃) > s(w)
Regularity w̃i = wi + β n ŵi = wi + β + α n s(w̃) < s(ŵ)
Hom. growth w + α1n n s(w̃) < s(w)

C. Proofs for Sparsity Functions

Proof A.11: [Continuity, symmetry] By the continuity and symmetry of `p -norms.


1) Proofs for “`0 ”: Denoted s0 in the following.
Proof A.12: [Concentration](cf. correction of [30].)

kwk0 − 1, new null coefficient,

kw̃k0 = (50)
kwk0 ,

otherwise,

i.e. s0 (w̃) ≥ s0 (w).


Proof A.13: [Scaling](cf. correction of [30].)

kw̃k0 = kwk0 , (51)

i.e. s0 (w̃) = s0 (w).


Proof A.14: [Replication](cf. correction of [30].)

kw̃k0 = mkwk0 , (52)

i.e. s0 (w̃) = s0 (w).


Proof A.15: [Bounds]

s0 (w̃) = −1, (53)


1
s0 (ŵ) = − , (54)
n
i.e. s0 (w̃) < s0 (ŵ).
Proof A.16: [Quasi-convexity]

kŵk0 ≥ min{kwk0 , kw̃k0 }, (55)

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 20

i.e. s0 (ŵ) ≤ max{s0 (w), s0 (w̃)}.


Proof A.17: [Monotonicity]

kw̃k0 = 2, (56)

2, ŵ < α,

kŵk0 = (57)
1, otherwise,

i.e. s0 (w̃) ≤ s0 (ŵ).


Proof A.18: [Completeness](cf. correction of [30].)

kw̃k0 = kwk0 , (58)

i.e. s0 (w̃) > s0 (w).


Proof A.19: [Regularity](cf. correction of [30].)

kw̃k0 = kŵk0 , (59)

i.e. s0 (w̃) = s0 (ŵ).


Proof A.20: [Hom. growth](cf. correction of [30].)

kwk0 + m, (m) new non-null coefficients,

kw̃k0 = (60)
kwk0 ,

otherwise,

i.e. s0 (w̃) ≤ s0 (w).


2) Proofs for “`1 ”: Denoted s1 in the following.
Proof A.21: [Concentration](cf. counterexample in [30].)

kw̃k1 = kwk1 , (61)

i.e. s1 (w̃) = s1 (w).


Proof A.22: [Scaling](cf. counterexample in [30].)

kw̃k1 = αkwk1 , (62)

i.e. s1 (w̃) = αs1 (w).


Proof A.23: [Replication](Correction of [30].)

kw̃k1 = mkwk1 , (63)

i.e. s1 (w̃) = s1 (w).


Proof A.24: [Bounds]
α
s1 (w̃) = s1 (ŵ) = − , (64)
n
i.e. s1 (w̃) = s1 (ŵ).

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 21

Proof A.25: [Quasi-convexity] Assume kw̃k1 ≤ kwk1 , then, s1 (w̃) ≥ s1 (w). Since w, w̃ non-negative,

kw̃k1 ≤ αkw̃k1 + (1 − α)kwk1 (65)

= kŵk1 , (66)

i.e. s1 (ŵ) ≤ s1 (w̃) = max{s1 (w), s1 (w̃)}.


Proof A.26: [Monotonicity]

kw̃k1 = kŵk1 , (67)

i.e. s1 (w̃) = s1 (ŵ).


Proof A.27: [Completeness](Correction of [30].)

kw̃k1 = kwk1 , (68)

i.e. s1 (w̃) > s1 (w).


Proof A.28: [Regularity](cf. counterexample in [30].)

kŵk1 = kw̃k1 + α, (69)

i.e. s1 (w̃) > s1 (ŵ).


Proof A.29: [Hom. growth](cf. obvious in [30].)

kw̃k1 = kwk1 + nα, (70)

i.e. s1 (w̃) < s1 (w).


Proof A.30: [Triangle ineq.] Notice that @α, β such that s̄1 (v) = αs1 (v) + β ≥ 0, ∀v without knowledge of
the maximum of w (sup |w| = kwk∞ ). Then,

kŵk ≤ kwk1 + kw̃k1 , (71)

i.e. s1 (ŵ) ≥ s1 (w) + s1 (w̃).


3) Proofs for Hoyer:
Proof A.31: [Concentration](Correction of [30].)

g(α) = sHoyer (w̃(α)) − sHoyer (w), (72)

where g(0) = 0. Then,


 0
0 kwk1 −1
g (α) = √ (73)
n − 1 kw̃(α)k2
1 kwk1
=√ (wi − wj + 2α) > 0, (74)
n − 1 kw̃k32
since wi + α > wj − α, i.e. sHoyer (w̃) > sHoyer (w).
Proof A.32: [Scaling](cf. obvious in [30].)
kw̃k1 kwk1
= , (75)
kw̃k2 kwk2

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 22

i.e. sHoyer (w̃) = sHoyer (w).


Proof A.33: [Replication](cf. counterexample in [30].)

n−1
sHoyer (w̃) = √ sHoyer (w), (76)
n − √1m

n−1
where √
n− √1m
→ 1 as n → ∞, i.e. sHoyer (w̃) → sHoyer (w).
Proof A.34: [Bounds]

sHoyer (w̃) = 0, (77)

sHoyer (ŵ) = 1, (78)

i.e. sHoyer (w̃) < sHoyer (ŵ).


Proof A.35: [Quasi-convexity] Let p = 1, q > 1, e.g. Hoyer with q = 2, s1∞ with q = ∞, then
• f (.) = k.kq convex,
• g(.) = k.kp affine (or concave),
f (.)
• by Theorem A.1(3 (or 1)), z(.) = g(.) is semi-strictly quasi-convex,
• and, by Theorem A.2 with a(.) = − 1. , a ◦ z(.) = − fg(.)
(.) is semi-strictly quasi-convex.

Proof A.36: [Monotonicity]

g(α) = sHoyer (ŵ(α)) − sHoyer (w̃(α)) (79)


α 1 1
=√ ( − ) > 0, (80)
n − 1 kw̃k2 kŵk2
α
since k.k2 convex and 2 ≤ w1 < w2 ≤ α, i.e. sHoyer (ŵ) > sHoyer (w̃).
Proof A.37: [Completeness](cf. obvious in [30].)

g(w) = sHoyer (w̃(w)) − sHoyer (w) (81)


√ √  
n+1− n kwk1
= √ √ − 1 > 0, (82)
( n + 1 − 1)( n − 1) kwk2
since kvk2 < kvk1 , i.e. sHoyer (w̃) > sHoyer (w).
Proof A.38: [Regularity](Correction of [30].)

g(α) = sHoyer (ŵ(α)) − sHoyer (w̃), (83)

where g(0) = 0. Then,


 0
−1 kw̃k1 + α
g 0 (α) = √ (84)
n−1 kŵ(α)k2
kw̃k1 (kw̃k∞ + α) − kw̃k22 − kw̃k∞ α
= √ > 0, (85)
( n − 1)kŵk32
since kvk22 < kvk∞ kvk1 and kvk1 > kvk∞ , i.e. sHoyer (ŵ) > sHoyer (w̃).
Proof A.39: [Hom. growth](cf. [30].)

g(α) = sHoyer (w) − sHoyer (w̃(α)), (86)

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 23

where g(0) = 0. Then,


 0
0 1 kwk1 + nα
g (α) = √ (87)
n−1 kw̃(α)k2
1 nkwk22 − kwk21
=√ > 0, (88)
n−1 kw̃k32

since kvk1 < nkvk2 , i.e. sHoyer (w) > sHoyer (w̃).
4) Proofs for Gini: Recall that the order of indexes changes for this function, i.e. wi > wj as i > j.
Proof A.40: [Concentration](cf. [30].)

g(α) = sGini (w̃(α)) − sGini (w), (89)

where g(0) = 0. Then,


2
g 0 (α) = (kw̃(α)k1∗ )0 (90)
(1 + n)kwk1
2
= (i∗ (wi + α) + j ∗ (wj − α))0 (91)
(1 + n)kwk1
2
= (i∗ − j ∗ ) > 0, (92)
(1 + n)kwk1
where i∗ ≥ i and j ∗ ≤ j are the new indexes (and weights) of coefficients wi + α and wj − α, respectively; i.e.
sGini (w̃) > sGini (w).
Proof A.41: [Scaling](cf. [30].)
kw̃k1∗ kwk1∗
= , (93)
kw̃k1 kwk1
i.e. sGini (w̃) = sGini (w).
Proof A.42: [Replication](Correction of [30].)
Pn Pim
2 i=1 k=(i−1)m+1 wi
s12 (w̃) = −1 (94)
1 + mn mkwk1
(1 + n)m
= s12 (w), (95)
1 + mn
(1+n)m
where 1+mn → 1 as n → ∞, i.e. sGini (w̃) → sGini (w).
Proof A.43: [Bounds]

sGini (w̃) = 0, (96)


n−1
sGini (ŵ) = , (97)
n+1
i.e, sGini (w̃) < sGini (ŵ).
Proof A.44: [Quasi-convexity] Let p = 1∗ , q = 1, e.g. Gini, then
• f (.) = k.kp affine (and convex)
• g(.) = k.kq affine
f (.)
• by Theorem A.1(3), z(.) = g(.) is semi-strictly quasi-convex.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 24

Proof A.45: [Monotonicity]

g(α) = sGini (ŵ(α)) − sGini (w̃(α)) (98)


2(ŵ − w̃)
= > 0, (99)
α(1 + n)
since ŵ > w̃, i.e. sGini (ŵ) > sGini (w̃).
Proof A.46: [Completeness](cf. [30].)

g(w) = sGini (w̃) − sGini (w), (100)


 Pn Pn 
2 i=1 (i + 1)wi i=1 iwi
= − (101)
1+n kwk1 kwk1
2
= > 0, (102)
1+n
i.e. sGini (w̃) > sGini (w).
Proof A.47: [Regularity](cf. [30].)

g(α) = sGini (ŵ(α)) − sGini (w̃), (103)

where g(0) = 0. Then,


 0
0 2 kw̃k1∗ + nα
g (α) = (104)
1+n kw̃k1 + α
2
(nkw̃k1 − kw̃k1∗ ) > 0, (105)
(1 + n)kŵk21
since nkvk1 − kvk1∗ , i.e. sGini (ŵ) > sGini (w̃).
Proof A.48: [Hom. growth](cf. [30].)

g(α) = sGini (w) − sGini (w̃(α)), (106)

where g(0) = 0. Then,


 Pn 0
0 −2 kwk1∗ + i=1 iα
g (α) = (107)
n+1 kwk1 + nα
n
2n X
= wi (2i − 1 − n) (108)
(n + 1)kw̃k21 i=1
bX
2c
n

2nkw̃k−2
1
= (wn−i+1 − wi )(n + 1 − 2i) > 0, (109)
(n + 1) i=1

since wn−i+1 > wi and n + 1 > 2i with i ≤ b n2 c, i.e. sGini (w) > sGini (w̃).
5) Proofs for s1∞ :
Proof A.49: [Concentration]

g(α) = s1∞ (w̃(α)) − s1∞ (w) (110)


 
kwk1 1 1
= − ≥ 0, (111)
n kwk∞ kw̃k∞

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 25

since kw̃k∞ ≥ kwk∞ , i.e. s1∞ (w̃) ≥ s1∞ (w).


Proof A.50: [Scaling]
kw̃k1 kwk1
= , (112)
kw̃k∞ kwk∞
i.e. s1∞ (w̃) = s1∞ (w).
Proof A.51: [Replication]

s1∞ (w̃) = s1∞ (w). (113)

Proof A.52: [Bounds]

s1∞ (w̃) = 0 (114)


1
s1∞ (ŵ) = 1 − , (115)
n
1
i.e. s1∞ (w̃) < s1∞ (ŵ). Further, let arbitrary w̄ with kw̄k1 = β and kw̄k∞ = λβ, n ≤ λ ≤ 1. Then,
1
s1∞ (w̄) = 1 − , (116)
λn
i.e. s1∞ (w̃) ≤ s1∞ (w̄) ≤ s1∞ (w2 ).
Proof A.53: [Monotonicity]
α
s1∞ (w̃) = 1 − , (117)
2w̃
α
s1∞ (w̃) = 1 − , (118)
2ŵ
i.e. s1∞ (w̃) < s1∞ (ŵ).
Proof A.54: [Completeness]

g(w) = s1∞ (w̃(w)) − s1∞ (w) (119)


1 kwk1
= > 0, (120)
n(n + 1) kwk∞
i.e. s1∞ (w̃) > s1∞ (w).
Proof A.55: [Regularity]

g(α) = s1∞ (ŵ(α)) − s1∞ (w̃), (121)


α(kw̃k1 − kw̃k∞ )
= > 0, (122)
nkw̃k∞ kŵk∞
since kvk1 > kvk∞ , i.e. s1∞ (ŵ) > s1∞ (w̃).
Proof A.56: [Hom. growth]

g(α) = s1∞ (w) − s1∞ (w̃(α)) (123)


α(nkwk∞ − kwk1 )
= > 0, (124)
nkwk∞ kw̃k∞
since nkvk∞ > kvk1 , i.e. s1∞ (w) > s1∞ (w̃).

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 26

6) Proofs for spq compressibility: Let p ≤ 1 < q.


Proof A.57: [Concentration](Correction of [30].)

g(α) = spq (w̃(α)) − spq (w) (125)

≈ α∇ei −ej spq (w), (126)


∂f ∂f
where g(0) = 0, ∇ei −ej f (v) = ∂vi (v) − ∂vj (v), and
!
∂spq 1 1 kwkp wkq−1 wkp−1
(w) = n q − p − (127)
∂wk kwkq kwkqq kwkpp

positive, since vkq−1 kvkpp > vkp−1 kvkqq (expand to see); i.e. spq (w̃) > spq (w).
Proof A.58: [Scaling](cf. [30].)
kw̃kp kwkp
= , (128)
kw̃kq kwkq
i.e. spq (w̃) = spq (w).
Proof A.59: [Replication](cf. [30].)

spq (w̃) = spq (w). (129)

Proof A.60: [Bounds]

spq (w̃) = −1, (130)


1 1
spq (ŵ) = −n q − p , (131)

i.e. spq (w̃) < spq (ŵ).


Proof A.61: [Quasi-convexity] Let p ≤ 1, q > 1, e.g. Hoyer with q = 2, s1∞ with q = ∞, then
• f (.) = −k.kp non-positive convex,
• g(.) = k.kq positive convex,
f (.)
• by Theorem A.1(2), z(.) = g(.) is semi-strictly quasi-convex;
Proof A.62: [Monotonicity] Let β = ŵ − w̃,

g(β) = spq (ŵ) − spq (w̃) (132)

≈ β∇e1 −e2 spq (w̃), (133)


∂f ∂f
where g(0) = 0, ∇e1 −e2 f (v) = ∂v1 (v) − ∂v2 (v), and
!
∂spq 1 1 kw̃kp w̃kq−1 w̃kp−1
(w̃) = n q − p q − (134)
∂ w̃k kw̃kq kw̃kq kw̃kpp

positive, since vkq−1 kvkpp > vkp−1 kvkqq (expand to see); i.e. spq (ŵ) > spq (w̃).
Proof A.63: [Completeness](cf. [30].)

g(w) = spq (w̃(w)) − spq (w) (135)


kwkp q1 − p1 1 1
= (n − (n + 1) q − p ) > 0, (136)
kwkq

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 27

Table IX
S UMMARY OF NOTATION USED IN PROOFS RELATED TO ENTROPY FUNCTIONS .

p̃ b̃, ñ p̂ b̂, n̂ Assumptions Desired result


Continuity b, n p̃ → p hb (p̃) → hb (p)
Symmetry π(p) b, n hb (p̃) = hb (p)
Concentration p̃i = pi + α, p̃j = pj − α b, n pi ≥ pj hb (p̃) < hb (p)
Scaling αp b, αn hb (p̃) = hb (p)
Replication pk . . . kp b, mn b→∞ hb (p̃) → hb (p)
1
Bounds 1
b b
b, n 0b−1 k1 b, n hb (p̃) > hb (p̂)
Quasi-concave b, n αp + (1 − α)p̃ b, n α < 1, hb (p) ≥ hb (p̃) hb (p̃) < hb (p̂)
1
Monotonicity [p̃, 1 − p̃] 2, n [p̂, 1 − p̂] 2, n 2
≤ p̃ < p̂ ≤ 1 hb (p̃) > hb (p̂)
Completeness pk0 b + 1, n hb (p̃) < hb (p)
Regularity p̃i = pi + β b, n + β p̂i = pi + β + α b, n + β + α hb (p̃) > hb (p̂)
Hom. growth p + α1n b, n + bα hb (p̃) > hb (p)

i.e. spq (w̃) > spq (w).


Proof A.64: [Regularity](cf. approx. in [30] under minor corrections.)

g(α) = spq (ŵ) − spq (w̃) (137)

≈ α∇ei spq (w̃), (138)

∂f
where g(0) = 0, ∇ei f (v) = ∂vi (v), and
!
∂spq 1 1 kw̃kp w̃iq−1 w̃ip−1
(w̃) = n q − p q − (139)
∂ w̃i kw̃kq kw̃kq kw̃kpp

positive, since vkq−1 kvkpp > vkp−1 kvkqq (expand to see); i.e. spq (ŵ) > spq (w̃).
Proof A.65: [Hom. growth](cf. approx. in [30] which divides by non-necessarily non-null coefficients.)

g(α) = spq (w) − spq (w̃(α)), (140)

where g(0) = 0. Then,


 0
1 1 kw + α1n kp
g 0 (α) = n q − p (141)
kw + α1n kq
p−1 q−1
!
1 1 kw̃kp kw̃kp−1 kw̃kq−1
=n q−p − > 0, (142)
kw̃kq kw̃kpp kw̃kqq

since kvkp−1 q p q−1


p−1 kvkq > kvkp kvkq−1 (expand to see), i.e. spq (w) > spq (w̃).

D. Proofs for Entropy Functions

Regularity II-B5 and homogeneous growth II-B6 use the histogram model: in each bin k, new qk count is added
to old count pk , then sample pmf p̃ is obtained after normalization k.k1 = 1 (using the number of realizations ñ).

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 28

1 1
Completeness II-B4 requires normalization by the number of events. The simple trial 1
b (instead of b q − p ) will do
it.
Proof A.66: [Continuity, symmetry] By the continuity and symmetry of `p -norms, and the continuity of the
logarithm and power functions.
1) Proofs for Shannon entropy:
Proof A.67: [Concentration](cf. [15].)

g(α) = hbShannon (p) − hbShannon (p̃(α)), (143)

where g(0) = 0. Then,

g 0 (α) = (log2 (pj − α)pj −α + log2 (pi + α)pi +α )0 (144)


pi + α
= log2 > 0, (145)
pj − α
since pi + α > pj − α, i.e. hbShannon (w) > hbShannon (w̃).
Proof A.68: [Scaling] Scaling holds by the `1 normalization of p, i.e. kpk1 = 1.
Proof A.69: [Replication] For entropy functions, replication equals scaling since each replica of the histogram
allocates proportional occurrences to events.
Proof A.70: [Bounds] Shannon entropy attains its maximum with the uniform distribution and its minimum with
a constant process [24].

hbShannon (p̃) = log2 b, (146)

hbShannon (p̂) = 0, (147)

i.e. hbShannon (p̃) > hbShannon (p̂).


Proof A.71: [Quasi-convexity] Shannon entropy is concave [24].
Proof A.72: [Monotonicity] By the Shannon entropy of the Bernoulli distribution [24].
Proof A.73: [Completeness]

g(w) = hbShannon (p) − hbShannon (p̃(p)) (148)


1
= hb (p) ≥ 0, (149)
b + 1 Shannon
since hbShannon (p) ≥ 0, i.e. hbShannon (p) ≥ hbShannon (p̃).
Proof A.74: [Regularity]

g(α) = hbShannon (p̂(α)) − hbShannon (p̂), (150)

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 29

where g(0) = 0. Then,


b  0
X qk qk
g 0 (α) = log2 ... (151)
ñ + α ñ + α
k=1
k6=i
 0
q˜i + α q˜i + α
··· + log2 (152)
ñ + α ñ + α
1 b
= (h (p̂) − hb1∞ (p̂)) > 0, (153)
n̂ Shannon
since hbShannon (p̂) > hb1∞ (p̂) (cf. Figs. 3 and 4), i.e. hbShannon (p̂) > hbShannon (p̂).
Proof A.75: [Hom. growth]

g(α) = hbShannon (p̃(α)) − hbShannon (p), (154)

where g(0) = 0. Then,


b  0
X qk + α qk + α
g 0 (α) = − log2 (155)
i=1
n + bα n + bα
b
X qk − n q̃k
= log2 > 0, (156)
ñ ñ
k=1

i.e. hbShannon (w̃) ≥ hbShannon (w).


2) Proofs for Rényi entropy:
Proof A.76: [Concentration]

g(α) = hbRenyi (p) − hbRenyi (p̃(α)), (157)

−1
where g(0) = 0. Let ξ = (1−q) log 2 . Then,
  0
b
X
g 0 (α) = ξ log  pqk + (pi + α)q + (pj − α)q  (158)
  
k=1
k6=i,j
ξ
= ((pi + α)q−1 − (pj − α)q−1 ) > 0, (159)
kp̃kqq
since pi + α > pj − α, i.e. hbRenyi (p) > hbRenyi (p̃).
Proof A.77: [Bounds]

hbRenyi (p̃) = log2 b, (160)

hbRenyi (p̂) = 0, (161)

i.e. hbRenyi (p̃) > hbRenyi (p̂).


Proof A.78: [Quasi-concavity] By Theorem A.2, with log2 increasing,
q
hbRenyi (p̂) = (log2 ◦ k.kq )(p). (162)
1−q |{z}
| {z } concave, q<1
>0, q<1 convex, q>1
<0, q>1

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 30

Proof A.79: [Monotonicity]

g(1) = hbRenyi (p̃) − hbRenyi (p̂) (163)


1 kp̃kq
= log2 , (164)
(1 − q) kp̂kq
since kp̃kq > (<)kp̂kq by the concavity (convexity) when q < (>)1, i.e. hbRenyi (p̃) < hbRenyi (p̂).
Proof A.80: [Completeness]

g(p) = hbRenyi (p) − hbRenyi (p̃(p)) (165)


1 b
= hRenyi (p) ≥ 0, (166)

since hbRenyi (p) ≥ 0, i.e. hbRenyi (p) > hbRenyi (p̃).
Proof A.81: [Regularity]

g(α) = hbRenyi (p̃) − hbRenyi (p̂(α)), (167)

where g(0) = 0. Then,


b  0  0
−1 X qk q̃k + α
g 0 (α) = + (168)
(1 − q) log 2 ñ + α ñ + α
k=1
k6=i
q
= (kp̂kqq − kp̂kq−1
∞ ) > 0, (169)
(1 − q) log 2kp̂kqq n̂
since kvkqq > (<)kvkq−1 b b
∞ when q < (>)1, i.e. hRenyi (p̃) > hRenyi (p̂).

Proof A.82: [Hom. growth]

g(α) = hbRenyi (p̃) − hbRenyi (p), (170)

where g(0) = 0. Then,


b  q 0
1 X qk + α
g 0 (α) = (171)
(1 − q) log2 n + bα
k=1
b
q X q−1
= q (p̃k − 1) > 0, (172)
(1 − q) log 2kpkq ñ
k=1

since p̃q−1
k < (>)1 with q > (<)1, i.e. hbRenyi (p̃) > hbRenyi (p̂).
3) Proofs for Tsallis entropy:
Proof A.83: [Concentration]

g(α) = hbTsallis (p) − hbTsallis (p̃(α)), (173)

where g(0) = 0. Then,


1
g 0 (α) = ((pi + α)q + (pj − α)q )0 (174)
(q − 1)
q
= ((pi + α)q−1 − (pj − α)q−1 ) > 0, (175)
(q − 1)

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 31

since pi + α > pj − α, i.e. hbTsallis (p) > hbTsallis (p̃).


Proof A.84: [Bounds]
1 − b1−q
hbTsallis (p̃) = , (176)
(q − 1)
hbTsallis (p̂) = 0, (177)

i.e. hbTsallis (p̃) > hbTsallis (p̂).


Proof A.85: [Quasi-concavity] By Theorem A.2, with (.)q increasing,
q
hbTsallis (p̂) = (1 − (.)q ◦ k.kq )(p). (178)
q−1 |{z}
| {z } concave,q<1
<0,q<1 convex,q>1
>0,q>1
Proof A.86: [Monotonicity]

g(1) = hbTsallis (p̃) − hbTsallis (p̂) (179)


kp̃kqq kp̂kqq
 
= −1 , (180)
(q − 1) kp̃kqq
since kp̃kq > (<)kp̂kq by the concavity (convexity) when q < (>)1, i.e. hbTsallis (p̃) < hbTsallis (p̂).
Proof A.87: [Completeness]

g(p) = hbTsallis (p) − hbTsallis (p̃(p)) (181)


1
= hb (p) ≥ 0, (182)
b + 1 Tsallis
since hbTsallis (p) ≥ 0, i.e. hbTsallis (p) > hbTsallis (p̃).
Proof A.88: [Regularity]

g(α) = hbTsallis (p̃) − hbTsallis (p̂(α)), (183)

where g(0) = 0. Then,


b  q  q 0
0 1 X qk q̃i + α
g (α) = + (184)
(1 − q) ñ + α ñ + α
k=1
k6=i
q
(kp̂kq−1 q
∞ − kp̂kq ) > 0, (185)
(1 − q)ñ
since kvkq−1 q b b
∞ > (<)kvkq with q < (>)1, i.e. hTsallis (p̃) > hTsallis (p̂).

Proof A.89: [Hom. growth]

g(α) = hbTsallis (p̃) − hbTsallis (p), (186)

where g(0) = 0. Then,


b  q 0
0 −1 X qk + α
g (α) = (187)
(1 − q) n + bα
k=1
−q
= (kp̃kq−1 q
q−1 − bkp̃kq ) > 0, (188)
(q − 1)ñ
since kvkq−1 q b b
q−1 ( )kvk0 kvkq with q ( )1, i.e. hTsallis (p̃) > hTsallis (p̂).

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 32

R EFERENCES

[1] K. Pearson, “Das Fehlergesetz und Seine Verallgemeinerungen Durch Fechner und Pearson. A Rejoinder,” Biometrika, vol. 4, no. 1-2, pp.
169–212, June 1905.
[2] H. Dalton, “The Measurement of the Inequity of Incomes,” The Economic Journal, vol. 30, no. 119, pp. 348–361, September 1920.
[3] R. V. L. Hartley, “Transmission of Information,” Bell System Technical Journal, vol. 7, pp. 535–563, 1928.
[4] C. Gini, “On the Measure of Concentration with Special Reference to Income and Statistics,” Colorado College Publication, General
Series, no. 208, pp. 73–79, 1936.
[5] C. E. Shannon, “A Mathematical Theory of Communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 623–656,
July, October 1948.
[6] A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and
Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, 1961, pp. 547–561.
[7] J. Aczél and Z. Daróczy, “Charakterisierung der Entropien Positiver Ordnung und der Shannonschen Entropie,” Acta Mathematica
Academiae Scientiarum Hungarica, vol. 14, no. 1-2, pp. 95–121, 1963.
[8] H. Theil, Economics and Information Theory. North-Holland Publishing Company, 1967.
[9] A. B. Atkinson, “On the Measurement of Inequality,” Journal of Economic Theory, vol. 2, no. 3, p. 244263, September 1970.
[10] A. Lempel and J. Ziv, “On the Complexity of Finite Sequences,” IEEE Trans. Inf. Theory, vol. 22, no. 1, pp. 75–81, January 1976.
[11] C. Tsallis, “Possible Generalization of Boltzmann-Gibbs Statistics,” Journal of Statistical Physics, vol. 52, 1988.
[12] S. M. Pincus, “Approximate Entropy as a Measure of System Complexity,” Proceedings of the National Academy of Sciences of the United
States of America (Mathematics), vol. 88, no. 6, pp. 2297–2301, March 1991.
[13] N. R. Pal and J. C. Bezdek, “Measuring Fuzzy Uncertainty,” IEEE Trans. Fuzzy Syst., vol. 2, no. 2, pp. 107–118, May 1994.
[14] M. D. Esteban and D. Morales, “A Summary of Entropy Statistics,” Kybernetika, vol. 31, no. 4, pp. 337–346, 1995.
[15] Lin Yuan and H. Kesavan, “Minimum Entropy and Information Measure,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 28, no. 3,
pp. 488–491, August 1998.
[16] N. Ebrahimi, E. Maasoumi, and E. S. Soofi, “Ordering Univariate Distributions by Entropy and Variance,” Journal of Econometrics, vol. 90,
no. 2, pp. 317–336, June 1999.
[17] J. Richman and J. Moorman, “Physiological Time-Series Analysis using Approximate Entropy and Sample Entropy,” American Journal
of Physiology. Heart and Circulatory Physiology, vol. 278, no. 6, pp. H2039–H2049, 2000.
[18] J. Martı́n, G. Mayor Forteza, and J. Suñer, “On Dispersion Measures,” Mathware and Soft Computing, vol. 8, no. 3, pp. 227–237, 2001.
[19] P. O. Hoyer, “Non-negative Matrix Factorization with Sparseness Constraints,” Journal of Machine Learning Research, vol. 5, pp. 1457–
1469, December 2004.
[20] H. Suyari, “Generalization of Shannon-Khinchin Axioms to Nonextensive Systems and the Uniqueness Theorem for the Nonextensive
Entropy,” IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1783–1787, August 2004.
[21] D. L. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal `1 -norm Solution is also the Sparsest Solution,”
Communications on Pure and Applied Mathematics, vol. 59, pp. 797–829, 2004.
[22] A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, and Y. Y. Zeevi, “Sparse ICA for Blind Separation of Transmitted and Reflected
Images,” International Journal of Imaging Systems and Technology, vol. 15, no. 1, pp. 84–91, 2005.
[23] R. Renner, “Security of Quantum Key Distribution,” PhD Thesis , ETH Zurich, 2005.
[24] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-
Interscience, 2006.
[25] E. J. Candès, J. Romberg, and T. Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction From Highly Incomplete Frequency
Information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, February 2006.
[26] R. Chartrand, “Exact Reconstruction of Sparse Signals via Nonconvex Minimization,” Signal Processing Letters, IEEE, vol. 14, no. 10,
pp. 707–710, Octpber 2007.
[27] C. Akujuobi, O. Odejide, A. Annamalai, and G. Fudge, “Sparseness Measures of Signals for Compressive Sampling,” in Signal Processing
and Information Technology, 2007 IEEE International Symposium on, December 2007, pp. 1042–1047.
[28] I. Csiszár, “Axiomatic Characterizations of Information Measures,” Entropy, vol. 10, no. 3, pp. 261–273, 2008.

January 23, 2015 DRAFT


IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. VOL, NO. NO, MONTH 2015 (DRAFT) 33

[29] A. Hormati and M. Vetterli, “Distributed Compressed Sensing: Sparsity Models and Reconstruction Algorithms using Annihilating Filter,”
in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, March 2008, pp. 5141–5144.
[30] N. Hurley and S. Rickard, “Comparing Measures of Sparsity,” IEEE Trans. Inf. Theory, vol. 55, no. 10, pp. 4723–4741, October 2009.
[31] A. Cambini and L. Martein, Generalized Convexity and Optimization. Springer, 2009, vol. 616.
[32] J. Haberl, “On Global Minima of Semistrictly Quasiconcave Functions,” Optimization Letters, vol. 3, no. 3, pp. 387–396, 2009.
[33] Tian Lan, D. Kao, Mung Chiang, and A. Sabharwal, “An Axiomatic Theory of Fairness in Network Resource Allocation,” in INFOCOM,
2010 Proceedings IEEE, March 2010, pp. 1–9.
[34] A. Cohen and A. Yeredor, “On the use of sparsity for recovering discrete probability distributions from their moments,” in Statistical
Signal Processing Workshop (SSP), 2011 IEEE, June 2011, pp. 753–756.
[35] D. Krishnan, T. Tay, and R. Fergus, “Blind Deconvolution using a Normalized Sparsity Measure,” in Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, June 2011, pp. 233–240.
[36] D. Zonoobi, A. A. Kassim, and Y. V. Venkatesh, “Gini Index as Sparsity Measure for Signal Reconstruction from Compressive Samples,”
Selected Topics in Signal Processing, IEEE Journal of, vol. 5, no. 5, pp. 927–932, September 2011.
[37] M. Pilanci, L. El Ghaoui, and V. Chandrasekaran, “Recovery of Sparse Probability Measures via Convex Programming,” in Proceedings
of Advances in Neural Information Processing Systems (NIPS), 2012, pp. 2429–2437.
[38] P. Cheridito and E. Kromer, “Reward-Risk Ratios,” Journal of Investment Strategies, Forthcoming, October 2013.
[39] T. Blumensath, “Compressed Sensing with Nonlinear Observations and Related Nonlinear Optimization Problems,” IEEE Trans. Inf. Theory,
vol. 59, no. 6, pp. 3466–3474, June 2013.
[40] M. Lopes, “Estimating Unknown Sparsity in Compressed Sensing,” in Proceedings of the 30th International Conference on Machine
Learning (ICML) (3), vol. 28, no. 3, May 2013, pp. 217–225.
[41] G. Pastor, I. Mora-Jiménez, R. Jäntti, and A. J. Caamaño, “Sparsity-Based Criteria for Entropy Measures,” in Wireless Communication
Systems (ISWCS 2013), Proceedings of the Tenth International Symposium on, August 2013, pp. 1–5.
[42] E. J. Candès, “Mathematics of Sparsity (and a Few Other Things),” in Proceedings of the International Congress of Mathematicians, 2014.

January 23, 2015 DRAFT

View publication stats

You might also like