Mathematics of Sparsity and Entropy: Axioms, Core Functions and Sparse Recovery
Mathematics of Sparsity and Entropy: Axioms, Core Functions and Sparse Recovery
Mathematics of Sparsity and Entropy: Axioms, Core Functions and Sparse Recovery
net/publication/271218681
CITATIONS READS
9 505
4 authors, including:
Antonio Caamaño
King Juan Carlos University
78 PUBLICATIONS 897 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Antonio Caamaño on 27 January 2015.
Abstract
Sparsity and entropy are pillar notions of modern theories in signal processing and information theory. However,
there is no clear consensus among scientists on the characterization of these notions. Previous efforts have contributed
to understand individually sparsity or entropy from specific research interests. This paper proposes a mathematical
formalism, a joint axiomatic characterization, which contributes to comprehend (the beauty of) sparsity and entropy.
The paper gathers and introduces inherent and first principles criteria as axioms and attributes that jointly characterize
sparsity and entropy. The proposed set of axioms is constructive and allows to derive simple or core functions
and further generalizations. Core sparsity generalizes the Hoyer measure, Gini index and pq-means. Core entropy
generalizes the Rényi entropy and Tsallis entropy, both of which generalize Shannon entropy. Finally, core functions
are successfully applied to compressed sensing and to minimum entropy given sample moments. More importantly,
the (simplest) core sparsity adds theoretical support to the `1 -minimization approach in compressed sensing.
Index Terms
Sparsity, entropy, axioms, compressive sampling, compressed sensing, Rényi entropy, Tsallis entropy.
This paper was presented in part in the Proceedings of the Tenth International Symposium in Wireless Communication Systems (ISWCS
2013).
G.P. (corresponding author, [email protected]) and R.J. are with the Department of Communications and Networking, Aalto University,
Finland.
G.P., I.M.J. and A.J.C. are with the Department of Information and Communication Technologies, King Juan Carlos University, Spain.
This work was supported by the Spanish Ministry of Economy and Competitiveness (TEC2013-4839-C4-1-R) and the Finnish Cultural
Foundation.
I. I NTRODUCTION
For a given signal x, the uncertainty (randomness) of its elements defines the compressibility (compactness)
of its coefficients w in a given domain. This dependence, between the (elements’) uncertainty and (coefficients’)
compressibility of a signal, suggests a connection between the two families of functions that measure these properties.
But, what axioms do these families have in common? This paper adopts the following definitions with the aim to
respond to this question. Let p be the probability mass function (pmf) of signal x.
Definition 1.1 (Compressibility or sparsity): The property of to concentrate most of the energy in few coefficients
of w.
Definition 1.2 (Uncertainty or entropy): The property of not to concentrate most of the probability mass in few
atoms of p.
Sparsity and entropy functions quantify properties 1.1 and 1.2, respectively. One approach to evaluate the goodness
of these functions is through validation of criteria (see [30] [33] for sparsity, [20] [28] for entropy, and [41] for
both). Following this approach, the paper gathers and introduces inherent and first principles criteria as axioms
and attributes that jointly characterize sparsity and entropy functions. As expected, it turns out that criteria of both
families of functions are strictly complementary.
In [16] a set of conditions are identified under which variance and entropy order distributions in a similar way.
This suggests that entropy could be extracted from the “shape” of distributions. Intuitively, for a given variance,
random processes with similar sorted pmf’s should present similar levels of entropy. Hence, for a given signal with
elements following a univariate distribution, the uncertainty of the undergoing process equals the non-compressibility
of the pmf. The following toy examples illustrate this relation. Let r.v. X ∈ {x1 , x2 } with pmf p = (p1 , p2 ).
1) Assume p1 > p2 : if p1 increases (thus p2 decreases since p1 , p2 ≥ 0 and kpk1 = 1), then x1 is even more
certain to appear, i.e. if the compressibility of p increases then the uncertainty of X decreases,
2) Assume p = (p1 , p2 ) = (1, 0) (a constant r.v.): if p2 increases (thus p1 decreases), then x1 is not the unique
possible outcome, i.e. if the compressibility of p decreases then the uncertainty of X increases,
Entropy functions (related to uncertainty, information, fuzziness or complexity measures) and sparsity functions
(related to compressibility, fairness or dispersion measures) are not regarded as easily characterizable. Hence these
functions have been thoroughly studied in numerous articles from different perspectives and research areas. The
following present relevant references which contribute explicitly to the set of axioms and attributes of sparsity and
entropy functions of Section II.
The main method for axiomatic characterization of entropy functions consist on to treat inherent or satisfactory
properties as axioms for (reference) Shannon entropy function. This started in 1948 with [5] which established
continuity and monotonicity of entropy functions; [13] added concavity for fuzziness functions and [18] relaxed it to
quasi-concavity for dispersion functions; [15] added concentration for information functions; [20] added maximality
for entropy functions; and [28] added symmetry for information functions.
On the other hand, although sparse data and sparsity has attracted a lot of attention in recent years, concentration,
scaling, homogeneous growth and replication were originally applied in 1920 by [2] in a financial setting to measure
the inequity of wealth distribution; [18] added bounds for dispersion functions; [30] [33] added symmetry and
continuity for sparsity and fairness functions, respectively; and [38] added quasi-convexity for reward-risk ratio
functions.
For each property, previous works have separately established some axioms and relationships, and derived
generalizations. The contribution of the paper is threefold.
1) The main contribution of this paper is a refined and constructive set of axioms (and attributes) of sparsity
and entropy functions (Section II), which allows to derive function generalizations.
2) The derived sparsity functions explain the benefits of proposed sparsity functions in Compressed Sensing and,
in particular, the effectiveness of the `1 -minimization approach (Section III).
3) The derived entropy functions generalize Rényi and Tsallis entropy, and offer simple formulations to the
sparse recovery of probability measures (Section IV).
The rest of the paper is organized as follows: For sparsity and entropy functions, Section II presents the axioms;
Sections III-IV describe existing functions and derive generalizations; Section V concludes; and the Appendix
contains the proofs.
II. A XIOMS
This section gathers and introduces inherent and first principles criteria as axioms and attributes that jointly
characterize sparsity and entropy functions.
Let X be an independent and identically distributed (i.i.d.) discrete random variable (r.v.), defined on a countable
sample space Ω = {x1 , . . . , xb } ⊆ R (b could be infinity), and let x = (xi ) ∈ Ωn be a vector containing n
realizations of r.v. X. The definition of vector x allows two possible interpretations. It is an arbitrary signal as a
whole and a random process formed by its elements. Throughout this paper both interpretations are adopted, and
different operations are applied on vector x according to the analysis performed. The analysis of sparsity will be from
the signal’s compressibility perspective. Analogously, the analysis of entropy will be from the process’s uncertainty
perspective. The analogies between these perspectives are summarized in Table I. The following operations will
provide adequate representations for the analyzes.
For the compressibility assessment of signal x, a domain transformation is applied using basis Φ ∈ Rn×n ,
Table I
A NALOGIES BETWEEN SPARSITY FUNCTIONS AND ENTROPY FUNCTIONS .
where w = (wi ) is the (vector of) coefficients of x. For ease of notation, since compressibility is measured from
magnitudes, the energy values of coefficients w are assumed non-negative, i.e. w ∈ Rn+ . Essentially, redefine
coefficients (in terms of true coefficients) as wi = |wi |. Under this model, a sparsity function, which measures the
compactness of signal x under basis Φ, is
For the uncertainty calculation of process x, a b-bins-based `1 -normalized histogram method is applied using
operator Ψ,
where p = (pj ) is the sample probability mass function (pmf) of x, with kpk1 = 1 and
cardinality({i : xi ∈ xj })
pj = ≈ P(X = xj ). (6)
n
Similarly, define Sb+ = {p ∈ [0, 1]b : kpk1 = 1} (probability simplex). Then, for a given variance σ 2 , an entropy
function, which measures the randomness of process x from its outcomes’ occurrences using method Ψ, is
Under this notation (see Table II for a summary), the following enumerates a collection of axioms and attributes
of sparsity and entropy functions. Axioms and attributes describe the effect of different actions on the argument
of functions, e.g. changes on its pattern, ratios or relative differences (see Table III for mathematical description).
In the case of entropy analysis, and in order to stay inside the probability simplex, the actions related to regularity
II-B5 and homogeneous growth II-B6 require a subsequent normalization kp̃k1 = 1, where p̃ denotes p after the
action. The histogram model described above makes this normalization step transparent.
A. Axioms
1) Continuity (II-A1):
S. (Sparsity) Slight energy perturbation retains the ratio of dominant and negligible coefficients. [33]
H. (Entropy) Slight probability mass perturbation retains the ratio of unlikely and likely events. [5]
Table II
S UMMARY OF NOTATION AND FUNCTIONS .
Symbol Description
s(.) sparsity function
hb (.) entropy function given b events (states)
Ω countable sample space, Ω = {x1 , . . . , xb } ⊆ R
P probability measure defined on Ω
X random variable, X : ω 7→ X(ω) = ω
x = (xi ) vector with elements xi ∈ Ω (realizations of X)
n length of vector x
Φ domain transformation, Φ ∈ Rn×n
w = (wi ) vector with coefficients wi of x, w = Φx
Ψ b-bins histogram operator, Ψ
p = (pj ) sample probability mass function of x, p = Ψx
b number of events (states) or bins, length of vector p
xj j-th event in the histogram method
v = (vk ) arbitrary non-null vector (v 6= 0) with elements vk
k.k0 `0 -“norm”, kvk0 = cardinality(supp(v))
1
`p -norm, kvkp = ( k |vk |p ) p , p ∈ (0, ∞)
P
k.kp
k.k∞ `∞ -norm, kvk∞ = maxk |vk |
P
k.k1∗ `1∗ weighted norm, kvk1∗ = k k|vk |, vi ≥ vj , i > j
.k. concatenation operator of vectors
h., .i inner product of vectors
π(.) permutation in the order of elements
Sn
+ n-simplex, Sn b
+ = {v ∈ [0, 1] : kvk1 = n}
0k , 1k vector of all zeros and ones of length k, resp.
2) Symmetry (II-A2):
S. Energy permutation retains the ratio of dominant and negligible coefficients. [18]
H. Probability mass permutation retains the ratio of unlikely and likely events. [28]
3) Concentration (II-A3) (Dalton’s 1st Law):
S. Moving energy from negligible to dominant coefficients shorten the set of dominant coefficients. [2]
H. Moving probability mass from unlikely to likely events shorten the set of likely events. [15]
4) Scaling (II-A4) (Dalton’s modified 2nd Law):
S. Scaled coefficients contain proportional amounts of energy in proportional number of coefficients. [2]
H. Scaled occurrences contain proportional amounts of probability mass in proportional number of events (or
by the kpk1 = 1 normalization).
5) Replication (II-A5) (Dalton’s 4th Law):
S. Concatenating replicas of (all) coefficients retains the ratio of dominant and negligible coefficients. [2]
H. Adding realizations following the same law retains the ratio of unlikely and likely events. This axiom
affirms that entropy is intrinsic to a given process (see Subsection IV-B for details).
B. Attributes
1) Bounds (II-B1):
S. The least (most) compressible coefficients allocate (all) its energy on all coefficients (on one single
coefficient), e.g. a white noise (a single “tone”). [18]
H. The most (least) uncertain distribution allocates (all) its probability mass on all events (on one single
event), e.g. a uniform r.v. (a constant r.v.). [5]
Lemma 2.1: II-A3 and II-A4 imply II-B1.
2) Quasi-convexity (II-B2):
S. Quasi-convexity prefers extremes than averages. [38]
H. Quasi-concavity encourages diversification. [13]
Lemma 2.2: II-A1, II-A2, II-A3 and II-A4 imply II-B2.
3) Monotonicity (II-B3): A simple version of II-A3.
S. For two signals with non-zero pair of coefficients, the signal with the largest ratio highest-energy by
lowest-energy is more compressible. [33]
H. For two processes with non-impossible pair of events, the process with the largest ratio highest-mass by
lowest-mass is less uncertain.
Lemma 2.3: II-A3 implies II-B3.
4) Completeness (II-B4):
S. Adding zero-energy coefficients concentrates the energy in relatively less coefficients. [30] (Babies)
H. (Conditioning on) having impossible (zero-mass) events concentrates the mass in relatively less events.
Lemma 2.4: [30] II-A3, II-A4 and II-A5 imply II-B4.
5) Regularity (II-B5):
S. Significant concentration of energy in one single coefficient makes the rest negligible. [30] (Bill Gates)
H. Significant concentration of probability mass in one single event makes the rest unlikely.
Lemma 2.5: [30] II-A3 and II-A4 imply II-B5.
6) Homogeneous growth (II-B6) (Dalton’s 3rd Law):
S. Increasing relatively more the energy of negligible coefficients makes them relatively more dominant. [2]
H. Increasing relatively more the probability mass of unlikely events makes them relatively more likely.
Lemma 2.6: II-A4, II-B1 and II-B2 imply II-B6.
7) Schur-convexity (II-B7):
S. Schur-convexity holds.
H. Schur-concavity holds.
Lemma 2.7: II-A2 and II-B2 imply II-B7.
Table III
A XIOMS II-A1-II-A5 AND ATTRIBUTES II-B1-II-B8 OF SPARSITY FUNCTIONS AND ENTROPY FUNCTIONS , WITH α, β > 0.
II-B6 Hom. growth w̃ = w + α1n , s(w̃) < s(w) p̃ = p + α1b , hb (p̃) > hb (p)
II-B7 Schur-convexity w w̃, s(w) ≥ s(w̃) p p̃, hb (p) ≤ hb (p̃)
II-B8 Triangle Ineq. s(.) ≥ 0, s(w + w̃) ≤ s(w) + s(w̃) hb (.) ≤ 0, hb (p + p̃) ≥ hb (p) + hb (p̃)
1 The quotation marks (“ ”) help to preserve and tries to alert about the misleading notation which treats `0 -pseudo- (`1 -) norm as a sparsity
function. In fact, `0 -pseudo- (`1 -) norm measures non-sparsity. This misleading notation led [30] to treat `0 (no quotation marks) as a sparsity
function.
Table IV
SPARSITY FUNCTIONS .
Table V
T EST RESULTS OF SPARSITY FUNCTIONS . (L EMMAS 2.#, P ROOFS A.##.)
where k.ks measures sparsity, and Φ and Π are the sparsifying and measurement matrices, respectively.
For (8) and related problems, the adoption of the `1 -norm as a proxy function allows to derive efficient algorithms
and performance guarantees [21]. “`1 is a convex surrogate for `0 count. It is the best surrogate in the sense that
the `1 ball is the smallest convex body containing all 1-sparse objects of the form ±ei ” [42]. This function will be
denoted “`1 ”. “`1 ” is not though a sparsity function [30] and it should not be treated as such. However, Subsection
III-B will validate its adoption in (8). Other sparsity functions proposed for (8) will be discussed later. Among
these functions, the kurtosis [1], Gini index [4], Hoyer measure [19] and pq-means {p ≤ 1, q > 1} [30] (denoted
pq-means in the following) satisfy most of the six-rules criteria of [30]. [41] extends these criteria and derives
max-sparsity. Table IV defines all these functions, and Table V shows the results of the test using axioms and
attributes of Section II.
Further inspection of sparsity functions (see Table IV) reveals a parallel in their construction [41]. Metrics of
inequality of wealth or fairness, i.e. non-sparsity, exhibit the same parallel, e.g. the Theil [8] and Atkinson [9]
indexes. This simple observation allows to derive a generalization.
Theorem 3.1: [Core sparsity] The function
1 1 kwkp
spq (w) = −n q − p , {p ≤ 1, q ≥ 1, p 6= q}, (9)
kwkq
satisfies axioms II-A1-II-A5.
Core sparsity is continuous (except at the origin), symmetric, scale-invariant and semi-strictly quasi-convex.
The kurtosis (p = 4, q = 2) is out of the range for p and q in (9). This function failed the test using criteria of
[30] hence it will not be further analyzed. Nevertheless, the Gini index (weighted mean p = 1∗ , q = 1) and Hoyer
measure (p = 1, q = 2) are positive affine transformations of core sparsity. These transformations are increasing
hence they preserve the quasi-convexity attribute II-B2 by Theorem A.2 of [31].
More importantly, core sparsity extends pq-means. [22] presents pq-means with {p ≤ 1, q = 2}. “This, (`p )
normalization by the `2 -norm may turn out to be the best sparseness criterion. This however, has yet to be further
investigated. [22]”. And although [30] grants the origin (citation principle) of pq-means to [22], (anonymous)
reviewers suggested these functions to [30] as stated in its acknowledgments. Thus, pq-means have no clear origin
neither a formal derivation in the literature. Thus, the Appendix derives core sparsity (which extends pq-means)
from axioms II-A1-II-A5.
A key factor of core sparsity is its resemblance to “`0 ”. Fig. 1 shows this resemblance. Figs. 1(d)-1(f) show
the smoothness, symmetry and quasi-convexity of sparsity functions; and Figs. 1(g)-1(l) show that the unit balls
of their respective “core” functions and of “`0 ” are the same. A difference between functions in Fig. 1 is their
behavior at the origin. “`0 ” attains its maximum at the origin, which makes sense from a practical perspective since
no information is required to reconstruct null vectors. However, core sparsity is not defined at the origin due to
continuity, e.g. in Fig. 1,
Notably, p and q parameters localize the domain of core sparsity. Fig. 2(b) presents this localization process
1
for w following a power law decay, i.e. w = (wi ), with wi = ri− ρ . Specifically, Fig. 2(a) shows the versatility
of max-sparsity (and of “`1 ”) to assess the sparsity of arbitrarily-sparse coefficients, and the localized domain of
the kurtosis, Gini index and Hoyer measure, which are better distinguishing among highly-sparse coefficients [41].
This would suggest the (slightly) superior performance, in terms of number of measurements m (the dimension of
y), of recovery strategies based on these last functions compared with the common `1 -minimization, as reported
in [26] for `p , with p ∈ (0, 1); in [36] for Gini index, which stochastic algorithm tends to be unstable; in [27] for
kurtosis, although the function selects inappropriate basis; and in [35] for Hoyer measure in the framework of image
regularization. Thus, appropriate choice of parameters p and q would offer a localized-sparsity formulation of (8),
(a) “`0 ” (b) “`1 ” (proxy) (c) Kurtosis (d) Gini index (e) Hoyer measure (f) max-sparsity
1 1 1 1 1 1
0 0 0 0 0 0
-1 -1 -1 -1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Figure 1. Sparsity functions (normalized to [0, 2]) (a)-(f) for w ∈ [0, 1]2 , and the unit ball of their “core” functions (g)-(l) for w ∈ [−1, 1]2 .
1 1
"L1" (Ref.) Ref.
max sparsity q = 32
0.8 Hoyer 0.8 q = 16
Gini q=8
Kurtosis q=4
q=2
0.6 0.6
q = 1.5
q = 1.25
q = 1.125
0.4 0.4
0.2 0.2
0 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
ρ ρ
1
−ρ
Figure 2. Sparsity (normalized to [0, 1]) of coefficients w = (wi ) of length n = 1000, with wi = ri (power law decay), r = 1.2 and
ρ ∈ [0.03, 8].
since core sparsity strongly encourages coefficients of a given (or estimated) level of sparsity. Further, core sparsity
is differentiable (by construction) which is appropriate for gradient-based optimization, and by the quasi-convexity,
its solution to (8) is attained at an extreme point of the polytope defined by linear (equality) constraints [32], e.g.
Πw = y. Then, a simple exact algorithm can solve (8) by walking along the extreme points of the feasible region.
Still, the potential of core sparsity in sparse recovery has not been studied.
“`1 ” fails the test using axioms and attributes as shown in Table V hence it is not a sparsity function. But “`1 ”
is able to solve (8) in which it commonly plays the role of objective functional. Further, it offers a computational
efficient convex, actually linear, formulation. Certainly core sparsity can not improve this formulation due to its
quasi-convex nature, which leads to quasi-convex maximization in (8). Nevertheless, maximization of sparsity via
1
Proof 3.3: Let t = |wmax | , ui = twi , and vi = 1t ui . Then,
maxw − 1 kwk1
minw |w1 |+···+|wn |
n kwk∞ |wmax |
≡ (12)
s.t. Πw = y, s.t. Πw = y,
min|u|≤1 |u1 | + · · · + |un |
≡ s.t. Πu = yt (13)
t ≥ 0,
min|v|≤ 1t |v1 | + · · · + |vn |
≡ t≥0 (14)
s.t. Πv = y.
Theorem 3.2 does not (only) state that s1∞ -maximization is equivalent to `1 -minimization, but that `1 -minimization
can be formulated in (8) since it is a (simple) reformulation of s1∞ -maximization: essentially, the linear constraint
in (8) ensures the appropriate normalization (by the maximum) that allows “`1 ” to perform as an actual spar-
sity function, e.g. Proof A.30. This result clearly extends to `p -minimization, with p ∈ (0, 1) [26], which is a
reformulation of sp∞ -maximization, with p ∈ (0, 1).
More importantly, the formalism of the present paper and the generalized convexity framework developed by
[31] offer an opportunity to face the non-linear-constrained version of (8), i.e. the non-linear Compressed Sensing
problem [39].
Entropy is a concept of physics and information theory that characterizes the unpredictability of systems’ states
and processes’ events.
In physics, the Tsallis entropy [11] generalizes Boltzmann-Gibbs entropy. In information theory, Rényi entropy
[6] generalizes entropy functions via its parameter p ≥ 0: Hartley entropy or log max entropy [3] (p = 0) counts the
cardinality of non-zero probability events; Shannon entropy (p = 1) follows a similar expression than Boltzmann-
Gibbs entropy to measure the average unpredictability of events; collision entropy (p = 2); log min entropy [23]
(p = ∞) is the most conservative entropy function since it measures the unpredictability of the most likely event.
Hence log min entropy is never greater than Shannon entropy (see Figs. 3-4). Table VI defines all these functions,
and Table VII shows the results of the test using axioms and attributes of Section II.
2 max-sparsity measures the non-similarity between w and 1n , the least compressible vector. It was originally denoted s1∞ in [41].
Table VI
ENTROPY FUNCTIONS .
Table VII
T EST RESULTS OF ENTROPY FUNCTIONS . (L EMMAS 2.#, P ROOFS A.##.)
In computer science, entropy characterizes the complexity of sequences. In this context, entropy functions are
called complexity algorithms. Among these algorithms, Lempel-Ziv complexity [10] characterizes the randomness of
a sequence of symbols by measuring its rate of (new) pattern generation. Following a distinct approach, Approximate
entropy [12] examines finite-length time series for similar epochs, and Sample entropy [17] performs similarly but
without counting self-matches. Both algorithms emerged from the formulation of Kolmogorov complexity for finite-
sample approximations. Axioms and attributes of Section II are not applicable to these algorithms but they will be
used for comparison (see Figs. 3-4).
Although Rényi entropy and Tsallis entropy are concepts from different fields and are applied to different objects,
both functions posses the same parallel that was identified in sparsity functions (see Tables IV and VI). The key
1 1
Shannon (Ref.) Ref.
Lempel-Ziv q = 1.125
0.8 Apen 0.8 q = 1.25
SampEn q = 1.5
min entropy q=2
q=4
0.6 0.6
q=8
q = 16
q = 32
0.4 0.4
0.2 0.2
0 0
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
π π
Figure 3. Entropy (normalized to [0, 1]) of p = (π, 1 − π) of n = 10000 samples xi ∼ Bern(π), with π ∈ [0, 1].
difference between Rényi entropy and Tsallis entropy (and sparsity functions of Section III) is the logarithm
transformation which provides the information units to Rényi entropy and which, at the same time, preserves the
quasi-concavity attribute by Theorem A.2 of [31]. The same applies to the power function in Tsallis entropy. Thus,
following Section II (which states the complementary behavior of sparsity functions and entropy functions) and
Section III (which introduces core sparsity), and removing all unit transformations, allows to derive a generalization.
Theorem 4.1: [Core entropy] The function
kpkp
hbpq (p) = , {p ≤ 1, q ≥ 1, p 6= q}, (15)
kpkq
satisfies axioms II-A1-II-A5.
Core entropy is continuous (except at the origin which corresponds to the empty set), symmetric and semi-strictly
quasi-concave. It follows the quotient-of-weighted-functions form of general entropy functions as in [14], e.g.
Aczel-Daróczy entropy [7].
Interestingly, the application of function a(.) = − 1. to core entropy increases the domain of its parameters p
and q to {p ≥ 1, q ≤ 1, p 6= q} and makes core entropy negative. Then, the Tsallis entropy (p 6= 1, q = 1), Rényi
entropy (p 6= 1, q = 1) and their special cases (Shannon, etc.) are transformations (by the logarithm and power
functions) of core entropy. All these previous transformations (including a(.)) are increasing hence they preserve
the (now) quasi-concavity attribute II-B2 by Theorem A.2 of [31]. Note that the domain extension of parameters p
and q to {p ≥ 1, q ≤ 1, p 6= q} also applies to core sparsity, which will be now positive.
Some features of core entropy are shown in Fig. 3 which presents the entropy of a r.v. following the Bernoulli
distribution with parameter π, the canonical example of information theory textbooks [24]. Fig. 3(a) shows the
smoothness, monotonicity, symmetry and quasi-concavity of entropy functions and complexity algorithms. Fig.
3(b) shows the same features in core entropy. Setting parameter p = 1 allows to appreciate the most interesting
case: core entropy tends to Shannon entropy as q ↓ 1. Unsurprisingly, the same tendency is present for q = 1 fixed
and p ↑ 1.
For general univariate distributions, [16] identifies conditions under which entropy defines an order (on distri-
butions). Fig. 4 shows this ordering phenomenon for several distributions. For a better visual comparison, these
1 1
0.8 0.8
0.6 0.6
Ref.
q = 1.125
q = 1.25
0.4 0.4
q = 1.5
Shannon (Ref.) q=2
Lempel-Ziv q=4
0.2 Apen 0.2 q=8
SampEn q = 16
min-entropy q = 32
0 0
2 4 6 8 2 4 6 8
Distributions Distributions
Figure 4. Entropy and ordering of univariate distributions. Unit-variance test distributions: (1) Constant (all probability mass in one single
event, σ = 0), (2) Pareto, (3) Frechet, (4) Log-normal, (5) Exponential, (6) Gamma, (7) Weibull, (8) Gumbel, and (9) Gaussian. Parameters:
deviation σ = 1, alphabet size b = 10, m = 2 and r = 0.2σ.
test distributions do not include the uniform distribution which reaches the maximum in entropy functions. Fig.
4(a) shows the ordering obtained using entropy functions and complexity algorithms. Although its simplicity, hb1∞
core entropy or min-entropy respects the same trend ordering reported by other functions. As in Fig. 3(b), Fig.
4(b) shows that the ordering trend drawn by core entropy, with p = 1 fixed and q ↓ 1, tends to resulting ordering
obtained using Shannon entropy.
Core sparsity is a relative measure and core entropy is an absolute measure. The following discusses this subtle
difference and its consequences on the way both functions follow the axioms (and attributes) of Section II.
For entropy functions, replication II-A5 does not refer to a larger number of events or states. As the number
of “replicas” of p goes to infinity, replication II-A5 affirms that entropy is intrinsic to a process (system). Hence
entropy functions measure an absolute property as is the uncertainty of a process (system) of fixed possible events
(states). Analogously, sparsity functions measure a relative property as is the efficiency of a basis representation or
compressibility. For example, coefficients w = (1, 0) and w̃ = (1, 1, 0, 0) both require 50% of their content for an
exact representation.
1 1
This subtle difference is characterized by the dimension normalization n q − p , which is present only in core
sparsity (cf. Proof A.10). Furthermore, (discrete) entropy is defined for discrete r.v.s on countable sample spaces
1 1
which cardinality could be infinity, i.e. b → ∞. Thus, normalization of core entropy by b q − p would not be possible
in general.
More importantly, replication II-A5 and completeness II-B4 should not be confused with the following two
common axioms of entropy functions [28].
Lemma 4.2: Core entropy satisfies
hm n
pq (1m ) < hpq (1n ). (17)
Proof 4.5:
16
hm n
pq (1m ) = hpq (1m k0n−m ) (18)
II-B4 n
< hpq (1n ), (19)
which is also obvious given 1m k0n−m is more “compressible”, hence less uncertain, than 1n . Or by simple
1 1 1 1
calculation: hm p − q , hn (1 ) = n p − q .
pq (1m ) = m pq n
Consider the minimum (Shannon) entropy problem on the probability simplex with sample moment constraints
[34],
min exp(hbShannon (p))
p≥0
kpk1 =1 (20)
s.t. Ep [X k ] = µk , 1 ≤ k ≤ m − 1,
Since log max entropy upper bounds Shannon entropy, the solution of (20) is upper-bounded by the solution of
minp≥0 hbmax (p)
minp≥0 kpk0
≡ (21)
s.t. Ap = µ s.t. Ap = µ,
where hbmax denotes hb01 core entropy or max-entropy. Deterministic measurement matrix A is Vandermonde [29],
and linear constraint Ap = µ includes the kpk1 = 1 normalization. Then, under the common Compressed Sensing
hypotheses, (21) is equivalent to
minp≥0 kpk1 minp≥0 1
≡ (22)
s.t. Ap = µ s.t. Ap = µ.
However, formulation (22) is useless in this case since it produces a feasibility problem which solution is not
unique if the problem is under-determined. Nevertheless, by the linearity of the constraints and Theorem 3.2, (22)
is equivalent to s1∞ maximization, which is equivalent to minimization of hb1∞ or min-entropy,3 denoted in the
following hbmin .
Theorem 4.6: Under an appropriate Vandermonde matrix A, min-entropy minimization solves (20) uniquely.
3 min-entropy measures the similarity between p and 1b , the most uncertain distribution, i.e. uniform distribution [41].
min hbmin (p) ≤ min hbShannon (p) ≤ min hbmax (p) . (23)
| {z } | {z }
=min hb1∞ (p) ≡min kpk0
≡max s1∞ (p) ≡min kpk1
≡min kpk1
V. C ONCLUSIONS
A complete characterization of sparsity and entropy offered new functions and tools to solve efficiently problems
of sparse recovery. This paper proposed a formalism, a joint axiomatic characterization, for sparsity and entropy.
The proposed set of axioms is constructive and allows to derive the core functions, core sparsity and core entropy.
Both core functions are simple functional forms found in well-known sparsity and entropy functions. Finally, core
functions were applied to the sparse recovery problem where they offered efficient formulations. More importantly,
the (simplest) core sparsity validates the use of the `1 -norm as a (non-)sparsity measure in compressed sensing
problem which further provides insights to this problem.
Future work will follow two directions. Concerning entropy, to study the compatibility of proposed axioms (and
attributes) and specific axioms of entropy, especially those related to relative and conditional entropy. Some of these
specific axioms were already verified here. Concerning sparsity, to study the potential of core sparsity in a general
optimization framework, where core sparsity can be “tuned” to match the optimal recovery strategy according to a
sparsity level.
A PPENDIX
In this appendix Lemmas 2.1-2.8 are proved, which derive attributes II-B1-II-B8 from axioms II-A1-II-A5. The
appendix also contains the proofs listed in Tables V and VII where functions, applied to non-zero vectors, are
evaluated using axioms and attributes of Section II. These proofs give a refinement of the formalism compared to
proofs of [30] of which some proofs are corrected.
Tables VIII-IX describe the notation adopted in the proofs related to sparsity and entropy functions, respectively.
The proofs are based on the following key results of [31].
Theorem A.1: [31] Let f , g be defined on S convex, and
f (w)
s(w) = . (24)
g(w)
If any of the following properties hold
Theorem A.2: [31] (Also [18].) Let s be semi-strictly quasi-convex (-concave) defined on S ⊆ Rn convex. Let
a : A → R be increasing, with s(S) ⊆ A. Then, a ◦ s is semi-strictly quasi-convex (-concave) on S.
Examples of function a(.) are the log(.) and − 1. (which changes the sign and allows to prove Lemma 2.8).
A. Proofs of Lemmas
1
Proof A.3: [Lem. 2.1, Bounds] Let w arbitrary with kwk = 1. Construct ŵ = 0n−1 k1 and w̃ = n 1n from w
via concentration actions II-A3. Then,
II-A3 II-A3
s(w̃) ≤ s(w) ≤ s(ŵ). (25)
Proof A.4: [Lem. 2.2, Quasi-convexity] Let s continuous II-A1, symmetric II-A2, homogeneous of degree 0 II-A4
f
and non-negative. Then, s can be written as g, with f and g continuous, symmetric, homogeneous of degree 1 and
non-negative. Construct w̃ from w via a concentration action II-A3. Then, it holds that
Proof A.7: [Lem. 2.5, Regularity][30] Let w̃ = [w1 , . . . , wi + β, . . . , wn ] and 0 < = 1 + δ. Construct ŵ from
w = w̃ via concentration actions II-A3,
= wi + β + α, (38)
P
where α = δ(β + j wj ). Then,
II-A4
s(w0 ) = s(w1 ) (39)
II-A3
< s(w2 ). (40)
Proof A.8: [Lem. 2.6, Hom. growth] Let w arbitrary, w̃ = 1n and α < 1. Then,
1−α II-A4
s w+ w̃ = s(αw + (1 − α)w̃) (41)
α
II-B2
< max{s(w), s(w̃)} (42)
II-B1
= s(w). (43)
Proof A.9: [Lem. 2.8, Triangle ineq.] Let w and w̃ arbitrary and α < 1,
II-B2
s(αw + (1 − α)w̃) ≤ max{s(w), s(w̃)} (44)
s>0
≤ s(w) + s(w̃) (45)
II-A4
= s(αw) + s((1 − α)w̃). (46)
Proof A.10: [Thm. 3.1, Thm. 4.1.] Since concentration II-A3 implies quasi-convexity II-B2, Theorems A.1 and
A.2, with a(.) = − 1. increasing, state that sparsity s can be written as
g(w)
s(w) = −ξ(n) (47)
f (w)
with ξ(n) > 0, ∀n, responsible of the dimension-invariance of s or replication II-A5; and g > 0 concave and
f > 0 convex, and both functions continuous II-A1, symmetric II-A2 and homogeneous of degree 1, such that s
is homogeneous of degree 0 or scale-invariant II-A4. By the homogeneity, f (0n ) = 0 and g(0n ) = 0 (hence s
cannot be defined at the origin). The previous description of f and q resembles vector norms and leads to the pair
of candidate functions
Table VIII
S UMMARY OF NOTATION USED IN PROOFS RELATED TO SPARSITY FUNCTIONS .
kw̃k0 = 2, (56)
2, ŵ < α,
kŵk0 = (57)
1, otherwise,
Proof A.25: [Quasi-convexity] Assume kw̃k1 ≤ kwk1 , then, s1 (w̃) ≥ s1 (w). Since w, w̃ non-negative,
= kŵk1 , (66)
2nkw̃k−2
1
= (wn−i+1 − wi )(n + 1 − 2i) > 0, (109)
(n + 1) i=1
since wn−i+1 > wi and n + 1 > 2i with i ≤ b n2 c, i.e. sGini (w) > sGini (w̃).
5) Proofs for s1∞ :
Proof A.49: [Concentration]
positive, since vkq−1 kvkpp > vkp−1 kvkqq (expand to see); i.e. spq (w̃) > spq (w).
Proof A.58: [Scaling](cf. [30].)
kw̃kp kwkp
= , (128)
kw̃kq kwkq
i.e. spq (w̃) = spq (w).
Proof A.59: [Replication](cf. [30].)
positive, since vkq−1 kvkpp > vkp−1 kvkqq (expand to see); i.e. spq (ŵ) > spq (w̃).
Proof A.63: [Completeness](cf. [30].)
Table IX
S UMMARY OF NOTATION USED IN PROOFS RELATED TO ENTROPY FUNCTIONS .
∂f
where g(0) = 0, ∇ei f (v) = ∂vi (v), and
!
∂spq 1 1 kw̃kp w̃iq−1 w̃ip−1
(w̃) = n q − p q − (139)
∂ w̃i kw̃kq kw̃kq kw̃kpp
positive, since vkq−1 kvkpp > vkp−1 kvkqq (expand to see); i.e. spq (ŵ) > spq (w̃).
Proof A.65: [Hom. growth](cf. approx. in [30] which divides by non-necessarily non-null coefficients.)
Regularity II-B5 and homogeneous growth II-B6 use the histogram model: in each bin k, new qk count is added
to old count pk , then sample pmf p̃ is obtained after normalization k.k1 = 1 (using the number of realizations ñ).
1 1
Completeness II-B4 requires normalization by the number of events. The simple trial 1
b (instead of b q − p ) will do
it.
Proof A.66: [Continuity, symmetry] By the continuity and symmetry of `p -norms, and the continuity of the
logarithm and power functions.
1) Proofs for Shannon entropy:
Proof A.67: [Concentration](cf. [15].)
−1
where g(0) = 0. Let ξ = (1−q) log 2 . Then,
0
b
X
g 0 (α) = ξ log pqk + (pi + α)q + (pj − α)q (158)
k=1
k6=i,j
ξ
= ((pi + α)q−1 − (pj − α)q−1 ) > 0, (159)
kp̃kqq
since pi + α > pj − α, i.e. hbRenyi (p) > hbRenyi (p̃).
Proof A.77: [Bounds]
since p̃q−1
k < (>)1 with q > (<)1, i.e. hbRenyi (p̃) > hbRenyi (p̂).
3) Proofs for Tsallis entropy:
Proof A.83: [Concentration]
R EFERENCES
[1] K. Pearson, “Das Fehlergesetz und Seine Verallgemeinerungen Durch Fechner und Pearson. A Rejoinder,” Biometrika, vol. 4, no. 1-2, pp.
169–212, June 1905.
[2] H. Dalton, “The Measurement of the Inequity of Incomes,” The Economic Journal, vol. 30, no. 119, pp. 348–361, September 1920.
[3] R. V. L. Hartley, “Transmission of Information,” Bell System Technical Journal, vol. 7, pp. 535–563, 1928.
[4] C. Gini, “On the Measure of Concentration with Special Reference to Income and Statistics,” Colorado College Publication, General
Series, no. 208, pp. 73–79, 1936.
[5] C. E. Shannon, “A Mathematical Theory of Communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 623–656,
July, October 1948.
[6] A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and
Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, 1961, pp. 547–561.
[7] J. Aczél and Z. Daróczy, “Charakterisierung der Entropien Positiver Ordnung und der Shannonschen Entropie,” Acta Mathematica
Academiae Scientiarum Hungarica, vol. 14, no. 1-2, pp. 95–121, 1963.
[8] H. Theil, Economics and Information Theory. North-Holland Publishing Company, 1967.
[9] A. B. Atkinson, “On the Measurement of Inequality,” Journal of Economic Theory, vol. 2, no. 3, p. 244263, September 1970.
[10] A. Lempel and J. Ziv, “On the Complexity of Finite Sequences,” IEEE Trans. Inf. Theory, vol. 22, no. 1, pp. 75–81, January 1976.
[11] C. Tsallis, “Possible Generalization of Boltzmann-Gibbs Statistics,” Journal of Statistical Physics, vol. 52, 1988.
[12] S. M. Pincus, “Approximate Entropy as a Measure of System Complexity,” Proceedings of the National Academy of Sciences of the United
States of America (Mathematics), vol. 88, no. 6, pp. 2297–2301, March 1991.
[13] N. R. Pal and J. C. Bezdek, “Measuring Fuzzy Uncertainty,” IEEE Trans. Fuzzy Syst., vol. 2, no. 2, pp. 107–118, May 1994.
[14] M. D. Esteban and D. Morales, “A Summary of Entropy Statistics,” Kybernetika, vol. 31, no. 4, pp. 337–346, 1995.
[15] Lin Yuan and H. Kesavan, “Minimum Entropy and Information Measure,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 28, no. 3,
pp. 488–491, August 1998.
[16] N. Ebrahimi, E. Maasoumi, and E. S. Soofi, “Ordering Univariate Distributions by Entropy and Variance,” Journal of Econometrics, vol. 90,
no. 2, pp. 317–336, June 1999.
[17] J. Richman and J. Moorman, “Physiological Time-Series Analysis using Approximate Entropy and Sample Entropy,” American Journal
of Physiology. Heart and Circulatory Physiology, vol. 278, no. 6, pp. H2039–H2049, 2000.
[18] J. Martı́n, G. Mayor Forteza, and J. Suñer, “On Dispersion Measures,” Mathware and Soft Computing, vol. 8, no. 3, pp. 227–237, 2001.
[19] P. O. Hoyer, “Non-negative Matrix Factorization with Sparseness Constraints,” Journal of Machine Learning Research, vol. 5, pp. 1457–
1469, December 2004.
[20] H. Suyari, “Generalization of Shannon-Khinchin Axioms to Nonextensive Systems and the Uniqueness Theorem for the Nonextensive
Entropy,” IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1783–1787, August 2004.
[21] D. L. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal `1 -norm Solution is also the Sparsest Solution,”
Communications on Pure and Applied Mathematics, vol. 59, pp. 797–829, 2004.
[22] A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, and Y. Y. Zeevi, “Sparse ICA for Blind Separation of Transmitted and Reflected
Images,” International Journal of Imaging Systems and Technology, vol. 15, no. 1, pp. 84–91, 2005.
[23] R. Renner, “Security of Quantum Key Distribution,” PhD Thesis , ETH Zurich, 2005.
[24] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-
Interscience, 2006.
[25] E. J. Candès, J. Romberg, and T. Tao, “Robust Uncertainty Principles: Exact Signal Reconstruction From Highly Incomplete Frequency
Information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, February 2006.
[26] R. Chartrand, “Exact Reconstruction of Sparse Signals via Nonconvex Minimization,” Signal Processing Letters, IEEE, vol. 14, no. 10,
pp. 707–710, Octpber 2007.
[27] C. Akujuobi, O. Odejide, A. Annamalai, and G. Fudge, “Sparseness Measures of Signals for Compressive Sampling,” in Signal Processing
and Information Technology, 2007 IEEE International Symposium on, December 2007, pp. 1042–1047.
[28] I. Csiszár, “Axiomatic Characterizations of Information Measures,” Entropy, vol. 10, no. 3, pp. 261–273, 2008.
[29] A. Hormati and M. Vetterli, “Distributed Compressed Sensing: Sparsity Models and Reconstruction Algorithms using Annihilating Filter,”
in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, March 2008, pp. 5141–5144.
[30] N. Hurley and S. Rickard, “Comparing Measures of Sparsity,” IEEE Trans. Inf. Theory, vol. 55, no. 10, pp. 4723–4741, October 2009.
[31] A. Cambini and L. Martein, Generalized Convexity and Optimization. Springer, 2009, vol. 616.
[32] J. Haberl, “On Global Minima of Semistrictly Quasiconcave Functions,” Optimization Letters, vol. 3, no. 3, pp. 387–396, 2009.
[33] Tian Lan, D. Kao, Mung Chiang, and A. Sabharwal, “An Axiomatic Theory of Fairness in Network Resource Allocation,” in INFOCOM,
2010 Proceedings IEEE, March 2010, pp. 1–9.
[34] A. Cohen and A. Yeredor, “On the use of sparsity for recovering discrete probability distributions from their moments,” in Statistical
Signal Processing Workshop (SSP), 2011 IEEE, June 2011, pp. 753–756.
[35] D. Krishnan, T. Tay, and R. Fergus, “Blind Deconvolution using a Normalized Sparsity Measure,” in Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, June 2011, pp. 233–240.
[36] D. Zonoobi, A. A. Kassim, and Y. V. Venkatesh, “Gini Index as Sparsity Measure for Signal Reconstruction from Compressive Samples,”
Selected Topics in Signal Processing, IEEE Journal of, vol. 5, no. 5, pp. 927–932, September 2011.
[37] M. Pilanci, L. El Ghaoui, and V. Chandrasekaran, “Recovery of Sparse Probability Measures via Convex Programming,” in Proceedings
of Advances in Neural Information Processing Systems (NIPS), 2012, pp. 2429–2437.
[38] P. Cheridito and E. Kromer, “Reward-Risk Ratios,” Journal of Investment Strategies, Forthcoming, October 2013.
[39] T. Blumensath, “Compressed Sensing with Nonlinear Observations and Related Nonlinear Optimization Problems,” IEEE Trans. Inf. Theory,
vol. 59, no. 6, pp. 3466–3474, June 2013.
[40] M. Lopes, “Estimating Unknown Sparsity in Compressed Sensing,” in Proceedings of the 30th International Conference on Machine
Learning (ICML) (3), vol. 28, no. 3, May 2013, pp. 217–225.
[41] G. Pastor, I. Mora-Jiménez, R. Jäntti, and A. J. Caamaño, “Sparsity-Based Criteria for Entropy Measures,” in Wireless Communication
Systems (ISWCS 2013), Proceedings of the Tenth International Symposium on, August 2013, pp. 1–5.
[42] E. J. Candès, “Mathematics of Sparsity (and a Few Other Things),” in Proceedings of the International Congress of Mathematicians, 2014.