Academia.eduAcademia.edu

The Cauchy-Schwarz divergence for Poisson point processes

—In this paper, we extend the notion of Cauchy-Schwarz divergence to point processes and establish that the Cauchy-Schwarz divergence between the probability densities of two Poisson point processes is half the squared L 2-distance between their intensity functions. Extension of this result to mixtures of Poisson point processes and, in the case where the intensity functions are Gaussian mixtures, closed form expressions for the Cauchy-Schwarz divergence are presented. Our result also implies that the Bhattacharyya distance between the probability distributions of two Poisson point processes is equal to the square of the Hellinger distance between their intensity measures. We illustrate the result via a sensor management application where the system states are modeled as point processes.

Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 1 The Cauchy-Schwarz divergence for Poisson point processes Hung Gia Hoang, Ba-Ngu Vo, Ba-Tuong Vo, and Ronald Mahler Abstract—In this paper, we extend the notion of CauchySchwarz divergence to point processes and establish that the Cauchy-Schwarz divergence between the probability densities of two Poisson point processes is half the squared L2 -distance between their intensity functions. Extension of this result to mixtures of Poisson point processes and, in the case where the intensity functions are Gaussian mixtures, closed form expressions for the Cauchy-Schwarz divergence are presented. Our result also implies that the Bhattacharyya distance between the probability distributions of two Poisson point processes is equal to the square of the Hellinger distance between their intensity measures. We illustrate the result via a sensor management application where the system states are modeled as point processes. Index Terms—Poisson point process, information divergence, random finite sets I. I NTRODUCTION The Poisson point process, which models “no interaction” or “complete spatial randomness” in spatial point patterns, is arguably one of the best known and most tractable of point processes [2]–[6]. Point process theory is the study of random counting measures with applications spanning numerous disciplines, see for example [2], [3], [6]–[8]. The Poisson point process itself arises in forestry [9], geology [10], biology [11], particle physics [12], communication networks [13]–[15] and signal processing [16]–[18]. The role of the Poisson point process in point process theory, in most respects, is analogous to that of the normal distribution in random vectors [19]. Similarity measures between random variables are fundamental in information theory and statistical analysis [20]. Information theoretic divergences, for example Kullback-Leibler, Rényi (or α-divergence) and their generalization CsiszárMorimoto (or Ali-Silvey), Jensen-Rényi, Cauchy-Schwarz etc., measure the difference between the information content of the random variables. Similarity between random variables can also be measured via the statistical distance between their probability distributions, for example total variation, Bhattacharyya, Hellinger/Matusita, Wasserstein, etc. Some distances are actually special cases of f -divergences [21]. Note that statistical distances are not necessarily proper metrics. For point processes or random finite sets, similarity measures have been studied extensively in various application Acknowledgement: The work of B.-N. Vo and B.-T. Vo are supported by the Australian Research Council under Future Fellowship FT0991854 and Discovery Early Career Research Award DE120102388 respectively. H. G. Hoang, B.-N. Vo, and B.-T. Vo are with the Department of Electrical and Computer Engineering, Curtin University, Bentley, WA 6102, Australia (email: {hung.hoang,ba-ngu.vo,ba-tuong.vo}@curtin.edu.au). R. Mahler is with Random Sets LLC (email: [email protected]). Part of the paper has been presented at the 2014 IEEE Workshop on Statistical Signal Processing, Gold Coast, Australia [1]. areas such as sensor management [22]–[27] and neuroscience [28]. However, so far except for trivial special cases, these similarity measures cannot be computed analytically and require expensive approximations such as Monte Carlo. In this paper, we present results on similarity measures for Poisson point processes via the Cauchy-Schwarz divergence and its relationship to the Bhattacharyya and Hellinger distances. In particular, we show that the Cauchy-Schwarz divergence between two Poisson point processes is given by the square of the L2 -distance between their intensity functions. Geometrically, this result relates the angle subtended by the probability densities of the Poisson point processes to the L2 -distance between their corresponding intensity functions. For Gaussian mixture intensity functions, their L2 -distance, and hence the Cauchy-Schwarz divergence can be evaluated analytically. We also extend the result to the Cauchy-Schwarz divergence for mixtures of Poisson point processes. In addition, using our result on the Cauchy-Schwarz divergence, we show that the Bhattacharyya distance between the probability distributions of two Poisson point processes is the square of the Hellinger distance between their respective intensity measures. The Poisson point process enjoys a number of nice properties [2]–[4], and our results are useful additions. We illustrate the use of our result on the Cauchy-Schwarz divergence in a sensor management application for multi-target tracking involving the Probability Hypothesis Density (PHD) filter [16]. The organization of the paper is as follows. Background on point processes and the Cauchy-Schwarz divergence is provided in Section II. Section III presents the main results of the paper that establish the analytical formulation for the CauchySchwarz divergence and Bhattacharyya distance between two Poisson point processes. In Section IV, the application of the Cauchy-Schwarz divergence to sensor management, including numerical examples, is studied. Finally, Section V concludes the paper. II. BACKGROUND d In this work we consider a state space ∫ X ⊆ R , and adopt 2 the inner product notation √ ⟨f, g⟩ , f (x)g(x)dx; the L norm notation ∏ ∥f ∥ , ⟨f, f ⟩; the multi-target exponential notation hX , x∈X h(x), where h is a real-valued function, with h∅ = 1 by convention; and the indicator function notation { 1, if x ∈ B 1B (x) , . 0, otherwise The notation N (x; m, Q) is used to explicitly denote the probability density of a Gaussian random variable with mean m and covariance Q, evaluated at x. 2 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 A. Point processes This section briefly summarizes concepts in point process theory needed for the exposition of our result. Point process theory, in general, is concerned with random counting measures. Our result is restricted to simple-finite point processes, which can be regarded as random finite sets. For simplicity, we omit the prefix “simple-finite” in the rest of the paper. For an introduction to the subject we refer the reader to the article [7], and for detailed treatments, books such as [2], [3], [5], [6]. A point process or random finite set X on X is a random variable taking values in F(X ), the space of finite subsets of X . Let |X| denotes the number of elements in a set X. A point process X on X is said to be Poisson with a given intensity function u (defined on X ) if [2], [3]: 1) for any B ⊆ X such that ⟨u, 1B ⟩ < ∞, the random variable |X ∩ B| is Poisson distributed with mean ⟨u, 1B ⟩, 2) for any disjoint B1 , ..., Bi ⊆ X , the random variables |X ∩ B1 |, ..., |X ∩ Bi | are independent. Since ⟨u, 1B ⟩ is the expected number of points of X in the region B, the intensity value u(x) can be interpreted as the instantaneous expected number of points per unit hypervolume at x. Consequently, u(x) is not dimensionless in general. If hyper-volume (on X ) is measured in units of K (e.g. md , cmd , ind , etc.) then the intensity function u has unit K −1 . The number of points of a Poisson point process X is Poisson distributed with mean ⟨u, 1⟩, and conditional on the number of points the elements x of X, are independently and identically distributed (i.i.d.) according to the probability density u(·)/ ⟨u, 1⟩ [2], [3], [5], [6]. It is implicit that ⟨u, 1⟩ is finite since we only consider simple-finite point processes. The probability distribution of a Poisson point process X with intensity function u is given by [6, pp. 15] Pr(X ∈ T ) = ∞ −⟨u,1⟩ ∫ ∑ e i=0 i! Xi 1T ({x1 , ..., xi })u{x1 ,...,xi } d(x1 , ..., xi ), (1) for any (measurable) subset T of F(X ), where X i denotes the ith -fold Cartesian product of X , with the convention X 0 = {∅}, and the integral over X 0 is 1T (∅). A Poisson point process is completely characterized by its intensity function (or more generally the intensity measure). Probability densities of point processes considered in this work are defined with respect to the reference measure µ given by ∫ ∞ ∑ 1 µ(T ) = 1T ({x1 , ..., xi })d(x1 , ..., xi ) (2) i!K i X i i=0 for any (measurable) subset T of F(X ). The measure µ is analogous to the Lebesque measure on X (indeed it is the unnormalized distribution of a Poisson point process with unit intensity u = 1/K when the state space X is bounded). Moreover, it was shown in [29] that for this choice of reference measure, the integral of a function f : F(X ) → R, given by ∫ ∫ ∞ ∑ 1 f ({x1 , ..., xi })d(x1 , ..., xi ), f (X)µ(dX) = i!K i X i i=0 (3) is equivalent to Mahler’s set integral [30]. Note that the reference measure µ, and the integrand f are all dimensionless. Our main result involves Poisson point processes with probability densities of the form X f (X) = e−⟨u,1⟩ [Ku] . (4) Note that for any (measurable) subset T of F(X ) ∫ f (X)µ(dX) T ∫ = 1T (X)f (X)µ(dX) ∞ −⟨u,1⟩∫ ∑ e = 1T ({x1 , ..., xi })u{x1 ,...,xi }d(x1 , ..., xi ). i! i X i=0 Thus, comparing with (1), f is indeed a probability density (with respect to µ) of a Poisson point process with intensity function u. B. The Cauchy-Schwarz divergence The Cauchy-Schwarz divergence is based on the CauchySchwarz inequality for inner products, and is defined for two random vectors with probability densities f and g by [31] DCS (f, g) = − ln ⟨f, g⟩ . ∥f ∥ ∥g∥ (5) The argument of the logarithm in (5) is non-negative (since probability densities are non-negative) and does not exceed one (by the Cauchy-Schwarz inequality). Moreover, this quantity can be interpreted as the cosine of the angle subtended by f and g in L2 (X , R), the space of square integrable functions taking X to R. Note that DCS (f, g) is symmetric and positive unless f = g, in which case DCS (f, g) = 0. Geometrically, the Cauchy-Schwarz divergence determines the information “difference” between random vectors from the angle between their probability densities. The CauchySchwarz divergence can also be interpreted as an approximation to the Kullback-Leibler divergence [31]. While the Kullback-Leibler divergence can be evaluated analytically for Gaussians (random vectors) [32], [33], for the more versatile class of Gaussian mixtures, only Jensen-Rényi and CauchySchwarz divergences can be evaluated in closed form [31], [34]. Hence, the Cauchy-Schwarz divergence between two densities of random variables has been employed in many information theoretic applications, especially in machine learning and pattern recognition [31], [35]–[38]. III. T HE C AUCHY-S CHWARZ DIVERGENCE FOR P OISSON POINT PROCESSES This section presents the main theoretical results of the paper. Subsection III-A establishes the Cauchy-Schwarz divergence for general Poisson point processes. Subsection III-B presents analytical solution for Poisson point processes with 3 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 Gaussian mixture intensities while subsection III-C details the solution for mixtures of Poisson point processes. Finally, subsection III-D presents a result on the Bhattacharyya distance between two Poisson processes. A. Cauchy-Schwarz divergence for Poisson point processes For point processes, the Csiszár-Morimoto divergence, which includes the Kullback-Leibler and Rényi, were formulated in [23] by replacing the standard (Lebesque) integral with the set integral which is defined for a Finite Set Statistics (FISST) density ϕ as follows [30] ∫ ∫ ∞ ∑ 1 ϕ({x1 , ..., xi })d(x1 , ..., xi ). ϕ(X)δX = i! i=0 The FISST density ϕ is not a probability density, but is closely related to a probability density, see [29] for further details. Note that ϕ({x1 , ..., xi }) has unit K −i , since the infinitesimal hyper-volume d(x1 , ..., xi ) has unit K i . Thus, ϕ(X) has different units for different cardinalities of X. Unlike the Csiszár-Morimoto divergence, the CauchySchwarz divergence, however, cannot be extended to point processes by simply replacing the standard integral with the set integral. To see this, consider the naı̈ve inner product between two FISST densities ϕ and φ via the set integral: ∫ ⟨ϕ, φ⟩ = ϕ(X)φ(X)δX ∫ ∞ ∑ 1 ϕ({x1 , ..., xi })φ({x1 , ..., xi })d(x1 , ..., xi ); = i! i=0 since the i-th term in the above sum has units of K −i , the sum itself is meaningless because the terms cannot be added together due to unit mismatch, e.g. if K = m3 , then the first term is unitless, the second term is in m−3 , the third term is in m−6 , etc. Indeed such naı̈ve inner product has been used incorrectly in [39]. Using the standard notion of density and integration summarized in subsection II-A, we can define the inner product ∫ ⟨f, g⟩µ = f (X)g(X)µ(dX), and corresponding norm ∥f ∥µ , √ Lemma 1. Let f (X) = rX and g(X) = sX with r, s ∈ −1 L2 (X , R). Then ⟨f, g⟩µ = eK ⟨r,s⟩ . Proof: ∫ X ∞ i ∑ ⟨r, s⟩ i=0 i!K i = eK −1 ⟨r,s⟩ . In the spirit of using the angle between probability densities to determine the information “difference”, the CauchySchwarz divergence can be extended to point processes as follows. Definition 1. The Cauchy-Schwarz divergence between the probability densities f and g of two point processes with respect to the reference measure µ is defined by DCS (f, g) = − ln ⟨f, g⟩µ ∥f ∥µ ∥g∥µ . (6) The above definition of the Cauchy-Schwarz divergence can be equivalently expressed in terms of set integrals as follows. Let ϕ and φ denote the FISST densities of the respective point processes. Using the relationship between the FISST density and the Radon-Nikodym derivative in [29], the corresponding probability densities relative to µ are given by f (X) = K |X| ϕ(X) and g(X) = K |X| φ(X). Since ∫ ∞ ∑ 1 K i ϕ({x1 , ..., xi })× ⟨f, g⟩µ = i i!K i X i=0 = ∫ K i φ({x1 , ..., xi })d(x1 , ..., xi ) K |X| ϕ(X)φ(X)δX the Cauchy-Schwarz divergence can be written as ∫ |X| K ϕ(X)φ(X)δX . DCS (ϕ, φ) = − ln √∫ ∫ |X| K ϕ2 (X)δX K |X| φ2 (X)δX The following proposition asserts that the Cauchy-Schwarz divergence between two Poisson point processes is half the squared distance between their intensity functions. Proposition 1. The Cauchy-Schwarz divergence between the probability densities f and g of two Poisson point processes with respective intensity functions u and v ∈ L2 (X , R) (measured in units of K −1 ), is given by DCS (f, g) = K 2 ∥u − v∥ . 2 (7) X ⟨f, f ⟩µ on L2 (F(X ), R). Such forms for the inner product and norm are well-defined because the densities f , g and reference measure µ are all unitless. Interestingly, the inner product between multi-object exponentials is given by the following result. ⟨f, g⟩µ = = [rs] µ(dX) ]i [∫ ∞ ∑ 1 r(x)s(x)dx = (using (3)) i!K i X i=0 Proof: Substituting f (X) = e−⟨u,1⟩ [Ku] , g(X) = X −⟨v,1⟩ e [Kv] into (6) and canceling out the constants e−⟨u,1⟩ , e−⟨v,1⟩ we have   ⟨ ⟩ [Ku](·) , [Kv](·) µ DCS (f, g) = − ln  ⟨ ⟩1 ⟨ ⟩1  [Ku](·) , [Ku](·) µ2 [Kv](·) , [Kv](·) µ2 Applying Lemma 1 to the above equation gives ) ( K K DCS (f, g) = − ln eK⟨u,v⟩− 2 ⟨u,u⟩− 2 ⟨v,v⟩ ( K ) = − ln e− 2 (⟨u,u⟩−2⟨u,v⟩+⟨v,v⟩) K 2 ∥u − v∥ . 2 Note that since the intensity functions have units of K −1 , 2 2 ∥u − v∥ also has unit of K −1 and hence K ∥u − v∥ is = 4 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 2 unitless. Moreover, K ∥u − v∥ , referred to as the squared distance between the intensity functions u and v, takes on the same value regardless of the choice of measurement unit. Suppose that the unit of the hyper-volume in the state space X has been changed from K to ρK (for example, from dm3 to m3 = 103 dm3 ) as illustrated in Fig. 1. The change of unit inevitably leads to the change in numerical values of the two intensity functions (for example, the intensity measured in m−3 , which is the expected number of points per cubic meter, is one thousand times the intensity measured in dm−3 ). However, these changes cancel each other in the 2 product ρK 2 ∥uρ − vρ ∥ such that the squared distance remains unchanged. u [K −1 ] Poisson distributions vanishes as the distance between their intensity functions tends to zero. However, it was not clear that a reduction in the error between the intensity functions necessarily implies a reduction in the “difference” between the corresponding distributions. Our result not only verifies that the “difference” between the distributions is reduced, it also quantifies the reduction. B. Gaussian Mixture Intensities In general, the L2 -distance between the intensity functions, and hence the Cauchy-Schwarz divergence, cannot be numerically evaluated in closed form. However, for Poisson point processes with Gaussian mixture intensity functions, applying the following identity for Gaussian probability density functions [40, pp. 200] ⟨N (·; µ0 , Σ0 ), N (·; µ1 , Σ1 )⟩ = N (µ0 ; µ1 , Σ0 + Σ1 ), u0 to (7) yields an analytic expression for the Cauchy-Schwarz divergence. This is stated more concisely in the following result. x0 x[K] Corollary 1. The Cauchy-Schwarz divergence between two Poisson point processes with Gaussian mixture intensities: u ′ [ρ−1 K −1 ] u(x) = ρu0 v(x) = Nu ∑ i=1 Nv ∑ i=1 (i) wu(i) N (x; m(i) u , Pu ), (8a) (i) wv(i) N (x; m(i) v , Pv ) (8b) (measured in units of K −1 ) is given by x ′ [ρK] ρ−1 x0 DCS (f, g) = Fig. 1. Change of unit in the state space Proposition 1 has a nice geometric interpretation that relates the angle subtended by the probability densities in L2 (F(X ), R) to the distance between the corresponding intensity functions in L2 (X , R) as depicted in Fig. 2. More concisely: the secant of the angle between the probability densities of two Poisson point processes equals the exponential of half the squared distance between their intensity functions. L2 (F(X ), R) ln(sec θ) = K 2 ∥u − v∥2 u 0 v v∥ θ g ∥u − f 0 L2 (X , R) Fig. 2. Geometric interpretation of Proposition 1 The above result has important implications in the approximation of Poisson point processes through their intensity functions. It is intuitive that the “difference” between the Nu ∑ Nu ( ) 1∑ (j) (i) (j) wu(i) wu(j) N m(i) ; m , P + P + u u u u 2 i=1 j=1 Nv ∑ Nv ( ) 1∑ (j) (i) (j) wv(i) wv(j) N m(i) − v ; mv , Pv + Pv 2 i=1 j=1 Nu ∑ Nv ∑ i=1 j=1 ( ) (j) (i) (j) wu(i) wv(j) N m(i) u ; mv , Pu + Pv (9) In terms of computational complexity, each term in (9) involves evaluations of a Gaussian probability density function within a double sum. Hence, if we use the standard Gauss-Jordan elimination for matrix inversions, computing DCS (f, g) is quadratic in the number of Gaussian components and cubic in the state dimension (i.e. O(Nv2 d3 ), assuming Nv ≥ Nu ). The complexity can be reduced to O(Nv2 d2.373 ) if the optimized Coppersmith-Winograd algorithm [41] was employed in place of the Gauss-Jordan elimination. This Corollary has important implications in Gaussian mixture reduction for intensity functions. The result provides mathematical justification for Gaussian mixture reduction for intensity functions based on L2 -error. Furthermore, since Gaussian mixtures can approximate any density to any desired accuracy [42], Corollary 1 enables the Cauchy-Schwarz divergence between two Poisson point processes to be approximated to any desired accuracy. 5 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 C. Mixture of Poisson point processes Proposition 1 can be easily extended to mixtures of Poisson point processes, i.e. those whose probability densities can be written as a weighted sum of Poisson point process densities: Nf ∑ (i) X f (X) = (10a) wf e−⟨ui ,1⟩ [Kui ] , i=1 Ng ∑ g(X) = X wg(i) e−⟨vi ,1⟩ [Kvi ] , (10b) i=1 (i) i=1 wf ∑Nf ∑Ng (i) = 1. Such point processes where = i=1 wg have applications in immunology [43], neural data analysis [44], criminology [45], and machine learning [46]. Substituting (10) into (6) and applying Lemma 1, the Cauchy-Schwarz divergence between two mixtures of Poisson point processes is stated as follows. Corollary 2. The Cauchy-Schwarz divergence between two mixtures of Poisson point processes given in (10) is   Nf Ng K⟨ui ,vj ⟩ ∑ (i) ∑ e wf wg(j) ⟨u +v ,1⟩  + DCS (f, g) = − ln  i j e i=1 j=1   Nf Nf 1 ∑ ∑ (i) (j) eK⟨ui ,uj ⟩  wf wf ⟨u +u ,1⟩ + ln 2 e i j i=1 j=1   Ng Ng 1 ∑ ∑ (i) (j) eK⟨vi ,vj ⟩  wg wg ⟨v +v ,1⟩ . (11) ln 2 e i j i=1 j=1 Furthermore, if the intensity function of each Poisson point process component is a Gaussian mixture (in units of K −1 ): ui (x) = Nui ∑ ℓ=1 Nvj vj (x) = ∑ ℓ=1 (ℓ) N (x; m(ℓ) ωu(ℓ) ui , Pui ), i (ℓ) ωv(ℓ) N (x; m(ℓ) vj , Pvj ), j ( ) (k) (ℓ) (k) (ℓ) (k) + P , P ; m N m ω ωv(ℓ) , vj vi vj vi vj i K⟨vi , vj ⟩ = ∑∑ K⟨ui , vj ⟩ = ∑∑ (k) (ℓ) (k) N m(ℓ) ωv(k) ωu(ℓ) , ui ; mvj , Pui + Pvj j i ⟨ui + vj , 1⟩ = ∑∑ ωu(ℓ) + ωv(k) , i j ⟨ui + uj , 1⟩ = ∑∑ , + ωu(k) ωu(ℓ) j i ⟨vi + vj , 1⟩ = ∑i ∑ ℓ=1 k=1 N u i N vj ℓ=1 k=1 N u i N vj ( ℓ=1 k=1 Nui Nuj ℓ=1 k=1 N v N vj ℓ=1 k=1 . + ωv(k) ωv(ℓ) j i The Cauchy-Schwarz divergence is based on the angle between two probability densities (with respect to a reference measure), and is not necessarily invariant to the choice of reference measure. Closely related to the Cauchy-Schwarz divergence is the Bhattacharyya distance between two probability measures [47]. Definition 2. The Bhattacharyya distance between to probability measures F and G, is defined by √ ⟩ ⟨√ dF dG (12) , DB (F, G) = − ln dµ dµ µ where µ is any measure dominating F and G. The inner product in the above definition, denoted by CB (F, G), is called the Bhattacharyya coefficient and is invariant to the choice of reference measure µ [47]. Unlike the Cauchy-Schwarz divergence, the Bhattacharyya distance avoids the requirement of square integrable probability densities since square roots of probability densities are always square integrable. Note also that the Bhattacharyya distance can be expressed as the Cauchy-Schwarz divergence between the square roots of the probability densities, i.e. for any µ that dominates F and G √ ) (√ dF dG (13) , DB (F, G) = DCS dµ dµ Hence, Proposition 1 can be applied to relate the Bhattacharyya distance between the probability distributions of Poisson point processes to their intensity functions. Corollary 3. The Bhattacharyya distance between the probability distributions F and G of two Poisson point processes with respective intensity measures U and V (assumed to have densities with respect to the Lebesque measure), is given by then DCS (f, g) can be evaluated analytically by substituting the following equations into (11) Nui Nuj ( ) ∑ ∑ (k) (ℓ) (k) (ℓ) (k) K⟨ui , uj ⟩ = ωu(ℓ) ω N m ; m , P + P , u u u u u i j i j i j ℓ=1 k=1 N vi N vj D. Bhattacharyya distance for Poisson point processes ) 2 (U, V ), DB (F, G) = DH (14) where 1 DH (U, V ) = √ 2 √ dU − dλ √ dV , dλ is the Hellinger distance between the measures U, and V , (which is invariant to the choice of reference measure). Proof: Let u and v be densities (measured in units of K −1 ) of U and V relative to the Lebesque measure λ. Then the densities of F and G relative to µ, are given by X X f (X) = e−⟨u,1⟩ [Ku] , g(X) = e−⟨v,1⟩ [Kv] . From √ Proposition 1 the Cauchy-Schwarz divergence between f (X) ∝ ]X [ √ ]X [ √ √ is given by K u/K , and g(X) ∝ K v/K DCS (√ f, √ √ u v √ ) K − g = 2 K K √ 2 1 √ = u− v , 2 2 (U, V ). = DH 2 , 6 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 The above Corollary asserts that the Bhattacharyya distance between two Poisson point processes is the squared Hellinger distance between their intensity measures. Moreover, the square of the Hellinger distance can be expanded as ⟨√ 2 2 √ √ ⟩ √ dU dV dU dV + − 2 , dλ dλ dλ dλ 2 DH (U, V ) = 2 U (X ) + V (X ) − CB (U, V ). = 2 The intensity masses U (X ) and V (X ) are the expected number of points of the respective Poisson point processes. Thus, Corollary 3 provides another interesting interpretation: the Bhattacharyya distance between two Poisson point processes is the difference between the expected number of points per process and the Bhattacharyya coefficient of their intensity measures. In general, the Hellinger distance cannot be numerically evaluated in closed form. However, for Poisson point processes with Gaussian intensity function, using the Bhattacharyya coefficient for Gaussians [48] CB (N (·; µ0 , Σ0 ), N (·; µ1 , Σ1 )) = ) ( √ √ µ0 µ1 Σ 0 + Σ 1 ; , (2π)d |Σ0 | |Σ1 |N 2 2 2 yields an analytic expression for the Hellinger distance between the Gaussian intensity functions, stated as follows. Corollary 4. The Bhattacharyya distance between two Poisson point processes with Gaussian intensities: u(x) = wu N (x; mu , Pu ), v(x) = wv N (x; mv , Pv ) (15a) (15b) (measured in units of K −1 ) is given by √ √ wu + wv DB (F, G) = − (2π)d wu wv |Pu | |Pv |× 2 ) ( mu mv Pu + Pv N . (16) ; , 2 2 2 Remark. For point processes, the Bhattacharyya distance can be defined by replacing the standard (Lebesque) integral with the set integral. Again let ϕ and φ denote the FISST densities of the respective point processes. Then it follows from [29] that the corresponding probability densities relative to µ are given by f (X) = K |X| ϕ(X) and g(X) = K |X| φ(X). Hence, ∫ √ ∞ ⟨√ √ ⟩ ∑ 1 f, g = K i ϕ({x1 , ..., xi })× i i!K µ i X i=0 √ K i φ({x1 , ..., xi })d(x1 , ..., xi ) ∫ √ √ = ϕ(X) φ(X)δX and the Bhattacharyya distance can be written in terms of FISST densities and set integral as ∫ √ √ ϕ(X) φ(X)δX. DB (ϕ, φ) = − ln IV. A PPLICATION TO MULTI - TARGET SENSOR MANAGEMENT In this section, we present an application of our result to a sensor management (a.k.a. sensor control) problem for multi-target systems, where system states are modeled as point processes or random finite sets (RFS) [16], [29], [30], [49]. A multi-target system is fundamentally different from a singletarget system in that the number of states changes with time due to births and deaths of targets. For the purpose of illustrating the result in the previous section, we assume a linear Gaussian multi-target model [50], where the hidden multi-target state at time k is a finite set Xk , which is partially observed as another finite set Zk . All aspects of the system dynamics as well as sensor detection and false alarms are described in details in Appendix A. Multi-target sensor management is a stochastic control problem which involves the following steps 1) Propagating the multi-target posterior density, or alternatively a tractable approximation, recursively in time; 2) At each time, determining the action of the sensor by optimizing an objective function over a set of admissible actions. In step 1, propagating the full posterior is generally intractable. However, for linear Gaussian multi-target systems, the first moment of the posterior (a.k.a. the intensity function) can be propagated efficiently via the Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter [50] as documented in Appendix B. The sensor action in step 2 is executed by applying a control command/signal to the sensor, usually in order to either minimize a cost or maximize a reward. In the rest of this section, we demonstrate that the Cauchy-Schwarz divergence is a useful reward function for multi-target sensor management. A. Cauchy-Schwarz divergence based reward Denote by R(ak−1 , Zk:k+p ) the value of a reward function if the control command ak−1 were applied to the sensor at time k − 1 and subsequently the measurement sequence Zk:k+p = [Zk , Zk+1 , ..., Zk+p ] is observed for p + 1 time steps in the future. For illustration purpose, we only focus on the single step look-ahead (i.e. p = 0) policy. Naturally, given the reward function R(ak−1 , Zk:k+p ), the optimal control [command a∗k−1 ] is chosen to maximize the expected reward E R(ak−1 , Zk ) , where the expectation is taken over all possible values of the future measurement Zk . A computationally cheaper approach is to maximize the ideal predicted reward R(ak−1 , Zk∗ ) [26], [51], [52], where Zk∗ is the ideal predicted measurement from the predicted intensity vk|k−1 , that is, assuming no false alarms (zero clutter) and perfect target measurements (unity detection probability and negligible measurement noise). Other choices of objective functions are discussed in [26], [51]–[53]. A common class of reward functions for sensor control is that of information theoretic divergences between the predicted and posterior probability densities. For example, in [26], [52], [53] the Rényi divergence is employed to quantify the information gain from the future measurements for a chosen control action. The main drawback of the Rényi divergence 7 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 based approach is that it involves computation of integrals in infinite dimensional spaces which is generally intractable. As an alternative to the Rényi divergence, we propose the use of the Cauchy-Schwarz divergence for multi-target sensor control. According to Proposition 1, computing the CauchySchwarz reward function for Poisson multi-target densities reduces to calculating the squared L2 -distance between the predicted and posterior intensities: ] y [m] 0. 84 400 0. 0.9 96 9 72 0 1 500 300 0.99 9 200 84 1 0 6 0. with σϵ,k = 3m. [ 9 0.9 100 0. 99 0 0 100 200 300 400 500 x [m] 0.9 2 Rk = σϵ,k 6 66 0. where 600 0. 0. 0. 78 g(zk |xk ) = N (zk ; Hxk , Rk ), 72 78 0. with T = 1s. Measurements are noisy position returns according to the single-target likelihood 0. 0. 6 0.9 The detection profile is illustrated in Fig. 3. The single-target transition density is f (xk |xk−1 ) = N (xk ; F xk−1 , Q), where   2   0 T 3 0 T54 1 0 T 0 2  0 1 0 T  0 T 3 0 T54   , Q = 27  F =   2 T 0 0 1 0   T54 0 0 81 2 T 0 0 0 1 0 0 T54 81 84 9 [ ] [ ] 1 0 0 0 3 −2.4 6 H= , S = 10 . 0 1 0 0 −2.4 3.6 0. 700 0. (18) 0.9 9 where N (sk (ak ); Hxk , S) N (0; 0, S) 54 800 0.9 pD,k (xk ; ak−1 ) = 48 0. 66 8 900 6 0. 9 This example is based on a scenario adapted from [26] in which a mobile robot is tracking a varying number of moving targets. The surveillance area is a square of dimensions 1000m × 1000m. Each target at time k − 1 is characterized by a single-target state of the form xk−1 = [pTk−1 , ṗTk−1 ]T where pk−1 is the 2D position vector and ṗk−1 is the 2D velocity vector. If the control command ak−1 is applied at time k − 1, the sensor will move from its current position sk−1 to a new position sk (ak−1 ), where a target with state xk can be detected with probability 72 0.7 99 B. Numerical example 4 0. 0. 0. 0.8 0. K 2 vk|k−1 (·) − vk (·; ak−1 , Zk ) (17) 2 This strategy effectively replaces the evaluation of the Rényi divergence, via integrals in the infinite dimensional space F(X ), with the Cauchy-Schwarz divergence, which can be computed via standard integrals on the finite dimensional space X . Moreover, when the GM-PHD filter is used for the propagation of the Gaussian mixture posterior intensity, the reward function R(ak−1 , Zk ) can be evaluated in closed form using Corollary 1. In this section, our control policy is to select the control command ak−1 so as to maximize the ideal reward ∗ R(ak−1 , Zk−1 ). R(ak−1 , Zk ) = 1000 600 700 800 900 1000 Fig. 3. Initial positions of the sensor (♢) and targets (). The contours depict the sensor’s detection profile presented in (18), in which the detection probability decreases with distance from the sensor. Clutter is modeled by a Poisson RFS with intensity κ(z) = λc(z) where λ = 2 × 10−5 m−2 and c(z) = U ([0, 1000m] × [0, 1000m]) is the uniform density over the surveillance area. At time k − 1, the set Ak−1 contains all admissible control command [ that drive] the sensor from the current position (x) (y) T sk−1 = sk−1 , sk−1 to one of the following locations {[ ]T}(NR ,Nθ ) (x) (y) Sk = sk−1 +j∆R cos(ℓ∆θ), sk−1 +j∆R sin(ℓ∆θ) , (j,ℓ)=(0,0) 2π Nθ rad and ∆R = 50m are the angular and where ∆θ = radial step sizes respectively. The number of angular and radial steps are NR = 2 and Nθ = 8. The set Sk , thus, has 17 options in total which discretize the angular and radial region around the current sensor position. The sensor is always kept inside the surveillance area by setting the value of the objective function corresponding to positions outside the surveillance area to −∞. With these settings, it is expected that our control policy should, intuitively speaking, move the sensor towards the targets, and remain in their vicinity in order to obtain a high detection probability. Fig. 4 depicts a typical sensor trajectory which appears to be consistent with this intuitive expectation. We proceed to illustrate the performance of the proposed strategy. First, we compare the performance of the CauchySchwarz divergence based control strategy to that of an existing Rényi divergence based control strategy proposed in [26]. Since the Rényi divergence in general has no closed form solution and thus must be approximated by Sequential Monte Carlo (SMC), we also have to implement the Cauchy-Schwarz divergence using SMC approximation in order to enable a fair comparison. Second, the proposed GM implementation performance is then benchmarked against that of the SMCbased approach. When the objective function is approximated 8 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 100 1000 Cauchy−Schwarz (GM) Cauchy−Schwarz (SMC) Renyi (SMC) Sensor trajectory 900 90 800 80 700 70 60 OSPA [m] y [m] 600 500 50 40 400 30 300 20 200 10 100 0 0 0 100 200 300 400 500 x [m] 600 700 800 900 1000 Fig. 4. A typical sensor trajectory. Target start and stop positions are marked by  and ∇, respectively. The red target died at k = 19 whereas the green target was born at k = 27. The sensor initially moved towards the targets and remained in their vicinity, then moved again to the middle of the existing targets and the new born target for optimal detection of all targets. by SMC, the corresponding SMC-PHD filter [29] is used for recursive propagation of the posterior intensity function. All algorithms were implemented in MATLAB R2010b on a laptop with an Intel Core i5-3360 CPU and 8GB of RAM. The average run time for the Rényi divergence based strategy is 10.62 seconds (SMC-PHD filter implementation) while those for the Cauchy-Schwarz based strategies are 10.68 seconds (SMC-PHD filter implementation) and 3.21 seconds (GMPHD filter implementation). It is evident that the closed form Cauchy-Schwarz divergence based strategy is the fastest. Fig. 5 shows the Optimal SubPattern Assignment (OSPA) metric or miss distance [54] (with parameters p = 2, c = 100m) averaged over 200 Monte Carlo runs for each of the considered control strategies. The OSPA curves in Figure 5 suggest that the closed form GM-PHD filter based strategy outperforms its approximate SMC-PHD filter based counterparts, while the performance of the two approximate SMC-PHD filter based strategies are virtually identical. These numerical results suggest that the Cauchy-Schwarz divergence can be at least as effective as the Rényi divergence when used as a reward function for multi-target sensor control. The results further suggest that the former has the distinct advantage of the GM implementation which leads to superior performance due to closed form solution and better filtering capability. V. C ONCLUSIONS In this paper, we have extended the notion of the CauchySchwarz divergence to point processes, and have shown that for an appropriate choice of reference measure, the CauchySchwarz divergence between the probability densities of two Poisson point processes is half the squared distance between their intensity functions. We have extended this result to mixtures of Poisson point process and derived closed form 5 10 15 20 Time steps 25 30 35 40 Fig. 5. Comparison of the averaged OSPA distance generated by different control strategies. While SMC-PHD implementations for the Rényi divergence (dashed line) and the Cauchy-Schwarz divergence (starred line) yield similar results, they are outperformed by the GM-PHD implementation (solid line) due to closed form solution for the Cauchy-Schwarz divergence and better filtering performance. expressions for the Cauchy-Schwarz divergence when the intensity functions are Gaussian mixtures. The Cauchy-Schwarz divergence for probability densities is not necessarily invariant to the choice of reference measure. Nonetheless the CauchySchwarz divergence for the square roots of probability densities, or equivalently, the Bhattacharyya distance for probability measures, importantly is invariant to the choice of reference measure. For Poisson point processes, our result implies that the Bhattacharyya distance between the probability distributions is equal to the square of the Hellinger distance between the intensity measures, which in turn is the difference between the expected number of points per process and the Bhattacharyya coefficient of their intensity measures. We have illustrated an application of our result on a sensor control problem for multi-target tracking where the system state is modeled as a point process. Our result is an addition to the list of interesting properties of Poisson point processes and has important implications in the approximation of point processes. A PPENDIX A L INEAR G AUSSIAN SYSTEM MODEL In a linear Gaussian multi-target model, each constituent element xk−1 of the multi-target state Xk−1 at time k − 1 either continues to exist at time k with probability pS,k or dies with probability 1 − pS,k , and conditional on its existence at time k, transitions from xk−1 to xk with probability density f (xk |xk−1 ) = N (xk ; Fk−1 xk−1 , Qk−1 ). (19) The surviving targets at time k is thus a Multi-Bernoulli point process or RFS [26], [51]–[53]. New targets can arise at time k either by spontaneous births, or by spawning from targets at time k − 1. The set of birth targets and spawned targets are 9 Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 modeled as Poisson point processes with respective Gaussian mixture intensity functions Jk−1 Jβ,k vβ,k|k−1 (x) = ∑∑ i=1 j=1 Jγ,k γk (x) = (i) ∑ ( ) (i) (i) (i) (i) wβ,k N x; Fβ,k−1 ζ + dβ,k−1 , Qβ,k−1 , , (i) JD,k ∑ j=0 (i) PS,k|k−1 = Qk−1 + Fk−1 Pk−1 [Fk−1 ] The multi-target state is hidden and is partially observed by a sensor driven by the control vector ak−1 at time k − 1. Each target evolves and generates observations independently of one another. A target with state xk is detected by the sensor with probability: pD,k (x; ak−1 ) = (i) mS,k|k−1 = Fk−1 mk−1 (i) wγ,k N i=1 (i) (i) x; mγ,k , Pγ,k ) ∑ i=1 Jβ,k βk|k−1 (x|ζ) = ( ( ) (j) (j) (j) wD,k N x; mD,k (ak−1 ), PD,k (or missed with probability 1 − pD,k (xk ; ak−1 )) and conditional on detection generates a measurement zk according to the probability density gk (zk |xk ) = N (zk ; Hk xk , Rk ). (i,j) (j) Update: If predicted intensity and detection probability are Gaussian mixtures of the form Jk|k−1 vk|k−1 (x) = then, the posterior intensity at time k is given by ∑ vk (x; Zk (ak−1 )) = vM,k (x; ak−1 ) + vD,k (x; z) z∈Zk (ak−1 ) where Jk|k−1 vM,k (x; ak−1 ) = (i) wM,k (ak−1 ) = (i) wµ,k (ak−1 ) = Tk (ak−1 ) = (i) (i) wk−1 N (x; mk−1 , Pk−1 ) then the predicted intensity at time k is also a Gaussian mixture and is given by vk|k−1 (x) = vS,k|k−1 (x) + vβ,k|k−1 (x) + γk (x) where Jk−1 ∑ i=1 ∑ ( ) (i) (i) (i) wM,k (ak−1 )N x; mk|k−1 , Pk|k−1 i=1 (i) wµ,k (ak−1 )Tk (ak−1 ) Jk|k−1 ∑ (i) wµ,k (ak−1 ) i=1 [ ( )] (i) (i) 1 − pD,k mk|k−1 ; ak−1 wk|k−1 ( ) (i) (i) (i) wk−1 N x; mS,k|k−1 , PS,k|k−1 1 Here, we use a slightly different technique from that in [55], which proposes an approximate propagation for the original GM-PHD filter in order to mitigate computational issues involving negative Gaussian mixture weights which arise due to a state dependent detection probability. For notational compactness we omit the time index on the state variable and the conditioning on the measurement history in expressions involving the posterior intensity function. ∑ i=1 (i,j) wk|k−1 (ak−1 ) = Jk|k−1 JD,k (i) wk|k−1 − ∑ ∑ i=1 (i,j) wk|k−1 (ak−1 ) j=0 (i) (j) (i,j) wk|k−1 wD,k qk|k−1 (ak−1 ), ) ( (i,j) (j) (i) (i) (j) qk|k−1 (ak−1 ) = N mD,k (ak−1 ); mk|k−1 , Pk|k−1 + PD,k , and Jk|k−1 JD,k vD,k (x; z) = ∑ ∑ i=1 Jk−1 vS,k|k−1 (x) = pS,k ( ) (i) (i) (i) wk|k−1 N x; mk|k−1 , Pk|k−1 , Jk|k−1 In general, posterior intensity function is propagated recursively in time via the Probability Hypothesis Density (PHD) filter [16]. For the linear Gaussian multi-target system described in Appendix A, the posterior intensity function is propagated via the Gaussian Mixture PHD (GM-PHD) filter [50] as follows1 . Prediction: If the posterior intensity at time k − 1 is a Gaussian mixture of the form i=1 ∑ i=1 P OSTERIOR INTENSITY PROPAGATION (i) (j) [ ]T (i) (j) (i,j) (j) (j) Pβ,k|−1 = Qβ,k−1 + Fβ,k−1 Pβ,k−1 Fβ,k−1 . A PPENDIX B ∑ (i) T mβ,k|k−1 = Fβ,k−1 mk−1 + dβ,k−1 (20) The detections corresponding to targets is thus a MultiBernoulli point process [26], [51]–[53]. The sensor also registers a set of spurious measurements (clutter), independent of the detections, modeled as a Poisson point process with intensity κk . Thus, at each time step the measurement is a collection of detections Zk , only some of which are generated by targets. vk−1 (x) = ) ( (i) (j) (i,j) (i,j) wk−1 wβ,k N x; mβ,k|k−1 , Pβ,k|k−1 (i,j) wk (i,j) wk j=0 (i,j) ) ( (i,j) (i,j) (z)N x; mk|k (z), Pk|k , (i,j) wk|k−1 (ak−1 )qk (z) = (z) , Jk|k−1 JD,k κk (z) + ∑ ∑ i=1 (i,j) (i,j) wk|k−1 (ak−1 )qk (z) j=0 ( ) (i,j) (i,j) (z) = N z; Hk mk|k−1 , Rk + Hk Pk|k−1 HkT , [ ] (i,j) (i) (i,j) (j) (i) mk|k−1 = mk|k−1 + Kk|k−1 mD,k (ak−1 ) − mk|k−1 , [ ] (i,j) (i,j) (i) Pk|k−1 = I − Kk|k−1 Pk|k−1 , [ ]−1 (i,j) (i) (i) (j) Kk|k−1 = Pk|k−1 Pk|k−1 + PD,k , [ ] (i,j) (i,j) (i,j) (i,j) mk|k (z) = mk|k−1 + Kk z − Hk mk|k−1 , [ ] (i,j) (i,j) (i,j) Pk|k = I − Kk Hk Pk|k−1 , ( )−1 (i,j) (i,j) (i,j) , Kk = Pk|k−1 HkT Hk Pk|k−1 HkT + Rk (i,j) qk Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 (i,0) (i,0) (i) and by convention qk|k−1 = 1, mk|k−1 = mk|k−1 , and (i,0) Pk|k−1 = (i) Pk|k−1 . R EFERENCES [1] H. G. Hoang, B.-N. Vo, B. T. Vo, and R. Mahler, “The Cauchy-Schwarz divergence for poisson point processes,” in Proc. IEEE Workshop on Statistical Signal Processing (SSP 2014), June 2014, pp. 240–243. [2] D. Daley and D. Vere-Jones, An introduction to the theory of point processes. Springer-Verlag, 1988. [3] D. Stoyan, D. Kendall, and J. Mecke, Stochastic Geometry and its Applications. John Wiley & Sons, 1995. [4] J. Kingman, Poisson Processes. Oxford University Press, 1993. [5] N. Van Lieshout, Markov Point Processes and Their Applications. Imperial College Press, 2000. [6] J. Moller and R. Waagepetersen, Statistical Inference and Simulation for Spatial Point Processes. Chapman & Hall CRC, 2004. [7] A. Baddeley, I. Bárány, R. Schneider, and W. Weil, Stochastic Geometry: Lectures Given at the C.I.M.E. Summer School Held in Martina Franca, Italy, September 13-18, 2004. Springer, 2007. [8] F. Baccelli and B. Bllaszczyszyn, “Stochastic geometry and wireless networks: Volume I Theory,” Foundation and Trends in Networking, vol. 3, no. 3-4, pp. 249–449, Mar. 2009. [9] D. Stoyan and A. Penttinen, “Recent applications of point process methods in forestry statistics,” Statistical Science, vol. 15, no. 1, pp. 61–78, 2000. [10] Y. Ogata, “Seismicity analysis through point-process modeling: A review,” Pure and applied geophysics, vol. 155, no. 2-4, pp. 471–507, 1999. [11] V. Marmarelis and T. Berger, “General methodology for nonlinear modeling of neural systems with Poisson point-process inputs,” Mathematical Biosciences, vol. 196, no. 1, pp. 1 – 13, 2005. [12] D. L. Snyder, L. J. Thomas, and M. M. Ter-Pogossian, “A mathematical model for positron-emission tomography systems having time-of-flight measurements,” IEEE Trans. Nucl. Sci., vol. 28, no. 3, pp. 3575–3583, 1981. [13] F. Baccelli, M. Klein, M. Lebourges, and S. A. Zuyev, “Stochastic geometry and architecture of communication networks,” Telecommunication Systems, vol. 7, no. 1-3, pp. 209–227, 1997. [14] M. Haenggi, “On distances in uniformly random networks,” IEEE Trans. Inf. Theory, vol. 51, no. 10, pp. 3584–3586, 2005. [15] M. Haenggi, J. Andrews, F. Baccelli, O. Dousse, and M. Franceschetti, “Stochastic geometry and random graphs for the analysis and design of wireless networks,” IEEE J. Sel. Areas Commun., vol. 27, no. 7, pp. 1029–1046, 2009. [16] R. Mahler, “Multitarget Bayes filtering via first-order multitarget moments,” IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 4, pp. 1152– 1178, Oct 2003. [17] S. S. Singh, B.-N. Vo, A. J. Baddeley, and S. A. Zuyev, “Filters for spatial point processes.” SIAM J. Control and Optimization, vol. 48, no. 4, pp. 2275–2295, 2009. [18] F. Caron, P. Del Moral, A. Doucet, and M. Pace, “On the conditional distributions of spatial point processes.” Advances in Applied Probability, vol. 43, no. 2, pp. 301–307, 2011. [19] D. R. Cox and V. Isham, Point processes. Chapman & Hall, 1980. [20] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, NY, USA: Wiley-Interscience, 1991. [21] S. M. Ali and S. D. Silvey, “A General Class of Coefficients of Divergence of One Distribution from Another,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 28, no. 1, pp. 131– 142, 1966. [22] W. Schmaedeke, “Information based sensor management,” in Proc. SPIE Signal Processing, Sensor Fusion, and Target Recognition II, vol. 155, 1993, pp. 156–164. [23] R. P. S. Mahler, “Global posterior densities for sensor management,” in Proc. SPIE, vol. 3365, 1998, pp. 252–263. [24] S. Singh, N. Kantas, B.-N. Vo, A. Doucet, and R. Evans, “Simulation based optimal sensor scheduling with application to observer trajectory planning,” Automatica, vol. 43, no. 5, pp. 817–830, 2007. [25] A. O. Hero, C. M. Kreucher, and D. Blatt, “Information theoretic approaches to sensor management,” in Foundations and applications of sensor management, A. O. Hero, D. A. Castanón, D. Cochran, and K. Kastella, Eds. Springer, 2008, ch. 3, pp. 33–57. [26] B. Ristic, B.-N. Vo, and D. Clark, “A note on the reward function for PHD filters with sensor control,” IEEE Trans. Aerosp. Electron. Syst., vol. 47, no. 2, pp. 1521–1529, 2011. 10 [27] H. G. Hoang, “Control of a mobile sensor for multi-target tracking using Multi-Target/Object Multi-Bernoulli filter,” in Proc. International Conference on Control, Automation and Information Sciences (ICCAIS 2012), Ho Chi Minh City, Vietnam, 2012, pp. 7–12. [28] J. D. Victor, “Spike train metrics,” Current Opinion in Neurobiology, vol. 15, no. 5, pp. 585–592, 2005. [29] B.-N. Vo, S. Singh, and A. Doucet, “Sequential Monte Carlo methods for multi-target filtering with random finite sets,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, no. 4, pp. 1224–1245, 2005. [30] R. Mahler, Statistical Multisource-Multitarget Information Fusion. Artech House, 2007. [31] K. Kampa, E. Hasanbelliu, and J. C. Principe, “Closed-form CauchySchwarz PDF divergence for mixture of Gaussians,” in Proc. International joint conference on Neural Networks (IJCNN 2011). IEEE, 2011, pp. 2578–2585. [32] S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. pp. 79–86, 1951. [33] J. Lin, “Divergence measures based on the Shannon entropy,” IEEE Trans. Inf. Theory, vol. 37, no. 1, pp. 145–151, 1991. [34] F. Wang, T. Syeda-Mahmood, B. Vemuri, D. Beymer, and A. Rangarajan, “Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise shape registration,” in Medical Image Computing and Computer-Assisted Intervention MICCAI 2009. Springer, 2009, vol. 5761, pp. 648–655. [35] R. Jenssen, D. Erdogmus, K. E. Hild, J. C. Principe, and T. Eltoft, “Optimizing the Cauchy-Schwarz PDF distance for information theoretic, non-parametric clustering,” in Proc. 5th international conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2005. [36] T. Villmann, B. Hammer, F.-M. Schleif, T. Geweniger, T. Fischer, and M. Cottrell, “Prototype based classification using information theoretic learning,” in Neural Information Processing, I. King, J. Wang, L.-W. Chan, and D. Wang, Eds. Springer Berlin Heidelberg, 2006, vol. 4233, pp. 40–49. [37] R. Jenssen, J. C. Principe, D. Erdogmus, and T. Eltoft, “The CauchySchwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels,” Journal of the Franklin Institute, vol. 343, no. 6, pp. 614 – 629, 2006. [38] E. Hasanbelliu, L. Sanchez-Giraldo, and J. Principe, “A robust point matching algorithm for non-rigid registration using the Cauchy-Schwarz divergence,” in Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2011), 2011, pp. 1–6. [39] K. DeMars, I. Hussein, M. Jah, and R. Erwin, “The Cauchy-Schwarz divergence for assessing situational information gain,” in Proc. International Conference on Information Fusion (FUSION2012), 2012, pp. 1126–1133. [40] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2005. [41] F. Le Gall, “Powers of tensors and fast matrix multiplication,” in Proc. 39th International Symposium on Symbolic and Algebraic Computation, ser. ISSAC ’14, 2014, pp. 296–303. [42] J.-H. Lo, “Finite-dimensional sensor orbits and optimal nonlinear filtering,” IEEE Trans. Inf. Theory, vol. 18, no. 5, pp. 583–588, 1972. [43] C. Ji, D. Merl, T. B. Kepler, and M. West, “Spatial mixture modelling for unobserved point processes: Examples in immunofluorescence histology,” Bayesian Analysis, vol. 4, pp. 297–316, 2009. [44] A. Kostas and S. Behseta, “Bayesian nonparametric modeling for comparison of single-neuron firing intensities,” Biometrics, vol. 66, pp. 277–286, 2010. [45] M. Taddy, “Autoregressive mixture models for dynamic spatial Poisson processes: Application to tracking intensity of violent crime,” Journal of the American Statistical Association, vol. 105, pp. 1403–1417, 2010. [46] D. Phung and B.-N. Vo, “A random finite set model for data clustering,” in Proc. International Conference on Information Fusion (FUSION 2014), July 2014, pp. 1–8. [47] A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distribution,” Bulletin of the Calcutta Mathematical Society, vol. 35, pp. 99–110, 1943. [48] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Trans. Inf. Theory, vol. 15, no. 1, pp. 52–60, 1967. [49] I. R. Goodman, R. Mahler, and H. T. Nguyen, Mathematics of Data Fusion. Kluwer Academic Publishers, 1997. [50] B.-N. Vo and W.-K. Ma, “The Gaussian mixture probability hypothesis density filter,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4091– 4104, 2006. Preprint: IEEE Trans. Information Theory, vol. 61, no. 8, pp. 4475-4485, 2015 [51] R. Mahler, “Multitarget sensor management of dispersed mobile sensors,” in Theory and algorithms for cooperative systems, D. Grundel, R. Murphey, and P. Pardalos, Eds. World Scientific Books, 2004, ch. 12, pp. 239–310. [52] H. G. Hoang and B. T. Vo, “Sensor management for multi-target tracking via multi-Bernoulli filtering,” Automatica, vol. 50, no. 4, pp. 1135–1142, 2014. [53] B. Ristic and B.-N. Vo, “Sensor control for multi-object state-space estimation using random finite sets,” Automatica, vol. 46, no. 11, pp. 1812–1818, 2010. [54] D. Schumacher, B.-T. Vo, and B.-N. Vo, “A consistent metric for performance evaluation of multi-object filters,” IEEE Trans. Signal Process., vol. 56, no. 8, pp. 3447–3457, 2008. [55] M. Ulmke, O. Erdinc, and P. Willett, “GMTI tracking via the Gaussian Mixture Cardinalized Probability Hypothesis Density filter,” IEEE Trans. Aerosp. Electron. Syst., vol. 46, no. 4, pp. 1821–1833, 2010. 11