Academia.eduAcademia.edu

Statistical Modeling of Photographic Images

2005, Handbook of Image and Video Processing

To appear in: Handbook of Video and Image Processing, 2nd edition ed. Alan Bovik, c Academic Press, 2005. 4.7 Statistical Modeling of Photographic Images Eero P. Simoncelli New York University January 18, 2005 The set of all possible visual images is huge, but not all of these are equally likely to be encountered by an imaging device such as the eye. Knowledge of this nonuniform probability on the image space is known to be exploited by biological visual systems, and can be used to advantage in the majority of applications in image processing and machine vision. For example, loosely speaking, when one observes a visual image that has been corrupted by some sort of noise, the process of estimating the original source image may be viewed as one of looking for the highest-probability image that is “close to” the noisy observation. The problem of compression essentially boils down to using a larger proportion of the available bits to encode those regions of the image space that are more likely. And problems such as resolution enhancement or image synthesis involve selecting (sampling) a highprobability image from the distribution, perhaps subject to some set of constraints. Precise developments of such applications can be found in many chapters throughout this book. In order to develop a probability model for visual images, we first must decide which images to model. In a practical sense, this means we must (a) decide on imaging conditions, such as the field of view, resolution, sensor or postprocessing nonlinearities, etc, (b) decide what kind of scenes, under what kind of lighting, are to be captured in the images. It may seem odd, if one has not encountered such models, to imagine that all images are drawn from a single universal probability urn. In particular, the features and properties in any given image are often specialized. For example, outdoor nature scenes contain structures that are quite different from city streets, which in turn are nothing like human faces. There are two means by which this dilemma is resolved. First, the statistical properties that we will examine are basic enough that they are relevant for essentially all visual scenes. Second, we will use parametric models, in which a set of hyperpa- rameters (possibly random variables themselves) govern the detailed behavior of the model, and thus allow a certain degree of adaptability of the model to different types of source material. How does one build and test a probability model for images? Many approaches have been developed, but in this chapter, we’ll describe an empirically-driven methodology, based on the study of discretized (pixelated) images. Currently available digital cameras record such images, typically containing millions of pixels. Naively, one could imagine examining a large set of such images to try to determine how they are distributed. But a moment’s thought leads one to realize the hopelessness of the endeavor. The amount of data needed to estimate a probability distribution from samples grows as K D , where D is the dimensionality of the space (in this case, the number of pixels). This is known as the “curse of dimensionality”. Thus, in order to make progress on image modeling, it is essential to reduce the dimensionality of the space. Two types of simplifying assumption can help in this regard. The first, known as a Markov assumption, is that the probability density of a pixel, when conditioned on a set of pixels in a small spatial neighborhood, is independent of the pixels outside of the neighborhood. A second type of simplification comes from imposing certain symmetries or invariances on the probability structure. The most common of these is that of translationinvariance (i.e., sometimes called homogeneity, or strictsense stationarity): the distribution of pixels in a neighborhood does not depend on the absolute location of that neighborhood within the image. This seems intuitively sensible, given that a lateral or vertical translation of the camera leads approximately to a translation of the image intensities across the pixel array. Note that translationinvariance is not well defined at the boundaries, and as is often the case in image processing, these locations must usually be handled specially. given relative spatial displacement. Implicit in these measurements is the assumption of homogeneity mentioned in the introduction: the distributions are assumed to be independent of the absolute location within the image. And although the pixels were taken from a single photographic image (in this case, a New York City street scene), they are nevertheless representative of what one sees in most visual images. Another common assumption is scale-invariance: resizing the image does not alter the probability structure. This may also be loosely justified by noting that adjusting the focal length (zoom) of a camera lens approximates (apart from perspective distortions) image resizing. As with translation-invariance, scale-invariance will clearly fail to hold at certain “boundaries”. Specifically, scaleinvariance must fail for discretized images at fine scales approaching the size of the pixels. And similarly, it must also fail for finite size images at coarse scales approaching the size of the entire image. The most striking behavior observed in the plots is that the pixel values are highly correlated: when one is large, the other tends to also be large. But this correlation falls with the distance between pixels. This behavior is summarized in Fig. 1(b), which shows the image autocorrelation (pixel correlation as a function of separation). With these sort of simplifying structural assumptions in place, we can return to the problem of developing a probability model. In recent years, researchers from image processing, computer vision, physics, applied math and statistics have proposed a wide variety of different types of model. In this chapter, I’ll review some basic statistical properties of photographic images, as observed empirically, and describe several models that have been developed to incorporate these properties. I’ll give some indication of how these models have been validated by examining how well they fit the data, but the true test usually comes when one uses the model to solve an image processing problem (such as compression or denoising). Although this is somewhat beyond the scope of this chapter, I’ll show some simple denoising examples to give an indication of how much performance gain one can obtain by using a better statistical model. In order to keep the discussion focused, I’ll limit the discussion to discretized grayscale photographic images. Many of the principles are easily extended to color photographs [8, 43], or temporal image sequences (movies) [15], as well as more specialized image classes such as portraits, landscapes, or textures. In addition, the general approach may also be applied to non-visual imaging devices, such as medical images, infrared images, radar and other types of range image, or astronomical images. The correlation statistics of Fig. 1 place a strong constraint on the structure of images, but they do not provide a full probability model. Specifically, there are many probability densities that would share the same correlation (or equivalently, covariance) structure. How should we choose a model from amongst this set? One natural criterion is to select a density that has maximal entropy, subject to the covariance constraint [25]. Solving for this density turns out to be relatively straighforward, and the result is a multi-dimensional Gaussian: P(~x) ∝ exp(−~xT Cx −1 ~x/2), (1) where ~x is a vector containing all of the image pixels (assumed, for notational simplicity, to be zero-mean) and Cx ≡ IE~x~xT is the covariance matrix. Gaussian densities are more succinctly described by transforming to a coordinate system in which the covariance matrix is diagonal. This is easily achieved using standard linear algebra techniques: ~y = E T ~x, where E is an orthogonal matrix containing the eigenvectors of Cx , such that Cx = EDE T , 1 The Gaussian Model =⇒ E T Cx E = D, (2) with D a diagonal matrix containing the associated eigenvalues. When the probability distribution on ~x is stationary (assuming periodic handling of boundaries), the covariance matrix, Cx , will be circulant. In this special case, the Fourier transform is known in advance to be a diagonalizing transformation matrix E, and is guaranteed to satisfy the relationship of Eq. (2). The classical model of image statistics was developed by television engineers in the 1950s (see [41] for a review), who were interested in optimal signal representation and transmission. The most basic motivation for these models comes from the observation that pixels at nearby locations tend to have similar intensity values. This is easily confirmed by measurements like those shown in Fig. 1(a). Each scatterplot shows values of a pair of pixels with a In order to complete the Gaussian image model, we need only specify the entries of the diagonal matrix D, which 2 ∆x=4 ∆ x = 16 250 250 200 200 200 150 150 150 100 100 100 50 50 50 0 0 100 0 0 200 100 200 0 0 1 Normalized correlation ∆x=1 250 0.95 0.9 0.85 0 100 200 100 200 ∆ x (pixels) 300 Fig. 1. (a) Scatterplots of pairs of pixels at three different spatial displacements, averaged over five examples images. (b) Autocorrelation function. Photographs are of New York City street scenes, taken with a Canon 10D digital camera, and processed in RAW linear sensor mode (producing pixel intensities are in roughly proportional to light intensity). Correlations were computed on the logs of these sensor intensity values [41]. correspond to the variances of frequency components in the Fourier transform. There are two means of arriving at an answer. First, setting aside the caveats mentioned in the introduction, we can assume that image statistics are scale-invariant. Specifically, suppose that the secondorder (covariance) statistical properties of the image are invariant to resizing of the image. We can express scaleinvariance in the Fourier domain as:   IE |F (sω)|2 = h(s)IE |F (ω)|2 , ∀ω, s. 42 That is, rescaling the frequency axis does not change the shape of the function; it merely multiplies the spectrum by a constant. The only functions that satisfy this identity are power laws:  log2(power) IE |F (ω)| 2 40 A = γ ω where the exponent γ controls the rate at which the spectrum falls. Thus, the dual assumptions of translation- and scale-invariance constrains the covariance structure of images to a model with a single parameter! 38 36 34 32 30 −5 −4 −3 −2 log (frequency/π) −1 2 Fig. 2. Power spectral estimates for five example images (see Fig. 1 for image description), as a function of spatial frequency, averaged over orientation. These are welldescribed by power law functions with an exponent, γ, slightly larger than 2.0. Alternatively, the form of the power spectrum may be estimated empirically [e.g. 14, 18, 50, 42, 53]. For many “typical” images, it turns out to be quite well approximated by a power law, thus providing confirmation of the scaleinvariance property for second-order statistics. In these empirical measurements, the value of the exponent is typically near two. Examples of power spectral estimates for several example images are shown in Fig. 2. It has also been demonstrated that scale-invariance holds for statistics other than the power spectrum [e.g., 42, 52]. The spectral model is the classic model of image processing. In addition to accounting for spectra of typical image 3 data, the simplicity of the Gaussian form leads to direct solutions for image compression and denoising that may be found in essentially any textbook on image processing. As an example, consider the problem of removing additive Gaussian white noise from an image, ~x. The degradation process is described by the conditional density of the observed (noisy) image, ~y , given the original (clean) image ~x: P(~y |~x) ∝ exp(−||~y − ~x||2 /2σn2 ) where σn2 is the variance of the noise. Using Bayes’ Rule, we can reverse the conditioning by multiplying by the prior probability density on ~x: P(~x|~y ) ∝ exp(−||~y − ~x||2 /2σn2 ) · P(~x). An estimate for ~x may now be obtained from this posterior density. One can, for example, choose the ~x that maximizes the probability (the maximum a posteriori or MAP estimate), or the mean of the density (the Bayes Least Squares or BLS estimate). In the case of a Gaussian prior of Eq. 1, these two solutions are identical. Fig. 3. Example image randomly drawn from the Gaussian spectral model, with γ = 2.0. 2 The Wavelet Marginal Model x̂(~y ) = Cx (Cx + σn2 )−1 ~y . For decades, the inadequacy of the Gaussian model was apparent. But direct improvement, through introduction of constraints on the Fourier phases, turned out to be quite difficult. Relationships between phase components are not easily measured, in part because of the difficulty of working with joint statistics of circular variables, and A/ω γ X̂(ω) = · Y (ω), in part because the dependencies between phases of difγ 2 A/ω + σn ferent frequencies do not seem to be well captured by a where X̂(ω) and Y (ω) are the Fourier transforms of x̂(~y ) model that is localized in frequency. A breakthrough ocand ~y , respectively. Thus, the estimate may be computed curred in the 1980s, when a number of authors began to by linearly rescaling each Fourier coefficient. In order to describe more direct indications of non-Gaussian behavapply this denoising method, one must be given (or must iors in images. Specifically, a multidimensional Gaussian estimate) the parameters, A, γ and σn . statistical model has the property that all conditional or marginal densities must also be Gaussian. But these auDespite the simplicity and tractability of the Gaussian thors noted that histograms of bandpass-filtered natural model, it is easy to see that the model provides a rather images were highly non-Gaussian [9, 18, 13, 31, 58]. These weak description. In particular, while the model strongly marginals tend to be much more sharply peaked at zero, constrains the amplitudes of the Fourier coefficients, it with more extensive tails, when compared with a Gausplaces no constraint on their phases. When one random- sian of the same variance. As an example, Fig. 4 shows izes the phases of an image, the appearance is completely histograms of three images, filtered with a Gabor function (a Gaussian-windowed sinuosoidal grating). The intudestroyed [37]. itive reason for this behavior is that images typically conAs a direct test, one can draw sample images from the dis- tain smooth regions, punctuated by localized “features” tribution by simply generating white noise in the Fourier such as lines, edges or corners. The smooth regions lead domain, weighting each sample appropriately by 1/ω γ , to small filter responses that generate the sharp peak at and then inverting the transform to generate an image. zero, and the localized features produce large-amplitude An example is shown in Fig. 3. The fact that such an ex- responses that generate the extensive tails. periment invariably produces images of clouds implies that a covariance constraint is insufficient to capture the This basic behavior holds for essentially any bandpass filricher structure of features that are found in most real im- ter, whether it is non-directional (center-surround), or oriented, but some filters lead to responses that are more ages. The solution is linear in the observed (noisy) image ~y . Finally, the solution may be rewritten in the Fourier domain, where the scale-invariance of the power spectrum may be explicitly incorporated: 4 Wavelet coefficient value log(Probability) log(Probability) log(Probability) log(Probability) p = 0.46 ∆H/H = 0.0031 p = 0.48 ∆H/H = 0.0014 p = 0.58 ∆H/H = 0.0011 Wavelet coefficient value Wavelet coefficient value p = 0.59 ∆H/H = 0.0012 Wavelet coefficient value Fig. 4. Log histograms of a single wavelet subband of four example images (see Fig. 1 for image description). For each histogram, tails are truncated so as to show 99.8% of the distribution. Also shown (dashed lines) are fitted model densities corresponding to equation (3). Text indicates the maximum-likelihood value of p used for the fitted model density, and the relative entropy (Kullback-Leibler divergence) of the model and histogram, as a fraction of the total entropy of the histogram. non-Gaussian than others. By the mid 1990s, a number of authors had developed methods of optimizing a basis of filters in order to to maximize the non-Gaussianity of the responses [e.g., 36, 4]. Often these methods operate by optimizing a higher-order statistic such as kurtosis (the fourth moment divided by the squared variance). The resulting basis sets contain oriented filters of different sizes with frequency bandwidths of roughly one octave. Figure 5 shows an example basis set, obtained by optimizing kurtosis of the marginal responses to an ensemble of 12 × 12 pixel blocks drawn from a large ensemble of natural images. In parallel with these statistical developments, authors from a variety of communities were developing multi-scale orthonormal bases for signal and image analysis, now generically known as “wavelets” (see chapter 4.2 in this volume). These provide a good approximation to optimized bases such as that shown in Fig. 5. Once we’ve transformed the image to a multi-scale wavelet representation, what statistical model can we use to characterize the the coefficients? The statistical motivation for the choice of basis came from the shape of the marginals, and thus it would seem natural to assume that the coefficients within a subband are independent and identically distributed. With this assumption, the model is completely determined by the marginal statistics of the coefficients, which can be examined empirically as in the examples of Fig. 4. For natural images, these histograms are surprisingly well described by a two-parameter generalized Gaussian (also known as a stretched, or generalized exponential) distribution [e.g., 31, 47, 34]: Pc (c; s, p) = exp(−|c/s|p ) , Z(s, p) Fig. 5. Example basis functions derived by optimizing a marginal kurtosis criterion [see 35]. (3) where the normalization constant is Z(s, p) = 2 ps Γ( p1 ). An exponent of p = 2 corresponds to a Gaussian density, and p = 1 corresponds to the Laplacian density. In 5 general, smaller values of p lead to a density that is both more concentrated at zero and has more expansive tails. Each of the histograms in Fig. 4 is plotted with a dashed curve corresponding to the best fitting instance of this density function, with the parameters {s, p} estimated by maximizing the likelihood of the data under the model. The density model fits the histograms remarkably well, as indicated numerically by the relative entropy measures given below each plot. We have observed that values of the exponent p typically lie in the range [0.4, 0.8]. The factor s varies monotonically with the scale of the basis functions, with correspondingly higher variance for coarserscale components. This wavelet marginal model is significantly more powerful than the classical Gaussian (spectral) model. For example, when applied to the problem of compression, the entropy of the distributions described above is significantly less than that of a Gaussian with the same variance, and this leads directly to gains in coding efficiency. In denoising, the use of this model as a prior density for images yields to significant improvements over the Gaussian model [e.g., 48, 11, 2, 34, 47]. Consider again the problem of removing additive Gaussian white noise from an image. If the wavelet transform is orthogonal, then the noise remains white in the wavelet domain. The degradation process may be described in the wavelet domain as: P(d|c) ∝ exp(−(d − c)2 /2σn2 ) Fig. 6. A sample image drawn from the wavelet marginal model, with subband density parameters chosen to fit the image of Fig. 7. generalized Gaussian densities. The density parameters for each subband were chosen as those that best fit the “Einstein” image. Although it has more structure than an image of white noise, and perhaps more than the image drawn from the spectral model (Fig. 3), the result still does not look very much like a photographic image! The wavelet marginal model may be improved by extending it to an overcomplete wavelet basis. In particular, Zhu et al. have shown, using a variant of the Fourier projection-slice theorem used for tomographic reconstruction, that large numbers of marginals are sufficient to uniquely constrain a high-dimensional probability density [62] (this is a variant of the Fourier projectionslice theorem used for tomographic reconstruction). This idea has been used to construct effective models of texture representation and synthesis [20, 61, 38]. The drawback of this approach is that the joint statistical properties are defined implicitly through the imposition of marginal statistics. They are thus difficult to study directly, or to utilize in developing solutions for image processing applications. In the next section, we consider the more direct development of joint statistical descriptions. where d is a wavelet coefficient of the observed (noisy) image, c is the corresponding wavelet coefficient of the original (clean) image, and σn2 is the variance of the noise. Again, using Bayes’ Rule, we can reverse the conditioning: P(c|d) ∝ exp(−(d − c)2 /2σn2 ) · P(c), where the prior on c is given by Eq. (3). The MAP and BLS solutions cannot, in general, be written in closed form, but numerical solutions are fairly easy to compute [48, 47]. The resulting estimators are nonlinear “coring” functions, in which small-amplitude coefficients are suppressed and large-amplitude coefficients preserved. These estimates show substantial improvement over the linear estimates associated with the Gaussian model of the previous section (see examples in Fig. 10). Despite these successes, it is again easy to see that important attributes of images are not captured by wavelet marginal models. When the wavelet transform is orthonormal, we can easily draw statistical samples from the model. Figure 6 shows the result of drawing the coefficients of a wavelet representation independently from 3 Wavelet Joint Models The primary reason for the poor appearance of the image in Fig. 6 is that the coefficients of the wavelet trans6 This clustering property was exploited in a heuristic but highly effective manner in the Embedded Zerotree Wavelet (EZW) image coder [44], and has been used in some fashion in nearly all image compression systems since. A more explicit description had been first developed in the context of denoising. More than 20 years ago, Lee [28] suggested a two-step procedure for image denoising, in which the local signal variance is first estimated from a neighborhood of observed pixels, after which the pixels in the neighborhood are denoised using a standard linear least squares method. Although it was done in the pixel domain, this paper introduced the idea that variance is a local property that should be estimated adaptively, as compared with the classical Gaussian model in which one assumes a fixed global variance. Ruderman [41] examined local variance properties of image derivatives, and noted that the derivative field could be made more homogeneous by normalizing by a local estimate of the standard deviation. It was not until the 1990s that a number of authors began to apply this concept to denoising in the wavelet domain, estimating the variance of clusters of wavelet coefficients at nearby positions, scales, and/or orientations, and then using these estimated variances in order to denoise the cluster [30, 46, 10, 47, 33, 55, 1]. Fig. 7. Amplitudes of multi-scale wavelet coefficients for the “Einstein” image. Each subimage shows coefficient amplitudes of a subband obtained by convolution with a filter of a different scale and orientation, and subsampled by an appropriate factor. Coefficients that are spatially near each other within a band tend to have similar amplitudes. In addition, coefficients at different orientations or scales but in nearby (relative) spatial positions tend to have similar amplitudes. The locally-adaptive variance principle is powerful, but does not constitute a full probability model. As in the previous sections, we can develop a more explicit model by directly examining the statistics of the coefficients [46]. The top row of Fig. 8 shows joint histograms of several different pairs of wavelet coefficients. As with the marginals, we assume homogeneity in order to consider the joint histogram of this pair of coefficients, gathered over the spatial extent of the image, as representative of the underlying density. Coefficients that come from adjacent basis functions are seen to produce contours that are nearly circular, whereas the others are clearly extended along the axes. Zetzsche [59] has examined the empirical joint densities of quadrature (Hilbert transform) pairs of basis functions and found that the contours are roughly circular. Several authors have also suggested circular generalized Gaussians as a model for joint statistics of nearby wavelet coefficients [22, 49]. form are not independent. Empirically, the coefficients of orthonormal wavelet decompositions of visual images are found to be moderately well decorrelated (i.e., their covariance is zero). But this is only a statement about their second-order dependence, and one can easily see that there are important higher-order dependencies. Figure 7 shows the amplitudes (absolute values) of coefficients in a four-level separable orthonormal wavelet decomposition. Note that large-magnitude coefficients tend to occur near each other within subbands, and also occur at the same relative spatial locations in subbands at adjacent scales and orientations [e.g., 46, 7]. The intuitive reason for the clustering of large-amplitude coefficients is that typical localized and isolated image features are represented in the wavelet domain via the superposition of a group of basis functions at different positions, orientations and scales. The signs and relative magnitudes of the coefficients associated with these basis functions will depend on the precise location, orientation and scale of the underlying feature. The magnitudes will also scale with the contrast of the structure. Thus, measurement of a large coefficient at one scale means that large coefficients at adjacent scales are more likely. The joint histograms shown in the first row of Fig. 8 do not make explicit the issue of whether the coefficients are independent. In order to make this more explicit, the bottom row shows conditional histograms of the same data. Let x2 correspond to the density coefficient (vertical axis), and x1 the conditioning coefficient (horizontal axis). The histograms illustrate several important aspects of the rela- 7 where N is the dimensionality of ~x and ~u (in our case, the size of the neighborhood). tionship between the two coefficients. First, the expected value of x2 is approximately zero for all values of x1 , indicating that they are nearly decorrelated (to second order). Second, the variance of the conditional histogram of x2 clearly depends on the value of x1 , and the strength of this dependency depends on the particular pair of coefficients being considered. Thus, although x2 and x1 are uncorrelated, they still exhibit statistical dependence! The conditions under which a random vector may be represented using a GSM have been studied [3], and the GSM family includes the α-stable family (including the Cauchy distribution), the generalized Gaussian (or stretched exponential) family and the symmetrized Gamma family [55]. GSM densities are symmetric and zero-mean, and they have highly kurtotic marginal densities (i.e., heavier tails than a Gaussian). A key property of the GSM model is that the density of ~x is Gaussian √ when conditioned on z. Also, the normalized vector ~x/ z is Gaussian. The form of the histograms shown in Fig. 8 is surprisingly robust across a wide range of images. Furthermore, the qualitative form of these statistical relationships also holds for pairs of coefficients at adjacent spatial locations and adjacent orientations. As one considers coefficients that are more distant (either in spatial position or in scale), the dependency becomes weaker, suggesting that a Markov assumption might be appropriate. A number of recent image models describe the wavelet coefficients within each local neighborhood using a Gaussian scale mixture (GSM) model, which can capture the strongly leptokurtotic behavior of the marginal densities of natural image wavelet coefficients, as well as the correlation in their local amplitudes, as illustrated in Fig. 9. For example, Baraniuk and colleagues used a 2-state hidden multiplier variable to characterize the two modes of behavior corresponding to smooth or low-contrast textured regions and features [12, 40]. Others assume that the local variance is governed by a continuous multiplier variable [29, 54, 33, 55, 39]. Some GSM models for images treat the multiplier variables, z, as if they were independent, even when they belong to overlapping coefficient neighborhoods [29, 54, 39]. More sophisticated models describe dependencies between these variables [12, 40, 55]. The circular (or elliptical) contours, the dependency between local coefficient amplitudes, as well as the associated marginal behaviors, can be modeled using a random field with a spatially fluctuating variance. A particularly useful example arises from the product of a Gaussian vector and a hidden scalar multiplier, known as a Gaussian scale mixture [3] (GSM). These distributions represent an important subset of the elliptically symmetric distributions, which are those that can be defined as functions of a quadratic norm of the random vector. Embedded in a random field, these kinds of models have been found useful in the speech-processing community [6]. A related set of models, known as autoregressive conditional heteroskedastic (ARCH) models [e.g., 5], have proven useful for many real signals that suffer from abrupt fluctuations, followed by relative “calm” periods (stock market prices, for example). Finally, physicists studying properties of turbulence have noted similar behaviors [e.g., 51]. The underlying Gaussian structure of the GSM model allows it to be adapted for problems such as denoising. The estimator is more complex than that described for the Gaussian or wavelet marginal models (see [39] for details), but the denoising performance shows a substantial improvement across a wide variety of images and noise levels. As a demonstration, Fig. 10 shows a performance comparison of BLS estimators based on the GSM model, a wavelet marginal model, and a wavelet Gaussian model. The GSM estimator is significantly better, both visually and in terms of mean squared error. Formally, a random vector ~x is a Gaussian scale mixture [3] if and only if it can be expressed as the product of a zero-mean Gaussian vector√~u and an independent positive scalar random variable z: √ ~x ∼ z~u, (4) where ∼ indicates equality in distribution. The variable z is known as the multiplier. The vector ~x is thus an infinite mixture of Gaussian vectors, whose density is determined by the covariance matrix Cu of vector ~u and the mixing density, pz (z): Z p~x (~x) = p(~x|z) pz (z)dz  Z exp −~xT (zCu )−1 ~x/2 pz (z)dz, (5) = (2π)N/2 |zCu |1/2 As with the models of the previous two sections, there are indications that the GSM model is insufficient to fully capture the structure of typical visual images. To demonstrate this, we note that normalizing each coefficient by (the square root of) its estimated variance should produce a field of Gaussian white noise [54]. Figure 11 illustrates this process, showing an example wavelet subband, the estimated variance field, and the normalized coefficients. There are two important types of structure that 8 adjacent near far other scale other ori 150 150 150 150 150 100 100 100 100 100 50 50 50 50 50 0 0 0 0 0 −50 −50 −50 −50 −50 −100 −100 −100 −100 −100 −150 −150 −150 −150 −100 0 100 −100 0 100 −100 0 100 −150 −500 0 500 −100 150 150 150 150 150 100 100 100 100 100 50 50 50 50 50 0 0 0 0 0 −50 −50 −50 −50 −50 −100 −100 −100 −100 −100 −150 −150 −150 −150 −100 0 100 −100 0 100 −100 0 100 0 100 −150 −500 0 500 −100 0 100 Fig. 8. Empirical joint distributions of wavelet coefficients associated with different pairs of basis functions, for a single image of a New York City street scene (see Fig. 1 for image description). The top row shows joint distributions as contour plots, with lines drawn at equal intervals of log probability. The three leftmost examples correspond to pairs of basis functions at the same scale and orientation, but separated by different spatial offsets. The next corresponds to a pair at adjacent scales (but the same orientation, and nearly the same position), and the rightmost corresponds to a pair at orthogonal orientations (but the same scale and nearly the same position). The bottom row shows corresponding conditional distributions: brightness corresponds to frequency of occurance, except that each column has been independently rescaled to fill the full range of intensities. remain. First, although the normalized coefficients are certainly closer to a homogeneous field, the signs of the coefficients still exhibit important structure. Second, the variance field itself is far from homogeneous, with most of the significant values concentrated on one-dimensional contours. cations. There are several issues that seem to be of primary importance in trying to extend such models. First, a number of authors have examined different methods of describing regularities in the local variance field. These include spatial random fields [23, 26, 24], and multiscale tree-structured models [40, 55]. Much of the structure in the variance field may be attributed to discontinuous features such as edges, lines, or corners. There is a substantial literature in computer vision describing such structures [e.g., 57, 32, 17, 27, 56], but it has proven difficult to establish models that are both explicit and flexible. Finally, there have been several recent studies investigating geometric regularities that arise from the continuity of contours and boundaries [45, 16, 19, 21, 60]. These and other image structures will undoubtedly be incorporated into future statistical models, leading to further improvements in image processing applications. 4 Discussion After nearly 50 years of Fourier/Gaussian modeling, the late 1980s and 1990s saw sudden and remarkable shift in viewpoint, arising from the confluence of (a) multi-scale image decompositions, (b) non-Gaussian statistical observations and descriptions, and (c) variance-adaptive statistical models based on hidden variables. The improvements in image processing applications arising from these ideas have been steady and substantial. But the complete synthesis of these ideas, and development of further refinements are still underway. References Variants of the GSM model described in the previous section seem to represent the current state-of-the-art, both in terms of characterizing the density of coefficients, and in terms of the quality of results in image processing appli- [1] F. Abramovich, T. Besbeas, and T. Sapatinas. Empirical Bayes approach to block wavelet function estimation. Computational Statistics and Data Analysis, 39:435–451, 2002. 9 Fig. 10. Denoising examples. Upper left: Original 8-bit image (cropped for visibility). Upper middle: Contaminated by adding simulated Gaussian white noise, σ = 21.4, peak-signal-to-noise-ratio (PSNR)=22.06. Upper right: Denoised with a Gaussian marginal model (PSNR=27.87). Lower left: Denoised with generalized-Gaussian marginal model (PSNR=29.24). Lower right: Denoised with a GSM model (PSNR=30.86). All methods were implemented in an overcomplete wavelet domain (see [47, 39]). In each case, the noise variance was assumed known, and the denoising procedure was Bayes leastsquares (i.e., the mean of the posterior distribution). Model parameters were estimated by maximizing the likelihood of the data. Original coefficients Estimated √ z field Normalized coefficients Fig. 11: Example wavelet subband, square root of the variance field, and normalized subband. 10 [2] F. Abramovich, T. Sapatinas, and B. W. Silverman. Wavelet thresholding via a Bayesian approach. J R Stat Soc B, 60:725–749, 1998. [3] D. Andrews and C. Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [4] A. J. Bell and T. J. Sejnowski. The ’independent components’ of natural scenes are edge filters. Vision Research, 37(23):3327–3338, 1997. 5 [5] T. Bollersley, K. Engle, and D. Nelson. ARCH models. In B. Engle and D. McFadden, editors, Handbook of Econometrics V. 1994. 5 10 10 [6] H. Brehm and W. Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. 0 10 [7] R. W. Buccigrossi and E. P. Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688– 1701, December 1999. 0 −50 0 (a) Observed 50 10 −50 0 50 (b) Simulated [8] G. Buchsbaum and A. Gottschalk. Trichromacy, opponent color coding, and optimum colour information transmission in the retina. Proc. R. Soc. Lond. Ser. B, 220:89–113, 1983. (c) Observed [9] P. J. Burt and E. H. Adelson. The Laplacian pyramid as a compact image code. IEEE Trans. Comm., COM31(4):532–540, April 1983. (d) Simulated Fig. 9. Comparison of statistics of coefficients from an example image subband (left panels) with those generated by simulation of a local GSM model (right panels). Model parameters (covariance matrix and the multiplier prior density) are estimated by maximizing the likelihood of the subband coefficients (see [39]). (a,b) Log of marginal histograms. (c,d) Conditional histograms of two spatially adjacent coefficients. Pixel intensity corresponds to frequency of occurance, except that each column has been independently rescaled to fill the full range of intensities. [10] S. G. Chang, B. Yu, and M. Vetterli. Spatially adaptive wavelet thresholding with context modeling for image denoising. In Fifth IEEE Int’l Conf on Image Proc, Chicago, October 1998. IEEE Computer Society. [11] H. A. Chipman, E. D. Kolaczyk, and R. M. McCulloch. Adaptive Bayesian wavelet shrinkage. J American Statistical Assoc, 92(440):1413–1421, 1997. [12] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Proc., 46:886–902, April 1998. [13] J. G. Daugman. Complete discrete 2–D Gabor transforms by neural networks for image analysis and compression. IEEE Trans. Acoust. Speech Signal Proc., 36(7):1169–1179, 1988. [14] N. G. Deriugin. The power spectrum and the correlation function of the television signal. Telecommunications, 1(7):1–12, 1956. 11 [15] D. W. Dong and J. J. Atick. Statistics of natural timevarying images. Network: Computation in Neural Systems, 6:345–358, 1995. [28] J. S. Lee. Digital image enhancement and noise filtering by use of local statistics. IEEE Pat. Anal. Mach. Intell., PAMI-2:165–168, March 1980. [16] J. H. Elder and R. M. Goldberg. Ecological statistics of gestalt laws for the perceptual organization of contours. Journal of Vision, 2(4):324–353, 2002. DOI 10:1167/2.4.5. [29] S. M. LoPresto, K. Ramchandran, and M. T. Orchard. Wavelet image coding based on a new generalized Gaussian mixture model. In Data Compression Conf, Snowbird, Utah, March 1997. [17] J. H. Elder and S. W. Zucker. Local scale control for edge detection and blur estimation. IEEE Pat. Anal. Mach. Intell., 20(7):699–716, 1998. [30] M. Malfait and D. Roose. Wavelet-based image denoising using a Markov random field a priori model. IEEE Trans. Image Proc., 6:549–565, April 1997. [18] D. J. Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [31] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Pat. Anal. Mach. Intell., 11:674–693, July 1989. [19] W. S. Geisler, J. S. Perry, B. J. Super, and D. P. Gallogly. Edge co-occurance in natural images predicts contour grouping performance. Vision Research, 41(6):711–724, March 2001. [32] S. G. Mallat. Zero-crossings of a wavelet transform. IEEE Trans. Info. Theory, 37(4):1019–1033, July 1991. [33] M. K. Mihçak, I. Kozintsev, K. Ramchandran, and P. Moulin. Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Trans. Sig. Proc., 6(12):300–303, December 1999. [20] D. Heeger and J. Bergen. Pyramid-based texture analysis/synthesis. In Proc. ACM SIGGRAPH, pages 229–238. Association for Computing Machinery, August 1995. [34] P. Moulin and J. Liu. Analysis of multiresolution image denoising schemes using a generalized Gaussian and complexity priors. IEEE Trans. Info. Theory, 45:909–919, 1999. [21] P. Hoyer and A. Hyvärinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12):1593–1605, 2002. [22] J. Huang and D. Mumford. Statistics of natural images and models. In CVPR, page Paper 216, 1999. [35] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996. [23] A. Hyvärinen and P. Hoyer. Emergence of topography and complex cell properties from natural images using extensions of ICA. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [36] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1997. [37] A. V. Oppenheim and J. S. Lim. The importance of phase in signals. Proc. of the IEEE, 69:529–541, 1981. [24] A. Hyvärinen, J. Hurri, and J. Väyrynen. Bubbles: A unifying framework for low-level statistical properties of natural image sequences. J. Opt. Soc. Am. A, 20(7), July 2003. [38] J. Portilla and E. P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. Int’l Journal of Computer Vision, 40(1):49– 71, December 2000. [25] E. T. Jaynes. Where do we stand on maximum entropy? In R. D. Levine and M. Tribus, editors, The Maximal Entropy Formalism. MIT Press, Cambridge, MA, 1978. [39] J. Portilla, V. Strela, M. Wainwright, and E. P. Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [26] Y. Karklin and M. S. Lewicki. Learning higher-order structures in natural images. Network, 14:483–499, 2003. [40] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [27] P. Kovesi. Image features from phase congruency. Videre: Journal of Computer Vision Research, 1(3), Summer 1999. 12 [41] D. L. Ruderman. The statistics of natural images. Network: Computation in Neural Systems, 5:517–548, 1996. [53] A. van der Schaaf and J. H. van Hateren. Modelling the power spectra of natural images: Statistics and information. Vision Research, 28(17):2759–2770, 1996. [42] D. L. Ruderman and W. Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [54] M. J. Wainwright and E. P. Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Adv. Neural Information Processing Systems (NIPS*99), volume 12, pages 855–861, Cambridge, MA, May 2000. MIT Press. [43] D. L. Ruderman, T. W. Cronin, and C.-C. Chiao. Statistics of cone responses to natural images: Implications for visual coding. J. Opt. Soc. Am. A, 15(8):2036–2045, 1998. [55] M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89– 123, July 2001. [44] J. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans Sig Proc, 41(12):3445–3462, December 1993. [45] M. Sigman, G. A. Cecchi, C. D. Gilbert, and M. O. Magnasco. On a common circle: Natural scenes and Gestalt rules. Proc. National Academy of Sciences, 98(4):1935–1940, 2001. [56] Z. Wang and E. P. Simoncelli. Local phase coherence and the perception of blur. In S. Thrun, L. Saul, and B. Schölkopf, editors, Adv. Neural Information Processing Systems (NIPS*03), volume 16, Cambridge, MA, 2004. MIT Press. [46] E. P. Simoncelli. Statistical models for images: Compression, restoration and synthesis. In Proc [57] A. P. Witkin. Scale-space filtering. In Proc. Intl. Joint 31st Asilomar Conf on Signals, Systems and ComConf. Artificial Intelligence, pages 1019–1021, 1985. puters, pages 673–678, Pacific Grove, CA, November 1997. IEEE Computer Society. Available from [58] C. Zetzsche and E. Barth. Fundamental limits http://www.cns.nyu.edu/∼eero/publications.html. of linear filters in the visual processing of twodimensional signals. Vision Research, 30:1111–1117, [47] E. P. Simoncelli. Bayesian denoising of visual im1990. ages in the wavelet domain. In P. Müller and B. Vidakovic, editors, Bayesian Inference in Wavelet Based Models, chapter 18, pages 291–308. Springer-Verlag, New York, 1999. Lecture Notes in Statistics, vol. 141. [59] C. Zetzsche, B. Wegmann, and E. Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [48] E. P. Simoncelli and E. H. Adelson. Noise removal via Bayesian wavelet coring. In Third Int’l Conf on Image Proc, volume I, pages 379–382, Lausanne, September 1996. IEEE Sig Proc Society. [60] S.-C. Zhu. Statistical modeling and conceptualization of visual patterns. IEEE Trans PAMI, 25(6), June 2003. [49] A. Srivastava, X. Liu, and U. Grenander. Universal analytical forms for modeling image probability. IEEE Pat. Anal. Mach. Intell., 28(9), 2002. [61] S. C. Zhu, Y. N. Wu, and D. Mumford. Minimax entropy principle and its application to texture modeling. In Neural Computation, volume 9, pages 1627– 1660, 1997. [50] D. J. Tolhurst, Y. Tadmor, and T. Chao. Amplitude spectra of natural images. Opth. and Physiol. Optics, 12:229–232, 1992. [62] S. C. Zhu, Y. N. Wu, and D. Mumford. FRAME: Filters, random fields and maximum entropy – towards a unified theory for texture modeling. Intl. J. Comp. Vis., 27(2):1–20, 1998. [51] A. Turiel, G. Mato, N. Parga, and J. P. Nadal. The self-similarity properties of natural images resemble those of turbulent flows. Phys. Rev. Lett., 80:1098– 1101, 1998. [52] A. Turiel and N. Parga. The multi-fractal structure of contrast changes in natural images: From sharp edges to textures. Neural Computation, 12:763–793, 2000. 13