A Tour of Mobile Photography
A Tour of Mobile Photography
A Tour of Mobile Photography
Abstract
The first mobile camera phone was sold only 20 years ago, when taking pictures with one’s
phone was an oddity, and sharing pictures online was unheard of. Today, the smartphone is
more camera than phone. How did this happen? This transformation was enabled by advances
in computational photography—the science and engineering of making great images from small
form factor, mobile cameras. Modern algorithmic and computing advances, including machine
learning, have changed the rules of photography, bringing to it new modes of capture, post-
processing, storage, and sharing. In this paper, we give a brief history of mobile computational
photography and describe some of the key technological components, including burst photog-
raphy, noise reduction, and super-resolution. At each step, we may draw naive parallels to the
human visual system.
Contents
1 Introduction and historical overview 2
1
4 Compound Features 20
4.1 Low-light imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Super-resolution and hybrid optical/digital zoom . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Upscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Synthetic bokeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2
Figure 1: An incomplete timeline of photographic innovations in the last four decades.
Figure 2: Note the crowds at Saint Peter’s Square in Vatican City when Pope Benedict XVI was
announced in 2005 and when Pope Francis was announced in 2013. Image courtesy of NBC News.
devices is still often bettered by complementary development and close integration of algorithmic
software solutions. In some ways, the situation is not altogether dissimilar to the development of
vision in nature. The evolution of the physical form of the eye has had to contend with physical
constraints that limit the size, shape, and sensitivity of light gathering in our vision system. In
tandem and complementary fashion, the visual cortex has developed the computational machinery
to interpolate, interpret, and expand the limits imposed by the physical shape of our eyes—our
visual “hardware.”
The introduction of the iPhone in 2007 was a watershed moment in the evolution of mobile
devices and changed the course of both phone and camera technology. Looking back at the early
devices, the layperson may conclude that the camera was immediately transformed to become the
high-utility application we know today. This was in fact not the case. What was reinvented for the
better at the time were the display and user interface, but not necessarily the camera yet; indeed, the
two-megapixel camera on the first iPhone was far inferior to just about any existing point-and-shoot
camera of comparable price at the time.
The year 2010 was pivotal for the mobile camera. A transition to both 4G wireless and 300 dots
per inch (dpi) displays enabled users to finally appreciate their photographs on the screens of their
own mobile devices. Indeed, users felt that their phone screens were not only rich enough for the
consumption of their own photos, but also sufficient to make photo-sharing worthwhile. Meanwhile,
the significantly faster wireless network speeds meant that individuals could share (almost instan-
taneously) their photos (see Figure 3). Once viewing and sharing had been improved by leaps and
bounds, it became imperative for mobile manufacturers to significantly improve the quality of the
captured photos as well. Hence began an emphasis on improved light collection, better dynamic
range, and higher resolution for camera phones so that consumers could use their phones as cameras
3
and communication devices. The added bonus was that users would no longer have to carry both a
phone and a camera—they could rely on their smartphone as a multipurpose device.
To achieve this ambitious goal meant overcoming a large number of limitations–challenges that
have kept the computational photography community busy for the last decade. In the next section,
we begin by highlighting some of the main limitations of the mobile imaging platform as compared
to standalone devices, and go on in the rest of the paper to review how these challenges have been
met in research and practice.
Figure 3: The year 2010 saw the convergence of two important trends: improved displays and
increased wireless speed. These forces conspired to catapult mobile photography to the dominant
mode of imaging in the ensuing decade.
Figure 4: This figure shows the pros and cons of a smartphone compared to a DSLR. The most
notable differences are the larger sensor and optics available on a DSLR. Surprisingly, however, a
high-end smartphone has significantly more computing power than most DSLR cameras.
4
flexible than those on a DSLR camera. Yet, while the smartphone’s physical hardware is limited,
the smartphones have access to much more computing power than available on a DSLR. To draw a
rough, but stark contrast between the two platforms, a mobile camera’s small aperture limits light
collection by two orders of magnitude as compared to a typical DSLR. Meanwhile, the same mobile
device houses roughly two orders of magnitude more computing power. The trade-off of additional
computing for more sophisticated imaging hardware is thus inevitable.
Next, we briefly summarize several key limitations of the mobile phone camera as compared to
a DSLR.
5
2.4 Limited zoom
As noted earlier, in response to consumer demands, smartphone design has trended towards ultra-
thin form factors. This design trend imposes severe limitations on the thickness (or z-height) of the
smartphone’s camera module, limiting the effective focal length, which in turn limits the camera
module’s optical zoom capability. To overcome this z-height limitation, modern smartphone man-
ufacturers typically feature multiple camera modules with different effective focal lengths and field
of view, enabling zoom capabilities ranging from ultra-wide to telephoto zoom. The z-height form
factor restriction has spurred a so-called thinnovation (a portmanteau on thin and innovation) in
optical design, with manufacturers exploring folded optics in an effort to increase the optical path
and effective focal length beyond the physical z-height limits of the device.
Camera sensor Sensor cross section Spectral profile of sensors R/G/B filters
1.0
Light
Microlenses help focus light
Spectral sensitivity
onto photodiodes
0.5
Figure 5: A typical camera sensor with a color filter array layout (Bayer pattern) is shown. A cross
section of the sensor is shown along with an example of the spectral sensitivities of the color filters.
1 Exceptions do exist, including sensors developed by Foveon and others, though these are not in common use.
6
3.1 Camera sensor
A camera sensor is comprised of a 2D grid of photodiodes. A photodiode is a semiconductor de-
vice that converts photons (light radiation) into electrical charge. A single photodiode typically
corresponds to a single image pixel. In order to produce a color image, color filters are placed over
the photodiodes. These color filters roughly correspond to the long, medium, and short cone cells
found in the retina. The typical arrangement of this color filter array (CFA) is often called a Bayer
pattern, named after Bryce Bayer, who proposed this design at Kodak in 1975 (Bayer, 1975). The
CFA appears as a mosaic of color tiles laid on top of the sensor as shown in Figure 5. A key process
in the camera pipeline is to “demosaic” the CFA array by interpolating a red, green, and blue value
for each pixel based on the surrounding R, G, B colors. It is important to note that the spectral sen-
sitivities of the red, green, and blue color filters are specific to a particular sensor’s make and model.
Because of this, a crucial step in the camera imaging pipeline is to convert these sensor-specific RGB
values to a device-independent perceptual color space, such as CIE 1931 XYZ. An image captured
directly from a sensor that is still in its mosaiced format is called a Bayer image or Bayer frame.
Bayer/RAW Processing Unit Single-Frame Camera Pipeline Image Processing Unit (Enhance)
Exposure control
Image sensor
ISO gain
Image sensor
ISO gain
Image Processing Unit Color Color space
Save to file Tone-mapping manipulation Ttansform to CIE
(Enhance) White balance
… (photo-finishing) (photo-finishing) XYZ/ProPhoto
Figure 6: The top of this figure shows a standard single-frame camera pipeline. The bottom figure
shows the extension to multi-frame (or burst imaging) used by most modern smartphone cameras.
Figure 6 (top) shows a diagram of a typical camera imaging pipeline that would be implemented
by an ISP. Depending on the ISP design, the routines shown may appear in a slightly different order.
Many of the routines described would represent proprietary algorithms specific to a particular ISP
manufacturer. Two different camera manufacturers may use the same ISP hardware, but can tune
and modify the ISP’s parameters and algorithms to produce images with a photographic quality
unique to their respective devices. The following provides a description of each of the processing
steps outlined in Figure 6 (top).
Sensor frame acquisition: When the Bayer image from the camera’s sensor is captured and
passed to the ISP, the ISO gain factor is adjusted at capture time depending on the scene brightness,
desired shutter speed, and aperture. The sensor Bayer frame is considered an unprocessed image
and is commonly referred to as a raw image. As shown in Figure 5, the Bayer frame has a single R,
G, B value per pixel location. These raw R, G, B values are not in a perceptual color space but are
specific the to color filter array’s spectral sensitivities.
Raw-image pre-processing: The raw sensor image is normalized such that its values range
from 0 to 1. Many cameras provide a BlackLevel parameter that represents the lowest pixel value
7
Image of a uniformly Image of a uniformly
illuminated surface. illuminated surface
The light falling on the after lens shading
sensor is reduced in correction has been
a radial pattern. applied.
Lens shading mask required
to correct the radial fall-off.
Figure 7: Light entering the camera does not fall evenly across the sensor. This creates an undesired
vignetting effect. Lens shading correction is used to adjust the recorded values on the sensor to have
a uniform response.
produced by the sensor. Interestingly, this deviates from 0 due to sensor error. For example, a
sensor that is exposed to no light should report a value of 0 for its output, but instead outputs a
small positive value called the BlackLevel. This BlackLevel is subtracted off the raw image. The
BlackLevel is often image specific and related to other camera settings, including ISO and gain. An
additional WhiteLevel (maximum value) can also be specified. If nothing is provided, the min and
max value of all intensities in the image is used to normalize the image between 0 and 1 after the
BlackLevel adjustment has been applied.
The pre-processing stage also corrects any defective pixels on the sensor. A defect pixel mask is
pre-calibrated in the factory and marks locations that have malfunctioning photodiodes. Defective
pixels can be photodiodes that always report a high value (a hot pixel) or pixels that output no
value (a dead pixel). Defective pixel values are interpolated using their neighbors.
Finally, a lens shading (or flat field) correction is applied to correct the effects of uneven light
hitting the sensor. The role of lens shading correction is shown in Figure 7. The figure shows the
result of capturing a flat illumination field before lens shading correction. The amount of light hitting
the sensor falls off radially towards the edges. The necessary radial correction is represented as a
lens shading correction mask that is applied by the ISP to correct the effects from the non-uniform
fallout. The lens shading mask is pre-calibrated by the manufacturer and is adjusted slightly per
frame to accommodate different brightness levels, gain factors, and the estimated scene illumination
used for white-balance (described below).
Bayer demosaicing: A demosaicing algorithm is applied to convert the single channel raw image
to a three-channel full-size RGB image. Demosaicing is performed by interpolating the missing values
in the Bayer pattern based on neighboring values in the CFA. Figure 8 shows an example of the
demosaicing process. In this example, a zoomed photodiode with a red color filter is shown. This
pixel’s green and blue color values need to be estimated. These missing pixel values are estimated by
interpolating the missing green pixel using the neighboring green values. A per-pixel weight mask is
computed based on the red pixel’s similarity to neighboring red pixels. The use of this weight mask
in the interpolation helps to avoid blurring around scene edges. Figure 8 illustrates a simplistic and
generic approach, whereas most demosaicing algorithms are proprietary methods that often also
perform highlight clipping, sharpening, and some initial denoising (Longere et al., 2002)2 .
White Balance: White balance is performed to mimic the human visual system’s ability to
perform chromatic adaptation to the scene illumination. White balance is often referred to as
2 The astute reader will note that this demosaicing step is effectively interpolating two out of three colors at every
pixel in the output image. The naive consumer may be shocked to learn that 2/3 of their image is made up!
8
0.8 0.8 0.8 0.4 0.2
0.8 0.9 0.9 0.3 0.2
? 0.7 0.9 1.0 0.3 0.2
0.5 0.3 0.3 0.2 0.2
0.1 0.2 0.2 0.2 0.2
Neighborhood
Neighboring Weight mask based on red pixel's
about a red
green values similarity to neighboring red values.
pixel
? ? The missing green pixel value is
R G B computed as a weighted interpolation
Captured raw-Bayer image of the neighboring green values. G
Figure 8: This figure illustrates a common approach to image demosiacing. Shown is a red pixel and
its neighboring Bayer pixels. The missing green and blue pixel values need to be estimated. These
missing values are interpolated from the neighboring pixels. A weight mask based on the red pixel’s
similarity to its neighbors is computed to guide this interpolation. This weighted interpolation helps
to avoid blurring across scene edges. This figure shows the interpolation of the missing green pixel
value.
computational color constancy to denote the connection to the human visual system. White balance
requires an estimate of the sensor’s R, G, B color filter response to the scene illumination. This
response can be pre-calibrated in the factory by recording the sensor’s response to spectra of common
illuminations (e.g., sunlight, incandescent, and fluorescent lighting). These pre-calibrated settings
are then part of the camera’s white-balance preset that a user can select. A more common alternative
is to rely on the camera’s auto-white-balance (AWB) algorithm that estimates the sensor’s R, G,
B response to the illumination directly from the captured image. Illumination estimation is a well-
studied topic in computer vision and image processing with a wide range of solutions (Barron &
Tsai, 2017, Buchsbaum, 1980, Cheng et al., 2014, 2015, Gehler et al., 2008, Hu et al., 2017, Van
De Weijer et al., 2007). Figure 9 illustrates the white-balance procedure.
Sensor's RGB
White-balance
response to scene
correction matrix
illumination (ℓ)
𝑟𝑤𝑏 1/ℓ𝑟 0 0 𝑟
𝑔𝑤𝑏 = 0 1/ℓ𝑔 0 𝑔
𝑏𝑤𝑏 0 0 1/ℓ𝑏 𝑏
ℓ𝑟 0.2
ℓ𝑔 = 0.8 Auto white balance (AWB)
ℓ𝑏 0.8 algorithm estimates the
illumination from the input image.
raw sensor image before raw sensor image after
white-balance correction white-balance correction
Figure 9: White balance is applied to the image to mimic our visual system’s ability to perform
color constancy. An auto white balance (AWB) algorithm estimates the sensor’s response to the
scene illumination. The raw RGB values of the image are then scaled to based on the estimated
illumination.
Once the sensor’s R, G, B values of the scene illumination have been obtained either by a preset
or by the AWB feature, the image is modified (i.e., white-balanced) by dividing all pixels for each
color channel by its corresponding R, G, B illumination value. This is similar to the well-known
diagonal von Kries color adaption transform (Ramanath & Drew, 2014). The Von Kries model is
based on the response of the eye’s short, medium, and long cone cells while white balance uses the
sensor’s R, G, B color filter responses.
9
Color space transform: After white balance is applied, the image is still in the sensor-specific
RGB color space. The color space transform step is performed to convert the image from the sensor’s
raw-RGB color space to a device-independent perceptual color space derived directly from the CIE
1931 XYZ color space. Most cameras use the wide-gamut ProPhoto RGB color space (Süsstrunk
et al., 1999). ProPhoto is able to represent 90% of colors visible to the average human observer.
Photo-Finishing
Figure 10: Photo-finishing is used to enhance the aesthetic quality of an image. Cameras often have
multiple picture styles. The color manipulation is often performed as a combination of a 3D lookup
table to modify the RGB colors and a 1D lookup table to adjust the image’s tonal values.
Color manipulation: Once the image is in a perceptual color space, cameras apply proprietary
color manipulation to enhance the visual aesthetics of the image. For DSLR devices, this enhance-
ment can be linked to different picture styles or photo-finishing modes that the user can select, such
as vivid, landscape, portrait, and standard. Such color manipulation is often implemented as a 3D
lookup table (LUT) that is used to map the input ProPhoto RGB values to new RGB values based
on a desired manipulation. Figure 10 shows an example. A 1D LUT tone map (discussed next) is
also part of this photo-finishing manipulation.
Additional color manipulation may be performed on a smaller set of select colors used to enhance
skin tones. Establishing the 3D LUT can be a time-consuming process and is often performed by
a group of “golden eye” experts who tune the ISP algorithms and tables to produce a particular
photographic aesthetic often associated with a particular camera. Note that camera manufacturers
may even sell the same camera with different color manipulation parameters based on the preferences
of users in different geographical locations. For example, cameras sold in Asia and South America
often have a slightly more vivid look than those sold in European and North American markets.
Tone mapping: A tone map is a 1D LUT that is applied per color channel to adjust the tonal
values of the image. Figure 10 shows an example. Tone mapping serves two purposes. The first is
combined with color manipulation to adjust the image’s aesthetic appeal, often by increasing the
contrast. Second, the final output image is usually only 8 to 10 bits per channel (i.e., 256 or 1024
tonal values) while the raw-RGB sensor represents a pixel’s digital value using 10–14 bits (i.e., 1024
up to 16384 tonal values). As a result, it is necessary to compress the tonal values from the wider
tonal range to a tighter range via tone mapping. This adjustment is reminiscent of the human
eye’s adaptation to scene brightness (Land, 1974). Figure 13 shows a typical 1D LUT used for tone
mapping.
10
Noise reduction: Noise reduction algorithms are a key step to improving the visual quality of the
image. A delicate balance must be struck in removing image noise while avoiding the suppression of
fine-detail image content. Too aggressive denoising and the image may have a blurred appearance.
Too little image denoising may result in visual noise being dominant and distracting in the final
image. Given the importance of denoising, there is a large body of literature on this problem,
which we will discuss in detail in Section 3.4.1. Denoising algorithms often consider multiple factors,
including the captured image’s ISO gain level and exposure settings. While we show noise reduction
applied after the color space transform and photo-finishing, some ISPs apply noise reduction before
photo-finishing, or both before and after. Indeed, ISPs often provide denoising algorithms that can
be tuned by camera makers to taste.
Output color space conversion: At this stage in the camera pipeline the image’s RGB values
are in a wide-gamut ProPhoto color space. However, modern display devices can only produce a
rather limited range of colors. As a result, the image is converted to a display-referred (or output-
referred) color space intended for consumer display devices with a narrow color gamut. The most
common color space is the standard RGB (sRGB)3 . Other color spaces, such as AdobeRGB and
Display-P3, are sometimes used. The output color space conversion includes a tone-mapping opera-
tor as part of its color space definition. This final tone-mapping operator is referred to as a “gamma”
encoding. The name comes from the Greek letter used in the formula to model the nonlinear tone
curve. The purpose of the gamma encoding is to code the digital values of the image into a percep-
tually uniform domain (Poynton, 2012). The gamma values used for sRGB and Display-P3 closely
follow Stevens’s power law coefficients for perceived brightness (Stevens, 1961).
Image resizing: The image can be resized based on the user preferences or target output device
(e.g., if the camera is used in a preview mode with a viewfinder). Image resizing is not limited to
image downsizing, but can be employed to upsample a cropped region in the captured image to a
larger size to provide a “digital zoom.” More details of this operation appear in Section 4.2.1.
JPEG compression and metadata: The image is finally compressed, typically with the JPEG
compression standard, and saved. Additional information, such as capture time, GPS location and
exposure setting, can be saved with the image as metadata.
11
Figure 11: Burst photography used in most mobile imaging pipelines consists of four major steps:
Capture: A burst of frames is captured based on an exposure schedule defining the total number
of frames to capture as well as the exposure time for each individual frame. This defines the total
exposure time for the burst. Align: Frames in the burst are spatially aligned. Merge: The aligned
frames are merged into a single output frame. Enhancement: Encapsulates all post-processing
steps after the merge step, including color manipulation, tone mapping and noise reduction.
frame is captured, this over-writes the oldest frame in the buffer, keeping a constant number of frames in memory.
12
ure 11). For example, a total exposure time of 300ms could be achieved through a schedule of five
frames, each with an exposure time of 60ms. Similarly, in a burst processing pipeline implementing
HDR through bracketing, the exposure schedule might define a schedule of short, medium and long
exposures. Exposure control for burst processing therefore needs to take into consideration not only
the available light in the scene but also the merge processing and how it impacts the overall exposure.
Another factor that greatly impacts exposure control is camera shake (e.g., due to being hand-
held), which can introduce motion blur. To enable longer exposure times, modern smartphone
cameras incorporate an optical image stabilizer (OIS), which actively counteracts camera shake.
However, this often does not completely remove the motion and does not help in the case of (local)
subject motion, which is also a source of motion blur. Adapting the exposure schedule in accor-
dance with motion observed in the scene is a common approach used to reduce the impact of motion
blur. In Section 4.1 we further examine exposure control and more severe limitations in the case of
low-light photography.
3.3.2 Alignment
Generating high-quality, artifact-free, images through burst processing relies on robust and accurate
spatial alignment of frames in the captured burst. This alignment process must account for not only
global camera motion (residual motion not compensated for by the OIS) but also local motion in
the scene. There is a long history of frame alignment techniques in the research literature, from
early variational methods that solve the global alignment problem using assumptions of brightness
constancy and spatial smoothness (Horn & Schunck, 1981), to multi-scale approaches solving for
both global and local motion (Bruhn et al., 2005).
Given the omnipresent convenience of the smartphone, photography has been made possible in
the most extreme of conditions. As a result, accurate alignment can be challenging, particularly
in under-exposed or low-light scenes, where noise can dominate the signal. Similarly, over-exposed
scenes can introduce clipping or motion blur, making alignment difficult or impossible due to the
loss of image detail. Even with optimal exposure, complex non-rigid motion, lighting changes, and
occlusions can make alignment challenging.
Although state-of-the-art multi-scale deep learning methods can achieve accurate frame align-
ment in challenging conditions (Sun et al., 2018), they are currently beyond the computational
capabilities of many smartphones. As a result, burst processing on a smartphone is greatly limited
by the accuracy of the alignment process. The exposure schedule must be defined so as to facilitate
accurate alignment, and the merge method must in turn be designed to be robust to misalignments
to avoid jarring artifacts in the merged output (also known as fusion artifacts). Common merge
artifacts due to misalignments include ghosting and zipper artifacts, often observed along the edges
of moving objects in a captured scene.
3.3.3 Merge
Once accurately aligned, the smartphone’s burst processing pipeline must reduce the multiple cap-
tured frames to a single output frame. In static scenes and in the absence of camera shake, a simple
averaging of frames of the same exposure will reduce noise proportionally to the square root of the
total number of merged frames. However, few scenarios arise in real-world photography where such
a simple merging strategy could be effectively applied. Also, such a simple merge strategy under-
utilizes the burst processing approach, which, as previously mentioned, can also facilitate increasing
dynamic range and resolution. In this section, we describe a merge method aimed at reducing
noise and increasing dynamic range called HDR+ (Hasinoff et al., 2016), but later in Section 4.2 we
describe a generalization of the method aimed at increasing resolution as well.
HDR+ was one of the earliest burst processing approaches that saw mass commercial distribution,
featuring in the native camera app of Google’s Nexus and Pixel smartphones. Aimed at reducing
noise and increasing dynamic range, the HDR+ method employs a robust multi-frame merge process
operating on 2–8 frames, achieving interactive end-to-end processing rates. To reduce the impact
13
of motion blur and to avoid pixel saturation, HDR+ defines an exposure schedule to deliberately
under-expose the captured frames in a ZSL buffer. Bypassing the smartphone’s standard (single-
frame) ISP, the merge pipeline operates on the raw Bayer frames directly from the camera’s sensor,
enabling the merge process to benefit from higher bit-depth accuracy and simplifying the modeling
of noise in the pipeline.
Given a reference frame close (in time) to the shutter press, the HDR+ pipeline successively
aligns and merges alternative frames in the burst, pair-wise. The merging of frame content operates
on tiles and is implemented in the frequency domain. For each reference and alternate tile pair,
a new tile is linearly interpolated between them (per frequency) and averaged with the reference
tile to generate the merged tile output. Given that the merge strategy is applied per frequency,
the merging achieved per tile can be partial. The interpolation weight is defined as a function of
the measured difference between the aligned tile pairs and the expected (i.e., modeled) noise. For
very large measured differences (e.g., possibly due to misalignment), the interpolated output tends
towards the reference tile, whereas for differences much less than the expected noise, the interpolated
output tends towards the alternate tile. By adapting in this way, the merging process provides some
robustness to misalignment, and degrades gracefully to outputting the reference frame only in cases
where misalignment occurs across the entire burst.
The final output of the merge process is a Bayer frame with higher bit depth and overall SNR,
which is then passed to a post-merge enhancement stage, including demosaicing, color correction,
and photo-finishing. Of particular importance among these post-merge processes is spatial denoising.
As a consequence of the tile-wise, and partial merging of frames, the resulting merged frame can
have spatially varying noise strength which must be adequately handled by the post-merge denoising
process.
3.4.1 Denoising
As should be clear from the preceding material, filtering an image is a fundamental operation
throughout the computational photography pipeline. Within this class the most widely used canon-
ical filtering operation is one that smooths an image or, more specifically, removes or attenuates the
effect of noise.
The basic design and analysis of image denoising operations have informed a very large part of
the image processing literature (Lebrun et al., 2012, Milanfar, 2013, Takeda et al., 2006, 2007), and
the resulting techniques have often quickly spread or been generalized to address a wider range of
restoration and reconstruction problems in imaging.
Over the last five decades, many approaches have been tried, starting with the simplest averaging
filters and moving to ones that adapted somewhat better (but still rather empirically) to the content
of the given image. With shrinking device sizes, and the rise in the number of pixels per unit area of
the sensor, modern mobile cameras have become increasingly prone to noise. The manufacturers of
14
these devices, therefore, depend heavily on image denoising algorithms to reduce the spurious effects
of noise. A summary timeline of denoising methods is illustrated in Figure 12.
Only relatively recently in the last decade or so (and concomitant with the broad proliferation
of mobile devices) has a great leap forward in denoising performance been realized (Chatterjee &
Milanfar, 2010). What ignited this recent progress were patch-based methods (Buades et al., 2005,
Efros & Leung, 1999). This generation of algorithms exploits both local and non-local redundancies
or “self-similarities” in the image. This now-commonplace notion is to measure and make use of
affinities (or similarities) between a given pixel or patch of interest, and other pixels or patches
in the given image. These similarities are then used in a filtering (e.g., data-dependent weighted-
averaging) context to give higher weights to contributions from more similar data values, and to
properly discount data points that are less similar.
Early on, the bilateral filter (Tomasi & Manduchi, 1998a) was developed with very much this
idea in mind, as were its spiritually close predecessors like the Susan filter (Smith & Brady, 1997).
More recent extensions of these ideas include (Buades et al., 2005, Dabov et al., 2007, Takeda et al.,
2006, Zoran & Weiss, 2011), and other generalizations described in (Milanfar, 2013).
The general construction of many denoising filters begins by specifying a (symmetric positive
semi-definite) kernel kij (y) = K(yi , yj ) ≥ 0, from which the coefficients of the adaptive filter are
constructed. Here y denotes the noisy image, and yi and yj denote pixels at locations i and j
respectively5 .
Specifically,
kij
wij = Pn .
i=1 kij
where the coefficients [a1j , · · · , anj ] describe the relative contribution of the input (noisy) pixels to
5 In practice, it is commonplace to compute the kernel not on the original noisy y, but on a “pre-filtered” version
of it, processed with some basic smoothing, with the intent to weaken the dependence of the filter coefficients on
noise.
15
the output pixels, with the constraint that they sum to one:
n
X
aij = 1.
i=1
Let’s concretely highlight a few such kernels which lead to popular denoising/smoothing filters.
These filters are commonly used in the computational photography, imaging, computer vision, and
graphics literature for many purposes.
−kxi − xj k2
kij = exp .
h2
Such kernels lead to the classical and well-worn Gaussian filters that apply the same weights regard-
less of the underlying pixels.
As can be observed in the exponent on the right-hand side, the similarity metric here is a weighted
Euclidean distance between the vectors (xi , yi ) and (xj , yj ). This approach has several advantages.
Namely, while the kernel is easy to construct, and computationally simple to calculate, it yields
useful local adaptivity to the pixels.
where Qij = Q(yi , yj ) is the covariance matrix of the gradient of sample values estimated from the
given pixels, yielding an approximation of local geodesic distance in the exponent. The dependence
of Qij on the given data means that the denoiser is highly nonlinear and shift varying. This kernel
is closely related, but somewhat more general than the Beltrami kernel of Spira et al. (2007) and
the coherence-enhancing diffusion approach of Weickert (1999).
16
More recently, methods based on deep convolutional neural networks (CNNs) have become domi-
nant in terms of the quality of the overall results with respect to well-established quantitative metrics
(Burger et al., 2012, Meinhardt et al., 2017). Supervised deep-learning based methods are currently
the state of the art—for example (Burger et al., 2012, Chen et al., 2015, Liu et al., 2018a,b, Mao
et al., 2016, Remez et al., 2018, Tai et al., 2017, Wang et al., 2015, Zhang et al., 2017, 2018a,b).
However, these CNN-based approaches are yet to become practical (especially in mobile devices)
due not only to heavy computation and memory demands, but their tendency to sometimes also
produce artifacts that are unrealistic with respect to more qualitative perceptual measures.
Finally, it is worth noting that as denoising methods evolve from traditional signal/image-
processing approaches to deep neural networks, there has been an increasing need for training sets
comprised of images that accurately represent noise found on small sensors used in camera phones.
Towards this goal, the recent DND (Plötz & Roth, 2017) and SSID datasets (Abdelhamed et al.,
2018) provide images captured directly from such devices for use in DNN training. Both works have
shown that training using real images vs. those synthesized from existing noise models provides
improved performance. This hints at the need for better noise models in the literature that are able
to capture the true characteristics of small camera sensors.
s = T (r),
where function T produces a new intensity level s for every input level r (i.e., it maps input tones to
output tones). Tone mapping is applied either globally or locally. A global tone map, often referred
to as a tone curve, is applied to all pixels’ intensity values in the image irrespective of the pixel’s
spatial location. A global tone map is constrained to satisfy the following conditions:
(1) T (r) is single-valued and monotonically increasing; and
(2) 0 ≤ T (r) ≤ 1.
Because images have discrete intensity values, a global T (r) can be implemented as a 1D LUT.
A global tone map can be applied to each color channel, or a separate tone map can be designed per
color channel. In addition, tone maps can be customized depending on the mode of imaging. For
example, in burst mode for low-light imaging, a tone map can be adjusted to impart a night scene’s
look and feel. This can be done using a tone map that maintains strong contrast with dark shadows
and strong highlights (Levoy & Pritch, 2018).
output tones s
0 1 0 1
HDR image after input tones r Image after global input tones r Image after local
burst fusion tone mapping applied tone mapping applied
Figure 13: An example of a global tone map applied to an HDR image and local tone maps that
vary spatially based on the image content.
Local tone mapping adjusts intensity values in a spatially varying manner. This is inspired by the
human visual system’s sensitivity to local contrast. Local tone mapping methods, often referred to
as tone operators, examine a spatial neighborhood about a pixel to adjust the intensity value (Ahn
17
et al., 2013, Banterle et al., 2017, Cerda et al., 2018, Ma et al., 2015, Paris et al., 2011, Reinhard
et al., 2002). As a result, the single-valued and monotonicity constraints are not always enforced as
done in global tone mapping. For example, in the case of burst imaging for HDR, intensity values
from the multiple images can be combined in a manner that darkens or lightens regions to enhance
the image’s visual quality. Most methods decomposed the input image into a base layer and one or
more detail layers. The detail layers are adjusted based on local contrast, while a global tone map
modifies the base layer. Figure 13 shows an HDR image that has been processed using both global
and local tone mapping.
3.4.7 Sharpening
Image sharpness is one of the most important attributes that defines the visual quality of a pho-
tograph. Every image processing pipeline has a dedicated component to mitigate the blurriness in
the captured image. Although there are several methods that try to quantify the sharpness of a
digital image, there is no clear definition that perfectly correlates with the quality perceived by our
visual system. This makes it particularly difficult to develop algorithms and properly adjust their
parameters in such a way that they produce appealing visual results in the universe of use cases and
introduce minimal artifacts.
Image blur can be observed when the camera’s focus is not correctly adjusted, when the objects
in the scene appear at different depths, or when the relative motion between the camera and the
scene is faster than the shutter speed (motion blur, camera shake). Even when a photograph is
perfectly shot, there are unavoidable physical limitations that introduce blur. Light diffraction due
to the finite lens aperture, integration of the light in the sensor, and other possible lens aberrations
introduce blur, leading to a loss of details. Additionally, other components of the image processing
pipeline itself, particularly demosaicing and denoising, introduce blur.
A powerful yet simple model of blur is to assume that the blurry image is formed by the local
average of nearby pixels of a latent unknown sharp image that we would like to estimate. This local
average acts as a low-pass filter attenuating the high-frequency image content, introducing blur.
This can be formally stated as a convolution operation—that is,
X
v[i, j] = h[k, l] u[i − k, j − l],
k,l
where v is the blurry image that we want to enhance, u is the ideal sharp image that we don’t have
access to, and h models the typically unknown blur filter.
There are two conceptually different approaches to remove image blur and increase apparent
sharpness. Sharpening algorithms seek to directly boost high- and mid-frequency content (e.g.,
image details, image edges) without explicitly modeling the blurring process. These methods are
sometimes also known as edge enhancement algorithms since they mainly increase edge contrast. On
the other hand, de-convolution methods try to explicitly model the blurring process by estimating
a blur kernel h and then trying to invert it. In practice, there are an infinite number of possible
combinations of u and h that can lead to the same image v, which implies that recovering u from v
is an ill-posed problem. One of the infinite possible solutions is indeed the no-blur explanation: u =
v, and h is the trivial kernel that maintains the image unaltered. This implies that the degradation
model is not sufficient to disentangle the blur h and the image u from the input image v, and more
information about h and/or u (prior) is needed.
Most blind deconvolution methods proceed in two steps: a blur kernel is first estimated and
then using the estimated kernel a non-blind deconvolution step is applied. These methods generally
combine natural image priors (i.e., what characteristics does a natural sharp image have), and
assumptions on the blur kernel (e.g., maximum size) to cast the blind deconvolution problem as one
of variational optimization El-Henawy et al. (2014), Fergus et al. (2006), Levin et al. (2009). In the
specific case of deblurring slightly blurry images, we can proceed in a more direct way by filtering the
image with an estimate of the blur and thus avoid using costly optimization procedures Delbracio
18
Figure 14: Example of deblurring a mildly blurry image using Polyblur Delbracio et al. (2020). The
estimated blur kernel is shown on top-right in the right panel.
pixel values
200
perceived
150
100
50
Laplace operator 0
0 10 20 30 40 50 60 70 80
Perceived contrast
Lateral inhibition in Vertebrates Mach bands
from (Kramer & Davenport, 2015)
Figure 15: Mach bands: The human visual system enhances local changes in contrast by exciting
and inhibiting regions in a way similar to action of the Laplace operator. Darker (brighter) areas
appear even darker (brighter) close to the boundary of two bands.
et al. (2020), Hosseini & Plataniotis (2019). Figure 14 shows an example of Polyblur Delbracio et al.
(2020) that efficiently removes blur by estimating the blur and combining multiple applications of
the estimated blur to approximate its inverse.
One of the most popular and simplest sharpening algorithms is unsharp masking. First, a copy
of the given image is further blurred to remove high-frequency image details. This new image is
subtracted from the original one to create a residual image that contains only image details. Finally,
a fraction of this residual image is added back to the original one, which results in boosting of the
high-frequency details. This procedure has its roots in analog photography, where a blurry positive
image is combined with the original negative to create a new more contrasted photograph. A typical
digital implementation of unsharp masking is by using a Gaussian blur,
where Gσ is the Gaussian blur operator having strength σ. The parameter κ and the amount of
blur σ should be empirically set.
At a very broad level, the human visual system (HVS) behaves similarly to unsharp masking
and the Laplace operator (Ratliff, 1965). The center-surround receptive fields present in the eye
have both excitatory (center) and inhibitory (surrounding) regions. This leads our visual system to
enhance changes in contrast (e.g., edge-detection) by exciting and inhibiting regions in a way similar
to action of the Laplace operator. In fact, one manifestation of this phenomenon is the well-known
Mach bands illusion, where the contrast between edges is exaggerated by the HVS (Figure 15).
There are different variants of this high-frequency boosting principle. (Kovásznay & Joseph,
1955) introduced the idea that a mildly blurred image could be deblurred by subtracting a small
19
amount of its Laplacian:
b = u − κ∆u.
u
In fact, this method is closely related to unsharp masking, where the residual mask u − Gσ u is
replaced with the negative of the image Laplacian −∆u.
A perhaps not very well-known fact is that the Nobel Prize winner Dennis Gabor studied this
process and determined how to set the best amount to subtract (Lindenbaum et al., 1994). In
fact, (Lindenbaum et al., 1994) showed that Laplacian sharpening methods can be interpreted as
approximating inverse diffusion processes—for example, by diffusion according to the heat equation,
but in reverse time. This connection has led to numerous other sharpening methods in the form
of regularized partial differential equations (Buades et al., 2006, Osher & Rudin, 1990, You et al.,
1996).
4 Compound Features
A common adage has emerged that describes the age of mobile cameras aptly: “The best camera is
the one that’s with you.” The sentiment expressed here declares that there is no longer any need to
carry a second camera, if you have a mobile phone in your pocket that has a camera with “nearly
the same functionality.” Of course, the key caveat is nearly the same. Some of what we take for
granted in a large form-factor camera is possible only because of the form factor. To approximate
those functions we must combine various pieces of technology to emulate the end result.
Here we briefly describe how we combine the basic techniques described earlier to enable advanced
features that not only approximate some functionalities of larger cameras, but also sometimes even
exceed them. For instance, the hybrid optical/digital zoom requires state-of-the-art multi-frame
merge and single-frame upscaling technologies. A second example is synthetic bokeh (e.g., synthe-
sizing shallow DoF), which requires both segmentation of the image for depth, and application of
different processing to the foreground vs. the background.
20
AWB which is trained specifically on night scenes. Liba et al. (2019) also introduce a creative
tone-mapping solution that draws inspiration from artists’ portrayal of night scenes in paintings by
keeping shadows close to black and boosting color saturation (Levoy & Pritch, 2018).
Extending the low-light capabilities of photography even further, some smartphones now offer
Astrophotography modes. This has been made possible through more sophisticated motion detection
modes that utilize on-device sensors to detect tripod (or non-handheld) mounting, enabling synthetic
exposure times of over four minutes (Kainz & Murthy, 2019).
21
b) Local Gradients c) Kernels
d) Alignment Vectors
g) Accumulation
a) RAW Input Burst
h) Merged Result
f) Motion Robustness
e) Local Statistics
Figure 16: Overview of super-resolution from raw images: A captured burst of raw (Bayer CFA)
images (a) is the input to our algorithm. Every frame is aligned locally (d) to a single frame, called
the base frame. We estimate each frame’s contribution at every pixel through kernel regression.
(g). The kernel shapes (c) are adjusted based on the estimated local gradients (b) and the sample
contributions are weighted based on a robustness model (f ). This robustness model computes a
per-pixel weight for every frame using the alignment field (d) and local statistics (e) gathered from
the neighborhood around each pixel. The final merged RGB image (h) is obtained by normalizing
the accumulated results per channel. We call the steps depicted in (b)–(g) the merge step. (Figure
from (Wronski et al., 2019))
Super-resolution is arguably not an alien process to the human visual system. It would appear
that the human brain also processes visual stimuli in a way that allows us to discriminate details
beyond the physical resolution given by optics and retinal sampling alone. This is commonly known
as visual hyperacuity, as in (Westheimer, 1975). A possible mechanism of visual super-resolution is
the random eye micro-movements known as microsaccades and ocular drifts (Intoy & Rucci, 2020,
Rucci et al., 2007).
Interestingly, in the super-resolution work described in Wronski & Milanfar (2018), natural hand
tremors play a similar role to eye movements. A natural, involuntary hand tremor is always present
when we hold any object. This tremor is comprised of low-amplitude and high-frequency components
consisting of a mechanical-reflex component, and a second component that causes micro-contractions
in the limb muscles (Riviere et al., 1998). In (Wronski et al., 2019), it was shown that the hand
tremor of a user holding a mobile camera is sufficient to provide sub-pixel movements across the
images in a burst for the purpose of super-resolution. Experimental measurements of such tremor
in captured bursts of images from a mobile device are illustrated in Figure 17.
As illustrated in Figure 19, we can see that the merge algorithm alone is able to deliver resolution
comparable to roughly a dedicated telephoto lens at a modest magnification factor, no more than
2×. Of course the same algorithm can also be applied to the telephoto lens itself, again with typically
even more modest gains in resolution. This suggests that to have a general solution for zoom across
a broad range of magnifications, a combination of multi-frame merge and high-quality single-frame
crop-and-upscale is required (see Figure 18). This upscaling technology is described next.
4.2.1 Upscaling
Digital zoom, also known as crop-and-scale allows the user to change the field of view (FOV) of
the captured photograph through digital post-processing. This operation is essential to allow the
photographer to zoom even in cameras that have a dedicated telephoto lens (optical zoom). The
operation consists of cropping the image to the desired FOV and then digitally enlarging the cropped
22
Figure 17: Horizontal and vertical angular displacement (excluding translational displacement) mea-
sured from handheld motion across 86 bursts. Red circle corresponds to one standard deviation, or
roughly 0.9 pixels. (Figure from (Wronski et al., 2019))
Figure 18: The full Super Res Zoom pipeline enhances image resolution in two distinct ways: the
merge step, and single-frame upscaling step.
23
Figure 19: Merging a burst onto different target grid resolutions: from left to right, 1×, 1.5×,
2×. The combination of the Super Res Zoom algorithm and the phone’s optical system leads to
significantly improved results when merged onto a 1.5× grid; small improvement up to 2× zoom are
also noted. (Figure from (Wronski et al., 2019))
structure encoded by the gradient strength, orientation, and coherence. Examples of such trained
filter banks are shown in Figure 21. Per each subset of filters, the angle varies from left to right; the
top, middle, and bottom three rows correspond to low, medium, and high coherence. It is important
to note that the filters at different upscaling factors are not trivial transformations of one another.
For instance, the 3× filters are not derived from the 2× filters—each set of filters carries novel
information from the training data. Given the apparent regularity of these filters, it may also be
tempting to imagine that they can be parameterized by known filter types (e.g., Gabor). This is not
the case. Specifically, the phase-response of the trained filter is deeply inherited from the training
data, and no parametric form has been found that is able to mimic this generally.
HR images
Learning stage
cheap output
LR image filter
upscaling image
Run-time
Figure 20: The learning and application of a filter that maps a class of low-resolution patches to their
high-resolution versions. More generally, a set of such filters is learned, indexed by local geometric
structures, shown in Figure 21.
The overall idea behind RAISR and related methods is that well-defined structures, like edges,
can be better interpolated if the interpolation kernel makes use of the specific orientation and local
structure properties. During execution, RAISR scans the input image and defines which kernel to
use on each pixel and then computes the upscaled image using the selected kernels on a per pixel
basis. The overall structure of the RAISR algorithm is shown in Figure 20.
RAISR is trained to enlarge an input image by an integer factor (2×–4×), but it does not allow
24
(a) 2× upscaling filters
Figure 21: Visualization of the learned filter sets for 2×, and 3× upscaling, indexed by angle,
strength, and coherence-based hashing of patch gradients. (Figure from (Romano et al., 2017))
intermediate zooms. In practice, RAISR is combined with a linear interpolation method (bicubic,
Lanczos) to generate the zoom factor desired by the user.
With the advancement of deep neural networks in recent years, a wide variety of new image up-
scaling methods have emerged (Wang et al., 2020). Similar to RAISR, deep-learning-based methods
propose to learn from image examples how to map a low-resolution image into a high-resolution
one. These methods generally produce high-quality results but they are not as computationally
efficient as shallow interpolation methods, such us RAISR, so their use in mobile phones is not yet
widespread. Undoubtedly, deep image upscaling is one of the most active areas of research. Recent
progress in academic research, combined with more powerful and dedicated hardware, may produce
significant improvements that could be part of the next generation of mobile cameras.
25
4.3 Synthetic bokeh
One of the main characteristics of mobile phone cameras is that the whole image is either in focus or
not. The depth of field, defined as the range of depths that are in focus (sharp), is frequently used
by photographers to distinguish the main subject from the background. As discussed in Section 2,
due to the small and fixed aperture used on smartphone cameras, capturing a shallow DoF image is
virtually impossible.
Segmentation mask
Input image with detected face Mask + disparity map Synthetic shallow DoF
Figure 22: Shallow depth of field can be computationally introduced by blurring an all-in-focus
image using depth estimation and segmentation. Image courtesy of (Wadhwa et al., 2018).
The range of the depth of field is inversely proportional to the size of the camera aperture: a
wide aperture produces a shallow depth of field while a narrow aperture leads to a wider depth of
field. On mobile phone cameras, physical limitations make it impossible to have a wide aperture.
This implies that capturing images with a shallow depth of field is virtually impossible. Although an
all-in-focus image retains the most information, for aesthetic and artistic reasons, users may want
to have control of the depth of field.
Recently, mobile phone manufacturers have introduced a computational (shallow) depth-of-field
effect called a “synthetic bokeh” (see Figure 22). An accurate depth map estimate would enable
computationally introducing spatially varying depth blur and simulate the depth-of-field effect. The
traditional solution to estimate a depth map is based on stereo vision and requires two cameras.
Adding a second camera introduces additional costs and increases power consumption and size.
An alternative is to introduce a dedicated depth sensor based on structured light or time-of-flight
technologies. However, these tend to be expensive and mainly work indoors, which significantly
restricts their use. Accurately estimating a depth map from a single image is a severely ill-posed
problem that generally leads to very limited accuracy.
Wadhwa et al. (2018) introduced a system to synthetically generate a depth-of-field effect on
smartphone cameras. The method runs completely on the device and uses only the information
from a single camera (rear or front facing). The key idea is to incorporate a deep neural network
to segment out people and faces, and then use the segmentation to adaptively blur the background.
Additionally, if available, the system uses dual-pixel information now present in many hardware
auto-focus systems. The dual-pixel data provides very small baseline stereo information that allows
algorithmic generation of dense depth maps.
Meanwhile, the front-facing camera is used almost exclusively to take “selfie” photos—that is, a
close-up of the face and upper half of the photographer’s body. A neural network trained for this
type of image allows segmenting the main character out from the background. The background is
then appropriately blurred to give the idea of depth of field. When using the rear-facing camera,
no prior information about the photograph composition can be assumed. Thus, having dense depth
information becomes crucial.
It is worth mentioning that this computational DoF system does not necessarily lead to a physi-
cally plausible photograph as would have been taken by a camera with a wider aperture—it merely
26
suggests the right look. For instance, among other algorithm design choices, all pixels belonging to
the segmentation mask are assumed to be in focus even if they are at different depths.
5.1 Algorithmics
The impressive success of neural networks in computer vision has not (yet) been widely replicated in
practical aspects of computational photography. Indeed, the impressive progress in mobile imaging
has largely been facilitated by methods that are mostly not based on deep neural networks (DNN).
Given the proliferation of DNNs in every other aspect of technology, this may seem surprising. Two
observations may help explain this landscape. First, DNNs still have relatively heavy computing
and memory requirements that are mostly outside the scope of current capabilities of mobile devices.
This may change soon. Second, resource limitations aside, DNN-based (particularly the so-called
generative) models still have the tendency to produce certain artifacts in the final images that are
either undesirable, or intolerable in a consumer device. Furthermore, such errors are not easily
diagnosed and repaired because, unlike existing methods, “tuning” the behavior of deep models is
not easy. These issues too will be remedied in due time. Meanwhile, DNN approaches continue to be
developed with the intention to replace the entire end-to-end processing pipeline—examples include
DeepISP (Schwartz et al., 2018) and (Ignatov et al., 2020), to name just two.
5.2 Curation
Today nearly everyone who can afford to have a smartphone owns one. And we now take and share
more photos than ever. Given how little cost and effort picture-taking entails, we’ve also evolved
a tendency to often capture multiple pictures of the same subject. Yet, typically only a few of our
many shots turn out well, or to our liking. As such, storage, curation, and retrieval of photographs
have become another aspect of photography that has drawn attention recently and deserves much
more work. Some recent methods (Talebi & Milanfar, 2018) have developed neural network models
trained on many images annotated for technical and aesthetic quality, which now enable machine
evaluation of images in both qualitative and quantitative terms. Of course, this technology is in
very early stages of development and represents aggregate opinion, not necessarily meant to cater
to personal taste yet. Similar models can also rank photos based on their technical quality—aspects
such as whether the subject is well lit, centered, and in focus. Needless to say, much work remains
to be done here.
27
certain, and varied, information extracted from the images. For instance, in the medical realm the
end task maybe diagnostic, and this may not be best facilitated by a “pretty” picture. Instead, what
is required is a maximally “informative” picture. Correspondingly, cameras on mobile devices are
not built, configured, or tuned to provide such information 8 . An interesting case study is the role
that cameras have played (or could better have played) in recent environmentally calamitous events
such as the massive wildfires in California. Nearly all cameras, tuned to normal viewing conditions,
and biased toward making the pictures pleasing, were largely unable to capture the true physical
attributes of the dark, orange-hued skies polluted with smoke9 .
The bottom line is that mobile cameras, properly re-imagined and built, can play an even more
useful, helpful, and instrumental role in our lives than they do today.
5.4 Epilogue
The technology behind computational photography has advanced rapidly in the last decade—the
science and engineering techniques that generate high-quality images from small mobile cameras will
continue to evolve. But so too will our needs and tastes for the types of devices we are willing to
carry around, and the kinds of visual or other experiences we wish to record and share.
It is hard to predict with any certainty what the mobile devices of the future will look like. But
as surely as Ansel Adams would not have seen the mobile phone camera coming, we too may be
surprised by both the form, and the vast new uses, of these devices in the next decade.
Disclosure Statement
The authors are not aware of any affiliations, memberships, funding, or financial holdings that might
be perceived as affecting the objectivity of this review.
Acknowledgments
The authors wish to acknowledge the computational imaging community of scholars and colleagues—
industrial and academic alike—whose work has led to the advances reported in this review. While
we could not cite them all, we dedicate this paper to their collective work.
References
Abdelhamed A, Lin S, Brown MS. 2018. A high-quality denoising dataset for smartphone cameras,
In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1692–1700
Adams A. 1935. Making a photograph: An introduction to photography. Studio Publishing
Ahn H, Keum B, Kim D, Lee HS. 2013. Adaptive local tone mapping based on Retinex for high
dynamic range images, In IEEE International Conference on Consumer Electronics, pp. 153–156
Banterle F, Artusi A, Debattista K, Chalmers A. 2017. Advanced high dynamic range imaging. CRC
Press
Barron JT, Tsai YT. 2017. Fast Fourier Color Constancy, In IEEE Conference of Computer Vision
and Pattern Recognition
Bayer BE. 1975. Color imaging array. US Patent 3,971,065
8 For instance, automatic white balance can be counter-productive if the end goal is to make a physical measurement
change/617224/
28
Bruhn A, Weickert J, Schnörr C. 2005. Lucas/Kanade meets Horn/Schunck: Combining local and
global optic flow methods. International Journal of Computer Vision 61:211–231
Buades A, Coll B, Morel JM. 2005. A review of image denoising algorithms, with a new one. Multi-
scale Modeling and Simulation (SIAM Interdisciplinary Journal) 4:490–530
Buades A, Coll B, Morel JM. 2006. Image enhancement by non-local reverse heat equation. Preprint
CMLA 22:2006
Buchsbaum G. 1980. A spatial processor model for object colour perception. Journal of the Franklin
Institute 310:1–26
Burger HC, Schuler CJ, Harmeling S. 2012. Image denoising: Can plain neural networks compete
with bm3d?, In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2392–2399
Burr D, Ross J, Morrone MC. 1986. Seeing objects in motion. Proceedings of the Royal Society of
London. Series B. Biological sciences 227:249–265
Cerda X, Parraga CA, Otazu X. 2018. Which tone-mapping operator is the best? A comparative
study of perceptual quality. Journal of the Optical Society of America A 35:626–638
Chatterjee P, Milanfar P. 2010. Is denoising dead? IEEE Transactions on Image Processing 19:895–
911
Chen Y, Yu W, Pock T. 2015. On learning optimized reaction diffusion processes for effective image
restoration, In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5261–5269
Cheng D, Prasad DK, Brown MS. 2014. Illuminant estimation for color constancy: Why spatial-
domain methods work and the role of the color distribution. Journal of the Optical Society of
America A 31:1049–1058
Cheng D, Price B, Cohen S, Brown MS. 2015. Effective learning-based illuminant estimation using
simple features, In IEEE Conference of Computer Vision and Pattern Recognition
Dabov K, Foi A, Katkovnik V, Egiazarian K. 2007. Image denoising by sparse 3-D transform-domain
collaborative filtering. IEEE Transactions on Image Processing 16:2080–2095
Debevec PE, Malik J. 1997. Recovering high dynamic range radiance maps from photographs, In
SIGGRAPH, p. 369–378
Delbracio M, Garcia-Dorado I, Choi S, Kelly D, Milanfar P. 2020. Polyblur: Removing mild blur by
polynomial reblurring. arXiv preprint arXiv:2012.09322
Efros AA, Leung TK. 1999. Texture synthesis by non-parametric sampling, In IEEE International
Conference on Computer Vision, vol. 2, pp. 1033–1038, IEEE
El-Henawy I, Amin A, Kareem Ahmed HA. 2014. A comparative study on image deblurring tech-
niques. International Journal of Advances in Computer Science and Technology (IJACST) 3:01–08
Elad M. 2002. On the origin of the bilateral filter and ways to improve it. IEEE Transactions on
Image Processing 11:1141–1150
Farsiu S, Elad M, Milanfar P. 2006. Multiframe demosaicing and super-resolution of color images.
IEEE Trans. Image Processing 15:141–159
Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT. 2006. Removing camera shake from a
single photograph. ACM Transactions on Graphics 25:787–794
Gehler PV, Rother C, Blake A, Minka T, Sharp T. 2008. Bayesian color constancy revisited, In
IEEE Conference of Computer Vision and Pattern Recognition
29
Gharbi M, Chaurasia G, Paris S, Durand F. 2016. Deep joint demosaicking and denoising. ACM
Transactions on Graphics 35:191
Godard C, Matzen K, Uyttendaele M. 2018. Deep burst denoising, In European Conference on
Computer Vision, pp. 560–577
Hasinoff SW, Durand F, Freeman WT. 2010. Noise-optimal capture for high dynamic range photog-
raphy, In IEEE Conference on Computer Vision and Pattern Recognition, pp. 553–560
Hasinoff SW, Sharlet D, Geiss R, Adams A, Barron JT, et al. 2016. Burst photography for high
dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics 35:192
Horn BK, Schunck BG. 1981. Determining optical flow, In Techniques and Applications of Image
Understanding, vol. 281, pp. 319–331, International Society for Optics and Photonics
Hosseini MS, Plataniotis KN. 2019. Convolutional deblurring for natural imaging. IEEE Transactions
on Image Processing 29:250–264
Hu Y, Wang B, Lin S. 2017. FC4: Fully convolutional color constancy with confidence-weighted
pooling, In IEEE Conference of Computer Vision and Pattern Recognition
Ignatov A, Van Gool L, Timofte R. 2020. Replacing mobile camera isp with a single deep learning
model, In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2275–
2285
Intoy J, Rucci M. 2020. Finely tuned eye movements enhance visual acuity. Nature Communications
11:1–11
Jiang J, Liu D, Gu J, Süsstrunk S. 2013. What is the space of spectral sensitivity functions for
digital color cameras?, In IEEE Workshop on Applications of Computer Vision, pp. 168–179
Kainz F, Murthy K. 2019. Astrophotography with Night Sight on Pixel Phones.
https://ai.googleblog.com/2019/11/astrophotography-with-night-sight-on.html. [Online; ac-
cessed 06-Nov-2020]
Kelber A, Yovanovich C, Olsson P. 2017. Thresholds and noise limitations of colour vision in dim
light. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 372
Kovásznay LS, Joseph HM. 1955. Image processing. Proceedings of the Institute of Radio Engineers
43:560–570
Kramer RH, Davenport CM. 2015. Lateral inhibition in the vertebrate retina: the case of the missing
neurotransmitter. PLoS Biology 13:e1002322
Land EH. 1974. The retinex theory of colour vision. Proc. Roy. Institution Gr. Britain 47:23–58
Lebrun M, Colom M, Buades A, Morel JM. 2012. Secrets of image denoising cuisine. Acta Numerica
21:475
Levin A, Weiss Y, Durand F, Freeman WT. 2009. Understanding and evaluating blind deconvolution
algorithms, In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1964–
1971, IEEE
Levoy M, Pritch Y. 2018. Night Sight: Seeing in the Dark on Pixel Phones.
https://ai.googleblog.com/2018/11/night-sight-seeing-in-dark-on-pixel.html. [Online; accessed
06-Nov-2020]
Liba O, Murthy K, Tsai YT, Brooks T, Xue T, et al. 2019. Handheld mobile photography in very
low light. ACM Transactions on Graphics 38:1–16
30
Lindenbaum M, Fischer M, Bruckstein A. 1994. On gabor’s contribution to image enhancement.
Pattern Recognition 27:1–8
Liu D, Wen B, Fan Y, Loy CC, Huang TS. 2018a. Non-local recurrent network for image restoration,
In Advances in Neural Information Processing Systems, pp. 1673–1682
Liu P, Zhang H, Zhang K, Lin L, Zuo W. 2018b. Multi-level wavelet-CNN for image restoration, In
IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 773–782
Longere P, Xuemei Zhang, Delahunt PB, Brainard DH. 2002. Perceptual assessment of demosaicing
algorithm performance. Proceedings of the IEEE 90:123–132
Ma K, Yeganeh H, Zeng K, Wang Z. 2015. High dynamic range image compression by optimizing
tone mapped image quality index. IEEE Transactions on Image Processing 24:3086–3097
Mao X, Shen C, Yang YB. 2016. Image restoration using very deep convolutional encoder-decoder
networks with symmetric skip connections, In Advances in Neural Information Processing Systems,
pp. 2802–2810
Meinhardt T, Möller M, Hazirbas C, Cremers D. 2017. Learning proximal operators: Using denoising
networks for regularizing inverse imaging problems, In International Conference on Computer
Vision
Mertens T, Kautz J, Reeth FV. 2007. Exposure fusion, In Pacific Conference on Computer Graphics
and Applications, p. 382–390, USA
Milanfar P. 2013. A tour of modern image filtering: New insights and methods, both practical and
theoretical. IEEE Signal Processing Magazine 30:106–128
Mildenhall B, Barron JT, Chen J, Sharlet D, Ng R, Carroll R. 2018. Burst denoising with kernel
prediction networks, In Proc. CVPR, pp. 2502–2510
Osher S, Rudin LI. 1990. Feature-oriented image enhancement using shock filters. SIAM Journal on
Numerical Analysis 27:919–940
Paris S, Hasinoff SW, Kautz J. 2011. Local Laplacian filters: edge-aware image processing with a
Laplacian pyramid. ACM Transactions on Graphics 30:68
Plötz T, Roth S. 2017. Benchmarking denoising algorithms with real photographs, In IEEE Confer-
ence on Computer Vision and Pattern Recognition
Poynton C. 2012. Digital Video and HD: Algorithms and Interfaces. Morgan Kaufmann, 2nd ed.
Ramanath R, Drew MS. 2014. von kries hypothesis. Springer
Ratliff F. 1965. Mach bands: quantitative studies on neural networks, vol. 2. Holden-Day, San
Francisco London Amsterdam
Reinhard E, Stark M, Shirley P, Ferwerda J. 2002. Photographic tone reproduction for digital images,
In In SIGGRAPH’02, pp. 267–276
Remez T, Litany O, Giryes R, Bronstein AM. 2018. Class-aware fully convolutional Gaussian and
Poisson denoising. IEEE Transactions on Image Processing 27:5707–5722
Riviere CN, Rader RS, Thakor NV. 1998. Adaptive cancelling of physiological tremor for improved
precision in microsurgery. IEEE Trans. Biomedical Engineering 45:839–846
Romano Y, Isidoro J, Milanfar P. 2017. RAISR: rapid and accurate image super resolution. IEEE
Transactions on Computational Imaging 3:110–125
31
Rucci M, Iovin R, Poletti M, Santini F. 2007. Miniature eye movements enhance fine spatial detail.
Nature 447:852–855
Sampat N, Venkataraman S, Yeh T, Kremens RL. 1999. System implications of implementing auto-
exposure on consumer digital cameras, In Sensors, Cameras, and Applications for Digital Pho-
tography, eds. N Sampat, T Yeh, vol. 3650, pp. 100 – 107, International Society for Optics and
Photonics, SPIE
Schwartz E, Giryes R, Bronstein AM. 2018. DeepISP: Toward learning an end-to-end image pro-
cessing pipeline. IEEE Transactions on Image Processing 28:912–923
Smith SM, Brady JM. 1997. SUSAN-A new approach to low level image processing. International
Journal of Computer Vision 23:45–78
Spira A, Kimmel R, Sochen N. 2007. A short time Beltrami kernel for smoothing images and mani-
folds. IEEE Trans. Image Processing 16:1628–1636
Stevens SS. 1961. To honor fechner and repeal his law: A power function, not a log function, describes
the operating characteristic of a sensory system. Science 133
Sun D, Yang X, Liu M, Kautz J. 2018. PWC-Net: CNNs for optical flow Using pyramid, warping, and
cost volume, In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943
Süsstrunk S, Buckley R, Swen S. 1999. Standard RGB color spaces, In Color and Imaging Conference
Tai Y, Yang J, Liu X, Xu C. 2017. Memnet: A persistent memory network for image restoration, In
Proceedings of the IEEE International Conference on Computer Vision, pp. 4539–4547
Takeda H, Farsiu S, Milanfar P. 2006. Robust kernel regression for restoration and reconstruction of
images from sparse, noisy data. International Conference on Image Processing :1257–1260
Takeda H, Farsiu S, Milanfar P. 2007. Kernel regression for image processing and reconstruction.
IEEE Transactions on Image Processing 16:349–366
Talebi H, Milanfar P. 2018. Nima: Neural image assessment. IEEE Transactions on Image Processing
27:3998–4011
Tan H, Zeng X, Lai S, Liu Y, Zhang M. 2017. Joint demosaicing and denoising of noisy bayer images
with admm, In International Conference on Image Procesing, pp. 2951–2955
Tomasi C, Manduchi R. 1998a. Bilateral filtering for gray and color images, In International Con-
ference on Computer Vision, pp. 839–846, IEEE
Tomasi C, Manduchi R. 1998b. Bilateral filtering for gray and color images. International Conference
on Computer Vision :836–846
Van De Weijer J, Gevers T, Gijsenij A. 2007. Edge-based color constancy. IEEE Transactions on
Image Processing 16:2207–2214
Wadhwa N, Garg R, Jacobs DE, Feldman BE, Kanazawa N, et al. 2018. Synthetic depth-of-field
with a single-camera mobile phone. ACM Transactions on Graphics (TOG) 37:1–13
Wang Z, Chen J, Hoi SC. 2020. Deep learning for image super-resolution: A survey. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence
Wang Z, Liu D, Yang J, Han W, Huang T. 2015. Deep networks for image super-resolution with
sparse prior, In IEEE Conference of Computer Vision and Pattern Recognition, pp. 370–378
Weickert J. 1999. Coherence-enhancing diffusion. International Journal of Computer Vision 31:111–
127
32
Westheimer G. 1975. Visual acuity and hyperacuity. Investigative Ophthalmology & Visual Science
14:570–572
Wronski B, Garcia-Dorado I, Ernst M, Kelly D, Krainin M, et al. 2019. Handheld multi-frame
super-resolution. ACM Transactions on Graphics 38
Wronski B, Milanfar P. 2018. See better and further with super res zoom on the pixel 3. https:
//ai.googleblog.com/2018/10/see-better-and-further-with-super-res.html
You YL, Xu W, Tannenbaum A, Kaveh M. 1996. Behavioral analysis of anisotropic diffusion in
image processing. IEEE Transactions on Image Processing 5:1539–1553
Zhang K, Zuo W, Chen Y, Meng D, Zhang L. 2017. Beyond a Gaussian denoiser: Residual learning
of deep CNN for image denoising. IEEE Transactions on Image Processing 26:3142–3155
Zhang K, Zuo W, Zhang L. 2018a. FFDNet: Toward a fast and flexible solution for CNN-based
image denoising. IEEE Transactions on Image Processing 27:4608–4622
Zhang Y, Tian Y, Kong Y, Zhong B, Fu Y. 2018b. Residual dense network for image restoration.
arXiv preprint arXiv:1812.10477
Zoran D, Weiss Y. 2011. From learning models of natural image patches to whole image restoration,
In 2011 International Conference on Computer Vision, pp. 479–486, IEEE
33