Visualizing Conformal Predictions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Conformal Prediction Masks: Visualizing

Uncertainty in Medical Imaging

Gilad Kutiel, Regev Cohen(B) , Michael Elad, Daniel Freedman,


and Ehud Rivlin

Verily Life Sciences, South San Francisco, USA


[email protected]

Abstract. Estimating uncertainty in image-to-image recovery networks


is an important task, particularly as such networks are being increasingly
deployed in the biological and medical imaging realms. A recent confor-
mal prediction technique derives per-pixel uncertainty intervals, guar-
anteed to contain the true value with a user-specified probability. Yet,
these intervals are hard to comprehend and fail to express uncertainty
at a conceptual level. In this paper, we introduce a new approach for
uncertainty quantification and visualization, based on masking. The pro-
posed technique produces interpretable image masks with rigorous statis-
tical guarantees for image regression problems. Given an image recovery
model, our approach computes a mask such that a desired divergence
between the masked reconstructed image and the masked true image is
guaranteed to be less than a specified risk level, with high probability.
The mask thus identifies reliable regions of the predicted image while
highlighting areas of high uncertainty. Our approach is agnostic to the
underlying recovery model and the true unknown data distribution. We
evaluate the proposed approach on image colorization, image completion,
and super-resolution tasks, attaining high quality performance on each.

1 Introduction
Deep Learning has been successful in many applications, spanning computer
vision, speech recognition, natural language processing, and beyond [12,13]. For
many years, researchers were mainly content in developing new techniques that
achieve unprecedented accuracy, without concerns for understanding the uncer-
tainty implicit in such models. Recently, however, there has been a concerted
effort within the research community to quantify the uncertainty of deep mod-
els.
This paper addresses the problem of quantifying and visualizing uncertainty
in the realm of image-to-image tasks. Such problems include super-resolution,
deblurring, colorization, and image completion, amongst others. Assessing uncer-
tainty is important generally, but is particularly so in application domains such

Supplementary Information The online version contains supplementary material


available at https://doi.org/10.1007/978-3-031-39539-0 14.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
H. Chen and L. Luo (Eds.): TML4H 2023, LNCS 13932, pp. 163–176, 2023.
https://doi.org/10.1007/978-3-031-39539-0_14
164 G. Kutiel et al.

as biological and medical imaging, in which fidelity to the ground truth is


paramount. If there is an area of the reconstructed image where such fidelity
is unlikely or unreliable due to high uncertainty, this is crucial to convey.
Our approach to uncertainty estimation is based on masking. Specifically,
we are interested in computing a mask such that the uncertain regions in the
image are masked out. Based on conformal prediction [3], we derive an algorithm
that can apply to any existing image-recovery model and produce uncertainty
a mask satisfying the following criterion: the divergence between the masked
reconstructed image and the masked true image is guaranteed to be less than
a specified level, with high probability. The resultant mask highlights areas in
the recovered image of high uncertainty while trustworthy regions remain intact.
Our distribution-free method, illustrated in Fig. 1, is agnostic to the prediction
model and to the choice of divergence function, which should be dictated by the
application. Our contributions are as follows:
1. We introduce the notion of conformal prediction masks: a distribution-
free approach to uncertainty quantification in image-to-image regression.
We derive masks which visually convey regions of uncertainty while rigor-
ously providing strong statistical guarantees for any regression model, image
dataset and desired divergence measure.
2. We develop a practical training algorithm for computing these masks which
only requires triplets of input (degraded), reconstructed and true images.
The resultant mask model is trained once for all possible risk levels and is
calibrated via a simple process to meet the required guarantees given a user-
specified risk level and confidence probability.
3. We demonstrate the power of the method on image colorization, image com-
pletion and super-resolution tasks. By assessing our performance both visu-
ally and quantitatively, we show the resultant masks attain the probabilistic
guarantee and provide interpretable uncertainty visualization without over-
masking the recovered images, in contrast to competing techniques.

2 Related Work

Bayesian Uncertainty Quantification. The Bayesian paradigm defines


uncertainty by assuming a distribution over the model parameters and/or acti-
vation functions. The most prevalent approach is Bayesian neural networks
[21,28,38], which are stochastic models trained using Bayesian inference. Yet,
as the number of model parameters has grown rapidly, computing the exact pos-
teriors has became computationally intractable. This shortcoming has led to the
development of approximation methods such as Monte Carlo dropout [15,16],
stochastic gradient Markov chain Monte Carlo [11,33], Laplacian approxima-
tions [31] and variational inference [10,27,30]. Alternative Bayesian techniques
include deep Gaussian processes [14], deep ensembles [8,19], and deep Bayesian
active learning [17], to name just a few. A comprehensive review on Bayesian
uncertainty quantification is given in [1].
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 165

Fig. 1. High-level overview. Given image measurements X (e.g. gray-scale image)


of a ground-truth image Y , and a predicted image fˆ(X) (e.g. colorized image), the
mask model outputs an uncertainty mask M(X) such that the divergence between the
masked ground-truth and the masked prediction is below a chosen risk level with high
probability.

Distribution-Free Methods and Conformal Prediction. Unlike Bayesian


methods, the frequentist approach assumes the true model parameters are fixed
with no underlying distribution. Examples of such distribution-free techniques
are model ensembles [24,29], bootstrap [2,22], interval regression [23,29,39] and
quantile regression [18,32]. An important distribution-free technique which is
most relevant to our work is conformal prediction [4,36]. This approach relies on
a labeled calibration dataset to convert point estimations into prediction regions.
Conformal methods can be used with any estimator, require no retraining, are
computationally efficient and provide coverage guarantees in finite samples [25].
Recent development includes conformalized quantile regression [7,32,35], confor-
mal risk control [5,6,9] and semantic uncertainty intervals for generative adver-
sarial networks [34]. [37] provides an extensive survey on distribution-free con-
formal prediction methods.

3 Background: Conformal Prediction in Image Regression


We present a brief overview of the work in [7], which stands out in the realm
of conformal prediction for image-to-image problems, and serves as the basis of
our work. Let Y ∈ Y = RN be a ground-truth image in vector form, an image
X ∈ X = RM be its measurements, and fˆ(X) ∈ Y an estimator of Y. Conformal
prediction constructs uncertainty intervals
 
T (X)[i] = fˆ(X)[i] − ˆl(X)[i] , fˆ(X)[i] + û(X)[i] , i = 0, ..., N − 1, (1)
166 G. Kutiel et al.

where ˆl(X)[i] ≥ 0 and û(X)[i] ≥ 0 represent the uncertainty in lower and upper
directions respectively. Given heuristic uncertainty values ˜l and ũ, the uncer-
tainty intervals are calibrated using a calibration dataset C  {Xk , Yk }K k=1 to
guarantee they contain at least a fraction α of the ground-truth pixel values with
probability 1 − δ. Here α ∈ (0, 1) and δ ∈ (0, 1) are user-specified risk and error
levels respectively. Formally, the per-pixel uncertainty intervals are defined as
follows.
Definition 1. Risk-Controlling Prediction Set (RCPS). A random set-

valued function T : X → Y = 2Y is an (α, δ)-Risk-Controlling Prediction Set
if
P(R(T ) ≤ 1 − α) ≥ 1 − δ.
  
 test 
Here the risk is R(T )  1 − E N1 {i : Y[i] ∈ T (X test )[i] } where the expec-
tation is over a new test point (X test , Y test ), while the outer probability is over
the calibration data.
The procedure for constructing RCPS consists of two stages. First, a machine
learning system (e.g. neural network) is trained to output a point prediction fˆ,
and heuristic lower and upper interval widths (˜l, ũ). The second phase utilizes
the calibration set to calibrate (˜l, ũ) so they contain the right fraction of ground
truth pixels. The final intervals are those in (1) with the calibrated widths (ˆl, û).
Conformal prediction provides per-pixel uncertainty intervals with statistical
guarantees in image-to-image regression problems. Yet, the per-pixel prediction
sets may be difficult to comprehend on their own. To remedy this, the uncer-
tainty intervals are visualized by passing the pixel-wise interval lengths through
a colormap, where small sets render a pixel blue and large sets render it red.
Thus, the redder a region is, the greater the uncertainty, and the bluer it is, the
greater the confidence. The resultant uncertainty map, however, is not directly
endowed with rigorous guarantees. This raises the following question: can we
directly produce an uncertainty map with strong statistical guarantees?

4 Conformal Prediction Masks


 
Inspired by the above, we construct uncertainty masks M(X) = M X, fˆ(X) ∈
[0, 1]N such that
  
 ˆ test test 
E M(X )[i] · f (X )[i] − Y[i]  ≤ β[i] ,
test
(2)

where the expectation is over a new test point, and β[i] ∈ R+ is user-specified risk
level. Define fˆM (X)  M(X)  fˆ(X) and YM  M(X)  Y where  represents
a point-wise (Hadamard) product. Then, note that building (2) is equivalent to
create the following uncertainty intervals
 
TM (X)[i] = fˆM (X)[i] − β[i] , fˆM (X)[i] + β[i] , (3)
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 167

which satisfies
test
YM[i] ∈ TM (X test )[i] . (4)
We remark a few difference between (3) and (1): In (1) the lower and upper
per-pixel uncertainty widths (ˆl, û) depend on X and are calibrated, while in (3)
ˆl = û ≡ β are user-specified and independent of X. Furthermore, the uncertainty
parameters which undergo calibration are {M(X)[i] }N i=1 .
One may notice that the above formulation exhibits a major limitation as
each value of the prediction mask is defined independently from other values.
Hence, it requires the user to specify a risk level for each pixel which is cum-
bersome, especially in high dimension. More importantly, setting each entry of
the mask independently may fail in capturing the dependency between pixels,
thus, fail to express uncertainty at a conceptual level. To overcome this, we
redefine our uncertainty masksto ensure with probability at least 1 − δ it holds


that E fˆM (X test ) − Y test
M ≤ α, where α ∈ R+ is a global risk level and
1

N
Z1  i=1 Z[i] is the L1 norm of an arbitrary image Z. Furthermore, the
latter formulation can be generalized to any divergence measure d : Y × Y → R+
such that  
E d fˆM (X test ), Y test ≤ α.
M (5)

Note we avoid trivial solutions, e.g. a zero-mask, which satisfy (5) yet provide
no useful information. Thus, we seek solutions that employ the least masking
required to meet (5), with high probability.
The above formulation enjoys several benefits. First, the current definition of
the mask captures pixel-dependency. Thus, rather than focusing on individual
pixels, the resultant map would mask out (or reduce) regions of high uncertainty
within the predicated image to guarantee the divergence remains below the given
risk level. Second, it accepts any divergence measure, each leading to a different
mask. For example, selecting d(·, ·) to be a distortion measure may underline
uncertainty regions of high-frequency objects (e.g. edges), while setting d(·, ·) to
be a perceptual loss may highlight semantic factors within the image. Formally,
we refers to these uncertainty masks as Risk-Controlling Prediction Masks, which
are defined below.

Definition 2. Risk-Controlling Prediction Mask (RCPM). A random


function M : X × Y → [0, 1]Y is an (α, δ)-Risk-Controlling Prediction Mask
if 

P E R(M) ≤ α ≥ 1 − δ,

where the risk is defined as R(M)  d fˆM (X test ), YMtest
for given a divergence
d(·, ·). The outer probability is over the calibration data, while the expectation
taken over a test point (X test , Y test ).

As for RCPS, the procedure for creating RCPM includes two main stages.
First, given a predictor fˆ, we require a heuristic notion of a non-zero uncertainty
168 G. Kutiel et al.

mask M . In particular, we train a neural network to output a mask given the


measurements and the predicted image as inputs. Second, given a divergence
measure, we use the calibration set to calibrate the heuristic mask until the
divergence measure decreases below the desired risk level. The final outputs are
the calibrated mask and the original prediction multiplied by the mask. The
overall method is outlined in Algorithm 1. Following the latter, we now discuss
notion of initial uncertainty masks and the subsequent calibration process.

Algorithm 1. Generating RCPM



1. Given a regression model fˆ, train a model which outputs an initial mask M.
 using the calibration dataset to obtain M (e.g. using Algorithm 2).
2. Calibrate M
3. Given X at inference, output the risk-controlling masked prediction fˆM (X) =
M(X)  fˆ(X).

4.1 Initial Estimation of Uncertainty Masks


Here we present two notions of uncertainty masks. The first concept, based on
[7], translates given uncertainty intervals into a heuristic mask. In the second we
develop a process for training a neural network which accepts the input and the
predicted images and outputs an uncertainty mask based on a given divergence
between the prediction and the ground-truth image.

Intervals to Masks. In [7], the authors propose to build uncertainty intervals


based on four heuristic notions of lower and upper interval widths ˜l and ũ: (1)
Regression to the magnitude of the residual; (2) one Gaussian per pixel; (3)
softmax outputs; and (4) pixel-wise quantile regression. Then, we build a mask
by setting the pixel-values to be inversely proportional to the interval sizes:
 

M(X) ˜ −1 .
[i] ∝ ũ[i] − l[i] (6)

Thus, the resultant mask holds high values at pixels with small-size intervals
(high confidence) and smaller values at pixels with larger intervals correspond-
ing to high uncertainty regions. However this approach requires first creating
uncertainty intervals, hence, we next introduce a technique which directly pro-
duces an uncertainty mask.

Mask Regression. Here, we introduce a notion of an uncertainty mask rep-



resented by a neural network M(X; θ) ∈ [0, 1]N with parameters θ. The mask
model is trained to output a mask which satisfies
 
ˆ
E d fM(X
train train
), YM ≤ α. (7)

Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 169

where here the expectation is over the training samples D  {Xj , Yj }Jj=1 used
to train fˆ. To derive our lossfunction, we
 start with formulating the following
problem for a given a triplet X, Y, fˆ(X)
 
min ||M X, fˆ(X) − 1||22 subject to d fˆ(X), Y  ≤ α, (8)
θ M M

where 1 is an image of all ones, representing no masking. The constraint in the


above corresponds to (7), while the objective aims to find the minimal solution,
i.e., the solution that masks the image the least (avoding trivial solutions). The
Lagrangian of the problem is given by
 
 
X, fˆ(X) − 1||2 + μ d fˆ(X), Y  − α
L(θ, μ)  ||M (9)
2 M M

where μ > 0 is the dual variable, considered as an hyperparameter. Given μ, the


optimal mask can be obtained by minimizing L(θ, μ) with respect to θ, which is
equivalent to minimizing
 
X, fˆ(X) − 1||22 + μ · d fˆ(X), Y 
||M (10)
M M

since α does not depend on θ. Thus, we train our mask model using the following
loss function:
  
L(D, θ)  X, fˆ(X) − 1||22 + μ · d fˆ(X), Y  .
||M (11)
M M
(X,Y )∈D

The proposed approach facilitates the use of any differentiable distortion


measure and is agnostic to the prediction model fˆ. Furthermore, notice that the
loss function is independent of α, hence, can be trained once for all values of
α. Thus, the output mask acts only as an initial uncertainty map which may
not satisfy (5) and need to be calibrated. Following proper calibration, discussed
next, our mask model attains (5) without requiring the ground-truth Y. Lastly,
this approach directly outputs uncertainty masks and thus it is the focus of our
work.

4.2 Mask Calibration


We consider the M (X) as an initial estimation of our uncertainty mask which
needs to calibrated to provide the guarantee in Definition 2. As the calibration
process is not the focus of our work, we perform a simple calibration outlined in
Algorithm 2. The core of the calibration employs a parametric function C(·; λ)
 
pixel-wise to obtain a mask Mλ (X)[i]  C M(X) [i] ; λ . In general, C(·; λ) can
be any monotonic non-decreasing function. Here we consider the following form1
 
λ
Mλ (X)[i]  min , 1 ∀i = 1, ..., N, (12)

1 − M(X) [i] + 
1
A small value  is added to the denominator to ensure numerical stability.
170 G. Kutiel et al.

which has been found empirically to perform well in our experiments. To set
λ > 0, we use the calibration dataset C  {Xk , Yk }K k=1 such that for any pair
(Xk , Yk ) ∈ C we compute
 
λk  max λ̂ : d fˆMλ̂ (Xk ), YkMλ̂ ≤ α . (13)

Finally, λ is taken to be the 1 − δ quantile of {λk }Kk=1 , i.e. the maximal value
for which at least δ fraction of the calibration set satisfies condition (5). Thus,
assuming the calibration and test sets are i.i.d samples from the same distribu-
tion, the calibrated mask is guaranteed to satisfy Definition 2.

Algorithm 2. Calibration Process


Input: Calibration data C  {Xk , Yk }K k=1 ; risk level α; error rate δ; underlying predic-
 a monotonic non-decreasing function C(·; λ) : [0, 1] → [0, 1]
tor fˆ; heuristic mask M;
parameterized by λ > 0.
 
1. For a given λ̃ > 0, define Mλ̃ (X)[i]  C M(X) [i] ; λ̃ for all i = 1, ..., N.
   
2. For each pair (Xk , Yk ) ∈ C, set λk  max λ̂ : d fˆM (Xk ), Yk M ≤ α .
λ̂ λ̂

3. Set λ to be the 1 − δ quantile of {λk }K


k=1 .
 

4. Define the final mask model as Mλ (X)[i]  C M(X)[i] ; λ .

Output: Calibrated uncertainty mask model Mλ .

5 Experiments
5.1 Datasets and Tasks

Datasets. Two data-sets are used in our experiments:


Places365 [40]: A large collection of 256 × 256 images from 365 scene categories.
We use 1,803,460 images for training and 36,500 images for validation/test.
Rat Astrocyte Cells [26]: A dataset of 1,200 uncompressed images of scanned
rat cells of resolution 990×708. We crop the images into 256×256 tiles, and
randomly split them into train and validation/test sets of sizes 373,744 and
11,621 respectively. The tiles are partially overlapped as we use stride of 32
pixels when cropping the images.
Tasks. We consider the following image-to-image tasks (illustrated in Fig. 4 in
the Appendix):
Image Completion: Using gray-scale version of Places365, we remove middle ver-
tical and horizontal stripes of 32 pixel width, and aim to reconstruct the missing
part.
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 171

Super Resolution: We experiment with this task on the two data-sets. The images
are scaled down to 64 × 64 images where the goal is to reconstruct the original
images.
Colorization: We convert the Places365 images to grayscale and aim to recover
their colors.

5.2 Experimental Settings

Image-to-Image Models. We start with training models for the above three
tasks. Note that these models are not intended to be state-of-the-art, but rather
used to demonstrate the uncertainty estimation technique proposed in this work.
We use the same model architecture for all tasks: an 8 layer U-Net. For each task
we train two versions of the network: (i) A simple regressor; and (ii) A conditional
GAN, where the generator plays the role of the reconstruction model. For the
GAN, the discriminator is implemented as a 4 layer CNN. We use the L1 loss as
the objective for the regressor, and add an adversarial loss for the conditional
GAN, as in [20]. All models are trained for 10 epochs using Adam optimizer
with a learning rate of 1e−5 and a batch size of 50.
Mask Model. For our mask model we use an 8 layer U-Net architecture for sim-
plicity and compatibility with previous works. The input to the mask model are
the measurement image and the predicated image, concatenated on the channel
axis. The output is a mask having the same shape as the predicted image with
values within the range [0, 1]. The mask model is trained using the loss function
(11) with μ = 2, a learning rate of 1e-5 and a batch size of 25.
Experiments. We consider the L1, L2, SSIM and LPIPS as our divergence mea-
sures. We set aside 1, 000 samples from each validation set for calibration and
use the remaining samples for evaluation. We demonstrate the flexibility of our
approach by conducting experiments on a variety of 12 settings: (i) Image Com-
pletion: {Regressor, GAN} × {L1, LPIPS}; (ii) Super Resolution: {Regressor,
GAN} × {L1, SSIM}; and (iii) Colorization: {Regressor, GAN} × {L1, L2}.
Risk and Error Levels. Recall that given a predicted image, our goal is to find
a mask that, when applied to both the prediction and the (unknown) reference
image, reduces the distortion between them to a predefined risk level α with high
probability δ. Here we fix δ = 0.9 and set α to be the 0.1-quantile of each measure
computed on a random sample from the validation set, i.e. roughly 10% of the
predictions are already considered sufficiently good and do not require masking
at all.

5.3 Competing Techniques for Comparison

Quantile – Interval-Based Technique. We compare our method to the


quantile regression option presented in [7], denoted by Quantile. While their cal-
ibrated uncertainty intervals are markedly different from the expected distortion
172 G. Kutiel et al.

we consider, we can use these intervals and transform them into a mask using
(6). For completeness, we also report the performance of the quantile regression
even when it is less suitable, i.e. when the underlying model is a GAN and when
the divergence function is different from L1. We note again that for the sake of
a fair comparison, our implementation of the mask model uses exactly the same
architecture as the quantile regressor.
Opt – Oracle. We also compare our method with an oracle, denoted Opt,
which given a ground-truth image computes an optimal mask by minimizing
(10). We perform gradient descent using Adam optimizer with a learning rate
of 0.01, iterating until the divergence term decreases below the risk level α.
This approach is performed to each test image individually, thus no calibration
needed.
Comparison Metrics. Given a mask M(X) we assess  its performance using
the following metrics: (i) Average mask size s M(X)  N1 M(X) − 11 ; (ii)
Pearson correlation Corr(M, d) between the mask size and the full (unmasked)
divergence value; and (iii) Pearson correlation Corr(M, Mopt ) between the mask
size and the optimal mask Mopt obtained by Opt.

5.4 Results and Discussion


We now show a series of results that demonstrate our proposed uncertainty
masking approach, and its comparison with Opt and Quantile2 . We begin with

Fig. 2. Examples of conformal prediction masks. The images from left to right
are the measurement, ground-truth, model prediction, our calibrated mask trained with
L1 loss and the ground-truth L1 error. Tasks are image completion (top), colorization
(middle) and super resolution (bottom).
2
Due to space limitations, we show more extensive experimental results in the
Appendix, while presenting a selected portion of them here.
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 173

Table 1. Quantitative results. Arrows points to the better direction where best results
are in blue.

s(M) (↓) Corr(M, d) (↑) Corr(M, Mopt ) (↑)


Network Distance Opt Ours Quantile Ours Quantile Ours Quantile
Image Completion - Places365
Regression L1 0.09 0.10 0.15 0.89 0.78 0.89 0.76
Regression LPIPS 0.01 0.01 0.20 0.54 0.51 0.89 0.77
GAN L1 0.09 0.09 0.14 0.95 0.85 0.94 0.80
GAN LPIPS 0.01 0.01 0.08 0.31 0.24 0.50 0.23
Super Resolution - Rat Astrocyte Cells
Regression L1 0.24 0.26 0.28 0.99 0.54 0.95 0.88
Regression SSIM 0.03 0.03 0.13 0.66 0.64 0.82 0.57
GAN L1 0.26 0.30 0.40 0.94 0.63 0.80 0.72
GAN SSIM 0.03 0.03 0.13 0.79 0.63 0.83 0.63
Super Resolution - Places365
Regression L1 0.30 0.36 0.39 0.99 0.97 0.95 0.94
Regression SSIM 0.10 0.23 0.48 0.89 0.85 0.94 0.84
GAN L1 0.37 0.38 0.47 0.97 0.81 0.95 0.67
GAN SSIM 0.10 0.12 0.51 0.86 0.81 0.92 0.86
Colorization - Places365
Regression L1 0.27 0.37 0.40 0.68 0.43 0.57 0.46
Regression L2 0.18 0.37 0.38 0.57 0.30 0.60 0.48
GAN L1 0.27 0.38 0.40 0.58 0.40 0.60 0.52
GAN L2 0.18 0.36 0.38 0.42 0.28 0.59 0.49

a representative visual illustration of our proposed mask for several test cases in
Fig. 2. As can be seen, the produced masks indeed identify sub-regions of high
uncertainty. In the image completion task the bottom left corner is richer in
details and thus there is high uncertainty regarding this part in the reconstructed
image. In the colorization task, the mask highlights the colored area of the bus
which is the most unreliable region since can be colorized with a large variety
of colors. In the super resolution task the mask marks regions of edges and text
while trustworthy parts such as smooth surfaces remain unmasked.
We present quantitative results in Table 1, showing that our method exhibits
smaller mask sizes, aligned well with the masks obtained by Opt. In contrast,
Quantile overestimates and produces larger masks as expected. In terms of the
correlation Corr(M, d), our method shows high agreement, while Quantile lags
behind. This correlation indicates a much desired adaptivity of the estimated
mask to the complexity of image content and thus to the corresponding uncer-
tainty. We provide a complement illustration of the results in Fig. 3 in the
Appendix. As seen from the top row, all three methods meet the probabilis-
tic guarantees regarding the divergence/loss with fewer than 10% exceptions,
174 G. Kutiel et al.

as required. Naturally, Opt does not have outliers since each mask is optimally
calibrated by its computation. The spread of loss values tends to be higher
with Quantile, indicating weaker performance. The middle and bottom rows are
consistent with results in Table 1, showing that our approach tends to produce
masks that are close in size to those of Opt; while Quantile produces larger,
and thus inferior, masked areas. We note that the colorization task seem to be
more challenging, resulting in a marginal performance increase for our method
compared to Quantile.

6 Conclusions
Uncertainty assessment in image-to-image regression problems is a challenging
task, due to the implied complexity, the high dimensions involved, and the need
to offer an effective and meaningful visualization of the estimated results. This
work proposes a novel approach towards these challenges by constructing a
conformal mask that visually-differentiate between trustworthy and uncertain
regions in an estimated image. This mask provides a measure of uncertainty
accompanied by an statistical guarantee, stating that with high probability,
the divergence between the original and the recovered images over the non-
masked regions is below a desired risk level. The presented paradigm is flexible,
being agnostic to the choice of divergence measure, and the regression method
employed.

References
1. Abdar, M., et al.: A review of uncertainty quantification in deep learning: tech-
niques, applications and challenges. Inf. Fusion 76, 243–297 (2021)
2. Alaa, A., Van Der Schaar, M.: Frequentist uncertainty in recurrent neural net-
works via blockwise influence functions. In: International Conference on Machine
Learning, pp. 175–190. PMLR (2020)
3. Angelopoulos, A.N., Bates, S.: A gentle introduction to conformal prediction and
distribution-free uncertainty quantification. CoRR abs/2107.07511 (2021). https://
arxiv.org/abs/2107.07511
4. Angelopoulos, A.N., Bates, S.: A gentle introduction to conformal prediction
and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511
(2021)
5. Angelopoulos, A.N., Bates, S., Candès, E.J., Jordan, M.I., Lei, L.: Learn then
test: calibrating predictive algorithms to achieve risk control. arXiv preprint
arXiv:2110.01052 (2021)
6. Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal risk
control. arXiv preprint arXiv:2208.02814 (2022)
7. Angelopoulos, A.N., et al.: Image-to-image regression with distribution-free uncer-
tainty quantification and applications in imaging. arXiv preprint arXiv:2202.05265
(2022)
8. Ashukha, A., Lyzhov, A., Molchanov, D., Vetrov, D.: Pitfalls of in-domain uncer-
tainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470
(2020)
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 175

9. Bates, S., Angelopoulos, A., Lei, L., Malik, J., Jordan, M.: Distribution-free, risk-
controlling prediction sets. J. ACM (JACM) 68(6), 1–34 (2021)
10. Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in
neural network. In: International Conference on Machine Learning, pp. 1613–1622.
PMLR (2015)
11. Chen, T., Fox, E., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo. In:
International Conference on Machine Learning, pp. 1683–1691. PMLR (2014)
12. Cohen, R., Blau, Y., Freedman, D., Rivlin, E.: It has potential: Gradient-driven
denoisers for convergent solutions to inverse problems. Adv. Neural. Inf. Process.
Syst. 34, 18152–18164 (2021)
13. Cohen, R., Elad, M., Milanfar, P.: Regularization by denoising via fixed-point
projection (red-pro). SIAM J. Imag. Sci. 14(3), 1374–1406 (2021)
14. Damianou, A., Lawrence, N.D.: Deep gaussian processes. In: Artificial intelligence
and statistics, pp. 207–215. PMLR (2013)
15. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing
model uncertainty in deep learning. In: International Conference on Machine Learn-
ing, pp. 1050–1059. PMLR (2016)
16. Gal, Y., Hron, J., Kendall, A.: Concrete dropout. Adv. Neural Inf. Process. Syst.
30 (2017)
17. Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image
data. In: International Conference on Machine Learning, pp. 1183–1192. PMLR
(2017)
18. Gasthaus, J., et al.: Probabilistic forecasting with spline quantile function RNNs.
In: The 22nd International Conference on Artificial Intelligence and Statistics, pp.
1901–1910. PMLR (2019)
19. Hu, R., Huang, Q., Chang, S., Wang, H., He, J.: The MBPEP: a deep ensem-
ble pruning algorithm providing high quality uncertainty prediction. Appl. Intell.
49(8), 2942–2955 (2019). https://doi.org/10.1007/s10489-019-01421-8
20. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1125–1134 (2017)
21. Izmailov, P., Maddox, W.J., Kirichenko, P., Garipov, T., Vetrov, D., Wilson, A.G.:
Subspace inference for Bayesian deep learning. In: Uncertainty in Artificial Intel-
ligence, pp. 1169–1179. PMLR (2020)
22. Kim, B., Xu, C., Barber, R.: Predictive inference is free with the jackknife+-after-
bootstrap. Adv. Neural. Inf. Process. Syst. 33, 4138–4149 (2020)
23. Kivaranovic, D., Johnson, K.D., Leeb, H.: Adaptive, distribution-free prediction
intervals for deep networks. In: International Conference on Artificial Intelligence
and Statistics, pp. 4346–4356. PMLR (2020)
24. Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive
uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. Syst. 30
(2017)
25. Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R.J., Wasserman, L.: Distribution-free
predictive inference for regression. J. Am. Stat. Assoc. 113(523), 1094–1111 (2018)
26. Ljosa, V., Sokolnicki, K.L., Carpenter, A.E.: Annotated high-throughput
microscopy image sets for validation. Nat. Methods 9(7), 637–637 (2012)
27. Louizos, C., Welling, M.: Multiplicative normalizing flows for variational Bayesian
neural networks. In: International Conference on Machine Learning, pp. 2218–2227.
PMLR (2017)
28. MacKay, D.J.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992)
176 G. Kutiel et al.

29. Pearce, T., Brintrup, A., Zaki, M., Neely, A.: High-quality prediction intervals for
deep learning: a distribution-free, ensembled approach. In: International Confer-
ence on Machine Learning, pp. 4075–4084. PMLR (2018)
30. Posch, K., Steinbrener, J., Pilz, J.: Variational inference to measure model uncer-
tainty in deep neural networks. arXiv preprint arXiv:1902.10189 (2019)
31. Ritter, H., Botev, A., Barber, D.: A scalable Laplace approximation for neural net-
works. In: 6th International Conference on Learning Representations, ICLR 2018-
Conference Track Proceedings, vol. 6. International Conference on Representation
Learning (2018)
32. Romano, Y., Patterson, E., Candes, E.: Conformalized quantile regression. Adv.
Neural Inf. Process. Syst. 32 (2019)
33. Salimans, T., Kingma, D., Welling, M.: Markov chain Monte Carlo and variational
inference: bridging the gap. In: International Conference on Machine Learning, pp.
1218–1226. PMLR (2015)
34. Sankaranarayanan, S., Angelopoulos, A.N., Bates, S., Romano, Y., Isola, P.:
Semantic uncertainty intervals for disentangled latent spaces. arXiv preprint
arXiv:2207.10074 (2022)
35. Sesia, M., Candès, E.J.: A comparison of some conformal quantile regression meth-
ods. Stat 9(1), e261 (2020)
36. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9(3)
(2008)
37. Sun, S.: Conformal methods for quantifying uncertainty in spatiotemporal data: a
survey. arXiv preprint arXiv:2209.03580 (2022)
38. Valentin Jospin, L., Buntine, W., Boussaid, F., Laga, H., Bennamoun, M.: Hands-
on Bayesian neural networks-a tutorial for deep learning users. arXiv e-prints pp.
arXiv-2007 (2020)
39. Wu, D., et al.: Quantifying uncertainty in deep spatiotemporal forecasting. arXiv
preprint arXiv:2105.11982 (2021)
40. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million
image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.
(2017)

You might also like