Visualizing Conformal Predictions
Visualizing Conformal Predictions
Visualizing Conformal Predictions
1 Introduction
Deep Learning has been successful in many applications, spanning computer
vision, speech recognition, natural language processing, and beyond [12,13]. For
many years, researchers were mainly content in developing new techniques that
achieve unprecedented accuracy, without concerns for understanding the uncer-
tainty implicit in such models. Recently, however, there has been a concerted
effort within the research community to quantify the uncertainty of deep mod-
els.
This paper addresses the problem of quantifying and visualizing uncertainty
in the realm of image-to-image tasks. Such problems include super-resolution,
deblurring, colorization, and image completion, amongst others. Assessing uncer-
tainty is important generally, but is particularly so in application domains such
2 Related Work
where ˆl(X)[i] ≥ 0 and û(X)[i] ≥ 0 represent the uncertainty in lower and upper
directions respectively. Given heuristic uncertainty values ˜l and ũ, the uncer-
tainty intervals are calibrated using a calibration dataset C {Xk , Yk }K k=1 to
guarantee they contain at least a fraction α of the ground-truth pixel values with
probability 1 − δ. Here α ∈ (0, 1) and δ ∈ (0, 1) are user-specified risk and error
levels respectively. Formally, the per-pixel uncertainty intervals are defined as
follows.
Definition 1. Risk-Controlling Prediction Set (RCPS). A random set-
valued function T : X → Y = 2Y is an (α, δ)-Risk-Controlling Prediction Set
if
P(R(T ) ≤ 1 − α) ≥ 1 − δ.
test
Here the risk is R(T ) 1 − E N1 {i : Y[i] ∈ T (X test )[i] } where the expec-
tation is over a new test point (X test , Y test ), while the outer probability is over
the calibration data.
The procedure for constructing RCPS consists of two stages. First, a machine
learning system (e.g. neural network) is trained to output a point prediction fˆ,
and heuristic lower and upper interval widths (˜l, ũ). The second phase utilizes
the calibration set to calibrate (˜l, ũ) so they contain the right fraction of ground
truth pixels. The final intervals are those in (1) with the calibrated widths (ˆl, û).
Conformal prediction provides per-pixel uncertainty intervals with statistical
guarantees in image-to-image regression problems. Yet, the per-pixel prediction
sets may be difficult to comprehend on their own. To remedy this, the uncer-
tainty intervals are visualized by passing the pixel-wise interval lengths through
a colormap, where small sets render a pixel blue and large sets render it red.
Thus, the redder a region is, the greater the uncertainty, and the bluer it is, the
greater the confidence. The resultant uncertainty map, however, is not directly
endowed with rigorous guarantees. This raises the following question: can we
directly produce an uncertainty map with strong statistical guarantees?
where the expectation is over a new test point, and β[i] ∈ R+ is user-specified risk
level. Define fˆM (X) M(X) fˆ(X) and YM M(X) Y where represents
a point-wise (Hadamard) product. Then, note that building (2) is equivalent to
create the following uncertainty intervals
TM (X)[i] = fˆM (X)[i] − β[i] , fˆM (X)[i] + β[i] , (3)
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 167
which satisfies
test
YM[i] ∈ TM (X test )[i] . (4)
We remark a few difference between (3) and (1): In (1) the lower and upper
per-pixel uncertainty widths (ˆl, û) depend on X and are calibrated, while in (3)
ˆl = û ≡ β are user-specified and independent of X. Furthermore, the uncertainty
parameters which undergo calibration are {M(X)[i] }N i=1 .
One may notice that the above formulation exhibits a major limitation as
each value of the prediction mask is defined independently from other values.
Hence, it requires the user to specify a risk level for each pixel which is cum-
bersome, especially in high dimension. More importantly, setting each entry of
the mask independently may fail in capturing the dependency between pixels,
thus, fail to express uncertainty at a conceptual level. To overcome this, we
redefine our uncertainty masksto ensure with probability at least 1 − δ it holds
that E fˆM (X test ) − Y test
M ≤ α, where α ∈ R+ is a global risk level and
1
N
Z1 i=1 Z[i] is the L1 norm of an arbitrary image Z. Furthermore, the
latter formulation can be generalized to any divergence measure d : Y × Y → R+
such that
E d fˆM (X test ), Y test ≤ α.
M (5)
Note we avoid trivial solutions, e.g. a zero-mask, which satisfy (5) yet provide
no useful information. Thus, we seek solutions that employ the least masking
required to meet (5), with high probability.
The above formulation enjoys several benefits. First, the current definition of
the mask captures pixel-dependency. Thus, rather than focusing on individual
pixels, the resultant map would mask out (or reduce) regions of high uncertainty
within the predicated image to guarantee the divergence remains below the given
risk level. Second, it accepts any divergence measure, each leading to a different
mask. For example, selecting d(·, ·) to be a distortion measure may underline
uncertainty regions of high-frequency objects (e.g. edges), while setting d(·, ·) to
be a perceptual loss may highlight semantic factors within the image. Formally,
we refers to these uncertainty masks as Risk-Controlling Prediction Masks, which
are defined below.
As for RCPS, the procedure for creating RCPM includes two main stages.
First, given a predictor fˆ, we require a heuristic notion of a non-zero uncertainty
168 G. Kutiel et al.
Thus, the resultant mask holds high values at pixels with small-size intervals
(high confidence) and smaller values at pixels with larger intervals correspond-
ing to high uncertainty regions. However this approach requires first creating
uncertainty intervals, hence, we next introduce a technique which directly pro-
duces an uncertainty mask.
where here the expectation is over the training samples D {Xj , Yj }Jj=1 used
to train fˆ. To derive our lossfunction, we
start with formulating the following
problem for a given a triplet X, Y, fˆ(X)
min ||M
X, fˆ(X) − 1||22 subject to d fˆ(X), Y ≤ α, (8)
θ M M
since α does not depend on θ. Thus, we train our mask model using the following
loss function:
L(D, θ)
X, fˆ(X) − 1||22 + μ · d fˆ(X), Y .
||M (11)
M M
(X,Y )∈D
which has been found empirically to perform well in our experiments. To set
λ > 0, we use the calibration dataset C {Xk , Yk }K k=1 such that for any pair
(Xk , Yk ) ∈ C we compute
λk max λ̂ : d fˆMλ̂ (Xk ), YkMλ̂ ≤ α . (13)
Finally, λ is taken to be the 1 − δ quantile of {λk }Kk=1 , i.e. the maximal value
for which at least δ fraction of the calibration set satisfies condition (5). Thus,
assuming the calibration and test sets are i.i.d samples from the same distribu-
tion, the calibrated mask is guaranteed to satisfy Definition 2.
5 Experiments
5.1 Datasets and Tasks
Super Resolution: We experiment with this task on the two data-sets. The images
are scaled down to 64 × 64 images where the goal is to reconstruct the original
images.
Colorization: We convert the Places365 images to grayscale and aim to recover
their colors.
Image-to-Image Models. We start with training models for the above three
tasks. Note that these models are not intended to be state-of-the-art, but rather
used to demonstrate the uncertainty estimation technique proposed in this work.
We use the same model architecture for all tasks: an 8 layer U-Net. For each task
we train two versions of the network: (i) A simple regressor; and (ii) A conditional
GAN, where the generator plays the role of the reconstruction model. For the
GAN, the discriminator is implemented as a 4 layer CNN. We use the L1 loss as
the objective for the regressor, and add an adversarial loss for the conditional
GAN, as in [20]. All models are trained for 10 epochs using Adam optimizer
with a learning rate of 1e−5 and a batch size of 50.
Mask Model. For our mask model we use an 8 layer U-Net architecture for sim-
plicity and compatibility with previous works. The input to the mask model are
the measurement image and the predicated image, concatenated on the channel
axis. The output is a mask having the same shape as the predicted image with
values within the range [0, 1]. The mask model is trained using the loss function
(11) with μ = 2, a learning rate of 1e-5 and a batch size of 25.
Experiments. We consider the L1, L2, SSIM and LPIPS as our divergence mea-
sures. We set aside 1, 000 samples from each validation set for calibration and
use the remaining samples for evaluation. We demonstrate the flexibility of our
approach by conducting experiments on a variety of 12 settings: (i) Image Com-
pletion: {Regressor, GAN} × {L1, LPIPS}; (ii) Super Resolution: {Regressor,
GAN} × {L1, SSIM}; and (iii) Colorization: {Regressor, GAN} × {L1, L2}.
Risk and Error Levels. Recall that given a predicted image, our goal is to find
a mask that, when applied to both the prediction and the (unknown) reference
image, reduces the distortion between them to a predefined risk level α with high
probability δ. Here we fix δ = 0.9 and set α to be the 0.1-quantile of each measure
computed on a random sample from the validation set, i.e. roughly 10% of the
predictions are already considered sufficiently good and do not require masking
at all.
we consider, we can use these intervals and transform them into a mask using
(6). For completeness, we also report the performance of the quantile regression
even when it is less suitable, i.e. when the underlying model is a GAN and when
the divergence function is different from L1. We note again that for the sake of
a fair comparison, our implementation of the mask model uses exactly the same
architecture as the quantile regressor.
Opt – Oracle. We also compare our method with an oracle, denoted Opt,
which given a ground-truth image computes an optimal mask by minimizing
(10). We perform gradient descent using Adam optimizer with a learning rate
of 0.01, iterating until the divergence term decreases below the risk level α.
This approach is performed to each test image individually, thus no calibration
needed.
Comparison Metrics. Given a mask M(X) we assess its performance using
the following metrics: (i) Average mask size s M(X) N1 M(X) − 11 ; (ii)
Pearson correlation Corr(M, d) between the mask size and the full (unmasked)
divergence value; and (iii) Pearson correlation Corr(M, Mopt ) between the mask
size and the optimal mask Mopt obtained by Opt.
Fig. 2. Examples of conformal prediction masks. The images from left to right
are the measurement, ground-truth, model prediction, our calibrated mask trained with
L1 loss and the ground-truth L1 error. Tasks are image completion (top), colorization
(middle) and super resolution (bottom).
2
Due to space limitations, we show more extensive experimental results in the
Appendix, while presenting a selected portion of them here.
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 173
Table 1. Quantitative results. Arrows points to the better direction where best results
are in blue.
a representative visual illustration of our proposed mask for several test cases in
Fig. 2. As can be seen, the produced masks indeed identify sub-regions of high
uncertainty. In the image completion task the bottom left corner is richer in
details and thus there is high uncertainty regarding this part in the reconstructed
image. In the colorization task, the mask highlights the colored area of the bus
which is the most unreliable region since can be colorized with a large variety
of colors. In the super resolution task the mask marks regions of edges and text
while trustworthy parts such as smooth surfaces remain unmasked.
We present quantitative results in Table 1, showing that our method exhibits
smaller mask sizes, aligned well with the masks obtained by Opt. In contrast,
Quantile overestimates and produces larger masks as expected. In terms of the
correlation Corr(M, d), our method shows high agreement, while Quantile lags
behind. This correlation indicates a much desired adaptivity of the estimated
mask to the complexity of image content and thus to the corresponding uncer-
tainty. We provide a complement illustration of the results in Fig. 3 in the
Appendix. As seen from the top row, all three methods meet the probabilis-
tic guarantees regarding the divergence/loss with fewer than 10% exceptions,
174 G. Kutiel et al.
as required. Naturally, Opt does not have outliers since each mask is optimally
calibrated by its computation. The spread of loss values tends to be higher
with Quantile, indicating weaker performance. The middle and bottom rows are
consistent with results in Table 1, showing that our approach tends to produce
masks that are close in size to those of Opt; while Quantile produces larger,
and thus inferior, masked areas. We note that the colorization task seem to be
more challenging, resulting in a marginal performance increase for our method
compared to Quantile.
6 Conclusions
Uncertainty assessment in image-to-image regression problems is a challenging
task, due to the implied complexity, the high dimensions involved, and the need
to offer an effective and meaningful visualization of the estimated results. This
work proposes a novel approach towards these challenges by constructing a
conformal mask that visually-differentiate between trustworthy and uncertain
regions in an estimated image. This mask provides a measure of uncertainty
accompanied by an statistical guarantee, stating that with high probability,
the divergence between the original and the recovered images over the non-
masked regions is below a desired risk level. The presented paradigm is flexible,
being agnostic to the choice of divergence measure, and the regression method
employed.
References
1. Abdar, M., et al.: A review of uncertainty quantification in deep learning: tech-
niques, applications and challenges. Inf. Fusion 76, 243–297 (2021)
2. Alaa, A., Van Der Schaar, M.: Frequentist uncertainty in recurrent neural net-
works via blockwise influence functions. In: International Conference on Machine
Learning, pp. 175–190. PMLR (2020)
3. Angelopoulos, A.N., Bates, S.: A gentle introduction to conformal prediction and
distribution-free uncertainty quantification. CoRR abs/2107.07511 (2021). https://
arxiv.org/abs/2107.07511
4. Angelopoulos, A.N., Bates, S.: A gentle introduction to conformal prediction
and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511
(2021)
5. Angelopoulos, A.N., Bates, S., Candès, E.J., Jordan, M.I., Lei, L.: Learn then
test: calibrating predictive algorithms to achieve risk control. arXiv preprint
arXiv:2110.01052 (2021)
6. Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal risk
control. arXiv preprint arXiv:2208.02814 (2022)
7. Angelopoulos, A.N., et al.: Image-to-image regression with distribution-free uncer-
tainty quantification and applications in imaging. arXiv preprint arXiv:2202.05265
(2022)
8. Ashukha, A., Lyzhov, A., Molchanov, D., Vetrov, D.: Pitfalls of in-domain uncer-
tainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470
(2020)
Conformal Prediction Masks: Visualizing Uncertainty in Medical Imaging 175
9. Bates, S., Angelopoulos, A., Lei, L., Malik, J., Jordan, M.: Distribution-free, risk-
controlling prediction sets. J. ACM (JACM) 68(6), 1–34 (2021)
10. Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in
neural network. In: International Conference on Machine Learning, pp. 1613–1622.
PMLR (2015)
11. Chen, T., Fox, E., Guestrin, C.: Stochastic gradient Hamiltonian Monte Carlo. In:
International Conference on Machine Learning, pp. 1683–1691. PMLR (2014)
12. Cohen, R., Blau, Y., Freedman, D., Rivlin, E.: It has potential: Gradient-driven
denoisers for convergent solutions to inverse problems. Adv. Neural. Inf. Process.
Syst. 34, 18152–18164 (2021)
13. Cohen, R., Elad, M., Milanfar, P.: Regularization by denoising via fixed-point
projection (red-pro). SIAM J. Imag. Sci. 14(3), 1374–1406 (2021)
14. Damianou, A., Lawrence, N.D.: Deep gaussian processes. In: Artificial intelligence
and statistics, pp. 207–215. PMLR (2013)
15. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing
model uncertainty in deep learning. In: International Conference on Machine Learn-
ing, pp. 1050–1059. PMLR (2016)
16. Gal, Y., Hron, J., Kendall, A.: Concrete dropout. Adv. Neural Inf. Process. Syst.
30 (2017)
17. Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image
data. In: International Conference on Machine Learning, pp. 1183–1192. PMLR
(2017)
18. Gasthaus, J., et al.: Probabilistic forecasting with spline quantile function RNNs.
In: The 22nd International Conference on Artificial Intelligence and Statistics, pp.
1901–1910. PMLR (2019)
19. Hu, R., Huang, Q., Chang, S., Wang, H., He, J.: The MBPEP: a deep ensem-
ble pruning algorithm providing high quality uncertainty prediction. Appl. Intell.
49(8), 2942–2955 (2019). https://doi.org/10.1007/s10489-019-01421-8
20. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1125–1134 (2017)
21. Izmailov, P., Maddox, W.J., Kirichenko, P., Garipov, T., Vetrov, D., Wilson, A.G.:
Subspace inference for Bayesian deep learning. In: Uncertainty in Artificial Intel-
ligence, pp. 1169–1179. PMLR (2020)
22. Kim, B., Xu, C., Barber, R.: Predictive inference is free with the jackknife+-after-
bootstrap. Adv. Neural. Inf. Process. Syst. 33, 4138–4149 (2020)
23. Kivaranovic, D., Johnson, K.D., Leeb, H.: Adaptive, distribution-free prediction
intervals for deep networks. In: International Conference on Artificial Intelligence
and Statistics, pp. 4346–4356. PMLR (2020)
24. Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive
uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. Syst. 30
(2017)
25. Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R.J., Wasserman, L.: Distribution-free
predictive inference for regression. J. Am. Stat. Assoc. 113(523), 1094–1111 (2018)
26. Ljosa, V., Sokolnicki, K.L., Carpenter, A.E.: Annotated high-throughput
microscopy image sets for validation. Nat. Methods 9(7), 637–637 (2012)
27. Louizos, C., Welling, M.: Multiplicative normalizing flows for variational Bayesian
neural networks. In: International Conference on Machine Learning, pp. 2218–2227.
PMLR (2017)
28. MacKay, D.J.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992)
176 G. Kutiel et al.
29. Pearce, T., Brintrup, A., Zaki, M., Neely, A.: High-quality prediction intervals for
deep learning: a distribution-free, ensembled approach. In: International Confer-
ence on Machine Learning, pp. 4075–4084. PMLR (2018)
30. Posch, K., Steinbrener, J., Pilz, J.: Variational inference to measure model uncer-
tainty in deep neural networks. arXiv preprint arXiv:1902.10189 (2019)
31. Ritter, H., Botev, A., Barber, D.: A scalable Laplace approximation for neural net-
works. In: 6th International Conference on Learning Representations, ICLR 2018-
Conference Track Proceedings, vol. 6. International Conference on Representation
Learning (2018)
32. Romano, Y., Patterson, E., Candes, E.: Conformalized quantile regression. Adv.
Neural Inf. Process. Syst. 32 (2019)
33. Salimans, T., Kingma, D., Welling, M.: Markov chain Monte Carlo and variational
inference: bridging the gap. In: International Conference on Machine Learning, pp.
1218–1226. PMLR (2015)
34. Sankaranarayanan, S., Angelopoulos, A.N., Bates, S., Romano, Y., Isola, P.:
Semantic uncertainty intervals for disentangled latent spaces. arXiv preprint
arXiv:2207.10074 (2022)
35. Sesia, M., Candès, E.J.: A comparison of some conformal quantile regression meth-
ods. Stat 9(1), e261 (2020)
36. Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9(3)
(2008)
37. Sun, S.: Conformal methods for quantifying uncertainty in spatiotemporal data: a
survey. arXiv preprint arXiv:2209.03580 (2022)
38. Valentin Jospin, L., Buntine, W., Boussaid, F., Laga, H., Bennamoun, M.: Hands-
on Bayesian neural networks-a tutorial for deep learning users. arXiv e-prints pp.
arXiv-2007 (2020)
39. Wu, D., et al.: Quantifying uncertainty in deep spatiotemporal forecasting. arXiv
preprint arXiv:2105.11982 (2021)
40. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million
image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.
(2017)