Madhubalakichu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

1

Deep Learning for Camera Calibration and


Beyond: A Survey
Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin, Jing Zhang, Yao Zhao, Fellow, IEEE, Moncef
Gabbouj, Fellow, IEEE, Dacheng Tao, Fellow, IEEE

Abstract— Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which
is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent
efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations.
Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we
arXiv:2303.10559v1 [cs.CV] 19 Mar 2023

provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main
calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model,
following the research trend and extended applications. As there is no benchmark in this community, we collect a holistic calibration
dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-
world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the
challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration
(spanned 8 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.
com/KangLiao929/Awesome-Deep-Camera-Calibration.

Index Terms—Camera calibration, Deep learning, Computational photography, Multiple view geometry, Robotics.

1 I NTRODUCTION

C calibration is a fundamental and indispensable Undistortion Radial Distortion


AMERA Y
Extrinsic Parameters
field in computer vision and it has a long research X
Pw

history [1], [2], [3], [4], tracing back to around 60 years ago y i
Wide-angle
[5]. The first step for many vision and robotics tasks is to cal- Pi
x
Camera Global Shutter Rolling Shutter

ibrate the intrinsic (image sensor and distortion parameters) Z


and/or extrinsic (rotation and translation) camera parame-
ters, ranging from computational photography, and multi- Intrinsic Parameters
CMOS Camera
view geometry, to 3D reconstruction. In terms of the task Standard Camera Model Distortion Camera Model
type, there are different techniques to calibrate the standard LiDAR Camera
pinhole camera, fisheye lens camera, stereo camera, light t
field camera, event camera, and LiDAR-camera system, etc. R

Figure 1 shows the popular calibration objectives, models,


and extended applications in camera calibration.
Traditional methods for camera calibration generally
depend on hand-crafted features and model assumptions. Projection Matrix
These methods can be broadly divided into three categories.
The most prevalent one involves using a known calibration Cross-View Model Cross-Sensor Model
target (e.g., a checkerboard) as it is deliberately moved in the
Fig. 1. Popular calibration objectives, models, and extended applications
3D scene [6], [7], [8]. Then, the camera captures the target in camera calibration.
from different viewpoints and the checkerboard corners are
detected for calculating the camera parameters. However,
such a procedure requires cumbersome manual interactions To pursue better flexibility, the second category of camera
and it cannot achieve automatic calibration “in the wild”. calibration, i.e., the geometric-prior-based calibration has
been largely studied [9], [10], [11], [12]. To be specific,
Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin (corresponding author), the geometric structures are leveraged to model the 3D-
and Yao Zhao are with the Institute of Information Science, Beijing Jiaotong 2D correspondence in the scene, such as lines and van-
University (BJTU), Beijing 100044, China, and also with the Beijing Key ishing points. However, this type of method heavily relies
Laboratory of Advanced Information Science and Network Technology, Beijing
100044, China (email: kang [email protected], [email protected], shujuan- on structured man-made scenes containing rich geometric
[email protected], [email protected], [email protected]) priors, leading to poor performance when applied to general
Moncef Gabbouj is with the Department of Computing Sciences, Tampere environments. The third category is self-calibration [13],
University, 33101 Tampere, Finland (e-mail: [email protected]) [14], [15]. Such a solution takes a sequence of images as
Jing Zhang and Dacheng Tao are with the School of Computer Sci-
ence, Faculty of Engineering, The University of Sydney, Australia (e-mail: inputs and estimates the camera parameters using multi-
[email protected]; [email protected]) view geometry. The accuracy of self-calibration, however, is
2

constrained by the limits of the feature detectors, which can based camera calibration. In-depth analysis and discussion
be influenced by diverse lighting conditions and textures. in various aspects are offered, including publications, net-
Since there are many standard techniques for calibrating work architecture, loss functions, datasets, evaluation met-
cameras in an industry/laboratory implementation [16], rics, learning strategies, implementation platforms, etc. The
[17], this process is usually ignored in recent development. detailed information of each literature is listed in Table 1.
However, calibrating single and wild images remains chal- (2) Despite the calibration algorithm, we comprehensively
lenging, especially when images are collected from websites review the classical camera models and their extended mod-
and unknown camera models. This challenge motivates the els. In particular, we summarize the redesigned calibration
researchers to investigate a new paradigm. objectives in deep learning since some traditional calibration
Recently, deep learning has brought new inspirations objectives are verified to be hard to learn by neural net-
to camera calibration and its applications. Learning-based works. (3) We collect a dataset containing images and videos
methods achieve state-of-the-art performances on various captured by different cameras in different environments,
tasks with higher efficiency. In particular, diverse deep which can serve as a platform to evaluate the generalization
neural networks (DNNs) have been developed, such as con- of existing methods. (4) We discuss the open challenges of
volutional neural networks (CNNs), generative adversarial learning-based camera calibration and propose some future
networks (GANs), PointNet, and vision transformers (ViTs), directions to provide guidance for further research in this
of which the high-level semantic features show more pow- field. (5) An open-source repository is created that pro-
erful representation capability compared with the hand- vides a taxonomy of all reviewed works and benchmarks.
crafted features. Moreover, diverse learning strategies have The repository will be updated regularly in https://github.
been exploited to boost the geometric perception of neural com/KangLiao929/Awesome-Deep-Camera-Calibration.
networks. Learning-based methods offer a fully automatic In the following sections, we discuss and analyze various
camera calibration solution, without manual interventions aspects of learning-based camera calibration. The remainder
or calibration targets, which sets them apart from traditional of this paper is organized as follows. In Section 2, we pro-
methods. Furthermore, some of these methods achieve cam- vide the concrete learning paradigms and learning strate-
era model-free and label-free calibration, showing promis- gies of the learning-based camera calibration. Subsequently,
ing and meaningful applications. we introduce and discuss the specific methods based on
With the rapid increase in the number of learning-based the standard camera model, distortion model, cross-view
camera calibration methods, it has become increasingly model, and cross-sensor model in Section 3, Section 4,
challenging to keep up with new advances. Consequently, Section 5, and Section 6, respectively (see Figure 2). The
there is an urgent need to analyze existing works and collected benchmark for calibration methods is depicted in
foster a community dedicated to this field. Previously, some Section 7. Finally, we conclude the learning-based camera
surveys, e.g., [18], [19], [20] only focused on a specific calibration and suggest the future directions of this commu-
task/camera in camera calibration or one type of approach. nity in Section 8.
For instance, Salvi et al. [18] reviewed the traditional camera
calibration methods in terms of the algorithms. Hughes et
2 P RELIMINARIES
al. [19] provided a detailed review for calibrating fisheye
cameras with traditional solutions. While Fan et al. [20] Deep learning has brought new inspirations to camera cal-
discussed both the traditional methods and deep learning ibration, enabling a fully automatic calibration procedure
methods, their survey only considers calibrating the wide- without manual intervention. Here, we first summarize
angle cameras. In addition, due to the few amount of two prevalent paradigms in learning-based camera calibra-
reviewed learning-based methods (around 10 papers), the tion: regression-based calibration and reconstruction-based
readers are difficult to picture the development trend of calibration. Then, the widely-used learning strategies are
general camera calibration in Fan et al. [20]. reviewed in this research field. The detailed definitions for
In this paper, we provide a comprehensive and in-depth classical camera models and their corresponding calibration
overview of recent advances in learning-based camera cali- objectives are exhibited in the supplementary material.
bration, covering over 100 papers. We also discuss potential
directions for further improvements and examine various 2.1 Learning Paradigm
types of cameras and targets. To facilitate future research on Driven by different architectures of the neural network,
different topics, we categorize the current solutions accord- the researchers have developed two main paradigms for
ing to calibration objectives and applications. In addition to learning-based camera calibration and its applications.
fundamental parameters such as focal length, rotation, and Regression-based Calibration Given an uncalibrated input,
translation, we also provide detailed reviews for correcting the regression-based calibration first extracts the high-level
image distortion (radial distortion and rolling shutter distor- semantic features using stacked convolutional layers. Then,
tion), estimating cross-view mapping, calibrating camera- the fully connected layers aggregate the semantic features
LiDAR systems, and other applications. Such a trend fol- and form a vector of the estimated calibration objective.
lows the development of cameras and market demands for The regressed parameters are used to conduct subsequent
virtual reality, autonomous driving, neural rendering, etc. tasks such as distortion rectification, image warping, camera
To our best knowledge, this is the first survey of the localization, etc. This paradigm is the earliest and has a
learning-based camera calibration and its extended appli- dominant role in learning-based camera calibration and
cations, it has the following unique contributions. (1) Our its applications. All the first works in various objectives,
work mainly follows recent advances in deep learning- e.g., intrinsics: Deepfocal [21], extrinsic: PoseNet [22], radial
3

distortion: Rong et al. [23], rolling shutter distortion: URS- notorious costly process, and obtaining perfect ground-truth
CNN [24], homography matrix: DHN [25], hybrid param- labels is challenging. As a result, it is often preferable to use
eters: Hold-Geoffroy et al. [26], camera-LiDAR parameters: weak supervision with machine learning methods. Weakly
RegNet [27] have been achieved with this paradigm. supervised learning refers to the process of building predic-
Reconstruction-based Calibration On the other hand, the tion models through learning with inadequate supervision.
reconstruction-based calibration paradigm discards the pa- Zhu et al. [48] present a weakly supervised camera cali-
rameter regression and directly learns the pixel-level map- bration method for single-view metrology in unconstrained
ping function between the uncalibrated input and target, environments, where there is only one accessible image of
inspired by the conditional image-to-image translation [28] a scene composed of objects of uncertain sizes. This work
and dense visual perception [29], [30]. The reconstructed leverages 2D object annotations from large-scale datasets,
results are then calculated for the pixel-wise loss with the where people and buildings are frequently present and
ground truth. In this regard, most reconstruction-based cal- serve as useful “reference objects” for determining 3D size.
ibration methods [31], [32], [33], [34] design their network Unsupervised Learning Unsupervised learning, commonly
architecture based on the fully convolutional network such referred to as unsupervised machine learning, analyzes and
as U-Net [35]. Specifically, an encoder-decoder network, groups unlabeled datasets using machine learning algo-
with skip connections between the encoder and decoder rithms. UDHN [49] is the first work for a cross-view camera
features at the same spatial resolution, progressively extracts model using unsupervised learning, which estimates the
the features from low-level to high-level and effectively in- homography matrix of a paired image without the projec-
tegrates multi-scale features. At the last convolutional layer, tion labels. By reducing a pixel-wise intensity error that
the learned features are aggregated into the target channel, does not require ground truth data, UDHN [49] outperforms
reconstructing the calibrated result at the pixel level. previous supervised learning techniques. While preserving
In contrast to the regression-based paradigm, the superior accuracy and robustness to fluctuation in light, the
reconstruction-based paradigm does not require the la- proposed unsupervised algorithm can also achieve faster
bel of diverse camera parameters. Besides, the imbalance inference time. Inspired by this work, increasing more meth-
loss problem can be eliminated since it only optimizes ods leverage the unsupervised learning strategy to estimate
the photometric loss of calibrated results. Therefore, the the homography such as CA-UDHN [50], BaseHomo [51],
reconstruction-based paradigm enables a blind camera cali- HomoGAN [52], and Liu et al. [53]. Besides, UnFishCor [54]
bration without a strong camera model assumption. frees the demands for distortion parameters and designs an
unsupervised framework for the wide-angle camera.
2.2 Learning Strategies Self-supervised Learning Robotics is where the phrase
In the following, we review the learning-based camera “self-supervised learning” first appears, as training data is
calibration literature regarding different learning strategies. automatically categorized by utilizing relationships between
Supervised Learning Most learning-based camera calibra- various input sensor signals. Compared to supervised learn-
tion methods train their networks with the supervised ing, self-supervised learning leverages input data itself as
learning strategy, from the classical methods [21], [22], [23], the supervision. Many self-supervised techniques are pre-
[25], [36], [37] to the state-of-the-art methods [38], [39], sented to learn visual characteristics from massive amounts
[40], [41], [42]. In terms of the learning paradigm, this of unlabeled photos or videos without the need for time-
strategy supervises the network with the ground truth of the consuming and expensive human annotations. SSR-Net [55]
camera parameters (regression-based paradigm) or paired presents a self-supervised deep homography estimation net-
data (reconstruction-based paradigm). In general, they syn- work, which relaxes the need for ground truth annotations
thesize the training dataset from other large-scale datasets, and leverages the invertibility constraints of homography.
under the random parameter/transformation sampling and To be specific, SSR-Net [55] utilizes the homography ma-
camera model simulation. Some recent works [43], [44], [45], trix representation in place of other approaches’ typically-
[46] establish their training dataset using a real-world setup used 4-point parameterization, to apply the invertibility
and label the captured images with manual annotations, constraints. SIR [56] devises a brand-new self-supervised
thereby fostering advancements in this research domain. camera calibration pipeline for wide-angle image rectifica-
Semi-Supervised Learning Training the network using an tion, based on the principle that the corrected results of
annotated dataset under diverse scenarios is an effective distorted images of the same scene taken with various lenses
learning strategy. However, human annotation can be prone need to be the same. With self-supervised depth and pose
to errors, leading to inconsistent annotation quality or the learning as a proxy aim, Fang et al. [57] present to self-
inclusion of contaminated data. Consequently, increasing calibrate a range of generic camera models from raw video,
the training dataset to improve performance can be chal- offering for the first time a calibration evaluation of camera
lenging due to the complexity and cost of constructing the model parameters learned solely via self-supervision.
dataset. To address this challenge, SS-WPC [47] proposes Reinforcement Learning Instead of aiming to minimize
a semi-supervised method for correcting portraits captured at each stage, reinforcement learning can maximize the
by a wide-angle camera. It employs a surrogate task (seg- cumulative benefits of a learning process as a whole. To
mentation) and a semi-supervised method that utilizes di- date, DQN-RecNet [58] is the first and only work in camera
rection and range consistency and regression consistency to calibration using reinforcement learning. It applies a deep
leverage both labeled and unlabeled data. reinforcement learning technique to tackle the fisheye image
Weakly-Supervised Learning Although significant progress rectification by a single Markov Decision Process, which is a
has been made, data labeling for camera calibration is a multi-step gradual calibration scheme. In this situation, the
4

Deep Learning for Camera Calibration and Beyond

Standard Model Distortion Model


(Section 3) (Section 4)

Intrinsics Calibration Extrinsics Calibration Joint Calibration Discussion Radial Distortion Roll Shutter Distortion Discussion
(Section 3.1) (Section 3.2) (Section 3.3) (Section 3.4) (Section 4.1) (Section 4.2) (Section 4.3)

Deepfocal Geometric Representation Composite Parameters Regression-based Reconstruction-based Single-frame-based Multi-frame-based


(Section 3.3.1) (Section 3.3.2) (Section 4.1.1) (Section 4.1.2) (Section 4.2.1) (Section 4.2.2)
MisCaliDet
PoseNet DeepVP Hold-Geoffroy et al Rong et al. DR-GAN URS-CNN DeepUnrollNet

DeepFEPE DeepHorizon CTRL-C DeepCalib BlindCor RSC-Net AW-RSC

Single view to multiple


views
Single sensor to multiple
Cross-view Model Cross-sensor Model
(Section 5) sensors (Section 6)

Direct Solution Cascaded Solution Iterative Solution Discussion Pixel-level Semantics-level Object/Keypoint-level Discussion
(Section 5.1) (Section 5.2) (Section 5.3) (Section 5.4) (Section 6.1) (Section 6.2) (Section 6.3) (Section 6.4)

4-pt Parameterization Other Parameterization IHN RegNet SOIC ATOP Summary


(Section 5.1.1) (Section 5.1.2)
CLKN
CalibNet SSI-Calib RKGCNet Research Trend
DHN SSR-Net LocalTrans DLKFM
CFNet SemAlign Future Effort
UDHN BasesHomo MHN

Fig. 2. The structural and hierarchical taxonomy of camera calibration with deep learning. Some classical methods are listed under each category.

current fisheye image represents the state of the environ- 3.2 Extrinsics Calibration
ment. The agent, Deep Q-Network [59], generates an action
that should be executed to correct the distorted image. In contrast to intrinsic calibration, extrinsic calibration infers
In the following, we will review the specific methods the spatial correspondence of the camera and its located
and literature for learning-based camera calibration. The 3D scene. PoseNet [22] first proposed deep convolutional
structural and hierarchical taxonomy is shown in Figure 2. neural networks to regress 6-DoF camera pose in real-time.
A pose vector p was predicted by PoseNet, given by the 3D
position x and orientation represented by quaternion q of
a camera, namely, p = [x, q]. For constructing the training
3 S TANDARD M ODEL
dataset, the labels are automatically calculated from a video
Generally, for learning-based calibration works, the objec- of the scenario using a structure from motion method [183].
tives of the intrinsics calibration contain focal length and Inspired by PoseNet [22], the following works improved
optical center, and the objectives of the extrinsic calibration the extrinsic calibration in terms of the intermediate rep-
contain the rotation matrix and translation vector. resentation, interpretability, data format, learning objective,
etc. For example, to optimize the geometric pose objective,
3.1 Intrinsics Calibration DeepFEPE [112] designed an end-to-end keypoint-based
framework with learnable modules for detection, feature
Deepfocal [21] is a pioneer work in learning-based camera extraction, matching, and outlier rejection. Such a pipeline
calibration, it aims to estimate the focal length of any image imitated the traditional baseline, in which the final perfor-
“in the wild”. In detail, Deepfocal considered a simple mance can be analyzed and improved by the intermediate
pinhole camera model and regressed the horizontal field of differentiable module. To bridge the domain gap between
view using a deep convolutional neural network. Given the the extrinsic objective and image features, recent works
width w of an image, the relationship between the horizon- proposed to first learn an intermediate representation from
tal field of view Hθ and focal length f can be described by: the input, such as surface geometry [86], depth map [134],
w directional probability distribution [148], and normal flow
Hθ = 2 arctan( ). (1) [167], etc. Then, the extrinsic are reasoned by geometric
2f
constraints and learned representation. Therefore, the neural
Due to component wear, temperature fluctuations, or networks are gradually guided to perceive the geometry-
outside disturbances like collisions, the calibrated param- related features, which are crucial for extrinsic estimation.
eters of a camera are susceptible to change over time. To Considering the privacy concerns and limited storage prob-
this end, MisCaliDet [108] proposed to identify if a camera lem, some recent works compressed the scene and exploited
needs to be recalibrated intrinsically. Compared to the con- the point-like feature to estimate the extrinsic. For example,
ventional intrinsic parameters such as the focal length and Do et al. [164] trained a network to recognize sparse but
image center, MisCaliDet presented a new scalar metric, i.e., significant 3D points, dubbed scene landmarks, by encoding
the average pixel position difference (APPD) to measure the their appearance as implicit features. And the camera pose
degree of camera miscalibration, which describes the mean can be calculated using a robust minimal solver followed
value of the pixel position differences over the entire image. by a Levenberg-Marquardt-based nonlinear refinement. Sce-
5

TABLE 1
Details of the learning-based camera calibration and its extended applications from 2015 to 2022, including the method abbreviation, publication,
calibration objective, network architecture, loss function, dataset, evaluation metrics, learning strategy, platform, and simulation or not (training
data). For the learning strategies, SL, USL, WSL, Semi-SL, SSL, and RL denote supervised learning, unsupervised learning, weakly-supervised
learning, semi-supervised learning, self-supervised learning, and reinforcement learning, respectively.
Method Publication Objective Network Loss Function Dataset Evaluation Learning Platform Simulation
2015 DeepFocal [21] ICIP Standard AlexNet L2 loss 1DSfM [60] Accuracy SL Caffe
PoseNet [22] ICCV Standard GoogLeNet L2 loss Cambridge Landmarks [61] Accuracy SL Caffe
2016 DeepHorizon [62] BMVC Standard GoogLeNet Huber loss HLW [63] Accuracy SL Caffe
DeepVP [36] CVPR Standard AlexNet Logistic loss YUD [64], ECD [65], HLW [63] Accuracy SL Caffe
Rong et al. [23] ACCV Distortion AlexNet Softmax loss ImageNet [66] Line length SL Caffe X
DHN [25] RSSW Cross-View VGG L2 loss MS-COCO [67] MSE SL Caffe X
2017 CLKN [68] CVPR Cross-View CNNs Hinge loss MS-COCO [67] MSE SL Torch X
HierarchicalNet [69] ICCVW Cross-View VGG L2 loss MS-COCO [67] MSE SL TensorFlow X
URS-CNN [24] CVPR Distortion CNNs L2 loss Sun [70], Oxford [71], Zubud [72], LFW [73] PSNR, RMSE SL Torch X
RegNet [27] IV Cross-Sensor CNNs L2 loss KITTI [74] MAE SL Caffe X
2018 Hold-Geoffroy et al. [26] CVPR Standard DenseNet Entropy loss SUN360 [75] Human sensitivity SL -
DeepCalib [37] CVMP Distortion Inception-V3 Logcosh loss SUN360 [75] Mean error SL TensorFlow X
FishEyeRecNet [76] ECCV Distortion VGG L2 loss ADE20K [77] PSNR, SSIM SL Caffe X
Shi et al. [78] ICPR Distortion ResNet L2 loss ImageNet [66] MSE SL PyTorch X
DeepFM [79] ECCV Cross-View ResNet L2 loss T&T [80], KITTI [74], 1DSfM [60] F-score, Mean SL PyTorch X
Poursaeed et al. [81] ECCVW Cross-View CNNs L1 , L2 loss KITTI [74] EPI-ABS, EPI-SQR SL -
UDHN [49] RAL Cross-View VGG L1 loss MS-COCO [67] RMSE USL TensorFlow X
PFNet [82] ACCV Cross-View FCN Smooth L1 loss MS-COCO [67] MAE SL TensorFlow X
CalibNet [83] IROS Cross-Sensor ResNet Point cloud distance, L2 loss KITTI [74] Geodesic distance, MAE SL TensorFlow X
Chang et al. [84] ICRA Standard AlexNet Cross-entropy loss DeepVP-1M [84] MSE, Accuracy SL Matconvnet
2019 Lopez et al. [85] CVPR Distortion DenseNet Bearing loss SUN360 [75] MSE SL PyTorch
UprightNet [86] ICCV Standard U-Net Geometry loss InteriorNet [87], ScanNet [88], SUN360 [75] Mean error SL PyTorch
Zhuang et al. [89] IROS Distortion ResNet L1 loss KITTI [74] Mean error, RMSE SL PyTorch X
SSR-Net [55] PRL Cross-View ResNet L2 loss MS-COCO [67] MAE SSL PyTorch X
Abbas et al. [90] ICCVW Cross-View CNNs Softmax loss CARLA [91] AUC [92], Mean error SL TensorFlow X
DR-GAN [31] TCSVT Distortion GANs Perceptual loss MS-COCO [67] PSNR, SSIM SL TensorFlow X
STD [93] TCSVT Distortion GANs+CNNs Perceptual loss MS-COCO [67] PSNR, SSIM SL TensorFlow X
Deep360Up [94] VR Standard DenseNet Log-cosh loss [95] SUN360 [75] Mean error SL - X
UnFishCor [54] JVCIR Distortion VGG L1 loss Places2 [96] PSNR, SSIM USL TensorFlow X
BlindCor [34] CVPR Distortion U-Net L2 loss Places2 [96] MSE SL PyTorch X
RSC-Net [97] CVPR Distortion ResNet L1 loss KITTI [74] Mean error SL PyTorch X
Xue et al. [98] CVPR Distortion ResNet L2 loss Wireframes [99], SUNCG [100] PSNR, SSIM, RPE SL PyTorch X
Zhao et al. [43] ICCV Distortion VGG+U-Net L1 loss Self-constructed+BU-4DFE [101] Mean error SL - X
NeurVPS [102] NeurIPS Standard CNNs Binary cross entropy, chamfer-L2 loss ScanNet [88], SU3 [103] Angle accuracy SL PyTorch
2020 Sha et al. [104] CVPR Cross-View U-Net Cross-entropy loss World Cup 2014 [105] IoU SL TensorFlow
Lee et al. [106] ECCV Standard PointNet + CNNs Cross-entropy loss Google Street View [107], HLW [63] Mean error, AUC [92] SL -
MisCaliDet [108] ICRA Distortion CNNs L2 loss KITTI [74] MSE SL TensorFlow X
DeepPTZ [109] WACV Distortion Inception-V3 L1 loss SUN360 [75] Mean error SL PyTorch X
MHN [110] CVPR Cross-View VGG Cross-entropy loss MS-COCO [67], Self-constructed MAE SL TensorFlow X
Davidson et al. [111] ECCV Standard FCN Dice loss SUN360 [75] Accuracy SL - X
CA-UDHN [50] ECCV Cross-View FCN + ResNet Triplet loss Self-constructed MSE USL PyTorch
DeepFEPE [112] IROS Standard VGG + PointNet L2 loss KITTI [74], ApolloScape [113] Mean error SL PyTorch
DDM [32] TIP Distortion GANs L1 loss MS-COCO [67] PSNR, SSIM SL TensorFlow X
Li et al. [114] TIP Distortion CNNs Cross-entropy, L1 loss CelebA [115] Cosine distance SL - X
PSE-GAN [116] ICPR Distortion GANs L1 , WGAN loss Place2 [96] MSE SL - X
RDC-Net [117] ICIP Distortion ResNet L1 , L2 loss ImageNet [66] PSNR, SSIM SL PyTorch X
FE-GAN [118] ICASSP Distortion GANs L1 , GAN loss Wireframe [99], LSUN [119] PSNR, SSIM, RMSE SSL PyTorch X
RDCFace [120] CVPR Distortion ResNet Cross-entropy, L2 loss IMDB-Face [121] Accuracy SL - X
LaRecNet [122] arXiv Distortion ResNet L2 loss Wireframes [99], SUNCG [100] PSNR, SSIM, RPE SL PyTorch X
Baradad et al. [123] CVPR Standard CNNs L2 loss ScanNet [88], NYU [124], SUN360 [75] Mean error, RMS SL PyTorch
Zheng et al. [125] CVPR Standard CNNs L1 loss FocaLens [126] Mean error, PSNR, SSIM SL - X
Zhu et al. [48] ECCV Standard CNNs + PointNet L1 loss SUN360 [75], MS-COCO [67] Mean error, Accuracy WSL PyTorch X
DeepUnrollNet [46] CVPR Distortion FCN L1 , perceptual, total variation loss Carla-RS [46], Fastec-RS [46] PSNR, SSIM SL PyTorch X
RGGNet [127] RAL Cross-Sensor ResNet Geodesic distance loss KITTI [74] MSE, MSEE, MRR SL TensorFlow X
CalibRCNN [128] IROS Cross-Sensor RNNs L2 , Epipolar geometry loss KITTI [74] MAE SL TensorFlow X
SSI-Calib [129] ICRA Cross-Sensor CNNs L2 loss Pascal VOC 2012 [130] Mean/standard deviation SL TensorFlow X
SOIC [131] arXiv Cross-Sensor ResNet + PointRCNN Cost function KITTI [74] Mean error SL -
NetCalib [132] ICPR Cross-Sensor CNNs L1 loss KITTI [74] MAE SL PyTorch X
SRHEN [133] ACM-MM Cross-View CNNs L2 loss MS-COCO [67], SUN397 [75] MACE SL - X
2021 StereoCaliNet [134] TCI Standard U-Net L1 loss TAUAgent [135], KITTI [74] Mean error SL PyTorch X
CTRL-C [136] ICCV Standard Transformer Cross-entropy, L1 loss Google Street View [107], SUN360 [75] Mean error, AUC [92] SL PyTorch X
Wakai et al. [137] ICCVW Distortion DenseNet Smooth L1 loss StreetLearn [138] Mean error, PSNR, SSIM SL - X
OrdianlDistortion [139] TIP Distortion CNNs Smooth L1 loss MS-COCO [67] PSNR, SSIM, MDLD SL TensorFlow X
PolarRecNet [140] TCSVT Distortion VGG + U-Net L1 , L2 loss MS-COCO [67], LMS [141] PSNR, SSIM, MSE SL PyTorch X
DQN-RecNet [58] PRL Distortion VGG L2 loss Wireframes [99] PSNR, SSIM, MSE RL PyTorch X
Tan et al. [44] CVPR Distortion U-Net L2 loss Self-constructed Accuracy SL PyTorch
PCN [142] CVPR Distortion U-Net L1 , L2 , GAN loss Place2 [96] PSNR, SSIM, FID, CW-SSIM SL PyTorch X
DaRecNet [33] ICCV Distortion U-Net Smooth L1 , L2 loss ADE20K [77] PSNR, SSIM SL PyTorch X
DLKFM [143] CVPR Cross-View Siamese-Net L2 loss MS-COCO [67], Google Earth, Google Map MSE SL TensorFlow X
LocalTrans [144] ICCV Cross-View Transformer L1 loss MS-COCO [67] MSE, PSNR, SSIM SL PyTorch X
BasesHomo [51] ICCV Cross-View ResNet Triplet loss CA-UDHN [50] MSE USL PyTorch
ShuffleHomoNet [145] ICIP Cross-View ShuffleNet L2 loss MS-COCO [67] RMSE SL TensorFlow X
DAMG-Homo [41] TCSVT Cross-View CNNs L1 loss MS-COCO [67], UDIS [146] RMSE, PSNR, SSIM SL TensorFlow X
SA-MobileNet [147] BMVC Standard MobileNet Cross-entropy loss SUN360 [75], ADE20K [77], NYU [124] MAE, Accuracy SL TensorFlow X
SPEC [45] ICCV Standard ResNet Softargmax-L2 loss Self-constructed W-MPJPE, PA-MPJPE SL PyTorch X
DirectionNet [148] CVPR Standard U-Net Cosine similarity loss InteriorNet [87], Matterport3D [149] Mean and median error SL TensorFlow X
JCD [150] CVPR Distortion FCN Charbonnier [151], perceptual loss BS-RSCD [150], Fastec-RS [46] PSNR, SSIM, LPIPS SL PyTorch
LCCNet [152] CVPRW Cross-Sensor CNNs Smooth L1 , L2 loss KITTI [74] MSE SL PyTorch X
CFNet [153] Sensors Cross-Sensor FCN L1 , Charbonnier [151] loss KITTI [74], KITTI-360 [154] MAE, MSEE, MRR SL PyTorch X
Fan et al. [155] ICCV Distortion U-Net L1 , perceptual loss Carla-RS [46], Fastec-RS [46] PSNR, SSIM, LPIPS SL PyTorch
SUNet [156] ICCV Distortion DenseNet + ResNet L1 , perceptual loss Carla-RS [46], Fastec-RS [46] PSNR, SSIM SL PyTorch
SemAlign [157] IROS Cross-Sensor CNNs Semantic alignment loss KITTI [74] Mean/median rotation errors SL PyTorch X
2022 DVPD [38] CVPR Standard CNNs Cross-entropy loss SU3 [103], ScanNet [88], YUD [64], NYU [124] Accuracy, AUC [92] SL PyTorch X
Fang et al. [57] ICRA Standard CNNs L2 loss KITTI [74], EuRoC [158], OmniCam [159] MRE, RMSE SSL PyTorch
CPL [160] ICASSP Standard Inception-V3 L1 loss CARLA [91], CyclistDetection [161] MAE SL TensorFlow X
IHN [162] CVPR Cross-View Siamese-Net L1 loss MS-COCO [67], Google Earth, Google Map MACE SL PyTorch X
HomoGAN [52] CVPR Cross-View GANs Cross-entropy, WGAN loss CA-UDHN [50] Mean error USL PyTorch X
SS-WPC [47] CVPR Distortion Transformer Cross-entropy, L1 loss Tan et al. [44] Accuracy Semi-SL PyTorch
AW-RSC [163] CVPR Distortion CNNs Charbonnier [151], perceptual loss Self-constructed, FastecRS [46] PSNR, SSIM SL PyTorch
EvUnroll [39] CVPR Distortion U-Net Charbonnier, perceptual, TV loss Self-constructed, FastecRS [46] PSNR, SSIM, LPIPS SL PyTorch
Do et al. [164] CVPR Standard ResNet L2 , Robust angular [165] loss Self-constructed, 7-SCENES [166] Median error, Recall SL PyTorch
DiffPoseNet [167] CVPR Standard CNNs + LSTM L2 loss TartanAir [168], KITTI [74], TUM-RGBD [169] PEE, AEE [170] SSL PyTorch
SceneSqueezer [171] CVPR Standard Transformer L1 loss RobotCar Seasons [172], Cambridge Landmarks [61] Mean error, Recall [170] SL PyTorch
FocalPose [173] CVPR Standard CNNs L1 , Huber loss Pix3D [174], CompCars [175], StanfordCars [175] Median error, Accuracy SL PyTorch
DXQ-Net [176] arXiv Cross-Sensor CNNs + RNNs L1 , geodesic loss KITTI [74], KITTI-360 [154] MSE SL PyTorch X
SST-Calib [42] ITSC Cross-Sensor CNNs L2 loss KITTI [74] QAD, AEAD SL PyTorch X
CCS-Net [177] IROS Distortion U-Net L1 loss TUM-RGBD [169] MAE, RPE SL PyTorch X
FishFormer [40] arXiv Distortion Transformer L2 loss Place2 [96], CelebA [115] PSNR, SSIM, FID SL PyTorch X
SIR [56] TIP Distortion ResNet L1 loss ADE20K [77], WireFrames [99], MS-COCO [67] PSNR, SSIM SSL PyTorch X
ATOP [178] TIV Cross-Sensor CNNs Cross entropy loss Self-constructed + KITTI [74] RRE, RTE SL -
FusionNet [179] ICRA Cross-Sensor CNNs+PointNet L2 loss KITTI [74] MAE SL PyTorch X
RKGCNet [180] TIM Cross-Sensor CNNs+PointNet L1 loss KITTI [74] MSE SL PyTorch X
GenCaliNet [181] ECCV Distortion DenseNet L2 loss StreetLearn [138], SP360 [182] MAE, PSNR, SSIM SL - X
Liu et al. [53] TPAMI Cross-View ResNet Triplet loss Self-constructed MSE, Accuracy USL PyTorch
6

neSqueezer [171] compressed the scene information from Zenith

FFN
three levels: the database frames are clustered using pair-
VP

Self-Attention

Add & Norm

Add & Norm


Multi-Head
1x1 Conv

Forward
ResNet
Horizon

FFN
Feed
wise co-visibility information, a point selection module
Line





FFN
FoV

prunes each cluster based on estimation performance, and Positional


Encoding Layer (x6)

learned quantization further compresses the selected points. LSD Encoding

Decoding Layer (x6)


Learned
Queries

Cross-Attention
Self-Attention

Add & Norm

Add & Norm


Add & Norm
(a2, ab, ac, b2, bc, c2)T

Multi-Head

Multi-Head
3.3 Joint Intrinsic and Extrinsic Calibration

Forward

Softmax
Feed

FFN
1x1 Conv
3.3.1 Geometric Representations

Softmax




FFN
Vanishing Points The intersection of projections of a set of
parallel lines in the world leads to a vanishing point. The Fig. 3. Overview of CTRL-C. The figure is from [136].
detection of vanishing points is a fundamental and crucial
challenge in 3D vision. In general, vanishing points reveal trained the network to estimate multiple angles within a
the direction of 3D lines, allowing the agent to deduce 3D narrow interval of the ground truth tilt, penalizing only
scene information from a single 2D image. those values that locate outside this narrow range.
DeepVP [36] is the first learning-based work for detect-
ing the vanishing points given a single image. It reversed the 3.3.2 Composite Parameters
conventional process by scoring the horizon line candidates Calibrating the composite parameters aims to estimate
according to the vanishing points they contain. Chang et al. the intrinsic parameters and extrinsic parameters simul-
[84] redesigned this task as a CNN classification problem taneously. By jointly estimating composite parameters
using an output layer with 225 discrete possible vanishing and training using data from a large-scale panorama
point locations. For constructing the dataset, the camera dataset [75], Hold-Geoffroy et al. [26] largely outperformed
view is panned and tilted with step 5° from -35° to 35° in the previous independent calibration tasks. Moreover, Hold-
panorama scene (total 225 images) from a single GPS loca- Geoffroy et al. [26] performed human perception research
tion. To directly leverage the geometric properties of vanish- in which the participants were asked to evaluate the realism
ing points, NeurVPS [102] proposed a canonical conic space of 3D objects composited with and without accurate calibra-
and a conic convolution operator that can be implemented tion. This data was further designed to a new perceptual
as regular convolutions in this space, where the learning measure for the calibration errors. In terms of the feature
model is capable of calculating the global geometric infor- category, Lee et al. [106] and CTRL-C [136] considered both
mation of vanishing points locally. To overcome the need for semantic features and geometric cues for camera calibration.
a large amount of training data in previous methods, DVPD They showed how taking use of geometric features, is capa-
[38] incorporated the neural network with two geometric ble of facilitating the network to comprehend the underlying
priors: Hough transformation and Gaussian sphere. First, perspective structure of an image. The pipeline of CTRL-C is
the convolutional features are transformed into a Hough illustrated in Figure 3. In recent literature, more applications
domain, mapping lines to distinct bins. The projection of the are jointly studied with camera calibration, for example,
Hough bins is then extended to the Gaussian sphere, where single view metrology [48], 3D human pose and shape
lines are transformed into great circles and vanishing points estimation [45], depth estimation [57], [123], object pose
are located at the intersection of these circles. Geometric estimation [173], and image reflection removal [125], etc.
priors are data-efficient because they eliminate the necessity Considering the heterogeneousness and visual implicit-
for learning this information from data, which enables an ness of different camera parameters, CPL [160] estimated the
interpretable learning framework and generalizes better to parameters using a novel camera projection loss, exploiting
domains with slightly different data distributions. the camera model neural network to reconstruct the 3D
Horizon Lines The horizon line is a crucial contextual point cloud. The proposed loss addressed the training im-
attribute for various computer vision tasks especially im- balance problem by representing different errors of camera
age metrology, computational photography, and 3D scene parameters in terms of a unified metric.
understanding. The projection of the line at infinity onto
any plane that is perpendicular to the local gravity vector 3.4 Discussion
determines the location of the horizon line.
Given the FoV, pitch, and roll of a camera, it is straight- 3.4.1 Technique Summary
forward to locate the horizon line in its captured image The above methods target automatic calibration without
space. DeepHorizon [62] proposed the first learning-based manual intervention and scene assumption. Early litera-
solution for estimating the horizon line from an image, ture [21], [22] separately studied the intrinsic calibration
without requiring any explicit geometric constraints or other or extrinsic calibration. Driven by large-scale datasets and
cues. To train the network, a new benchmark dataset, powerful networks, subsequent works [26], [36], [62], [136]
Horizon Lines in the Wild (HLW), was constructed, which considered a comprehensive camera calibration, inferring
consists of real-world images with labeled horizon lines. SA- various parameters and geometric representations. To re-
MobileNet [147] proposed an image tilt detection and cor- lieve the difficulty of learning the camera parameters, some
rection with self-attention MobileNet [184] for smartphones. works [86], [134], [148], [167] proposed to learn an interme-
A spatial self-attention module was devised to learn long- diate representation. In recent literature, more applications
range dependencies and global context within the input are jointly studied with camera calibration [45], [48], [57],
images. To address the difficulty of the regression task, they [123], [125]. This suggests solving the downstream vision
7

tasks, especially in 3D tasks may require prior knowledge


of the image formation model. Moreover, some geometric
priors [38] can alleviate the data-starved requirement of
deep learning, showing the potential to bridge the gap
between the calibration target and semantic features.
It is interesting to find that increasing more extrinsic
calibration methods [112], [164], [171] revisited and re-
stored the traditional feature point-based solutions. The
Fig. 4. Three common learning solutions of the regression-based wide-
standard extrinsics that describe the camera motion contain angle camera calibration: (a) SingleNet, (b) DualNet, (c) SeqNet, where
limited degrees of freedom, and thus some local features I is the distortion image and f and ξ denote the focal length and
can well represent the spatial correspondence. Besides, the distortion parameters, respectively. The figure is from [37].
network designed for point learning significantly improves
the efficiency of calibration models, such as PointNet [185] with the convolutional layers and fully connected layers
and PointCNN [186]. Such a pipeline also enables clear were used to learn the distortion features of inputs and
interpretability of learning-based camera calibration, which predict the camera parameters. In particular, DeepCalib [37]
promotes understanding of how the network calibrates and explored three learning solutions for wide-angle camera
magnifies the influences of intermediate modules. calibration as illustrated in Figure 4. Their experiments
showed the simplest architecture SingleNet achieves the
3.4.2 Future Effort best performance on both accuracy and efficiency. To en-
(1) Explore more vision/geometric priors. Due to the scarce hance the distortion perception of networks, the following
real-world dataset in the learning-based camera calibration works investigated introducing more diverse features such
field, digging more priors that ease the demand of learning as the semantic features [76] and geometry features [98],
from data is promising. For example, the prior of the image [120], [122]. Additionally, some works improved the gen-
formation model could allow us to associate the relationship eralization by designing learning strategies such as unsu-
between 3D camera parameters and 2D image layout. pervised learning [54], self-supervised learning [56], and
(2) Decouple different stages in an end-to-end calibra- reinforcement learning [43]. By randomly chosen coeffi-
tion learning model. Most learning-based camera calibration cients throughout each mini-batch of the training process,
methods include a feature extraction stage and an objective RDC-Net [117] was able to dynamically generate distortion
estimation stage. However, how the networks learn the fea- images on-the-fly. It enhanced the rectification performance
tures related to calibration is ambiguous. Therefore, decou- and prevents the learning model from overfitting. Instead
pling the learning process by different traditional calibration of contributing to the techniques of deep learning, other
stages can guide the way of feature extraction. It would be works leaned to explore the vision prior to interpretable
meaningful to extend the idea in extrinsic calibration [112], calibration. For example, having observed the radial dis-
[164], [171] to more general calibration problems. tortion image owns the center symmetry characteristics, in
(3) Transfer the measurement space from the parameter which the texture far from the image center has stronger
error to the geometric difference. When it comes to jointly distortion, Shi et al. [78] and PSE-GAN [116] developed a
calibrating various camera parameters, the training process position-aware weight layer (fixed [78] and learnable [116])
will suffer from an imbalance loss optimization problem. of this property and enabled the network to explicitly per-
The main reason is different camera parameters correspond ceive the distortion. Lopez et al. [85] proposed a novel
to different sample distributions. The simple normalization parameterization for radial distortion that is better suited for
strategy cannot unify their error spaces. Therefore, we can networks than directly learning the distortion parameters.
formulate a straightforward measurement space in terms of Furthermore, OrdinalDistortion [139] presented a learning-
the geometric property of different camera parameters. friendly representation, i.e., ordinal distortion. Compared to
the implicit and heterogeneous camera parameters, such a
representation can facilitate the distortion perception of the
4 D ISTORTION M ODEL neural network due to its clear relation to the image features.
In the learning-based camera calibration, calibrating the
radial distortion and roll shutter distortion gains increasing 4.1.2 Reconstruction-based Solution
attention due to their widely used applications for the wide- Inspired by the conditional image-to-image translation and
angle lens and CMOS sensor. In this part, we mainly review dense visual perception, the reconstruction-based solution
the calibration/rectification of these two distortions. starts to evolve from the conventional regression-based
4.1 Radial Distortion paradigm. DR-GAN [31] is the first reconstruction-based
The literature on learning-based radial distortion calibration solution for calibrating the radial distortion, which directly
can be classified into two main categories: regression-based models the pixel-wise mapping between the distorted image
solutions and reconstruction-based solutions. and rectified image. It achieved the camera parameter-free
training and one-stage rectification. Thanks to the liberation
4.1.1 Regression-based Solution of the assumption of camera models, the reconstruction-
Rong et al. [23] and DeepCalib [37] are pioneer works based solution showed the potential to calibrate various
for the learning-based wide-angle camera calibration. They types of cameras in one learning network. For example,
treated the camera calibration as a supervised classifica- DDM [32] unified different camera models into a domain by
tion [23] or regression [37] problem, and then the networks presenting the distortion distribution map, which explicitly
8

Fig. 5. Architecture of FE-GAN. The figure is from [118].

Fig. 6. Architecture of RSC-Net. The figure is from [97].


describes the distortion level of each pixel in a distorted Model Arch Version3

image. Then, the network learned to reconstruct the rectified


image using this geometric prior map. To make the mapping
function interpretable, the subsequent works [34], [43], [44],
Motion Motion Motion
Estimation Estimation Estimation

[47], [93], [118], [140], [142] developed the displacement Adaptive Adaptive Adaptive

filed between the distorted image and rectified image. Such


Warping Warping Warping
Neighboring Frames
Fusion Fusion Fusion

a manner is able to eliminate the generated artifacts in


Conv Conv Conv

the pixel-level reconstruction. In particular, FE-GAN [118]


integrated the geometry prior like Shi et al. [78] and Central Frame Corrected Central Frame

PSE-GAN [116] into their reconstruction-based solution and Neighboring


𝑆𝑐𝑎𝑙𝑒1
presented a self-supervised strategy to learn the distortion
Features RS Feature

Ada-MSA

Conv 𝟑 × 𝟑
Correlation
Concatenate

Volume
𝑆𝑐𝑎𝑙𝑒2

flow for wide-angle camera calibration in Figure 5. Most Central RS


Feature
Motion Fields Upsample 𝑆𝑐𝑎𝑙𝑒3

reconstruction-based solutions exploit a U-Net-like architec- Motion Estimation Adaptive Warping Central Frame Features

ture to learn pixel-level mapping. However, the distortion


feature can be transferred from encoder to decoder by the Fig. 7. Architecture of AW-RSC. The figure is from [163].
skip-connection operation, leading to a blurring appear-
ance and incomplete correction in reconstruction results. convolutions were leveraged to extract attributes along
To address this issue, Li et al. [114] abandoned the skip- horizontal and vertical axes. RSC-Net [97] improved URS-
connection in their rectification network. To keep the feature CNN [24] from 2 degrees of freedom (DoF) to 6-DoF and
fusion and restrain the geometric difference simultaneously, presents a structure-and-motion-aware RS correction model,
PCN [142] designed a correction layer in skip-connection where the camera scanline velocity and depth were esti-
and applied the appearance flows to revise the convolved mated. Compared to URS-CNN [24], RSC-Net [97] further
features in different encoder layers. Having noticed that the reasoned about the concealed motion between the scanlines
previous sampling strategy of the convolution kernel ne- as well as the scene structure as shown in Figure 6. To bridge
glected the radial symmetry of distortion, PolarRecNet [140] the spatiotemporal connection between RS and GS, EvUn-
transformed the distorted image from the Cartesian coordi- roll [39] exploited the neuromorphic events to correct the
nates domain into the polar coordinates domain. RS effect. Event cameras can overcome a number of draw-
backs of conventional frame-based activities for dynamic
situations with quick motion due to their high temporal
4.2 Roll Shutter Distortion resolution property with microsecond-level sensitivity.
The existing deep learning calibration works on roll shutter
4.2.2 Multi-frame-based Solution
(RS) distortion can be classified into two categories: single-
frame-based [24], [39], [97] and multi-frame-based [46], Most multi-frame-based solutions are based on the recon-
[150], [155], [156], [163]. The single-frame-based solution struction paradigm, they mainly devote to contributing how
studies the case of a single roll shutter image as input to represent the dense displacement field between RS and
and directly learns to correct the distortion using neural global GS images and accurately warp the RS domain to the
networks. The ideal corrected result can be regarded as the GS domain. For the first time, DeepUnrollNet [46] proposed
global shutter (GS) image. It is an ill-posed problem and an end-to-end network for two consecutive rolling shutter
requires some additional prior assumptions to be defined. images using a differentiable forward warping module.
On the contrary, the multi-frame-based solution considers In this method, a motion estimation network is used to
the consecutive frames (two or more) of a video taken by a estimate the dense displacement field from a rolling shutter
roll shutter camera, in which the strong temporal correlation image to its matching global shutter image. The second
can be investigated for more reasonable correction. contribution of DeepUnrollNet [46] is to construct two novel
datasets: the Fastec-RS dataset and the Carla-RS dataset.
4.2.1 Single-frame-based Solution Furthermore, JCD [150] jointly considered the rolling shutter
URS-CNN [24] is the first learning work for calibrating the correction and deblurring (RSCD) techniques, which largely
rolling shutter camera. In this work, a neural network with exist in the medium and long exposure cases of rolling
long kernel characteristics was used to understand how the shutter cameras. It applied bi-directional warping streams
scene structure and row-wise camera motion interact. To to compensate for the displacement while keeping the non-
specifically address the nature of the RS effect produced by warped deblurring stream to restore details. The authors
the row-wise exposure, the row-kernel and column-kernel also contributed a real-world dataset using a well-designed
9

beam-splitter acquisition system, BS-RSCD, which includes


both ego-motion and object motion in dynamic scenes.
SUNet [156] extended DeepUnrollNet [46] from the middle
time of the second frame ( 3τ2 ) into the intermediate time of
two frames (τ ). By using PWC-Net [187], SUNet [156] esti-
mated the symmetric undistortion fields and reconstructed
the potential GS frames by a time-centered GS image de-
coder network. To effectively reduce the misalignment be-
tween the contexts warped from two consecutive RS images,
the context-aware undistortion flow estimator and the sym-
metric consistency enforcement were designed. To achieve a
higher frame rate, Fan et al. [155] generated a GS video from
two consecutive RS images based on the scanline-dependent
nature of the RS camera. In particular, they first analyzed the Fig. 8. Architectures of DHN [25] and UDHN [49]. The figure is from [49].
inherent connection between bidirectional RS undistortion
flow and optical flow, demonstrating the RS undistortion 4.3.2 Future Effort
flow map has a more pronounced scanline dependency (1) The development of wide-angle camera calibration and
than the isotropically smooth optical flow map. Then, they roll shutter camera calibration can promote each other. For
developed the bidirectional undistortion flows to describe instance, the well-studied multi-frame-based solution in roll
the pixel-wise RS-aware displacement, and further devised shutter calibration is able to inspire wide-angle calibration.
a computation technique for the mutual conversion between The same object located at different sequences could provide
different RS undistortion flows corresponding to various useful priors regarding to radial distortion. Additionally,
scanlines. To eliminate the inaccurate displacement field the elaborate studies of the displacement field and warping
estimation and error-prone warping problems in previous layer [150], [155], [163] have the potential to motivate the
methods, AW-RSC [163] proposed to predict multiple fields development of wide-angle camera calibration and other
and adaptively warped the learned RS features into global fields. Furthermore, the investigation of geometric priors in
shutter counterparts. Using a coarse-to-fine approach, these wide-angle calibration could also improve the interpretabil-
warped features were combined and generated to precise ity of the network in roll shutter calibration.
global shutter frames as shown in Figure 7. Compared (2) Most methods synthesize their training dataset based
to previous works [46], [150], [155], [156], the warping on random samples from all camera parameters. How-
operation consisting of adaptive multi-head attention and ever, for the images captured by real lenses, the distribu-
a convolutional block in AW-RSC [163] is learnable and tion of camera parameters probably locates at a potential
effective. In addition, AW-RSC [163] contributed a real- manifold [85]. Learning on a label-redundant calibration
world rolling shutter correction dataset: BS-RSC, where the dataset makes the training process inefficient. Thus, explor-
RS videos with corresponding GS ground truth are captured ing a practical sampling strategy for the synthesized dataset
simultaneously with a beam-splitter-based acquisition sys- could be a meaningful task in the future direction.
tem. (3) To overcome the ill-posed problem of single-frame
calibration, introducing other high-precision sensors can
4.3 Discussion compensate for the current calibration performance, such
as event cameras [39]. With the rapid development of vision
4.3.1 Technique Summary
sensors, joint calibration using multiple sensors is valuable.
The deep learning works on wide-angle camera and roll Consequently, more cross-modal and multi-modal fusion
shutter calibration share a similar technique pipeline. Along techniques will be investigated along this research way.
this research trend, most early literature begins with the
regression-based solution [23], [24], [37]. The subsequent 5 C ROSS -V IEW M ODEL
works innovated the traditional calibration with a recon- The existing deep calibration methods can estimate the
struction perspective [31], [32], [46], [118], which directly specific camera parameters from a single camera. In fact,
learns the displacement field to rectify the uncalibrated there can be more complicated parameter representations
input. For higher accuracy of calibration, a more intuitive in multi-camera circumstances. For example, in the multi-
displacement field, and more effective warping strategy view model, the fundamental matrix and essential matrix
have been developed [142], [150], [155], [163]. To fit the describe the epipolar geometry and they are intricately
distribution of different distortions, some works designed tangled with intrinsics and extrinsics. The homography
different shapes of the convolutional kernel [24] or trans- depicts the pixel-level correspondences between different
formed the convolved coordinates [140]. views. In addition to intrinsics and extrinsics, it is also
Existing works devoted themselves to designing more intertwined with depth. Among these complex parameter
powerful networks and introducing more diverse features to representations, homography is the most widely leveraged
facilitate calibration performance. Increasingly more meth- in practical applications and its related learning-based meth-
ods focused on the geometry priors of the distortion [78], ods are the most investigated. To this end, we mainly focus
[116], [118]. These priors can be directly weighted into the on the review of deep homography estimation solutions for
convolutional layers or used to supervise network training, the cross-view model and they can be divided into three
promoting the learning model to converge faster. categories: direct, cascaded, and iterative solution.
10

5.1 Direct Solution


We review the direct deep homography solutions from the
perspective of different parameterizations, including the
classical 4-pt parameterization and other parameterizations.

5.1.1 4-pt Parameterization


Deep homography estimation is first proposed in DHN [25],
where a VGG-style network is adopted to predict the 4-pt
parameterization H4pt . To train and evaluate the network,
a synthetic dataset named Warped MS-COCO is created
to provide ground truth 4-pt parameterization Ĥ4pt . The
pipeline is illustrated in Fig. 8(a), and the objective function
is formulated as LH : Fig. 9. Architecture of HomoGAN. The figure is from [52].
1
LH = k H4pt − Ĥ4pt k22 . (2) points are the minimum requirement to solve the homogra-
2
phy. To address these issues, they formulated the param-
Then the 4-pt parameterization can be solved as a 3 × 3 eterization as a perspective field (PF) that models pixel-
homography matrix using normalized DLT [188]. However, to-pixel bijection and designed a PFNet. This extends the
DHN is limited to synthetic datasets where the ground displacements of the four vertices to as many dense pixel
truth can be generated for free or requires costly labeling points as possible. The homography can then be solved
of real-world datasets. Subsequently, the first unsupervised using RANSAC [191] with outlier filtering, enabling robust
solution named UDHN [49] is proposed to address this estimation by utilizing dense correspondences. Neverthe-
problem. As shown in Fig. 8(c), it used the same network less, dense correspondences lead to a significant increase in
architecture as DHN and defined an unsupervised loss func- the computational complexity of RANSAC. Furthermore, Ye
tion by minimizing the average photometric error motivated et al. [51] proposed an 8-DOF flow representation without
by traditional methods [189]: extra post-processing, which has a size of H × W × 2 in an
8D subspace constrained by the homography. To represent
LP W =k P(IA (x)) − P(IB (W(x; p))) k1 , (3) arbitrary homography flows in this subspace, 8 flow bases
are defined, and the proposed BasesHomo is to predict the
where W(·; ·) and P(·) denote the operations of warping coefficients for the flow bases. To obtain desirable bases,
via homography parameters p and extracting an image BasesHomo first generates 8 homography flows by modi-
patch, respectively. IA and IB are the original images with fying every single entry of an identity homography matrix
overlapping regions. The input of UDHN is a pair of image except for the last entry. Then, these flows are normalized
patches, but it warps the original images when calculating by their largest flow magnitude followed by a QR decom-
the loss. In this manner, it avoids the adverse effects of position, enforcing all the bases normalized and orthogonal.
invalid pixels after warping and lifts the magnitude of
pixel supervision. To gain accuracy and speed with a tiny 5.2 Cascaded Solution
model, Chen et al. proposed ShuffleHomoNet [145], which Direct solutions explore various homography parameteriza-
integrates ShuffleNet compressed units [190] and location- tions with simple network structures, while the cascaded
aware pooling [81] into a lightweight model. To further han- ones focus on complex designs of network architectures.
dle large displacement, a multi-scale weight-sharing version In HierarchicalNet [69], Nowruzi et al. hold that the
is exploited by extracting multi-scale feature representations warped images can be regarded as the input of another
and adaptively fusing multi-scale predictions. However, the network. Therefore they stacked the networks sequentially
homography cannot perfectly align images with parallax to reduce the error bounds of the estimate. Based on Hierar-
caused by non-planar structures with non-overlapping cam- chicalNet, SRHEN [133] introduced the cost volume [187] to
era centers. To deal with parallax, CA-UDHN [50] designs the cascaded network, measuring the feature correlation by
learnable attention masks to overlook the parallax regions, cosine distance and formulating it as a volume. The stacked
contributing to better background plane alignment. Besides, networks and cost volume increase the performance, but
the 4-pt homography can be extended to meshflow [53] to they cannot handle the dynamic scenes. MHN [110] devel-
realize non-planar accurate alignment. oped a multi-scale neural network and proposed to learn ho-
mography estimation and dynamic content detection simul-
5.1.2 Other Parameterizations taneously. Moreover, to tackle the cross-resolution problem,
In addition to 4-pt parameterization, the homography can LocalTrans [144] formulated it as a multimodal problem and
be parameterized as other formulations. To better utilize ho- proposed a local transformer network embedded within a
mography invertibility, Wang et al. proposed SSR-Net [55]. multiscale structure to explicitly learn correspondences be-
They established the invertibility constraint through a con- tween the multimodal inputs. These inputs include images
ventional matrix representation in a cyclic manner. Zeng et with different resolutions, and LocalTrans achieved superior
al. [82] argued that the 4-point parameterization regressed performance on cross-resolution cases with a resolution gap
by a fully-connected layer can harm the spatial order of of up to 10x. All the solutions mentioned above leverage
the corners and be susceptible to perturbations, since four image pyramids to progressively enhance the ability to
11

address large displacements. However, every image pair multiple iterations until the stopping condition is met in
at each level requires a unique feature extraction network, the testing stage. Besides, CLKN stacked three similar LK
resulting in the redundancy of feature maps. To alleviate this networks to further boost the performance by treating the
problem, some researchers [41], [52], [146], [192] replaced output of the last LK network as the initial warp parameters
image pyramids with feature pyramids. Specifically, they of the next LK network. From Eq. 7, the IC-LK algorithm
warped the feature maps directly instead of images to heavily relied on feature maps, which tend to fail in multi-
avoid excessive feature extraction networks. To address the modal images. Instead, DLKFM [143] constructed a single-
low-overlap homography estimation problem in real-world channel feature map by using the eigenvalues of the local
images [146], Nie et al. [146] modified the unsupervised covariance matrix on the output tensor. To learn DLKFM, it
constraint (Eq. 3) to adapt to low-overlap scenes: designed two special constraint terms to align multimodal
feature maps and contribute to convergence.
L0P W =k IA (x) · 1(W(x; p)) − IB (W(x; p)) k1 , (4)
However, LK-based algorithms can fail if the Jacobian
where 1 is an all-one matrix with the same size as IA or matrix is rank-deficient [194]. Additionally, the IC-LK it-
IB . It solved the low-overlap problem by taking the original erator is untrainable, which means this drawback is theo-
images as network input and ablating the corresponding retically unavoidable. To address this issue, a completely
pixels of IA to the invalid pixels of warped IB . To solve trainable iterative homography network (IHN) [162] was
the non-planar homography estimation problem, DAMG- proposed. Inspired by RAFT [195], IHN updates the cost
Homo [41] proposed backward multi-gird deformation with volume to refine the estimated homography using the same
contextual correlation to align parallax images. Compared estimator repeatedly every iteration. Furthermore, IHN can
with traditional cost volume, the proposed contextual cor- handle dynamic scenes by producing an inlier mask in the
relation helped to reach better accuracy with lower compu- estimator without requiring extra supervision.
tational complexity. Another way to address the non-planar
problem is to focus on the dominant plane. In HomoGAN 5.4 Discussion
[52], an unsupervised GAN is proposed to impose a copla- 5.4.1 Technique Summary
narity constraint on the predicted homography, as shown in The above works are devoted to exploring different homog-
Figure 9. To implement this approach, a generator is used raphy parameterizations such as 4-pt parameterization [25],
to predict masks of aligned regions, while a discriminator is perspective field [82], and motion bases representation [51],
used to determine whether two masked feature maps were which contributes to better convergence and performance.
produced by a single homography. Other works tend to design various network architectures.
In particular, cascaded and iterative solutions are proposed
5.3 Iterative Solution
to refine the performance progressively, which can be fur-
Compared with cascaded methods, iterative solutions ther combined together to reach higher accuracy. To make
achieve higher accuracy by iteratively optimizing the last the methods more practical, various challenging problems
estimation. Lucas-Kanade (LK) algorithm [189] is usually are preliminarily addressed, such as cross resolutions [144],
used in image registration to estimate the parameterized multiple modalities [143], [162], dynamic objects [110], [162],
warps iteratively, such as affine transformation, optical flow, and non-planar scenes [41], [50], [52], etc.
etc. It aims at the incremental update of warp parameters
∆p every iteration by minimizing the sum of squared error 5.4.2 Challenge and Future Effort
between a template image T and an input image I :
We summarize the existing challenges as follows:
E(∆p) =k T (x) − I(W(x; p + ∆p)) k22 . (5) (1) Many homography estimation solutions are designed
for fixed resolutions, while real-world applications often
However, when optimizing Eq. 5 using first-order Taylor involve much more flexible resolutions. When pre-trained
expansion, ∂I(W(x; p))/∂p should be recomputed every models are applied to images with different resolutions,
iteration because I(W(x; p)) varies with p. To avoid this performance can dramatically drop due to the need for
problem, the inverse compositional (IC) LK algorithm [193], input resizing to satisfy the regulated resolution.
an equivalence to LK algorithm, can be used to reformulate (2) Unlike optical flow estimation, which assumes small
the optimization goal as follows: motions between images, homography estimation often
E 0 (∆p) =k T (W(x; ∆p)) − I(W(x; p)) k22 . (6) deals with images that have significantly low-overlap rates.
In such cases, existing methods may exhibit inferior perfor-
After linearizing Eq. 6 with first-order Taylor expansion, mance due to limited receptive fields.
we compute ∂T (W(x; 0))/∂p instead of ∂I(W(x; p))/∂p, (3) Existing methods address the parallax or dynamic
which would not vary every iteration. objects by learning to reject outliers in the feature extractor
To combine the advantages of deep learning with IC-LK [50], cost volume [196], or estimator [162]. However, it is still
iterator, CLKN [68] conducted LK iterative optimization on unclear which stage is more appropriate for outlier rejection.
semantic feature maps extracted by CNNs as follows: Based on the challenges we have discussed, some poten-
tial research directions for future efforts can be identified:
E f (∆p) =k FT (W(x; ∆p)) − FI (W(x; p)) k22 , (7)
(1) To overcome the first challenge, we can design var-
where FT and FI are the feature maps of the template and ious strategies to enhance resolution robustness, such as
input images. Then, they enforced the network to run a resolution-related data augmentation, and continual learn-
single iteration with a hinge loss, while the network runs ing on multiple datasets with different resolutions. Besides,
12

we can also formulate a resolution-free parameterization


form. The perspective field [82] is a typical case, which rep-
resents the homography as dense correspondences with the
same resolution as input images. But it requires RANSAC
as the post-processing approach, introducing extra com-
putational complexity, especially in the case of extensive
correspondences. Therefore, a resolution-free and efficient
parameterization form should be explored.
(2) To enhance the performance in low-overlap rate, the
main insight is to increase the receptive fields of a network.
To this end, the cross-attention module of the transformer
explicitly leverages the long-range correlation to eliminate
short-range inductive bias [197]. On the other hand, we Fig. 10. Network architecture of CalibNet. The figure is from [83].
can exploit beneficial varieties of cost volume to integrate
feature correlation [41], [162].
(3) As there is no interaction between different image made a further step into more accurate camera-LiDAR
features in the feature extractor, it is reasonable to assume calibration in terms of the geometric constraint [83], [128],
that outlier rejection should occur after feature extraction. temporal correlation [128], loss design [127], feature extrac-
It is not possible to identify outliers within a single image tion [179], feature matching [132], [152], feature fusion [179],
as the depth alone cannot be used as an outlier cue. For and calibration representation [153], [176].
example, images captured by purely rotated cameras do not For example, as shown in Figure 10, CalibNet [83] de-
contain parallax outliers. Additionally, it seems intuitive to signed a network to predict calibration parameters that
learn the capability of outlier rejection by combining global maximize the geometric and photometric consistency of
and local correlation, similar to the insight of RANSAC. images and point clouds, solving the underlying physical
problem by 3D Spatial Transformers [198]. To refine the
calibration model, CalibRCNN [128] presented a synthetic
6 C ROSS -S ENSOR M ODEL view and an epipolar geometry constraint to measure the
Multi-sensor calibration estimates intrinsic and extrinsic photometric and geometric inaccuracies between consecu-
parameters of multiple sensors like cameras, LiDARs, and tive frames, of which the temporal information learned by
IMUs. This ensures that data from different sensors are syn- the LSTM network has been investigated in the learning-
chronized and registered in a common coordinate system, based camera-LiDAR calibration for the first time. Since
allowing them to be fused together for a more accurate the output space of the LiDAR-camera calibration is on
representation of the environment. Accurate multi-sensor the 3D Special Euclidean Group (SE(3)) rather than the
calibration is crucial for applications like autonomous driv- normal Euclidean space, RGGNet [127] considered Rieman-
ing and robotics, where reliable sensor fusion is necessary nian geometry constraints in the loss function, namely, used
for safe and efficient operation. a SE(3) geodesic distance equipped with left-invariant
In this part, we mainly review the literature on learning- Riemannian metrics to optimize the calibration network.
based camera-LiDAR calibration, i.e., predicting the 6-DoF LCCNet [152] exploited the cost volume layer to learn the
rigid body transformation between a camera and a 3D correlation between the image and the depth transformed
LiDAR, without requiring any presence of specific fea- by the point cloud. Because the depth map ignores the
tures or landmarks in the implementation. Like calibration 3D geometric structure of the point cloud, FusionNet [179]
works on other types of cameras/systems, this research leveraged PointNet++ [199] to directly learn the features
field can also be classified into regression-based solutions from the 3D point cloud. Subsequently, a feature fusion with
and flow/reconstruction-based solutions. But we are prone Ball Query [199] and attention strategy was proposed to
to follow the special matching principle in camera-LiDAR effectively fuse the features of images and point clouds.
calibration and divide the existing learning-based literature CFNet [153] first proposed the calibration flow for
into three categories: pixel-level solution, semantics-level camera-LiDAR calibration, which represents the deviation
solution, and object/keypoint-level solution. between the positions of initial projected 2D points and
ground truth. Compared to directly predicting extrinsic pa-
6.1 Pixel-level Solution rameters, learning the calibration flow helped the network
The first deep learning technique in camera-LiDAR cali- to understand the underlying geometric constraint. To build
bration, RegNet [27], used CNNs to combine feature ex- precise 2D-3D correspondences, CFNet [153] corrected the
traction, feature matching, and global regression to infer originally projected points using the estimated calibration
the 6-DoF extrinsic parameters. It processed the RGB and flow. Then the efficient Perspective-n-Point (EPnP) algo-
LiDAR depth map separately and branched two parallel rithm was applied to calculate the final extrinsic parameters
data network streams. Then, a specific correlation layer was by RANSAC. Because RANSAC is nondifferentiable, DXQ-
proposed to convolve the stacked LiDAR and RGB features Net [176] further presented a probabilistic model for LiDAR-
as a joint representation. After this feature matching, the camera calibration flow, which estimates the uncertainty
global information fusion and parameter regression were to measure the quality of LiDAR-camera data association.
achieved by two fully connected layers with a Euclidean Then, the differentiable pose estimation module was de-
loss function. Motivated by this work, the subsequent works signed for solving extrinsic parameters, back-propagating
13

the extrinsic error to the flow prediction network. multi-scale feature extraction, cross-modal interaction, cost-
volume establishment, and confidence-guided fusion.
6.2 Semantics-level Solution (2) Directly regressing 6-DoF parameters yields weak
Semantic features can be well learned and represented generalization ability. To overcome this, intermediate rep-
by deep neural networks. A perfect calibration enables resentations like calibration flow have been introduced.
to accurately align the same instance in different sensors. Additionally, calibration flow can handle non-rigid trans-
To this end, some works [42], [129], [131], [157] explored formations that are common in real-world applications.
to guide the camera-LiDAR calibration with the semantic (3) Traditional methods require specific environments
information. SOIC [131] calibrated and transforms the ini- but have well-designed strategies. To balance accuracy and
tialization issue into the semantic centroids’ PnP problem generalization, a combination of geometric solving algo-
using semantic information. Since the 3D semantic centroids rithms and learning methods has been investigated.
of the point cloud and the 2D semantic centroids of the
picture cannot match precisely, a matching constraint cost 6.4.3 Future Effort
function based on the semantic components was presented. (1) Camera-LiDAR calibration methods typically rely on
SSI-Calib [129] reformulated the calibration as an optimiza- datasets like KITTI, which provide only initial extrinsic
tion problem with a novel calibration quality metric based parameters. To create a decalibration dataset, researchers
on semantic features. Then, a non-monotonic subgradient add noise transformations to the initial extrinsics, but this
ascent algorithm was proposed to calculate the calibration approach assumes a fixed position camera-LiDAR system
parameters. Other works utilized the off-the-shelf segmenta- with miscalibration. In real-world applications, the camera-
tion networks for point cloud and image, and optimized the LiDAR relative pose varies, making it challenging to collect
calibration parameters by minimizing semantic alignment large-scale real data with ground truth extrinsics. To address
loss in single-direction [157] and bi-direction [42]. this challenge, generating synthetic camera-LiDAR data us-
ing simulation systems could be a valuable solution.
6.3 Object/Keypoint-level Solution (2) To optimize the combination of networks and tradi-
ATOP [178] designed an attention-based object-level match- tional solutions, a more compact approach is needed. Cur-
ing network, i.e., Cross-Modal Matching Network to explore rent methods mainly use networks as feature extractors, re-
the overlapped FoV between camera and LiDAR, which fa- sulting in non-end-to-end pipelines with inadequate feature
cilitated generating the 2D-3D object-level correspondences. extraction adjustments for calibration. A deep declarative
2D and 3D object proposals were detected by YOLOv4 [200] network (DDN) is a promising framework for making the
and PointPillar [201]. Then, two cascaded PSO-based algo- entire pipeline differentiable. The aggregation of learning
rithms [202] were devised to estimate the calibration extrin- and traditional methods can be optimized using DDN.
sic parameters in the optimization stage. Using the deep (3) The most important aspect of camera-LiDAR calibra-
declarative network (DDN) [203], RKGCNet [180] combined tion is 2D-3D matching. To achieve this, the point cloud is
the standard neural layer and a PnP solver in the same net- commonly transformed into a depth image. However, large
work, formulating the 2D–3D data association and pose es- deviations in extrinsic simulation can result in detail loss.
timation as a bilevel optimization problem. Therefore, both With the great development of Transformer and cross-modal
the feature extraction capability of the convolutional layer techniques, we believe leveraging Transformer to directly
and the conventional geometric solver can be employed. learn the features of image and point cloud in the same
Microsoft’s human keypoint extraction network [204] was pipeline could facilitate better 2D-3D matching.
applied to detect the 2D–3D matching keypoints. Addition-
ally, RKGCNet [180] presented a learnable weight layer that
determines the keypoints involved in the solver, enabling 7 B ENCHMARK
the whole pipeline to be trained end-to-end. As there is no public and unified benchmark in learning-
based camera calibration, we contribute a dataset that can
6.4 Discussion serve as a platform for generalization evaluations. In this
6.4.1 Technique Summary dataset, the images and videos are captured by different
The current method can be briefly classified based on the cameras under diverse scenes, including simulation envi-
principle of building 2D and 3D matching, namely, the ronments and real-world settings. Additionally, we provide
calibration target. In summary, most pixel-level solutions the calibration ground truth, parameter label, and visual
utilized the end-to-end framework to address this task. clues in this dataset based on different conditions. Figure 11
While these solutions delivered satisfactory performances shows some samples of our collected dataset.
on specific datasets, their generalization abilities are limited. Standard Model. We collected 300 high-resolution im-
Semantics-level and object/keypoint-level methods derived ages on the Internet, captured by popular digital cameras
from traditional calibration offered both acceptable perfor- such as Canon, Fujifilm, Nikon, Olympus, Sigma, Sony, etc.
mances and generalization abilities. However, they heavily For each image, we provide the specific focal length of its
relied on the quality of fore-end feature extraction. lens. We have included a diverse range of subjects, including
landscapes, portraits, wildlife, architecture, etc. The range of
6.4.2 Research Trend focal length is from 4.5mm to 600mm.
(1) Network architecture is becoming more complex with Distortion Model. We created a comprehensive dataset
the use of different structures for feature extraction, match- for the distortion camera model, with a focus on wide-angle
ing, and fusion. Current methods employ strategies like cameras. The dataset is comprised of three subcategories.
14

Front

Unit: CM
Rear

Apollo
Left
Right

Vehicular Fisheye Lens System Calibration Data Synthetic Wide-angle Image


DAIR-V2X
Distortion Model
f = 34mm f = 56mm f = 20mm

KITTI

f = 135mm f = 270mm f = 35mm


Google Earth Google Map

f = 62mm ONCE

f = 15mm f = 42mm f = 600mm

MS-COCO CAHomo NuScenes


Cross-View Model Standard Pinhole Model Cross-Sensor Model

Fig. 11. Overview of our collected benchmark, which covers all models reviewed in this paper. In this dataset, the image and video derive from
diverse cameras under different environments. The accurate ground truth and label are provided for each sample.

The first is a synthetic dataset, which was generated us- 1242×375, while the LiDAR sensors are from Velodyne and
ing the widely-used 4th order polynomial model. It con- Hesai, with 16, 32, 40, 64, and 128 beams. They include
tains both circular and rectangular structures, with 1,000 not only normal surrounding multi-view images but also
distortion-rectification image pairs. The second subcategory small baseline multi-view data. Additionally, we also added
consists of data captured under real-world settings, derived random disturbance of around 20 degrees rotation and 1.5
from the raw calibration data for around 40 types of wide- meters translation based on classical settings [27] to simulate
angle cameras. For each calibration data, the intrinsics, vibration and collision.
extrinsics, and distortion coefficients are available. Finally,
we exploit a car equipped with different cameras to capture 8 F UTURE R ESEARCH D IRECTIONS
video sequences. The scenes cover both indoor and outdoor
environments, including daytime and nighttime footage. Camera calibration is a fundamental and challenging re-
search topic. From the above technical reviews and limi-
Cross-View Model. We selected 500 testing samples
tation analysis, we can conclude there is still room for im-
at random from each of four representative datasets (MS-
provement with deep learning. From Section 3 to Section 6,
COCO [25], GoogleEarch [143], GoogleMap [143], CA-
specific future efforts are discussed for each model. In this
Homo [50]) to create a dataset for the cross-view model.
section, we suggest more general future research directions.
It covers a range of scenarios: MS-COCO provides natural
synthetic data, GoogleEarch contains aerial synthetic data,
and GoogleMap offers multi-modal synthetic data. Parallax 8.1 Sequences
is not a factor in these three datasets, while CAHomo pro- Most studies focus on calibrating a single image. However,
vides real-world data with non-planar scenes. To standard- the rich spatiotemporal correlation among sequences that
ize the dataset, we converted all images to a unified format offers useful information on calibration has been over-
and recorded the matched points between two views. In looked. Learning the spatiotemporal correlation can provide
MS-COCO, GoogleEarch, and GoogleMap, we used four the network with knowledge of structure from motion,
vertices of the images as the matched points. In CAHomo, which aligns with the principles of traditional calibrations.
we identified six matched key points within the same plane. Directly applying existing calibration methods to the first
Cross-Sensor Model. We collected RGB and point frame and then propagating the calibrated objectives to
cloud data from Apollo [205], DAIR-V2X [206], KITTI [74], subsequent frames is a straightforward approach. However,
KUCL [207], NuScenes [208], and ONCE [209]. Around 300 there are no methods that can perfectly calibrate every
data pairs with calibration parameters are included in each uncalibrated input, and the calibration error will persist
category. The datasets are captured in different countries to throughout the entire sequence. Another solution is to cal-
provide enough variety. Each dataset has a different sensor ibrate all frames simultaneously. However, the calibration
setup, obtaining camera-LiDAR data with varying image results of learning-based methods heavily rely on the se-
resolution, LiDAR scan pattern, and camera-LiDAR relative mantic features of the image. As a result, unstable jitter
location. The image resolution ranges from 2448×2048 to effects may occur in calibrated sequences when the scenes
15

change slightly. To this end, exploring video stabilization for 9 C ONCLUSION


sequence calibration is an interesting future direction. In this paper, we present a comprehensive survey of the
recent efforts in the area of deep learning-based camera
8.2 Learning Target calibration. Our survey covers conventional camera mod-
Due to the implicit relationship to image features, conven- els, classified learning paradigms and learning strategies,
tional calibration objectives can be challenging for neural detailed reviews of the state-of-the-art approach, a public
networks to learn. To this end, some works have developed benchmark, and future research directions. To exhibit the
novel learning targets that replace conventional calibration development process and link the connections between
objectives, providing learning-friendly representations for existing works, we provide a fine-grained taxonomy that
neural networks. Additionally, intermediate geometric rep- categorizes literature by jointly considering camera models
resentations have been presented to bridge the gap between and applications. Moreover, the relationships, strengths, dis-
image features and calibration objectives, such as reflective tinctions, and limitations are thoroughly discussed in each
amplitude coefficient maps [125], rectification flow [34], category. An open-source repository will keep updating
surface geometry [86], and normal flow [167], etc. Looking regularly with new works and datasets. We hope that this
ahead to the future development of this community, we be- survey could promote future research in this field.
lieve there is still great potential for designing more explicit
and reasonable learning targets for calibration objectives. ACKNOWLEDGMENT
We thank Leidong Qin and Shangrong Yang at Beijing Jiao-
8.3 Pre-training tong University for the partial dataset collection. We thank
Pre-training on ImageNet [66] has become a widely used Jinlong Fan at the University of Sydney for the insightful
strategy in deep learning. However, recent studies [93] discussion.
have shown that this approach provides less benefit for
specific camera calibration tasks, such as wide-angle camera R EFERENCES
calibration. This is due to two main reasons: the data gap [1] C. B. Duane, “Close-range camera calibration,” Photogramm. Eng,
and the task gap. The ImageNet dataset only contains per- vol. 37, no. 8, pp. 855–866, 1971.
spective images without distortions, making the initialized [2] S. J. Maybank and O. D. Faugeras, “A theory of self-calibration of
a moving camera,” International journal of computer vision, vol. 8,
weights of neural networks irrelevant to distortion models. no. 2, pp. 123–151, 1992.
Furthermore, He et al. [210] demonstrated that the task of [3] J. Weng, P. Cohen, M. Herniou et al., “Camera calibration with
ImageNet pre-training has limited benefits when the final distortion models and accuracy evaluation,” IEEE Transactions on
task is more sensitive to localization. As a result, the perfor- pattern analysis and machine intelligence, vol. 14, no. 10, pp. 965–
980, 1992.
mance of extrinsics estimation may be impacted by this task [4] Z. Zhang, “A flexible new technique for camera calibration,”
gap. Moreover, pre-training beyond a single image and a IEEE Transactions on pattern analysis and machine intelligence,
single modality, to our knowledge, has not been thoroughly vol. 22, no. 11, pp. 1330–1334, 2000.
[5] D. C. Brown, “Decentering distortion of lenses,” Photogrammetric
investigated in the related field. We suggest that designing a
Engineering and Remote Sensing, 1966.
customized pre-training strategy for learning-based camera [6] Z. Zhang, “Flexible camera calibration by viewing a plane from
calibration is an interesting area of research. unknown orientations,” in Proceedings of the seventh ieee interna-
tional conference on computer vision, vol. 1. IEEE, 1999, pp. 666–
673.
8.4 Implicit Unified Model [7] S. Gasparini, P. Sturm, and J. P. Barreto, “Plane-based calibration
Deep learning-based camera calibration methods use tradi- of central catadioptric cameras,” in 2009 IEEE 12th International
Conference on Computer Vision. IEEE, 2009, pp. 1195–1202.
tional parametric camera models, which lack the flexibility [8] S. Shah and J. Aggarwal, “A simple calibration procedure for
to fit complex situations. Non-parametric camera models fish-eye (high distortion) lens camera,” in Proceedings of the 1994
relate each pixel to its corresponding 3D observation ray, IEEE international Conference on Robotics and Automation. IEEE,
1994, pp. 3422–3427.
overcoming parametric model limitations. However, they [9] J. P. Barreto and H. Araujo, “Geometric properties of central
require strict calibration targets and are more complex for catadioptric line images and their application in calibration,”
undistortion, projection, and unprojection. Deep learning IEEE Transactions on Pattern Analysis and Machine Intelligence,
methods show potential for calibration tasks, making non- vol. 27, no. 8, pp. 1327–1333, 2005.
[10] R. Carroll, M. Agrawal, and A. Agarwala, “Optimizing content-
parametric models worth revisiting and potentially replac- preserving projections for wide-angle images,” in ACM Transac-
ing parametric models in the future. Moreover, they allow tions on Graphics (TOG), vol. 28, no. 3. ACM, 2009, p. 43.
for implicit and unified calibration, fitting all camera types [11] F. Bukhari and M. N. Dailey, “Automatic radial distortion esti-
mation from a single image,” Journal of Mathematical Imaging and
through pixel-level regression and avoiding explicit feature Vision, vol. 45, no. 1, pp. 31–45, 2013.
extraction and geometry solving. Researchers combined [12] M. Alemán-Flores, L. Alvarez, L. Gomez, and D. Santana-Cedrés,
the advantages of implicit and unified representation with “Automatic lens distortion correction using one-parameter divi-
the Neural Radiance Field (NeRF) for reconstructing 3D sion models,” Image Processing On Line, vol. 4, pp. 327–343, 2014.
[13] O. D. Faugeras, Q.-T. Luong, and S. J. Maybank, “Camera self-
structures and synthesizing novel views. Self-calibration calibration: Theory and experiments,” in European conference on
NeRF [211] has been proposed for generic cameras with computer vision. Springer, 1992, pp. 321–334.
arbitrary non-linear distortions, and end-to-end pipelines [14] C. S. Fraser, “Digital camera self-calibration,” ISPRS Journal of
Photogrammetry and Remote sensing, vol. 52, no. 4, pp. 149–159,
have been explored to learn depth and ego-motion with- 1997.
out calibration targets. We believe the implicit and unified [15] R. I. Hartley, “Self-calibration from multiple views with a rotating
camera models could be used to optimize learning-based camera,” in European Conference on Computer Vision. Springer,
algorithms or integrated into downstream 3D vision tasks. 1994, pp. 471–478.
16

[16] [Online]. Available: https://docs.opencv.org/4.x/dc/dbb/ SIGGRAPH European Conference on Visual Media Production, 2018,
tutorial py calibration.html pp. 1–10.
[17] [Online]. Available: https://www.mathworks.com/help/ [38] Y. Lin, R. Wiersma, S. L. Pintea, K. Hildebrandt, E. Eisemann,
vision/camera-calibration.html and J. C. van Gemert, “Deep vanishing point detection: Ge-
[18] J. Salvi, X. Armangué, and J. Batlle, “A comparative review of ometric priors make dataset variations vanish,” arXiv preprint
camera calibrating methods with accuracy evaluation,” Pattern arXiv:2203.08586, 2022.
recognition, vol. 35, no. 7, pp. 1617–1635, 2002. [39] X. Zhou, P. Duan, Y. Ma, and B. Shi, “Evunroll: Neuromorphic
[19] C. Hughes, M. Glavin, E. Jones, and P. Denny, “Review of events based rolling shutter image correction,” in Proceedings of
geometric distortion compensation in fish-eye cameras,” 2008. the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[20] J. Fan, J. Zhang, S. J. Maybank, and D. Tao, “Wide-angle image tion, 2022, pp. 17 775–17 784.
rectification: a survey,” International Journal of Computer Vision, [40] Y. Shangrong, L. Chunyu, L. Kang, and Z. Yao, “Fishformer:
vol. 130, no. 3, pp. 747–776, 2022. Annulus slicing-based transformer for fisheye rectification with
[21] S. Workman, C. Greenwell, M. Zhai, R. Baltenberger, and N. Ja- efficacy domain exploration,” arXiv preprint arXiv:2207.01925,
cobs, “Deepfocal: A method for direct focal length estimation,” 2022.
in 2015 IEEE International Conference on Image Processing (ICIP), [41] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Depth-aware multi-
2015, pp. 1369–1373. grid deep homography estimation with contextual correlation,”
[22] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional IEEE Transactions on Circuits and Systems for Video Technology, pp.
network for real-time 6-dof camera relocalization,” in Proceedings 1–1, 2021.
of the IEEE International Conference on Computer Vision (ICCV), [42] K. Akio, Z. Yiyang, Z. Pengwei, Z. Wei, and T. Masayoshi,
December 2015. “Sst-calib: Simultaneous spatial-temporal parameter calibration
[23] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens dis- between lidar and camera,” arXiv preprint arXiv:2207.03704, 2022.
tortion correction using convolutional neural networks trained [43] Y. Zhao, Z. Huang, T. Li, W. Chen, C. LeGendre, X. Ren,
with synthesized images,” in Asian Conference on Computer Vision. A. Shapiro, and H. Li, “Learning perspective undistortion of
Springer, 2016, pp. 35–49. portraits,” in Proceedings of the IEEE/CVF International Conference
[24] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the on Computer Vision (ICCV), October 2019.
shutter: Cnn to correct motion distortions,” in Proceedings of the [44] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical
IEEE Conference on computer Vision and Pattern Recognition, 2017, wide-angle portraits correction with deep structured models,”
pp. 2291–2299. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[25] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image Pattern Recognition (CVPR), June 2021, pp. 3498–3506.
homography estimation,” arXiv preprint arXiv:1606.03798, 2016. [45] M. Kocabas, C.-H. P. Huang, J. Tesch, L. Müller, O. Hilliges, and
[26] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gam- M. J. Black, “Spec: Seeing people in the wild with an estimated
baretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for camera,” in Proceedings of the IEEE/CVF International Conference on
deep single image camera calibration,” in Proceedings of the IEEE Computer Vision (ICCV), October 2021, pp. 11 035–11 045.
Conference on Computer Vision and Pattern Recognition (CVPR), [46] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter
June 2018. unrolling network,” in Proceedings of the IEEE/CVF Conference on
[27] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: Computer Vision and Pattern Recognition, 2020, pp. 5941–5949.
Multimodal sensor registration using deep neural networks,” in [47] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-
2017 IEEE intelligent vehicles symposium (IV). IEEE, 2017, pp. supervised wide-angle portraits correction by multi-scale trans-
1803–1810. former,” in Proceedings of the IEEE/CVF Conference on Computer
[28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
translation with conditional adversarial networks,” in Proceedings [48] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann,
of the IEEE conference on computer vision and pattern recognition, K. Sunkavalli, and M. Chandraker, “Single view metrology in
2017, pp. 1125–1134. the wild,” in European Conference on Computer Vision. Springer,
[29] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- 2020, pp. 316–333.
works for semantic segmentation,” in Proceedings of the IEEE [49] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and
conference on computer vision and pattern recognition, 2015, pp. V. Kumar, “Unsupervised deep homography: A fast and robust
3431–3440. homography estimation model,” IEEE Robotics and Automation
[30] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from Letters, vol. 3, no. 3, pp. 2346–2353, 2018.
a single image using a multi-scale deep network,” Advances in [50] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and
neural information processing systems, vol. 27, 2014. J. Sun, “Content-aware unsupervised deep homography estima-
[31] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic tion,” in European Conference on Computer Vision. Springer, 2020,
radial distortion rectification using conditional gan in real-time,” pp. 653–669.
IEEE Transactions on Circuits and Systems for Video Technology, [51] N. Ye, C. Wang, H. Fan, and S. Liu, “Motion basis learning
vol. 30, no. 3, pp. 725–733, 2020. for unsupervised deep homography estimation with subspace
[32] K. Liao, C. Lin, Y. Zhao, and M. Xu, “Model-free distortion projection,” in Proceedings of the IEEE/CVF International Conference
rectification framework bridged by distortion distribution map,” on Computer Vision (ICCV), October 2021, pp. 13 117–13 125.
IEEE Transactions on Image Processing, vol. 29, pp. 3707–3718, 2020. [52] M. Hong, Y. Lu, N. Ye, C. Lin, Q. Zhao, and S. Liu, “Unsu-
[33] K. Liao, C. Lin, L. Liao, Y. Zhao, and W. Lin, “Multi-level curricu- pervised homography estimation with coplanarity-aware gan,”
lum for training a distortion-aware barrel distortion rectification arXiv preprint arXiv:2205.03821, 2022.
model,” in Proceedings of the IEEE/CVF International Conference on [53] S. Liu, N. Ye, C. Wang, K. Luo, J. Wang, and J. Sun, “Content-
Computer Vision (ICCV), October 2021, pp. 4389–4398. aware unsupervised deep homography estimation and beyond,”
[34] X. Li, B. Zhang, P. V. Sander, and J. Liao, “Blind geometric IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.
distortion correction on images through deep learning,” in Pro- 1–1, 2022.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [54] S. Yang, C. Lin, K. Liao, Y. Zhao, and M. Liu, “Unsupervised fish-
Recognition (CVPR), June 2019. eye image correction through bidirectional loss with geometric
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional prior,” Journal of Visual Communication and Image Representation,
networks for biomedical image segmentation,” in International vol. 66, p. 102692, 2020.
Conference on Medical image computing and computer-assisted inter- [55] X. Wang, C. Wang, B. Liu, X. Zhou, L. Zhang, J. Zheng, and X. Bai,
vention. Springer, 2015, pp. 234–241. “Multi-view stereo in the deep learning era: A comprehensive
[36] M. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points revfiew,” Displays, vol. 70, p. 102102, 2021.
using global image context in a non-manhattan world,” in Pro- [56] J. Fan, J. Zhang, and D. Tao, “Sir: Self-supervised image rectifi-
ceedings of the IEEE Conference on Computer Vision and Pattern cation via seeing the same scene from multiple different lenses,”
Recognition (CVPR), June 2016. IEEE Transactions on Image Processing, 2022.
[37] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib: [57] J. Fang, I. Vasiljevic, V. Guizilini, R. Ambrus, G. Shakhnarovich,
a deep learning approach for automatic intrinsic calibration of A. Gaidon, and M. R. Walter, “Self-supervised camera self-
wide field-of-view cameras,” in Proceedings of the 15th ACM calibration from video,” arXiv preprint arXiv:2112.03325, 2021.
17

[58] J. Zhao, S. Wei, L. Liao, and Y. Zhao, “Dqn-based gradual fisheye [79] R. Ranftl and V. Koltun, “Deep fundamental matrix estima-
image rectification,” Pattern Recognition Letters, vol. 152, pp. 129– tion,” in Proceedings of the European Conference on Computer Vision
134, 2021. (ECCV), September 2018.
[59] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. [80] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski temples: Benchmarking large-scale scene reconstruction,” ACM
et al., “Human-level control through deep reinforcement learn- Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017.
ing,” nature, vol. 518, no. 7540, pp. 529–533, 2015. [81] O. Poursaeed, G. Yang, A. Prakash, Q. Fang, H. Jiang, B. Har-
[60] K. Wilson and N. Snavely, “Robust global translations with iharan, and S. Belongie, “Deep fundamental matrix estimation
1dsfm,” in European Conference on Computer Vision. Springer, without correspondences,” in Proceedings of the European Confer-
2014, pp. 61–75. ence on Computer Vision (ECCV) Workshops, September 2018.
[61] [Online]. Available: https://www.repository.cam.ac.uk/handle/ [82] R. Zeng, S. Denman, S. Sridharan, and C. Fookes, “Rethinking
1810/251342;jsessionid=90AB1617B8707CD387CBF67437683F77 planar homography estimation using perspective fields,” in Asian
[62] S. Workman, M. Zhai, and N. Jacobs, “Horizon lines in the wild,” Conference on Computer Vision. Springer, 2018, pp. 571–586.
arXiv preprint arXiv:1604.02129, 2016. [83] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet:
[63] [Online]. Available: https://mvrl.cse.wustl.edu/datasets/hlw/ Geometrically supervised extrinsic calibration using 3d spatial
transformer networks,” in 2018 IEEE/RSJ International Conference
[64] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based
on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1110–
methods for estimating manhattan frames in urban imagery,” in
1117.
European conference on computer vision. Springer, 2008, pp. 197–
210. [84] C.-K. Chang, J. Zhao, and L. Itti, “Deepvp: Deep learning for
vanishing point detection on 1 million street view images,” in
[65] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric
2018 IEEE International Conference on Robotics and Automation
image parsing in man-made environments,” in European confer-
(ICRA). IEEE, 2018, pp. 4496–4503.
ence on computer vision. Springer, 2010, pp. 57–70.
[85] M. Lopez, R. Mari, P. Gargallo, Y. Kuang, J. Gonzalez-Jimenez,
[66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, and G. Haro, “Deep single image camera calibration with radial
“Imagenet: A large-scale hierarchical image database,” in IEEE distortion,” in Proceedings of the IEEE/CVF Conference on Computer
Conference on Computer Vision and Pattern Recognition, 2009, pp. Vision and Pattern Recognition (CVPR), June 2019.
248–255.
[86] W. Xian, Z. Li, M. Fisher, J. Eisenmann, E. Shechtman, and
[67] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, N. Snavely, “Uprightnet: Geometry-aware camera orientation
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects estimation from single images,” in Proceedings of the IEEE/CVF
in context,” in European conference on computer vision. Springer, International Conference on Computer Vision (ICCV), October 2019.
2014, pp. 740–755.
[87] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye,
[68] C.-H. Chang, C.-N. Chou, and E. Y. Chang, “Clkn: Cascaded Y. Huang, R. Tang, and S. Leutenegger, “Interiornet: Mega-scale
lucas-kanade networks for image alignment,” in Proceedings of multi-sensor photo-realistic indoor scenes dataset,” arXiv preprint
the IEEE Conference on Computer Vision and Pattern Recognition arXiv:1809.00716, 2018.
(CVPR), July 2017.
[88] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and
[69] F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homogra- M. Nießner, “Scannet: Richly-annotated 3d reconstructions of
phy estimation from image pairs with hierarchical convolutional indoor scenes,” in Proceedings of the IEEE conference on computer
networks,” in Proceedings of the IEEE International Conference on vision and pattern recognition, 2017, pp. 5828–5839.
Computer Vision (ICCV) Workshops, Oct 2017. [89] B. Zhuang, Q.-H. Tran, G. H. Lee, L. F. Cheong, and M. Chan-
[70] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun draker, “Degeneracy in self-calibration revisited and a deep
database: Large-scale scene recognition from abbey to zoo,” in learning solution for uncalibrated slam,” in 2019 IEEE/RSJ Inter-
2010 IEEE computer society conference on computer vision and pattern national Conference on Intelligent Robots and Systems (IROS), 2019,
recognition. IEEE, 2010, pp. 3485–3492. pp. 3766–3773.
[71] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object [90] S. Ammar Abbas and A. Zisserman, “A geometric approach
retrieval with large vocabularies and fast spatial matching,” in to obtain a bird’s eye view from an image,” in Proceedings of
2007 IEEE conference on computer vision and pattern recognition. the IEEE/CVF International Conference on Computer Vision (ICCV)
IEEE, 2007, pp. 1–8. Workshops, Oct 2019.
[72] H. Shao, T. Svoboda, and L. Van Gool, “Zubud-zurich buildings [91] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
database for image based recognition,” Computer Vision Lab, Swiss “Carla: An open urban driving simulator,” in Conference on robot
Federal Institute of Technology, Switzerland, Tech. Rep, vol. 260, learning. PMLR, 2017, pp. 1–16.
no. 20, p. 6, 2003. [92] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric
[73] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “La- image parsing in man-made environments,” in European confer-
beled faces in the wild: A database forstudying face recognition ence on computer vision. Springer, 2010, pp. 57–70.
in unconstrained environments,” in Workshop on faces in’Real- [93] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Distortion rectifica-
Life’Images: detection, alignment, and recognition, 2008. tion from static to dynamic: A distortion sequence construction
[74] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au- perspective,” IEEE Transactions on Circuits and Systems for Video
tonomous driving? the kitti vision benchmark suite,” in 2012 Technology, vol. 30, no. 11, pp. 3870–3882, 2020.
IEEE conference on computer vision and pattern recognition. IEEE, [94] R. Jung, A. S. J. Lee, A. Ashtari, and J.-C. Bazin, “Deep360up:
2012, pp. 3354–3361. A deep learning-based approach for automatic vr image upright
[75] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing adjustment,” in 2019 IEEE Conference on Virtual Reality and 3D
scene viewpoint using panoramic place representation,” in 2012 User Interfaces (VR), 2019, pp. 1–8.
IEEE Conference on Computer Vision and Pattern Recognition. IEEE, [95] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust
2012, pp. 2695–2702. optimization for deep regression,” in Proceedings of the IEEE
[76] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “Fishey- international conference on computer vision, 2015, pp. 2830–2838.
erecnet: A multi-context collaborative deep network for fisheye [96] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba,
image rectification,” in Proceedings of the European Conference on “Places: A 10 million image database for scene recognition,” IEEE
Computer Vision (ECCV), September 2018. Transactions on Pattern Analysis and Machine Intelligence, vol. 40,
[77] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, no. 6, pp. 1452–1464, 2018.
“Scene parsing through ade20k dataset,” in Proceedings of the IEEE [97] B. Zhuang, Q.-H. Tran, P. Ji, L.-F. Cheong, and M. Chandraker,
conference on computer vision and pattern recognition, 2017, pp. 633– “Learning structure-and-motion-aware rolling shutter correc-
641. tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[78] Y. Shi, D. Zhang, J. Wen, X. Tong, X. Ying, and H. Zha, “Radial and Pattern Recognition (CVPR), June 2019.
lens distortion correction by adding a weight layer with inverted [98] Z. Xue, N. Xue, G.-S. Xia, and W. Shen, “Learning to calibrate
foveal models to convolutional neural networks,” in 2018 24th straight lines for fisheye image rectification,” in Proceedings of the
International Conference on Pattern Recognition (ICPR), 2018, pp. IEEE/CVF Conference on Computer Vision and Pattern Recognition
1–6. (CVPR), June 2019.
18

[99] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learn- [120] H. Zhao, X. Ying, Y. Shi, X. Tong, J. Wen, and H. Zha, “Rdcface:
ing to parse wireframes in images of man-made environments,” Radial distortion correction for face recognition,” in Proceedings
in Proceedings of the IEEE Conference on Computer Vision and Pattern of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
Recognition, 2018, pp. 626–635. nition (CVPR), June 2020.
[100] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and [121] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. C.
T. Funkhouser, “Semantic scene completion from a single depth Loy, “The devil of face recognition is in the noise,” in Proceedings
image,” in Proceedings of the IEEE conference on computer vision and of the European Conference on Computer Vision (ECCV), 2018, pp.
pattern recognition, 2017, pp. 1746–1754. 765–780.
[101] L. Yin, X. Sun, T. Worm, and M. Reale, “A high-resolution 3d [122] Z.-C. Xue, N. Xue, and G.-S. Xia, “Fisheye distortion rectification
dynamic facial expression database, 2008,” in IEEE International from deep straight lines,” arXiv preprint arXiv:2003.11386, 2020.
Conference on Automatic Face and Gesture Recognition, Amsterdam, [123] M. Baradad and A. Torralba, “Height and uprightness invari-
The Netherlands, vol. 126. ance for 3d prediction from a single view,” in Proceedings of the
[102] Y. Zhou, H. Qi, J. Huang, and Y. Ma, “Neurvps: Neural vanishing IEEE/CVF Conference on Computer Vision and Pattern Recognition
point scanning via conic convolution,” Advances in Neural Infor- (CVPR), June 2020.
mation Processing Systems, vol. 32, 2019. [124] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-
[103] Y. Zhou, H. Qi, Y. Zhai, Q. Sun, Z. Chen, L.-Y. Wei, and Y. Ma, mentation and support inference from rgbd images,” in European
“Learning to reconstruct 3d manhattan wireframes from a single conference on computer vision. Springer, 2012, pp. 746–760.
image,” in Proceedings of the IEEE/CVF International Conference on [125] Q. Zheng, J. Chen, Z. Lu, B. Shi, X. Jiang, K.-H. Yap, L.-Y. Duan,
Computer Vision, 2019, pp. 7698–7707. and A. C. Kot, “What does plate glass reveal about camera cal-
[104] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End- ibration?” in Proceedings of the IEEE/CVF Conference on Computer
to-end camera calibration for broadcast videos,” in Proceedings of Vision and Pattern Recognition (CVPR), June 2020.
the IEEE/CVF Conference on Computer Vision and Pattern Recogni- [126] [Online]. Available: https://figshare.com/articles/dataset/
tion (CVPR), June 2020. FocaLens/3399169/2
[105] N. Homayounfar, S. Fidler, and R. Urtasun, “Sports field local- [127] K. Yuan, Z. Guo, and Z. J. Wang, “Rggnet: Tolerance aware
ization via deep structured models,” in Proceedings of the IEEE lidar-camera online calibration with geometric deep learning and
Conference on Computer Vision and Pattern Recognition, 2017, pp. generative model,” IEEE Robotics and Automation Letters, vol. 5,
5212–5220. no. 4, pp. 6956–6963, 2020.
[106] J. Lee, M. Sung, H. Lee, and J. Kim, “Neural geometric parser
[128] J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu,
for single image camera calibration,” in European Conference on
“Calibrcnn: Calibrating camera and lidar by recurrent convo-
Computer Vision. Springer, 2020, pp. 541–557.
lutional neural network and geometric constraints,” in 2020
[107] [Online]. Available: https://developers.google.com/maps/ IEEE/RSJ International Conference on Intelligent Robots and Systems
[108] A. Cramariuc, A. Petrov, R. Suri, M. Mittal, R. Siegwart, and (IROS). IEEE, 2020, pp. 10 197–10 202.
C. Cadena, “Learning camera miscalibration detection,” in 2020
[129] Y. Zhu, C. Li, and Y. Zhang, “Online camera-lidar calibration
IEEE International Conference on Robotics and Automation (ICRA),
with sensor semantic information,” in 2020 IEEE International
2020, pp. 4997–5003.
Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.
[109] C. Zhang, F. Rameau, J. Kim, D. M. Argaw, J.-C. Bazin, and 4970–4976.
I. S. Kweon, “Deepptz: Deep self-calibration for ptz cameras,”
[130] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
in Proceedings of the IEEE/CVF Winter Conference on Applications of
and A. Zisserman, “The PASCAL Visual Object Classes
Computer Vision (WACV), March 2020.
Challenge 2012 (VOC2012) Results,” http://www.pascal-
[110] H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography
network.org/challenges/VOC/voc2012/workshop/index.html.
estimation for dynamic scenes,” in Proceedings of the IEEE/CVF
[131] W. Wang, S. Nobuhara, R. Nakamura, and K. Sakurada, “Soic: Se-
Conference on Computer Vision and Pattern Recognition (CVPR),
mantic online initialization and calibration for lidar and camera,”
June 2020.
arXiv preprint arXiv:2003.04260, 2020.
[111] B. Davidson, M. S. Alvi, and J. F. Henriques, “360° camera
alignment via segmentation,” in European Conference on Computer [132] S. Wu, A. Hadachi, D. Vivet, and Y. Prabhakar, “Netcalib: A novel
Vision. Springer, 2020, pp. 579–595. approach for lidar-camera auto-calibration based on deep learn-
ing,” in 2020 25th International Conference on Pattern Recognition
[112] Y.-Y. Jau, R. Zhu, H. Su, and M. Chandraker, “Deep keypoint-
(ICPR). IEEE, 2021, pp. 6648–6655.
based camera pose estimation with geometric constraints,” in
2020 IEEE/RSJ International Conference on Intelligent Robots and [133] Y. Li, W. Pei, and Z. He, “Srhen: stepwise-refining homography
Systems (IROS), 2020, pp. 4950–4957. estimation network via parsing geometric correspondences in
[113] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, deep latent space,” in Proceedings of the 28th ACM International
“The apolloscape open dataset for autonomous driving and its Conference on Multimedia, 2020, pp. 3063–3071.
application,” IEEE transactions on pattern analysis and machine [134] Y. Gil, S. Elmalem, H. Haim, E. Marom, and R. Giryes, “Online
intelligence, vol. 42, no. 10, pp. 2702–2719, 2019. training of stereo self-calibration using monocular depth estima-
[114] Y.-H. Li, I.-C. Lo, and H. H. Chen, “Deep face rectification for tion,” IEEE Transactions on Computational Imaging, vol. 7, pp. 812–
360° dual-fisheye cameras,” IEEE Transactions on Image Processing, 823, 2021.
vol. 30, pp. 264–276, 2021. [135] [Online]. Available: http://www.cs.toronto.edu/∼harel/
[115] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: TAUAgent/download.html
A dataset and benchmark for large-scale face recognition,” in [136] J. Lee, H. Go, H. Lee, S. Cho, M. Sung, and J. Kim, “Ctrl-c: Camera
European conference on computer vision. Springer, 2016, pp. 87– calibration transformer with line-classification,” in Proceedings of
102. the IEEE/CVF International Conference on Computer Vision (ICCV),
[116] Y. Shi, X. Tong, J. Wen, H. Zhao, X. Ying, and H. Zha, “Position- October 2021, pp. 16 228–16 237.
aware and symmetry enhanced gan for radial distortion correc- [137] N. Wakai and T. Yamashita, “Deep single fisheye image camera
tion,” in 2020 25th International Conference on Pattern Recognition calibration for over 180-degree projection of field of view,” in
(ICPR), 2021, pp. 1701–1708. Proceedings of the IEEE/CVF International Conference on Computer
[117] H. Zhao, Y. Shi, X. Tong, X. Ying, and H. Zha, “A simple yet Vision (ICCV) Workshops, October 2021, pp. 1174–1183.
effective pipeline for radial distortion correction,” in 2020 IEEE [138] P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin,
International Conference on Image Processing (ICIP), 2020, pp. 878– K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan,
882. K. Kavukcuoglu, A. Zisserman et al., “The streetlearn environ-
[118] C.-H. Chao, P.-L. Hsu, H.-Y. Lee, and Y.-C. F. Wang, “Self- ment and dataset,” arXiv preprint arXiv:1903.01292, 2019.
supervised deep learning for fisheye image rectification,” in [139] K. Liao, C. Lin, and Y. Zhao, “A deep ordinal distortion estima-
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, tion approach for distortion rectification,” IEEE Transactions on
Speech and Signal Processing (ICASSP), 2020, pp. 2248–2252. Image Processing, vol. 30, pp. 3362–3375, 2021.
[119] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and [140] K. Zhao, C. Lin, K. Liao, S. Yang, and Y. Zhao, “Revisiting radial
J. Xiao, “Lsun: Construction of a large-scale image dataset us- distortion rectification in polar-coordinates: A new and efficient
ing deep learning with humans in the loop,” arXiv preprint learning perspective,” IEEE Transactions on Circuits and Systems
arXiv:1506.03365, 2015. for Video Technology, pp. 1–1, 2021.
19

[141] A. Eichenseer and A. Kaup, “A data set providing synthetic and [161] X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, K. Li,
real-world fisheye video sequences,” in 2016 IEEE International and D. M. Gavrila, “A new benchmark for vision-based cyclist
Conference on Acoustics, Speech and Signal Processing (ICASSP). detection,” in 2016 IEEE Intelligent Vehicles Symposium (IV). IEEE,
IEEE, 2016, pp. 1541–1545. 2016, pp. 1028–1033.
[142] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively [162] S.-Y. Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep
complementary network for fisheye image rectification using homography estimation,” arXiv preprint arXiv:2203.15982, 2022.
appearance flow,” in Proceedings of the IEEE/CVF Conference on [163] M. Cao, Z. Zhong, J. Wang, Y. Zheng, and Y. Yang, “Learning
Computer Vision and Pattern Recognition (CVPR), June 2021, pp. adaptive warping for real-world rolling shutter correction,” in
6348–6357. Proceedings of the IEEE/CVF Conference on Computer Vision and
[143] Y. Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homog- Pattern Recognition, 2022, pp. 17 785–17 793.
raphy for multimodal image alignment,” in Proceedings of the [164] T. Do, O. Miksik, J. DeGol, H. S. Park, and S. N. Sinha, “Learning
IEEE/CVF Conference on Computer Vision and Pattern Recognition to detect scene landmarks for camera localization,” in Proceedings
(CVPR), June 2021, pp. 15 950–15 959. of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
[144] R. Shao, G. Wu, Y. Zhou, Y. Fu, L. Fang, and Y. Liu, “Localtrans: A nition, 2022, pp. 11 132–11 142.
multiscale local transformer network for cross-resolution homog- [165] T. Do, K. Vuong, S. I. Roumeliotis, and H. S. Park, “Surface nor-
raphy estimation,” in Proceedings of the IEEE/CVF International mal estimation of tilted images via spatial rectifier,” in European
Conference on Computer Vision (ICCV), October 2021, pp. 14 890– Conference on Computer Vision. Springer, 2020, pp. 265–280.
14 899. [166] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
[145] Y. Chen, G. Wang, P. An, Z. You, and X. Huang, “Fast and A. Fitzgibbon, “Scene coordinate regression forests for camera re-
accurate homography estimation using extendable compression localization in rgb-d images,” in Proceedings of the IEEE Conference
network,” in 2021 IEEE International Conference on Image Process- on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
ing (ICIP), 2021, pp. 1024–1028. [167] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, and
[146] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Unsupervised deep Y. Aloimonos, “Diffposenet: Direct differentiable camera pose
image stitching: Reconstructing stitched features to images,” estimation,” in Proceedings of the IEEE/CVF Conference on Computer
IEEE Transactions on Image Processing, vol. 30, pp. 6184–6197, 2021. Vision and Pattern Recognition, 2022, pp. 6845–6854.
[147] S. Garg, D. P. Mohanty, S. P. Thota, and S. Moharana, “A simple [168] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu,
approach to image tilt correction with self-attention mobilenet for A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the
smartphones,” arXiv preprint arXiv:2111.00398, 2021. limits of visual slam,” in 2020 IEEE/RSJ International Conference
[148] K. Chen, N. Snavely, and A. Makadia, “Wide-baseline relative on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–
camera pose estimation with directional learning,” in Proceedings 4916.
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- [169] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
nition, 2021, pp. 3258–3268. “A benchmark for the evaluation of rgb-d slam systems,” in 2012
[149] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, IEEE/RSJ international conference on intelligent robots and systems.
M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: IEEE, 2012, pp. 573–580.
Learning from rgb-d data in indoor environments,” arXiv preprint [170] B. J. Pijnacker Hordijk, K. Y. Scheper, and G. C. De Croon,
arXiv:1709.06158, 2017. “Vertical landing for micro air vehicles using event-based optical
[150] Z. Zhong, Y. Zheng, and I. Sato, “Towards rolling shutter cor- flow,” Journal of Field Robotics, vol. 35, no. 1, pp. 69–90, 2018.
rection and deblurring in dynamic scenes,” in Proceedings of the [171] L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, “Scenesqueezer: Learning to compress scene for camera relocal-
2021, pp. 9219–9228. ization,” in Proceedings of the IEEE/CVF Conference on Computer
[151] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Fast and Vision and Pattern Recognition, 2022, pp. 8259–8268.
accurate image super-resolution with deep laplacian pyramid [172] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year,
networks,” IEEE transactions on pattern analysis and machine in- 1000 km: The oxford robotcar dataset,” The International Journal of
telligence, vol. 41, no. 11, pp. 2599–2613, 2018. Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
[152] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and [173] G. Ponimatkin, Y. Labbé, B. Russell, M. Aubry, and J. Sivic, “Focal
camera self-calibration using cost volume network,” in Proceed- length and object pose estimation via render and compare,” in
ings of the IEEE/CVF Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF Conference on Computer Vision and
Recognition, 2021, pp. 2894–2901. Pattern Recognition, 2022, pp. 3825–3834.
[153] X. Lv, S. Wang, and D. Ye, “Cfnet: Lidar-camera registration using [174] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B.
calibration flow network,” Sensors, vol. 21, no. 23, p. 8112, 2021. Tenenbaum, and W. T. Freeman, “Pix3d: Dataset and methods
[154] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and for single-image 3d shape modeling,” in Proceedings of the IEEE
benchmarks for urban scene understanding in 2d and 3d,” IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.
Transactions on Pattern Analysis and Machine Intelligence, 2022. 2974–2983.
[155] B. Fan and Y. Dai, “Inverting a rolling shutter camera: bring [175] Y. Wang, X. Tan, Y. Yang, X. Liu, E. Ding, F. Zhou, and L. S.
rolling shutter images to high framerate global shutter video,” in Davis, “3d pose estimation for fine-grained object categories,” in
Proceedings of the IEEE/CVF International Conference on Computer Proceedings of the European Conference on Computer Vision (ECCV)
Vision, 2021, pp. 4228–4237. Workshops, 2018, pp. 0–0.
[156] B. Fan, Y. Dai, and M. He, “Sunet: symmetric undistortion [176] X. Jing, X. Ding, R. Xiong, H. Deng, and Y. Wang, “Dxq-net:
network for rolling shutter correction,” in Proceedings of the Differentiable lidar-camera extrinsic calibration using quality-
IEEE/CVF International Conference on Computer Vision, 2021, pp. aware flow,” arXiv preprint arXiv:2203.09385, 2022.
4541–4550. [177] Y. Zhang, X. Zhao, and D. Qian, “Learning-based framework for
[157] Z. Liu, H. Tang, S. Zhu, and S. Han, “Semalign: Annotation-free camera calibration with distortion correction and high precision
camera-lidar calibration with semantic alignment loss,” in 2021 feature detection,” arXiv preprint arXiv:2202.00158, 2022.
IEEE/RSJ International Conference on Intelligent Robots and Systems [178] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An
(IROS). IEEE, 2021, pp. 8845–8851. attention-to-optimization approach for automatic lidar-camera
[158] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, calibration via cross-modal object matching,” IEEE Transactions
M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle on Intelligent Vehicles, 2022.
datasets,” The International Journal of Robotics Research, vol. 35, [179] G. Wang, J. Qiu, Y. Guo, and H. Wang, “Fusionnet: Coarse-
no. 10, pp. 1157–1163, 2016. to-fine extrinsic calibration network of lidar and camera with
[159] M. Schönbein, T. Strauß, and A. Geiger, “Calibrating and center- hierarchical point-pixel fusion,” in 2022 International Conference
ing quasi-central catadioptric cameras,” in 2014 IEEE International on Robotics and Automation (ICRA). IEEE, 2022, pp. 8964–8970.
Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. [180] C. Ye, H. Pan, and H. Gao, “Keypoint-based lidar-camera online
4443–4450. calibration with robust geometric network,” IEEE Transactions on
[160] T. H. Butt and M. Taj, “Camera calibration through camera projec- Instrumentation and Measurement, vol. 71, pp. 1–11, 2021.
tion loss,” in ICASSP 2022 - 2022 IEEE International Conference on [181] N. Wakai, S. Sato, Y. Ishii, and T. Yamashita, “Rethinking generic
Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2649– camera models for deep single image camera calibration to
2653. recover rotation and fisheye distortion,” in Proceedings of European
20

Conference on Computer Vision (ECCV), vol. 13678, 2022, pp. 679– application,” IEEE transactions on pattern analysis and machine
698. intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
[182] S.-H. Chang, C.-Y. Chiu, C.-S. Chang, K.-W. Chen, C.-Y. Yao, R.-R. [206] H. Yu, Y. Luo, M. Shu, Y. Huo, Z. Yang, Y. Shi, Z. Guo, H. Li,
Lee, and H.-K. Chu, “Generating 360 outdoor panorama dataset X. Hu, J. Yuan et al., “Dair-v2x: A large-scale dataset for vehicle-
with reliable sun position estimation,” in SIGGRAPH Asia 2018 infrastructure cooperative 3d object detection,” in Proceedings of
Posters, 2018, pp. 1–2. the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[183] C. Wu, “Towards linear-time incremental structure from motion,” tion, 2022, pp. 21 361–21 370.
in 2013 International Conference on 3D Vision-3DV 2013. IEEE, [207] J. Kang and N. L. Doh, “Automatic targetless camera–LIDAR
2013, pp. 127–134. calibration by aligning edge with Gaussian mixture model,”
[184] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, Journal of Field Robotics, vol. 37, no. 1, pp. 158–179, 2020.
W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for [208] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
mobilenetv3,” in Proceedings of the IEEE/CVF international confer- A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
ence on computer vision, 2019, pp. 1314–1324. multimodal dataset for autonomous driving,” in Proceedings of
[185] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep the IEEE/CVF conference on computer vision and pattern recognition,
learning on point sets for 3d classification and segmentation,” 2020, pp. 11 621–11 631.
in Proceedings of the IEEE conference on computer vision and pattern [209] J. Mao, M. Niu, C. Jiang, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li,
recognition, 2017, pp. 652–660. J. Yu, C. Xu et al., “One million scenes for autonomous driving:
[186] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Con- Once dataset,” 2021.
volution on x-transformed points,” Advances in neural information [210] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-
processing systems, vol. 31, 2018. training,” in Proceedings of the IEEE International Conference on
[187] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for Computer Vision, 2019, pp. 4918–4927.
optical flow using pyramid, warping, and cost volume,” in [211] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park,
Proceedings of the IEEE Conference on Computer Vision and Pattern “Self-calibrating neural radiance fields,” in Proceedings of the
Recognition (CVPR), June 2018. IEEE/CVF International Conference on Computer Vision, 2021, pp.
[188] R. Hartley and A. Zisserman, Multiple view geometry in computer 5846–5854.
vision. Cambridge university press, 2003.
[189] B. D. Lucas, T. Kanade et al., An iterative image registration technique Kang Liao received his Ph.D. degree from Beijing Jiaotong University in
with an application to stereo vision. Vancouver, 1981, vol. 81. 2023. From 2021 to 2022, he was a Visiting Researcher at Max Planck
[190] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practi- Institute for Informatics in Germany. His current research interests in-
cal guidelines for efficient cnn architecture design,” in Proceedings clude camera calibration, 3D vision, and panoramic vision.
of the European conference on computer vision (ECCV), 2018, pp. 116–
131. Lang Nie is currently pursuing his Ph.D. degree at Beijing Jiaotong
[191] M. A. Fischler and R. C. Bolles, “Random sample consensus: a University. His current research interests include multi-view geometry,
paradigm for model fitting with applications to image analysis image stitching, and computer vision.
and automated cartography,” Communications of the ACM, vol. 24,
no. 6, pp. 381–395, 1981. Shujuan Huang is currently pursuing his Ph.D. degree at Beijing Jiao-
[192] L. Nie, C. Lin, K. Liao, and Y. Zhao, “Learning edge-preserved tong University. His current research interests include camera-LiDAR
image stitching from multi-scale deep homography,” Neurocom- calibration, depth completion, and computer vision.
puting, vol. 491, pp. 533–543, 2022.
[193] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying Chunyu Lin is a Professor at Beijing Jiaotong University. From 2011 to
framework,” International journal of computer vision, vol. 56, no. 3, 2012, he was a Post-Doctoral Researcher at the Multimedia Laboratory,
pp. 221–255, 2004. Ghent University, Belgium. His research interests include multi-view
[194] J. Nocedal and S. J. Wright, Numerical optimization. Springer, geometry, camera calibration, and virtual reality video processing.
1999.
[195] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for Jing Zhang is currently a Research Fellow at the School of Computer
optical flow,” in European conference on computer vision. Springer, Science, The University of Sydney. His research interests include com-
2020, pp. 402–419. puter vision and deep learning. He has published more than 60 papers
[196] Y. Li, W. Pei, and Z. He, “Ssorn: Self-supervised outlier removal on prestigious conferences and journals, such as CVPR, ICCV, ECCV,
network for robust homography estimation,” arXiv preprint IJCV and IEEE T-PAMI. He is a SPC of the AAAI and IJCAI.
arXiv:2208.14093, 2022.
[197] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Yao Zhao (Fellow, IEEE) is the Director of the Institute of Information
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Science, Beijing Jiaotong University. His current research interests in-
Advances in neural information processing systems, vol. 30, 2017. clude image/video coding and video analysis and understanding. He
[198] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, was named a Distinguished Young Scholar by the National Science
and A. Davison, “gvnn: Neural network library for geometric Foundation of China in 2010 and was elected as a Chang Jiang Scholar
computer vision,” in European Conference on Computer Vision. of Ministry of Education of China in 2013.
Springer, 2016, pp. 67–82.
Moncef Gabbouj (Fellow, IEEE) is a Professor at the Department of
[199] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep
Computing Sciences, Tampere University, Finland. He was an Academy
hierarchical feature learning on point sets in a metric space,”
of Finland Professor. His research interests include Big Data analytics,
Advances in neural information processing systems, vol. 30, 2017.
multimedia analysis, artificial intelligence, machine learning, pattern
[200] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
recognition, video processing, and coding. Dr. Gabbouj is a Fellow
timal speed and accuracy of object detection,” arXiv preprint
of the IEEE and Asia-Pacific Artificial Intelligence Association. He is
arXiv:2004.10934, 2020.
member of the Academia Europaea, the Finnish Academy of Science
[201] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Bei-
and Letters, and the Finnish Academy of Engineering Sciences.
jbom, “Pointpillars: Fast encoders for object detection from point
clouds,” in Proceedings of the IEEE/CVF conference on computer
Dacheng Tao (Fellow, IEEE) is currently the Inaugural Director of the
vision and pattern recognition, 2019, pp. 12 697–12 705.
JD Explore Academy and a Senior Vice President of JD.com, Inc. He
[202] R. Poli, J. Kennedy, and T. Blackwell, “Particle swarm optimiza- mainly applies statistics and mathematics to artificial intelligence and
tion,” Swarm intelligence, vol. 1, no. 1, pp. 33–57, 2007. data science. His research is detailed in one monograph and over 200
[203] S. Gould, R. Hartley, and D. Campbell, “Deep declarative net- publications in prestigious journals and proceedings at leading confer-
works,” IEEE Transactions on Pattern Analysis and Machine Intelli- ences. He is a fellow of the Australian Academy of Science, AAAS,
gence, vol. 44, no. 8, pp. 3988–4004, 2021. and ACM. He received the 2015 Australian Scopus-Eureka Prize, the
[204] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose 2018 IEEE ICDM Research Contributions Award, and the 2021 IEEE
estimation and tracking,” in Proceedings of the European conference Computer Society McCluskey Technical Achievement Award.
on computer vision (ECCV), 2018, pp. 466–481.
[205] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang,
“The apolloscape open dataset for autonomous driving and its
1

Deep Learning for Camera Calibration and


Beyond: A Survey
-Supplementary Material-
Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin, Jing Zhang, Yao Zhao, Fellow, IEEE, Moncef
Gabbouj, Fellow, IEEE, Dacheng Tao, Fellow, IEEE

1 D EVELOPMENT R ECAP 2.1 Pinhole Camera Model


1.1 Milestone The most popular and commonly applied camera model
A concrete milestone from 2015 to 2022 of deep learning- in computer vision is the pinhole camera model. It can be
based camera calibration is shown in Figure 1, spanning the regarded as a geometrically accurate first-order approxima-
main deep learning era. We classify all literature based on tion of the traditional camera. A pinhole camera has one
the uncalibrated camera model and its extended applica- single effective perspective because the pinhole aperture
tions: standard model, distortion model, cross-view model, is thought to be an infinitesimal point through which all
and cross-sensor model. projection lines must pass.
Using a mathematical formulation, the camera model
depicts the imaging process from a point in the 3D world co-
1.2 Statistic Analysis ordinate to its projection on the 2D image plane. Assuming
As we can observe in Figure 2, the number of learning-based the homogeneous coordinates Pw = [X, Y, Z, 1]T ∈ R4×1
camera calibrations has grown since 2015 and boomed since and Pi = [u, v, 1]T ∈ R3×1 denote a point in the 3D world
2019. And the learning targets are extended from the simple coordinate and its corresponding point on a 2D image plane,
and pure parameters to complicated and hybrid parameters, respectively. Then, a camera model can be described by a
driven by larger datasets, more reasonable learning strate- projection mapping M ∈ R3×4 between Pw and Pi :
gies, more explicit learning representations, and more solid
network architectures, etc. Pi = M Pw , (1)
The data analysis of different learning strategies used in where the projection can be further formed by:
learning-based camera calibration is also shown in Figure 2.
From the statistic, six strategies have been investigated, in Pc = [R | t]Pw , (2)
which supervised learning accounts for the largest major- where Pc = [xc , yc , zc ]T ∈ R3×1 denotes a transformed
ity (more than 90%). Considering the expensive labeling point in the camera coordinate using a 3 × 3 rotation R
works, some recent research explores liberating the train- and a 3-dimension translation t. The 3 × 4 matrix [R | t] is
ing demand for camera parameters using semi-supervised generally named as the extrinsic camera matrix, in which
learning, weakly-supervised learning, unsupervised learn- the camera rotation can be further parameterized by three
ing, and self-supervised learning. Reinforcement learning angles: yaw φ, pitch θ, and roll ψ . Subsequently, the point
also has been exploited to dynamically address the camera Pc is projected onto a surface. This surface is represented
calibration problem. by the pinhole camera model as a plane z = 1, and the
normalized coordinate of the point in camera coordinate is
2 C AMERA M ODEL expressed by [xn , yn ]T = [ xzcc , yzcc ]T .
Finally, the point on the normalized plane is projected
Researchers utilize mathematical formulas to establish cam-
onto the image plane, obtaining a pixel Pi by:
era models that describe the imaging process from a point in
3D world coordinates to its projection on a 2D image plane. Pi = K[xn , yn , 1]T , (3)
Different cameras and systems correspond to different types
of parametric models. In this section, we first provide a de- where K ∈ R 3×3
is an intrinsic camera matrix, which
tailed formulation of the basic pinhole camera model. Then, consists of various camera intrinsic parameters such as the
we review more complex and useful camera models, as well focal length, skew coefficient, and image center:
 
as extended models studied in recent literature, to meet the fx m u s cu
advanced development of cameras and academic/industrial K= 0 f y m v cv  , (4)
demands. 0 0 1
2

Legend
2015 DeepFocal (ICIP) PoseNet (ICCV)
Standard Distortion
Cross-View Cross-Sensor
2016 DeepHorizon (BMVC) DeepVP (CVPR) Rong et al. (ACCV) DHN (RSSW)

2017 CLKN (CVPR) HierarchicalNet (ICCVW) URS-CNN (CVPR) RegNet (IV)

Hold-Geoffroy et al. (CVPR) DeepCalib (CVMP) FishEyeRecNet (ECCV) Shi et al. (ICPR) DeepFM (ECCV)
2018
Poursaeed et al. (ECCVW) UDHN (RAL) PFNet (ACCV) CalibNet (IROS) Chang et al. (ICRA)

Lopez et al. (CVPR) UprightNet (ICCV) Zhuang et al. (IROS) SSR-Net (RPL) Abbas et al. (ICCVW)

2019 DR-GAN (TCSVT) STD (TCSVT) Deep360Up (VR) UnFishCor (JVCIR) BlindCor (CVPR) RSC-Net (CVPR)

Xue et al. (CVPR) Zhao et al. (ICCV) NeurVPS (NeurIPS)

Sha et al. (CVPR) Lee et al. (ECCV) MisCaliDet (ICRA) DeepPTZ (WACV) MHN (CVPR) Davidson et al. (ECCV)

CA-UDHN (ECCV) DeepFEPE (IROS) DDM (TIP) FE-GAN (ICASSP) PSE-GAN (ICPR) RDC-Net (ICIP)
2020 Li et al. (TIP) RDCFace (CVPR) LaRecNet (arXiv) Baradad et al. (CVPR) Zheng et al. (CVPR) Zhu et al. (ECCV)
DeepUnrollNet (CVPR) RGGNet (RAL) CalibRCNN (IROS) SSI-Calib (ICRA) SOIC (arXiv) NetCalib (ICPR)
SRHEN (ACM-MM)

StereoCaliNet (TCI) CTRL-C (ICCV) Wakai et al. (ICCVW) SemAlign (IROS) OrdianlDistortion (TIP)

PolarRecNet (TCSVT) DQN-RecNet (PRL) Tan et al. (CVPR) PCN (CVPR) DaRecNet (ICCV)

2021 DLKFM (CVPR) LocalTrans (ICCV) BasesHomo (ICCV) ShuffleHomoNet (ICIP) DirectionNet (CVPR)

DAMG-Homo (TCSVT) SA-MobileNet (BMVC) SPEC (ICCV) JCD (CVPR) LCCNet (CVPRW) CFNet (Sensors)

Fan et al. (ICCV) SUNet (ICCV)

Fang et al. (ICRA) CPL (ICASSP) DVPD (CVPR) IHN (CVPR) HomoGAN (CVPR) SS-WPC (CVPR)

2022 AW-RSC (CVPR) EvUnroll (CVPR) Do et al. (CVPR) DiffPoseNet (CVPR) SceneSqueezer (CVPR)
FocalPose (CVPR) FishFormer (arXiv) DXQ-Net (arXiv) CCS-Net (IROS) SST-Calib (ITSC) SIR (TIP)

ATOP (TIV) FusionNet (ICRA) RKGCNet (TIM) GenCaliNet (ECCV) Liu et al. (TPAMI)

Fig. 1. A concise milestone of deep learning-based camera calibration methods. We classify all methods based on the uncalibrated camera model
and its extended applications: standard model, distortion model, cross-view model, and cross-sensor model. Standard model: DeepFocal [1],
PoseNet [2], DeepHorizon [3], DeepVP [4], Chang et al. [5], UprightNet [6], Lee et al. [7], NeurVPS [8], Deep360Up [9], Davidson et al. [10],
DeepFEPE [11], Baradad et al. [12], Zheng et al. [13], Zhu et al. [14], StereoCaliNet [15], SA-MobileNet [16], Fang et al. [17], CPL [18], FocalPose
[19], DirectionNet [20], DVPD [21], Do et al. [22], CTRL-C [23], SPEC [24], DiffPoseNet [25], SceneSqueezer [26]. Distortion model: Rong et
al. [27], Hold-Geoffroy et al. [28], DeepCalib [29], URS-CNN [30], FishEyeRecNet [31], Shi et al. [32], DR-GAN [33], STD [34], UnFishCor [35],
BlindCor [36], RSC-Net [37], Xue et al. [38], Zhuang et al. [39], Lopez et al. [40], Zhao et al. [41], DDM [42], MisCaliDet [43], DeepPTZ [44], FE-
GAN [45], PSE-GAN [46], RDC-Net [47], Li et al. [48], RDCFace [49], LaRecNet [50], DeepUnrollNet [51], OrdianlDistortion [52], PolarRecNet [53],
DQN-RecNet [54], Tan et al. [55], PCN [56], SIR [57], DaRecNet [58], Wakai et al. [59], GenCaliNet [60], JCD [61], SS-WPC [62], AW-RSC [63],
EvUnroll [64], Fan et al. [65], SUNet [66], CCS-Net [67], FishFormer [68]. Cross-View model: DHN [69], CLKN [70], HierarchicalNet [71], DeepFM
[72], Poursaeed et al. [73], UDHN [74], PFNet [75], SRHEN [76], SSR-Net [77], Abbas et al. [78], Sha et al. [79], MHN [80], CA-UDHN [81], DLKFM
[82], LocalTrans [83], BasesHomo [84], ShuffleHomoNet [85], DAMG-Homo [86], IHN [87], HomoGAN [88], Liu et al. [89]. Cross-Sensor model:
RegNet [90], CalibNet [91], RGGNet [92], SSI-Calib [93], SOIC [94], CalibRCNN [95], NetCalib [96], LCCNet [97], CFNet [98], SemAlign [99],
DXQ-Net [100], SST-Calib [101], ATOP [102], FusionNet [103], RKGCNet [104].

where fx and fy are the focal lengths at X-axis and Y-axis unit, then Eq. (3) can be reformulated as:
of the camera, respectively. Generally, for most cameras,  
fx 0 cu
fx = fy , and they are unified to f . mu and mv are the Pi =  0 fy cv  [xn , yn , 1]T . (5)
number of pixels per unit distance, in which mu = mv , if 0 0 1
the image has square pixels. s is the skew coefficient. A CCD
sensor’s pixels might not be precisely square, which would In addition to numerical camera parameters, some geo-
cause a slight distortion in the X or Y axes. The number of metric representations can provide useful clues for camera
pixels on the CCD sensor per unit length in each direction calibration, such as vanishing points and horizon lines.
is known as the skew coefficient. It would become 0 when These representations establish clear relationships between
X-axis and Y-axis are perpendicular to each other. [cu , cv ]T image features and calibration objectives, which can alle-
is the coordinate of the image center. According to previous viate the difficulty of learning conventional and implicit
works and factory design, the intrinsic parameters can be camera parameters.
Lines and points are both represented as three-
refined by s = 0, mu = mv and focal length in the pixel
dimensional vectors in homogeneous coordinates. The def-
3

10 25

Publication number per year


Standard Cross-View All
Distortion Cross-Sensor
8 20

6 15

4 10

2 5

0 0
2015 2016 2017 2018 2019 2020 2021 2022 2015 2016 2017 2018 2019 2020 2021 2022

Radial
Distortion
Calibration objectives

SL
Intrinsics/

Learning strategy
Yes No
Extrinsics
31

Simulation
27
33 90 Semi-SL
1
1
1 RL
15 6 WSL
73 5
8 USL
21 Camera-LiDAR
SSL
Roll Shutter
Distortion Projection Matrix

Fig. 2. A statistic analysis of deep learning-based camera calibration methods. To be specific, we summarize all literature based on the number of
publications per year, calibration objectives, simulation of the dataset, and learning strategy.

initions for computing the line l that connects two points where r denotes the projection distance between the princi-
and the point p at the intersection of two lines can be given pal point and the points in the image. θ denotes the angle
by: between the incident ray and the optical axis of the camera.
p1 × p2 l1 × l2 It is straightforward to determine that θ should be less than
l= p= (6)
||p1 × p2 || ||l1 × l2 || 90◦ . Without a projection point on the image plane, the
There are two parameterizations of the horizon line: incoming ray will not cross with the image plane and the
slope-offset (θ, ρ) and left-right (l, r). Assuming that the pinhole camera will not be able to view anything behind.
viewing orientation is down the negative z -axis, with the Because of their restricted field of view (FoV), most cameras
positive x-direction to the right, and the positive y -direction cannot see all of the points in the 3D environment at the
to the up. As a result, the world viewing direction of the same time.
camera can be described by RcT [0, 0, −1]T . For the world Due to the wide FoV, wide-angle cameras are increas-
vector [0, 1, 0]T points in the zenith direction, a set of points ingly widely used in computer vision and robotics tasks
p can represent the horizon line: such as navigation, localization, and tracking. Specifically,
pT K −T R[0, 1, 0]T = 0. (7) an extra wide-angle lens called a fisheye camera is used to
create a broad, hemispherical, or panoramic image. Fisheye
As mentioned in Barnard [105], the normalized line lenses employ a specific mapping to produce convex and
direction vector d can be formulated for the Gaussian non-rectilinear images as opposed to images with straight
sphere representation of a vanishing point v. In particular, lines of perspective (rectilinear images). However, the wide-
supposed a 3D ray is described by o + λd, where o and d angle camera violates the pinhole camera assumption and
are its origin and unit direction vector, respectively. Then, the captured image suffers from geometric distortions.
the vanishing point can be represented by λ → ∞, in
Geometric distortion induced by wide-angle cameras
which the image coordinate is formed by v = [vx , vy ]T :=
can generally be classified into radial distortion and tangen-
limλ→∞ [px , py ]T ∈ R2 . Thus, the 3D direction of a line
tial distortion (de-centering distortion). Radial distortion is
based on its vanishing point can be calculated by:
the primary distortion in central single-view camera sys-
 T
d = vx − cx vy − cy f ∈ R3 . (8) tems, exhibiting circular symmetry with respect to the dis-
tortion center. This distortion results in points on the image
By using d rather than v, the degraded situations where d
plane being moved away from their ideal location under the
is parallel to the image plane are eliminated. Additionally,
perspective camera model along the radial axis from the dis-
it provides a natural measurement for determining the
tortion center. Radial distortion models can be formulated as
separation between two vanishing points.
nonlinear functions of the radial distance [106]. On the other
2.2 Wide-angle Camera Model hand, tangential distortion occurs when the lens and image
plane are not parallel. Tangential distortion, also known
The perspective projection model, given a typical pinhole
as de-centering distortion, is primarily caused by the lens
camera with focal length f , can be expressed as:
assembly not being centered over and parallel to the image
r = f tan θ, (9) plane. Unlike radial distortion, tangential distortion has a
4

Global Shutter Rolling Shutter 2.3 Rolling Shutter Camera Model


0 0
T T Due to the compact design, low price, and high frame
Line 1 Line 1
rate, numerous consumer cameras, including webcams
Line 2 Line 2 and mobile phones, employ CMOS (complementary
Line 3 Line 3 metal–oxide–semiconductor) sensors. But they are restricted
to using rolling shutter (RS) devices. With a consistent time
Line N Line N
delay between each row, RS exposes the sensor array row
Readout of the Exposure of the Readout of the by row from top to bottom, as opposed to global shutter
previous frame current frame current frame (GS) based on CCD sensors, which simultaneously read out
all rows of the sensor array. If the RS camera is moving
Fig. 3. Comparison of the mechanism of global shutter camera and while capturing the image, various distortions, such as
rolling shutter camera.
skew, smear, or wobble, will break the reality of the original
scene, which deviates from the pinhole camera paradigm.
geometric impact that is not solely along the radial axis, and The unknown camera movements during the capturing
can also cause rotation and skewing of the image plane with process induce the so-called RS effects (also known as the
respect to the distance from the image center. The camera jelly effect). In other words, an RS image is a row-by-row
model with radial distortion and tangential distortion can combination of GS images taken by a virtual moving GS
be parameterized by: camera throughout the camera readout time. The compari-
 son of the RS camera and GS camera is shown in Figure 3.

 xr = xd + x̄(k1 rd2 + k2 rd4 + k3 rd6 + · · · ) The RS camera can be regarded as a high-frequency

+(p1 (rd2 + 2x̄2 ) + 2p2 x̄ȳ)(1 + p3 rd2 + · · · ) sensor that produces sparse spatial information with rich
, (10)

 y = yd + ȳ(k1 rd2 + k2 rd4 + k3 rd6 + · · · ) temporal coverage conveyed by distortions [109]. Modeling
 r
+(p2 (rd2 + 2ȳ 2 ) + 2p1 x̄ȳ)(1 + p3 rd2 + · · · ) the RS camera faces a common challenge of estimating the
transformation between RS and GS images. Assume a 3D
where x̄ = xd −cx and ȳ = yd −cy . K = (k1 , k2 , k3 , . . . ) and latent space-time volume captures the desired scene across
P = (p1 , p2 , p3 , . . . ) are the radial distortion parameters and the desired time period [0, t0 ] and creates a virtual GS image
decentering distortion parameters, respectively. rd describes IGS . We suppose the readout direction is from top to bottom,
the radial distance from an image point to the distortion and then the row-by-row readout RS imaging IRS can be
center (cx , cy ). Such an equation represents the mapping expressed by:
from a point [xd , yd ]T in the image captured by the wide- XH
angle camera to that in the rectified image without the IRS = M (IGS
tr , y), (13)
geometric distortion [xr , yr ]T . y=1
Previous works demonstrate that tangential distortion
where H is the height of the RS image (total number of
is basically insignificant and can be neglected. Moreover, as
rows) and y indicates the vertical coordinate. M (·, ·) masks
we surveyed, all learning-based camera calibration methods
a specific row in the GS image, in which tr represents the
only consider the radial distortion for calibrating the wide-
readout (offset) time for each row of RS.
angle camera. To this end, Eq.10 can be simplified by a
On the other hand, by warping the RS features backward
Taylor expansion:
with an estimated displacement field, the GS image can be

xr = xd (k1 rd2 + k2 rd4 + k3 rd6 + · · · ) formulated by:
. (11)
yr = yd (k1 rd2 + k2 rd4 + k3 rd6 + · · · )
IGS (x) = IRS (x + FGS→RS (x)), (14)
This equation is known as the even-order polynomial
where FGS→RS ∈ R2 denotes the displacement field of the
model, which can also be expressed as an odd-order poly-
pixel x from the GS image to the RS image.
nomial model by shifting the power. However, according
The above formulations describe the rolling shutter cam-
to Wang et al. [107], while the polynomial model is suit-
era model under a short exposure scenario. When the expo-
able for small distortions, it requires an unreasonably high
sure time of the camera increases, the motion blur effects
number of non-zero distortion parameters for severe dis-
occur in the captured image, jointly with the RS distortion:
tortions. As an alternative, Fitzgibbon et al. [108] proposed
Z
a division model that more accurately approximates the ′ 1 t−th +itr +T /2 GS
genuine undistortion function of a common camera. For IRS
(t) [i] = I [i]dt, (15)
T t−th +itr −T /2 (t−th +itr )
significant distortion, the division model is preferred over ′
the polynomial model because it requires fewer terms: where IRS
(t) [i] denotes the i
th
row of the RS distortion image
( IRS with the middle moment of exposure at time t. T
xr = k1 r2 +k2 rx4d+k3 r6 +··· indicates the exposure time of camera and th = (H/2)tr .
d d d . (12)
yr = k1 r2 +k2 ry4d+k3 r6 +···
d d d

Some classical works demonstrate the single-parameter 2.4 Cross-View Camera Model
division model (only with distortion parameter k1 in Eq.12) The cross-view camera model is a type of multi-view camera
seems to be sufficient for most wide-angle cameras, which system used in computer vision. It involves placing two
has been widely applied in learning-based wide-angle cam- or more cameras at opposite sides of a scene to capture
era calibration [27], [29], [36], [54]. multiple views of the same scene. This setup enables the
5

creation of 3D reconstructions of the scene by triangulat- describes the homography’s rotational term and the vector
ing corresponding points from multiple camera views. The [h13 , h23 ]T denotes the translation transformation. Consid-
cross-view camera model is commonly used in surveillance, ering the rotation and shear components typically have
robotics, and augmented reality applications, and provides a smaller magnitudes than the translation component, it will
more accurate and complete representation of the scene than have a negligible impact on the loss function of the com-
what can be achieved with a single camera. Alternatively, ponent elements, leading to an imbalance training problem
a camera with stable movement can also be regarded as a with a neural network. Instead, a 4-point parameterization
cross-view camera model. [111] has been demonstrated to be more learning-friendly
In a cross-view camera model, the captured images can for learning-based homography estimation than the 3 × 3
be used to calculate the fundamental matrix and homogra- parameterization. Supposed that the offsets of the image’s
phy matrix, which are essential tools for 3D reconstruction, vertex are ∆ui = u′i − ui and ∆vi = vi′ − vi , then the 4-point
image rectification, and camera calibration. parameterization H e can describe a homography by:
Fundamental Matrix Geometric relationships between the
3D points and their projections onto the 2D plane impose  
∆u1 ∆v1
constraints on the image points when two cameras capture ∆u2 ∆v2 
the same 3D scene from different perspectives. This intrinsic e =
H . (19)
∆u3 ∆v3 
projective geometry can be embodied by a fundamental ∆u4 ∆v4
matrix F.
F = K2 −T [t]× RK1 −1 . (16) The 4-point parameterization owns eight variables, which
Such an equation describes the epipolar geometry, where are equivalent to the matrix formulation of the homography.
It is straightforward to transform from H e to H using the
K1 and K2 indicate the intrinsic parameters of two cam-
eras, and R and [t]× are the relative camera rotation and normalized Direct Linear Transform (DLT) [112] if the four
translation, respectively. corners’ displacement is known.
The fundamental matrix can be calculated from the
correspondences of projected scene points by q T Fp = 0,
in which q and p are the matching points derived from two 2.5 Cross-Sensor Model
views. Specifically, the eight-point algorithm [110] uses 8
point correspondences and enforces the rank-2 constraint Modern robots are often equipped with various sensors
using Singular Value Decomposition (SVD), computing a to provide a comprehensive understanding of the environ-
matrix with the minimum Frobenius distance. ment. These sensors capture scenes using different types of
Homography Matrix Estimating a 2D homography matrix representations. For autonomous cars and robotics, cameras
(or projection transformation) is an elemental geometric task and Light Detection and Ranging sensors (LiDAR) are com-
for a pair of images that are captured from the same planar monly used for vision tasks. The 3D LiDAR records long-
surface in a 3D scene with different perspectives. An invert- range spatial data as sparse point clouds, while the camera
ible mapping from one image plane to another with eight captures texturally dense 2D color RGB images. Combining
degrees of freedom: two each for translation, rotation, scale, these sensors can facilitate 3D reconstruction and provide
and lines at infinity, is known as a homography. Supposed precise and robust perception for the robots, overcoming
that the homogeneous coordinates x = [u, v, 1]T ∈ R3×1 the limitations of individual sensors.
and x′ = [u′ , v ′ , 1]T ∈ R3×1 are points from two images However, collision and vibration problems can occur
but indicating the same point in the 3D scene, a non- when using different sensors in a robot or system. Addition-
singular 3 × 3 matrix can represent a linear transformation ally, the 3D point clouds cannot be effectively projected onto
that maps x ⇔ x′ as a planar projective transformation or a 2D image without accurate extrinsic parameters, making
homography H: it difficult to reliably correlate pixels in an image with depth
 ′    information. Therefore, it is crucial to precisely calibrate the
u h11 h12 h13 u 2D-3D matching correspondences between pairs of tempo-
v  ∼ h21 h22 h23  v  ,

(17) rally synchronized camera and LiDAR data.
1 h31 h32 h33 1 The appropriate extrinsic calibration of the transforma-
where the transformation can be simplified as x′ ∼ Hx. This tion (i.e., rotation and translation) between the camera and
transformation can be rewritten by two following equations: LiDAR in 6-DoF is a key condition for data fusion. To be
more specific, 3D LiDAR point cloud P C = [X, Y, Z] ∈ R3
h11 u + h12 v + h13 ′ h21 u + h22 v + h23 can be projected onto the image plane by transforming it
u′ = ;v = . (18)
h31 u + h32 v + h33 h31 u + h32 v + h33 into the camera coordinate using the extrinsic matrix T
Previous methods [69], [74] point out that the above between the camera and LiDAR as well as camera intrinsic
conventional 3 × 3 parameterization H is not desirable for K . The inverse depth and the projected 2D coordinates can
training neural networks. Concretely, it is challenging to be represented as d = 1/Z and p = [u, v] ∈ R2 , respectively.
guarantee the non-singularity of H due to the significant Then, the camera-LiDAR model can be described by:
variance in the size of the members of the 3 × 3 homogra-    
phy matrix. Moreover, the rotation, translation, scale, and u fx (X̂/Ẑ) + cx
shear components of the homography transformation are v  =  
 fy (Ŷ /Ẑ) + cy  , (20)
mixed in H. For instance, the submatrix [h11 h12 ; h21 h22 ] d 1/Ẑ
6

where (fx , fy ) and (cx , cy ) indicate the focal lengths and the data collected from a single camera, reducing the time and
image center as listed in Eq. 4. [X̂, Ŷ , Ẑ] is the transformed effort required for calibration. Similarly, in mobile devices,
point cloud PˆC using the estimated extrinsic matrix: transfer learning can enable faster and more accurate cali-
bration of the camera, resulting in improved image quality
[X̂, Ŷ , Ẑ, 1]T = T [X, Y, Z, 1]T . (21) and performance.
Most deep learning works exploit the Lie algebra to
parameterize the calibration camera-LiDAR extrinsic pa- 3.3 Robustness to noise and outliers
rameters. In particular, the output of the calibration network Another promising application of deep learning in camera
is a 1 x 6 vector ξ = (v, ω) ∈ se(3) in which v is the calibration is improving the robustness of calibration to
translation vector, and ω is the rotation vector. To recover noise and outliers in the data. This approach can help ensure
the original objectives, the rotation vector in so(3) should be accurate calibration even in challenging environments, with
transformed to its corresponding rotation matrix. Supposed low-quality data or noisy sensor readings. Conventionally,
that ω = (ω1 , ω2 , ω3 )T , an element ω ∈ so(3) can be camera calibration algorithms are sensitive to noise and out-
transformed to SO(3) using the exponential map by: liers in the data, which can lead to significant errors in the
exp : so(3) → SO(3); ω̂ 7→ eω̂ , (22) estimated camera parameters. However, with the applica-
tion of deep learning, it is possible to learn more robust and
where ω̂ and e denote the skew-symmetric matrix from ω
ω̂
accurate models that can better handle noise and outliers in
and Taylor series expansion for the matrix exponential func- the data. For instance, regularization techniques can be used
tion, respectively. Then, the rotation matrix can be formed in to impose constraints on the learned parameters, preventing
SO(3), and its Rodrigues formula is derived from the above overfitting and enhancing the generalization ability of the
equation by: model. Moreover, outlier detection techniques can be used
to identify and exclude data points that are likely to be
ω̂ ω̂ 2
R = eω̂ = I + sin ∥ω∥ + (1 − cos(∥ω∥)). (23) outliers, reducing their impact on the calibration process.
∥ω∥ ∥ω∥2 This can be achieved using various statistical and machine-
Thus, the 3D rigid body transformation T ∈ SE(3) between learning methods, such as clustering, classification, and
camera and LiDAR can be represented by: regression.
 
R t
T = where R ∈ SO(3), t ≜ v ∈ R3 . (24) 3.4 Online calibration
0 1
With the rapid development of deep learning, online camera
3 M ORE F UTURE D IRECTIONS calibration is becoming more efficient and practical. This
technique involves updating the calibration parameters in
3.1 Dataset real-time, allowing for better performance as the camera
One of the main challenges of learning-based camera cali- moves or as the environment changes. This can be achieved
brations is the difficulty in constructing datasets with high using deep learning algorithms that can learn the complex
accuracy. This requires laborious manual intervention to relationships between the camera parameters and the image
obtain real-world data with labels. As we summarized, ap- data. Learning-based camera calibration has the potential
proximately 70% of the works rely on synthesized datasets. to revolutionize various industries, such as robotics and
However, the significant differences between synthesized augmented reality. In robotics, online calibration can im-
and real-world datasets cannot be ignored, leading to do- prove the accuracy of robot vision, which is crucial for
main gaps in the learned models. Therefore, the construc- tasks such as object detection and manipulation. Similarly,
tion of a standardized, large-scale calibration dataset would in augmented reality, online calibration can enhance the
significantly benefit this community. Recent works have user experience by ensuring that virtual objects are correctly
demonstrated that well-designed learning strategies, such as aligned with the real world. This can help create more
semi-supervised learning [62], self-supervised learning [17], realistic and immersive AR applications, which have numer-
[77], and unsupervised learning [74], [81], can help address ous practical applications in fields such as entertainment,
the demand for annotations in learning-based camera cali- education, and training.
brations. These strategies also have the potential to discover
additional calibration priors within the data itself. 3.5 Multimodal calibration
The potential of deep learning techniques in camera cali-
3.2 Transfer learning bration goes beyond traditional photography and computer
The advancements in deep learning have led to the de- vision applications. It could also be applied to calibrate
velopment of transfer learning techniques, which could cameras with other sensors, such as remote sensing, infrared
facilitate the transfer of knowledge learned from one camera sensors, or radar. This advancement could lead to more
to another. This approach can significantly speed up and precise and robust perception in various applications, in-
streamline the calibration process, making it more efficient cluding but not limited to autonomous driving, where mul-
and cost-effective. Transfer learning can be especially useful tiple sensors are used. Incorporating deep learning-based
in applications that involve multiple cameras or mobile calibration methods with multiple sensors could enhance
devices. For example, in a multi-camera system, transfer the accuracy of the fusion of data from different sources.
learning can be used to calibrate all the cameras using the It could facilitate more accurate perception in challenging
7

environments such as low-light conditions, occlusions, and [18] T. H. Butt and M. Taj, “Camera calibration through camera projec-
adverse weather conditions. Furthermore, the ability to cal- tion loss,” in ICASSP 2022 - 2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2649–
ibrate multiple sensors with deep learning methods could 2653.
provide more reliable and consistent results compared to [19] G. Ponimatkin, Y. Labbé, B. Russell, M. Aubry, and J. Sivic, “Focal
traditional calibration techniques. length and object pose estimation via render and compare,” in
These are a few potential directions for future research in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2022, pp. 3825–3834.
camera calibration with deep learning. As the field contin- [20] K. Chen, N. Snavely, and A. Makadia, “Wide-baseline relative
ues to evolve, there may be many other exciting avenues for camera pose estimation with directional learning,” in Proceedings
exploration and innovation. In addition, it is also thrilling of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
to see how this technology will continue to impact various nition, 2021, pp. 3258–3268.
[21] Y. Lin, R. Wiersma, S. L. Pintea, K. Hildebrandt, E. Eisemann,
industries in the future. and J. C. van Gemert, “Deep vanishing point detection: Ge-
ometric priors make dataset variations vanish,” arXiv preprint
R EFERENCES arXiv:2203.08586, 2022.
[22] T. Do, O. Miksik, J. DeGol, H. S. Park, and S. N. Sinha, “Learning
[1] S. Workman, C. Greenwell, M. Zhai, R. Baltenberger, and N. Ja- to detect scene landmarks for camera localization,” in Proceedings
cobs, “Deepfocal: A method for direct focal length estimation,” of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
in 2015 IEEE International Conference on Image Processing (ICIP), nition, 2022, pp. 11 132–11 142.
2015, pp. 1369–1373. [23] J. Lee, H. Go, H. Lee, S. Cho, M. Sung, and J. Kim, “Ctrl-c: Camera
[2] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional calibration transformer with line-classification,” in Proceedings of
network for real-time 6-dof camera relocalization,” in Proceedings the IEEE/CVF International Conference on Computer Vision (ICCV),
of the IEEE International Conference on Computer Vision (ICCV), October 2021, pp. 16 228–16 237.
December 2015. [24] M. Kocabas, C.-H. P. Huang, J. Tesch, L. Müller, O. Hilliges, and
[3] S. Workman, M. Zhai, and N. Jacobs, “Horizon lines in the wild,” M. J. Black, “Spec: Seeing people in the wild with an estimated
arXiv preprint arXiv:1604.02129, 2016. camera,” in Proceedings of the IEEE/CVF International Conference on
[4] M. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points Computer Vision (ICCV), October 2021, pp. 11 035–11 045.
using global image context in a non-manhattan world,” in Pro- [25] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, and
ceedings of the IEEE Conference on Computer Vision and Pattern Y. Aloimonos, “Diffposenet: Direct differentiable camera pose
Recognition (CVPR), June 2016. estimation,” in Proceedings of the IEEE/CVF Conference on Computer
[5] C.-K. Chang, J. Zhao, and L. Itti, “Deepvp: Deep learning for Vision and Pattern Recognition, 2022, pp. 6845–6854.
vanishing point detection on 1 million street view images,” in
[26] L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan,
2018 IEEE International Conference on Robotics and Automation
“Scenesqueezer: Learning to compress scene for camera relocal-
(ICRA). IEEE, 2018, pp. 4496–4503.
ization,” in Proceedings of the IEEE/CVF Conference on Computer
[6] W. Xian, Z. Li, M. Fisher, J. Eisenmann, E. Shechtman, and
Vision and Pattern Recognition, 2022, pp. 8259–8268.
N. Snavely, “Uprightnet: Geometry-aware camera orientation
[27] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens dis-
estimation from single images,” in Proceedings of the IEEE/CVF
tortion correction using convolutional neural networks trained
International Conference on Computer Vision (ICCV), October 2019.
with synthesized images,” in Asian Conference on Computer Vision.
[7] J. Lee, M. Sung, H. Lee, and J. Kim, “Neural geometric parser
Springer, 2016, pp. 35–49.
for single image camera calibration,” in European Conference on
Computer Vision. Springer, 2020, pp. 541–557. [28] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gam-
[8] Y. Zhou, H. Qi, J. Huang, and Y. Ma, “Neurvps: Neural vanishing baretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for
point scanning via conic convolution,” Advances in Neural Infor- deep single image camera calibration,” in Proceedings of the IEEE
mation Processing Systems, vol. 32, 2019. Conference on Computer Vision and Pattern Recognition (CVPR),
[9] R. Jung, A. S. J. Lee, A. Ashtari, and J.-C. Bazin, “Deep360up: June 2018.
A deep learning-based approach for automatic vr image upright [29] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib:
adjustment,” in 2019 IEEE Conference on Virtual Reality and 3D a deep learning approach for automatic intrinsic calibration of
User Interfaces (VR), 2019, pp. 1–8. wide field-of-view cameras,” in Proceedings of the 15th ACM
[10] B. Davidson, M. S. Alvi, and J. F. Henriques, “360° camera SIGGRAPH European Conference on Visual Media Production, 2018,
alignment via segmentation,” in European Conference on Computer pp. 1–10.
Vision. Springer, 2020, pp. 579–595. [30] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the
[11] Y.-Y. Jau, R. Zhu, H. Su, and M. Chandraker, “Deep keypoint- shutter: Cnn to correct motion distortions,” in Proceedings of the
based camera pose estimation with geometric constraints,” in IEEE Conference on computer Vision and Pattern Recognition, 2017,
2020 IEEE/RSJ International Conference on Intelligent Robots and pp. 2291–2299.
Systems (IROS), 2020, pp. 4950–4957. [31] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “Fishey-
[12] M. Baradad and A. Torralba, “Height and uprightness invari- erecnet: A multi-context collaborative deep network for fisheye
ance for 3d prediction from a single view,” in Proceedings of the image rectification,” in Proceedings of the European Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition Computer Vision (ECCV), September 2018.
(CVPR), June 2020. [32] Y. Shi, D. Zhang, J. Wen, X. Tong, X. Ying, and H. Zha, “Radial
[13] Q. Zheng, J. Chen, Z. Lu, B. Shi, X. Jiang, K.-H. Yap, L.-Y. Duan, lens distortion correction by adding a weight layer with inverted
and A. C. Kot, “What does plate glass reveal about camera cal- foveal models to convolutional neural networks,” in 2018 24th
ibration?” in Proceedings of the IEEE/CVF Conference on Computer International Conference on Pattern Recognition (ICPR), 2018, pp.
Vision and Pattern Recognition (CVPR), June 2020. 1–6.
[14] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann, [33] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic
K. Sunkavalli, and M. Chandraker, “Single view metrology in radial distortion rectification using conditional gan in real-time,”
the wild,” in European Conference on Computer Vision. Springer, IEEE Transactions on Circuits and Systems for Video Technology,
2020, pp. 316–333. vol. 30, no. 3, pp. 725–733, 2020.
[15] Y. Gil, S. Elmalem, H. Haim, E. Marom, and R. Giryes, “Online [34] ——, “Distortion rectification from static to dynamic: A distortion
training of stereo self-calibration using monocular depth estima- sequence construction perspective,” IEEE Transactions on Circuits
tion,” IEEE Transactions on Computational Imaging, vol. 7, pp. 812– and Systems for Video Technology, vol. 30, no. 11, pp. 3870–3882,
823, 2021. 2020.
[16] S. Garg, D. P. Mohanty, S. P. Thota, and S. Moharana, “A simple [35] S. Yang, C. Lin, K. Liao, Y. Zhao, and M. Liu, “Unsupervised fish-
approach to image tilt correction with self-attention mobilenet for eye image correction through bidirectional loss with geometric
smartphones,” arXiv preprint arXiv:2111.00398, 2021. prior,” Journal of Visual Communication and Image Representation,
[17] J. Fang, I. Vasiljevic, V. Guizilini, R. Ambrus, G. Shakhnarovich, vol. 66, p. 102692, 2020.
A. Gaidon, and M. R. Walter, “Self-supervised camera self- [36] X. Li, B. Zhang, P. V. Sander, and J. Liao, “Blind geometric
calibration from video,” arXiv preprint arXiv:2112.03325, 2021. distortion correction on images through deep learning,” in Pro-
8

ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [57] J. Fan, J. Zhang, and D. Tao, “Sir: Self-supervised image rectifi-
Recognition (CVPR), June 2019. cation via seeing the same scene from multiple different lenses,”
[37] B. Zhuang, Q.-H. Tran, P. Ji, L.-F. Cheong, and M. Chandraker, IEEE Transactions on Image Processing, 2022.
“Learning structure-and-motion-aware rolling shutter correc- [58] K. Liao, C. Lin, L. Liao, Y. Zhao, and W. Lin, “Multi-level curricu-
tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision lum for training a distortion-aware barrel distortion rectification
and Pattern Recognition (CVPR), June 2019. model,” in Proceedings of the IEEE/CVF International Conference on
[38] Z. Xue, N. Xue, G.-S. Xia, and W. Shen, “Learning to calibrate Computer Vision (ICCV), October 2021, pp. 4389–4398.
straight lines for fisheye image rectification,” in Proceedings of the [59] N. Wakai and T. Yamashita, “Deep single fisheye image camera
IEEE/CVF Conference on Computer Vision and Pattern Recognition calibration for over 180-degree projection of field of view,” in
(CVPR), June 2019. Proceedings of the IEEE/CVF International Conference on Computer
[39] B. Zhuang, Q.-H. Tran, G. H. Lee, L. F. Cheong, and M. Chan- Vision (ICCV) Workshops, October 2021, pp. 1174–1183.
draker, “Degeneracy in self-calibration revisited and a deep [60] N. Wakai, S. Sato, Y. Ishii, and T. Yamashita, “Rethinking generic
learning solution for uncalibrated slam,” in 2019 IEEE/RSJ Inter- camera models for deep single image camera calibration to
national Conference on Intelligent Robots and Systems (IROS), 2019, recover rotation and fisheye distortion,” in Proceedings of European
pp. 3766–3773. Conference on Computer Vision (ECCV), vol. 13678, 2022, pp. 679–
[40] M. Lopez, R. Mari, P. Gargallo, Y. Kuang, J. Gonzalez-Jimenez, 698.
and G. Haro, “Deep single image camera calibration with radial [61] Z. Zhong, Y. Zheng, and I. Sato, “Towards rolling shutter cor-
distortion,” in Proceedings of the IEEE/CVF Conference on Computer rection and deblurring in dynamic scenes,” in Proceedings of the
Vision and Pattern Recognition (CVPR), June 2019. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[41] Y. Zhao, Z. Huang, T. Li, W. Chen, C. LeGendre, X. Ren, 2021, pp. 9219–9228.
A. Shapiro, and H. Li, “Learning perspective undistortion of [62] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-
portraits,” in Proceedings of the IEEE/CVF International Conference supervised wide-angle portraits correction by multi-scale trans-
on Computer Vision (ICCV), October 2019. former,” in Proceedings of the IEEE/CVF Conference on Computer
[42] K. Liao, C. Lin, Y. Zhao, and M. Xu, “Model-free distortion Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
rectification framework bridged by distortion distribution map,” [63] M. Cao, Z. Zhong, J. Wang, Y. Zheng, and Y. Yang, “Learning
IEEE Transactions on Image Processing, vol. 29, pp. 3707–3718, 2020. adaptive warping for real-world rolling shutter correction,” in
[43] A. Cramariuc, A. Petrov, R. Suri, M. Mittal, R. Siegwart, and Proceedings of the IEEE/CVF Conference on Computer Vision and
C. Cadena, “Learning camera miscalibration detection,” in 2020 Pattern Recognition, 2022, pp. 17 785–17 793.
IEEE International Conference on Robotics and Automation (ICRA), [64] X. Zhou, P. Duan, Y. Ma, and B. Shi, “Evunroll: Neuromorphic
2020, pp. 4997–5003. events based rolling shutter image correction,” in Proceedings of
[44] C. Zhang, F. Rameau, J. Kim, D. M. Argaw, J.-C. Bazin, and the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
I. S. Kweon, “Deepptz: Deep self-calibration for ptz cameras,” tion, 2022, pp. 17 775–17 784.
in Proceedings of the IEEE/CVF Winter Conference on Applications of [65] B. Fan and Y. Dai, “Inverting a rolling shutter camera: bring
Computer Vision (WACV), March 2020. rolling shutter images to high framerate global shutter video,” in
[45] C.-H. Chao, P.-L. Hsu, H.-Y. Lee, and Y.-C. F. Wang, “Self- Proceedings of the IEEE/CVF International Conference on Computer
supervised deep learning for fisheye image rectification,” in Vision, 2021, pp. 4228–4237.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
[66] B. Fan, Y. Dai, and M. He, “Sunet: symmetric undistortion
Speech and Signal Processing (ICASSP), 2020, pp. 2248–2252.
network for rolling shutter correction,” in Proceedings of the
[46] Y. Shi, X. Tong, J. Wen, H. Zhao, X. Ying, and H. Zha, “Position- IEEE/CVF International Conference on Computer Vision, 2021, pp.
aware and symmetry enhanced gan for radial distortion correc- 4541–4550.
tion,” in 2020 25th International Conference on Pattern Recognition
[67] Y. Zhang, X. Zhao, and D. Qian, “Learning-based framework for
(ICPR), 2021, pp. 1701–1708.
camera calibration with distortion correction and high precision
[47] H. Zhao, Y. Shi, X. Tong, X. Ying, and H. Zha, “A simple yet
feature detection,” arXiv preprint arXiv:2202.00158, 2022.
effective pipeline for radial distortion correction,” in 2020 IEEE
[68] Y. Shangrong, L. Chunyu, L. Kang, and Z. Yao, “Fishformer:
International Conference on Image Processing (ICIP), 2020, pp. 878–
Annulus slicing-based transformer for fisheye rectification with
882.
efficacy domain exploration,” arXiv preprint arXiv:2207.01925,
[48] Y.-H. Li, I.-C. Lo, and H. H. Chen, “Deep face rectification for
2022.
360° dual-fisheye cameras,” IEEE Transactions on Image Processing,
vol. 30, pp. 264–276, 2021. [69] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image
homography estimation,” arXiv preprint arXiv:1606.03798, 2016.
[49] H. Zhao, X. Ying, Y. Shi, X. Tong, J. Wen, and H. Zha, “Rdcface:
Radial distortion correction for face recognition,” in Proceedings [70] C.-H. Chang, C.-N. Chou, and E. Y. Chang, “Clkn: Cascaded
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- lucas-kanade networks for image alignment,” in Proceedings of
nition (CVPR), June 2020. the IEEE Conference on Computer Vision and Pattern Recognition
[50] Z.-C. Xue, N. Xue, and G.-S. Xia, “Fisheye distortion rectification (CVPR), July 2017.
from deep straight lines,” arXiv preprint arXiv:2003.11386, 2020. [71] F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homogra-
[51] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter phy estimation from image pairs with hierarchical convolutional
unrolling network,” in Proceedings of the IEEE/CVF Conference on networks,” in Proceedings of the IEEE International Conference on
Computer Vision and Pattern Recognition, 2020, pp. 5941–5949. Computer Vision (ICCV) Workshops, Oct 2017.
[52] K. Liao, C. Lin, and Y. Zhao, “A deep ordinal distortion estima- [72] R. Ranftl and V. Koltun, “Deep fundamental matrix estima-
tion approach for distortion rectification,” IEEE Transactions on tion,” in Proceedings of the European Conference on Computer Vision
Image Processing, vol. 30, pp. 3362–3375, 2021. (ECCV), September 2018.
[53] K. Zhao, C. Lin, K. Liao, S. Yang, and Y. Zhao, “Revisiting radial [73] O. Poursaeed, G. Yang, A. Prakash, Q. Fang, H. Jiang, B. Har-
distortion rectification in polar-coordinates: A new and efficient iharan, and S. Belongie, “Deep fundamental matrix estimation
learning perspective,” IEEE Transactions on Circuits and Systems without correspondences,” in Proceedings of the European Confer-
for Video Technology, pp. 1–1, 2021. ence on Computer Vision (ECCV) Workshops, September 2018.
[54] J. Zhao, S. Wei, L. Liao, and Y. Zhao, “Dqn-based gradual fisheye [74] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and
image rectification,” Pattern Recognition Letters, vol. 152, pp. 129– V. Kumar, “Unsupervised deep homography: A fast and robust
134, 2021. homography estimation model,” IEEE Robotics and Automation
[55] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical Letters, vol. 3, no. 3, pp. 2346–2353, 2018.
wide-angle portraits correction with deep structured models,” [75] R. Zeng, S. Denman, S. Sridharan, and C. Fookes, “Rethinking
in Proceedings of the IEEE/CVF Conference on Computer Vision and planar homography estimation using perspective fields,” in Asian
Pattern Recognition (CVPR), June 2021, pp. 3498–3506. Conference on Computer Vision. Springer, 2018, pp. 571–586.
[56] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively [76] Y. Li, W. Pei, and Z. He, “Srhen: stepwise-refining homography
complementary network for fisheye image rectification using estimation network via parsing geometric correspondences in
appearance flow,” in Proceedings of the IEEE/CVF Conference on deep latent space,” in Proceedings of the 28th ACM International
Computer Vision and Pattern Recognition (CVPR), June 2021, pp. Conference on Multimedia, 2020, pp. 3063–3071.
6348–6357. [77] X. Wang, C. Wang, B. Liu, X. Zhou, L. Zhang, J. Zheng, and X. Bai,
9

“Multi-view stereo in the deep learning era: A comprehensive [97] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and
revfiew,” Displays, vol. 70, p. 102102, 2021. camera self-calibration using cost volume network,” in Proceed-
[78] S. Ammar Abbas and A. Zisserman, “A geometric approach ings of the IEEE/CVF Conference on Computer Vision and Pattern
to obtain a bird’s eye view from an image,” in Proceedings of Recognition, 2021, pp. 2894–2901.
the IEEE/CVF International Conference on Computer Vision (ICCV) [98] X. Lv, S. Wang, and D. Ye, “Cfnet: Lidar-camera registration using
Workshops, Oct 2019. calibration flow network,” Sensors, vol. 21, no. 23, p. 8112, 2021.
[79] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End- [99] Z. Liu, H. Tang, S. Zhu, and S. Han, “Semalign: Annotation-free
to-end camera calibration for broadcast videos,” in Proceedings of camera-lidar calibration with semantic alignment loss,” in 2021
the IEEE/CVF Conference on Computer Vision and Pattern Recogni- IEEE/RSJ International Conference on Intelligent Robots and Systems
tion (CVPR), June 2020. (IROS). IEEE, 2021, pp. 8845–8851.
[80] H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography [100] X. Jing, X. Ding, R. Xiong, H. Deng, and Y. Wang, “Dxq-net:
estimation for dynamic scenes,” in Proceedings of the IEEE/CVF Differentiable lidar-camera extrinsic calibration using quality-
Conference on Computer Vision and Pattern Recognition (CVPR), aware flow,” arXiv preprint arXiv:2203.09385, 2022.
June 2020. [101] K. Akio, Z. Yiyang, Z. Pengwei, Z. Wei, and T. Masayoshi,
[81] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and “Sst-calib: Simultaneous spatial-temporal parameter calibration
J. Sun, “Content-aware unsupervised deep homography estima- between lidar and camera,” arXiv preprint arXiv:2207.03704, 2022.
tion,” in European Conference on Computer Vision. Springer, 2020, [102] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An
pp. 653–669. attention-to-optimization approach for automatic lidar-camera
[82] Y. Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homog- calibration via cross-modal object matching,” IEEE Transactions
raphy for multimodal image alignment,” in Proceedings of the on Intelligent Vehicles, 2022.
IEEE/CVF Conference on Computer Vision and Pattern Recognition [103] G. Wang, J. Qiu, Y. Guo, and H. Wang, “Fusionnet: Coarse-
(CVPR), June 2021, pp. 15 950–15 959. to-fine extrinsic calibration network of lidar and camera with
[83] R. Shao, G. Wu, Y. Zhou, Y. Fu, L. Fang, and Y. Liu, “Localtrans: A hierarchical point-pixel fusion,” in 2022 International Conference
multiscale local transformer network for cross-resolution homog- on Robotics and Automation (ICRA). IEEE, 2022, pp. 8964–8970.
raphy estimation,” in Proceedings of the IEEE/CVF International [104] C. Ye, H. Pan, and H. Gao, “Keypoint-based lidar-camera online
Conference on Computer Vision (ICCV), October 2021, pp. 14 890– calibration with robust geometric network,” IEEE Transactions on
14 899. Instrumentation and Measurement, vol. 71, pp. 1–11, 2021.
[84] N. Ye, C. Wang, H. Fan, and S. Liu, “Motion basis learning [105] S. T. Barnard, “Interpreting perspective images,” Artificial intelli-
for unsupervised deep homography estimation with subspace gence, vol. 21, no. 4, pp. 435–462, 1983.
projection,” in Proceedings of the IEEE/CVF International Conference [106] J. Fan, J. Zhang, S. J. Maybank, and D. Tao, “Wide-angle image
on Computer Vision (ICCV), October 2021, pp. 13 117–13 125. rectification: a survey,” International Journal of Computer Vision,
[85] Y. Chen, G. Wang, P. An, Z. You, and X. Huang, “Fast and vol. 130, no. 3, pp. 747–776, 2022.
accurate homography estimation using extendable compression [107] A. Wang, T. Qiu, and L. Shao, “A simple method of radial
network,” in 2021 IEEE International Conference on Image Process- distortion correction with centre of distortion estimation,” Journal
ing (ICIP), 2021, pp. 1024–1028. of Mathematical Imaging and Vision, vol. 35, no. 3, pp. 165–172,
[86] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Depth-aware multi- 2009.
grid deep homography estimation with contextual correlation,” [108] A. W. Fitzgibbon, “Simultaneous linear estimation of multiple
IEEE Transactions on Circuits and Systems for Video Technology, pp. view geometry and lens distortion,” in Proceedings of the 2001
1–1, 2021. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. CVPR 2001, vol. 1. IEEE, 2001, pp. I–I.
[87] S.-Y. Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep
[109] O. Ait-Aider, N. Andreff, J. M. Lavest, and P. Martinet, “Simul-
homography estimation,” arXiv preprint arXiv:2203.15982, 2022.
taneous object pose and velocity computation using a single
[88] M. Hong, Y. Lu, N. Ye, C. Lin, Q. Zhao, and S. Liu, “Unsu-
view from a rolling shutter camera,” in European Conference on
pervised homography estimation with coplanarity-aware gan,”
Computer Vision. Springer, 2006, pp. 56–68.
arXiv preprint arXiv:2205.03821, 2022.
[110] H. C. Longuet-Higgins, “A computer algorithm for reconstruct-
[89] S. Liu, N. Ye, C. Wang, K. Luo, J. Wang, and J. Sun, “Content- ing a scene from two projections,” Nature, vol. 293, no. 5828, pp.
aware unsupervised deep homography estimation and beyond,” 133–135, 1981.
IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. [111] S. Baker, A. Datta, and T. Kanade, “Parameterizing homogra-
1–1, 2022. phies,” Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-
[90] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: 06-11, 2006.
Multimodal sensor registration using deep neural networks,” in [112] B. K. Horn, “The direct linear transformation from comparator
2017 IEEE intelligent vehicles symposium (IV). IEEE, 2017, pp. coordinates into object-space coordinates in close-range pho-
1803–1810. togrammetry,” ISPRS Journal of Photogrammetry and Remote Sens-
[91] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet: ing, vol. 42, no. 3, pp. 125–133, 1987.
Geometrically supervised extrinsic calibration using 3d spatial
transformer networks,” in 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1110–
1117.
[92] K. Yuan, Z. Guo, and Z. J. Wang, “Rggnet: Tolerance aware
lidar-camera online calibration with geometric deep learning and
generative model,” IEEE Robotics and Automation Letters, vol. 5,
no. 4, pp. 6956–6963, 2020.
[93] Y. Zhu, C. Li, and Y. Zhang, “Online camera-lidar calibration
with sensor semantic information,” in 2020 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.
4970–4976.
[94] W. Wang, S. Nobuhara, R. Nakamura, and K. Sakurada, “Soic: Se-
mantic online initialization and calibration for lidar and camera,”
arXiv preprint arXiv:2003.04260, 2020.
[95] J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu,
“Calibrcnn: Calibrating camera and lidar by recurrent convo-
lutional neural network and geometric constraints,” in 2020
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2020, pp. 10 197–10 202.
[96] S. Wu, A. Hadachi, D. Vivet, and Y. Prabhakar, “Netcalib: A novel
approach for lidar-camera auto-calibration based on deep learn-
ing,” in 2020 25th International Conference on Pattern Recognition
(ICPR). IEEE, 2021, pp. 6648–6655.

You might also like