Madhubalakichu
Madhubalakichu
Madhubalakichu
Abstract— Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which
is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent
efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations.
Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we
arXiv:2303.10559v1 [cs.CV] 19 Mar 2023
provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main
calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model,
following the research trend and extended applications. As there is no benchmark in this community, we collect a holistic calibration
dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-
world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the
challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration
(spanned 8 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.
com/KangLiao929/Awesome-Deep-Camera-Calibration.
Index Terms—Camera calibration, Deep learning, Computational photography, Multiple view geometry, Robotics.
1 I NTRODUCTION
history [1], [2], [3], [4], tracing back to around 60 years ago y i
Wide-angle
[5]. The first step for many vision and robotics tasks is to cal- Pi
x
Camera Global Shutter Rolling Shutter
constrained by the limits of the feature detectors, which can based camera calibration. In-depth analysis and discussion
be influenced by diverse lighting conditions and textures. in various aspects are offered, including publications, net-
Since there are many standard techniques for calibrating work architecture, loss functions, datasets, evaluation met-
cameras in an industry/laboratory implementation [16], rics, learning strategies, implementation platforms, etc. The
[17], this process is usually ignored in recent development. detailed information of each literature is listed in Table 1.
However, calibrating single and wild images remains chal- (2) Despite the calibration algorithm, we comprehensively
lenging, especially when images are collected from websites review the classical camera models and their extended mod-
and unknown camera models. This challenge motivates the els. In particular, we summarize the redesigned calibration
researchers to investigate a new paradigm. objectives in deep learning since some traditional calibration
Recently, deep learning has brought new inspirations objectives are verified to be hard to learn by neural net-
to camera calibration and its applications. Learning-based works. (3) We collect a dataset containing images and videos
methods achieve state-of-the-art performances on various captured by different cameras in different environments,
tasks with higher efficiency. In particular, diverse deep which can serve as a platform to evaluate the generalization
neural networks (DNNs) have been developed, such as con- of existing methods. (4) We discuss the open challenges of
volutional neural networks (CNNs), generative adversarial learning-based camera calibration and propose some future
networks (GANs), PointNet, and vision transformers (ViTs), directions to provide guidance for further research in this
of which the high-level semantic features show more pow- field. (5) An open-source repository is created that pro-
erful representation capability compared with the hand- vides a taxonomy of all reviewed works and benchmarks.
crafted features. Moreover, diverse learning strategies have The repository will be updated regularly in https://github.
been exploited to boost the geometric perception of neural com/KangLiao929/Awesome-Deep-Camera-Calibration.
networks. Learning-based methods offer a fully automatic In the following sections, we discuss and analyze various
camera calibration solution, without manual interventions aspects of learning-based camera calibration. The remainder
or calibration targets, which sets them apart from traditional of this paper is organized as follows. In Section 2, we pro-
methods. Furthermore, some of these methods achieve cam- vide the concrete learning paradigms and learning strate-
era model-free and label-free calibration, showing promis- gies of the learning-based camera calibration. Subsequently,
ing and meaningful applications. we introduce and discuss the specific methods based on
With the rapid increase in the number of learning-based the standard camera model, distortion model, cross-view
camera calibration methods, it has become increasingly model, and cross-sensor model in Section 3, Section 4,
challenging to keep up with new advances. Consequently, Section 5, and Section 6, respectively (see Figure 2). The
there is an urgent need to analyze existing works and collected benchmark for calibration methods is depicted in
foster a community dedicated to this field. Previously, some Section 7. Finally, we conclude the learning-based camera
surveys, e.g., [18], [19], [20] only focused on a specific calibration and suggest the future directions of this commu-
task/camera in camera calibration or one type of approach. nity in Section 8.
For instance, Salvi et al. [18] reviewed the traditional camera
calibration methods in terms of the algorithms. Hughes et
2 P RELIMINARIES
al. [19] provided a detailed review for calibrating fisheye
cameras with traditional solutions. While Fan et al. [20] Deep learning has brought new inspirations to camera cal-
discussed both the traditional methods and deep learning ibration, enabling a fully automatic calibration procedure
methods, their survey only considers calibrating the wide- without manual intervention. Here, we first summarize
angle cameras. In addition, due to the few amount of two prevalent paradigms in learning-based camera calibra-
reviewed learning-based methods (around 10 papers), the tion: regression-based calibration and reconstruction-based
readers are difficult to picture the development trend of calibration. Then, the widely-used learning strategies are
general camera calibration in Fan et al. [20]. reviewed in this research field. The detailed definitions for
In this paper, we provide a comprehensive and in-depth classical camera models and their corresponding calibration
overview of recent advances in learning-based camera cali- objectives are exhibited in the supplementary material.
bration, covering over 100 papers. We also discuss potential
directions for further improvements and examine various 2.1 Learning Paradigm
types of cameras and targets. To facilitate future research on Driven by different architectures of the neural network,
different topics, we categorize the current solutions accord- the researchers have developed two main paradigms for
ing to calibration objectives and applications. In addition to learning-based camera calibration and its applications.
fundamental parameters such as focal length, rotation, and Regression-based Calibration Given an uncalibrated input,
translation, we also provide detailed reviews for correcting the regression-based calibration first extracts the high-level
image distortion (radial distortion and rolling shutter distor- semantic features using stacked convolutional layers. Then,
tion), estimating cross-view mapping, calibrating camera- the fully connected layers aggregate the semantic features
LiDAR systems, and other applications. Such a trend fol- and form a vector of the estimated calibration objective.
lows the development of cameras and market demands for The regressed parameters are used to conduct subsequent
virtual reality, autonomous driving, neural rendering, etc. tasks such as distortion rectification, image warping, camera
To our best knowledge, this is the first survey of the localization, etc. This paradigm is the earliest and has a
learning-based camera calibration and its extended appli- dominant role in learning-based camera calibration and
cations, it has the following unique contributions. (1) Our its applications. All the first works in various objectives,
work mainly follows recent advances in deep learning- e.g., intrinsics: Deepfocal [21], extrinsic: PoseNet [22], radial
3
distortion: Rong et al. [23], rolling shutter distortion: URS- notorious costly process, and obtaining perfect ground-truth
CNN [24], homography matrix: DHN [25], hybrid param- labels is challenging. As a result, it is often preferable to use
eters: Hold-Geoffroy et al. [26], camera-LiDAR parameters: weak supervision with machine learning methods. Weakly
RegNet [27] have been achieved with this paradigm. supervised learning refers to the process of building predic-
Reconstruction-based Calibration On the other hand, the tion models through learning with inadequate supervision.
reconstruction-based calibration paradigm discards the pa- Zhu et al. [48] present a weakly supervised camera cali-
rameter regression and directly learns the pixel-level map- bration method for single-view metrology in unconstrained
ping function between the uncalibrated input and target, environments, where there is only one accessible image of
inspired by the conditional image-to-image translation [28] a scene composed of objects of uncertain sizes. This work
and dense visual perception [29], [30]. The reconstructed leverages 2D object annotations from large-scale datasets,
results are then calculated for the pixel-wise loss with the where people and buildings are frequently present and
ground truth. In this regard, most reconstruction-based cal- serve as useful “reference objects” for determining 3D size.
ibration methods [31], [32], [33], [34] design their network Unsupervised Learning Unsupervised learning, commonly
architecture based on the fully convolutional network such referred to as unsupervised machine learning, analyzes and
as U-Net [35]. Specifically, an encoder-decoder network, groups unlabeled datasets using machine learning algo-
with skip connections between the encoder and decoder rithms. UDHN [49] is the first work for a cross-view camera
features at the same spatial resolution, progressively extracts model using unsupervised learning, which estimates the
the features from low-level to high-level and effectively in- homography matrix of a paired image without the projec-
tegrates multi-scale features. At the last convolutional layer, tion labels. By reducing a pixel-wise intensity error that
the learned features are aggregated into the target channel, does not require ground truth data, UDHN [49] outperforms
reconstructing the calibrated result at the pixel level. previous supervised learning techniques. While preserving
In contrast to the regression-based paradigm, the superior accuracy and robustness to fluctuation in light, the
reconstruction-based paradigm does not require the la- proposed unsupervised algorithm can also achieve faster
bel of diverse camera parameters. Besides, the imbalance inference time. Inspired by this work, increasing more meth-
loss problem can be eliminated since it only optimizes ods leverage the unsupervised learning strategy to estimate
the photometric loss of calibrated results. Therefore, the the homography such as CA-UDHN [50], BaseHomo [51],
reconstruction-based paradigm enables a blind camera cali- HomoGAN [52], and Liu et al. [53]. Besides, UnFishCor [54]
bration without a strong camera model assumption. frees the demands for distortion parameters and designs an
unsupervised framework for the wide-angle camera.
2.2 Learning Strategies Self-supervised Learning Robotics is where the phrase
In the following, we review the learning-based camera “self-supervised learning” first appears, as training data is
calibration literature regarding different learning strategies. automatically categorized by utilizing relationships between
Supervised Learning Most learning-based camera calibra- various input sensor signals. Compared to supervised learn-
tion methods train their networks with the supervised ing, self-supervised learning leverages input data itself as
learning strategy, from the classical methods [21], [22], [23], the supervision. Many self-supervised techniques are pre-
[25], [36], [37] to the state-of-the-art methods [38], [39], sented to learn visual characteristics from massive amounts
[40], [41], [42]. In terms of the learning paradigm, this of unlabeled photos or videos without the need for time-
strategy supervises the network with the ground truth of the consuming and expensive human annotations. SSR-Net [55]
camera parameters (regression-based paradigm) or paired presents a self-supervised deep homography estimation net-
data (reconstruction-based paradigm). In general, they syn- work, which relaxes the need for ground truth annotations
thesize the training dataset from other large-scale datasets, and leverages the invertibility constraints of homography.
under the random parameter/transformation sampling and To be specific, SSR-Net [55] utilizes the homography ma-
camera model simulation. Some recent works [43], [44], [45], trix representation in place of other approaches’ typically-
[46] establish their training dataset using a real-world setup used 4-point parameterization, to apply the invertibility
and label the captured images with manual annotations, constraints. SIR [56] devises a brand-new self-supervised
thereby fostering advancements in this research domain. camera calibration pipeline for wide-angle image rectifica-
Semi-Supervised Learning Training the network using an tion, based on the principle that the corrected results of
annotated dataset under diverse scenarios is an effective distorted images of the same scene taken with various lenses
learning strategy. However, human annotation can be prone need to be the same. With self-supervised depth and pose
to errors, leading to inconsistent annotation quality or the learning as a proxy aim, Fang et al. [57] present to self-
inclusion of contaminated data. Consequently, increasing calibrate a range of generic camera models from raw video,
the training dataset to improve performance can be chal- offering for the first time a calibration evaluation of camera
lenging due to the complexity and cost of constructing the model parameters learned solely via self-supervision.
dataset. To address this challenge, SS-WPC [47] proposes Reinforcement Learning Instead of aiming to minimize
a semi-supervised method for correcting portraits captured at each stage, reinforcement learning can maximize the
by a wide-angle camera. It employs a surrogate task (seg- cumulative benefits of a learning process as a whole. To
mentation) and a semi-supervised method that utilizes di- date, DQN-RecNet [58] is the first and only work in camera
rection and range consistency and regression consistency to calibration using reinforcement learning. It applies a deep
leverage both labeled and unlabeled data. reinforcement learning technique to tackle the fisheye image
Weakly-Supervised Learning Although significant progress rectification by a single Markov Decision Process, which is a
has been made, data labeling for camera calibration is a multi-step gradual calibration scheme. In this situation, the
4
Intrinsics Calibration Extrinsics Calibration Joint Calibration Discussion Radial Distortion Roll Shutter Distortion Discussion
(Section 3.1) (Section 3.2) (Section 3.3) (Section 3.4) (Section 4.1) (Section 4.2) (Section 4.3)
Direct Solution Cascaded Solution Iterative Solution Discussion Pixel-level Semantics-level Object/Keypoint-level Discussion
(Section 5.1) (Section 5.2) (Section 5.3) (Section 5.4) (Section 6.1) (Section 6.2) (Section 6.3) (Section 6.4)
Fig. 2. The structural and hierarchical taxonomy of camera calibration with deep learning. Some classical methods are listed under each category.
current fisheye image represents the state of the environ- 3.2 Extrinsics Calibration
ment. The agent, Deep Q-Network [59], generates an action
that should be executed to correct the distorted image. In contrast to intrinsic calibration, extrinsic calibration infers
In the following, we will review the specific methods the spatial correspondence of the camera and its located
and literature for learning-based camera calibration. The 3D scene. PoseNet [22] first proposed deep convolutional
structural and hierarchical taxonomy is shown in Figure 2. neural networks to regress 6-DoF camera pose in real-time.
A pose vector p was predicted by PoseNet, given by the 3D
position x and orientation represented by quaternion q of
a camera, namely, p = [x, q]. For constructing the training
3 S TANDARD M ODEL
dataset, the labels are automatically calculated from a video
Generally, for learning-based calibration works, the objec- of the scenario using a structure from motion method [183].
tives of the intrinsics calibration contain focal length and Inspired by PoseNet [22], the following works improved
optical center, and the objectives of the extrinsic calibration the extrinsic calibration in terms of the intermediate rep-
contain the rotation matrix and translation vector. resentation, interpretability, data format, learning objective,
etc. For example, to optimize the geometric pose objective,
3.1 Intrinsics Calibration DeepFEPE [112] designed an end-to-end keypoint-based
framework with learnable modules for detection, feature
Deepfocal [21] is a pioneer work in learning-based camera extraction, matching, and outlier rejection. Such a pipeline
calibration, it aims to estimate the focal length of any image imitated the traditional baseline, in which the final perfor-
“in the wild”. In detail, Deepfocal considered a simple mance can be analyzed and improved by the intermediate
pinhole camera model and regressed the horizontal field of differentiable module. To bridge the domain gap between
view using a deep convolutional neural network. Given the the extrinsic objective and image features, recent works
width w of an image, the relationship between the horizon- proposed to first learn an intermediate representation from
tal field of view Hθ and focal length f can be described by: the input, such as surface geometry [86], depth map [134],
w directional probability distribution [148], and normal flow
Hθ = 2 arctan( ). (1) [167], etc. Then, the extrinsic are reasoned by geometric
2f
constraints and learned representation. Therefore, the neural
Due to component wear, temperature fluctuations, or networks are gradually guided to perceive the geometry-
outside disturbances like collisions, the calibrated param- related features, which are crucial for extrinsic estimation.
eters of a camera are susceptible to change over time. To Considering the privacy concerns and limited storage prob-
this end, MisCaliDet [108] proposed to identify if a camera lem, some recent works compressed the scene and exploited
needs to be recalibrated intrinsically. Compared to the con- the point-like feature to estimate the extrinsic. For example,
ventional intrinsic parameters such as the focal length and Do et al. [164] trained a network to recognize sparse but
image center, MisCaliDet presented a new scalar metric, i.e., significant 3D points, dubbed scene landmarks, by encoding
the average pixel position difference (APPD) to measure the their appearance as implicit features. And the camera pose
degree of camera miscalibration, which describes the mean can be calculated using a robust minimal solver followed
value of the pixel position differences over the entire image. by a Levenberg-Marquardt-based nonlinear refinement. Sce-
5
TABLE 1
Details of the learning-based camera calibration and its extended applications from 2015 to 2022, including the method abbreviation, publication,
calibration objective, network architecture, loss function, dataset, evaluation metrics, learning strategy, platform, and simulation or not (training
data). For the learning strategies, SL, USL, WSL, Semi-SL, SSL, and RL denote supervised learning, unsupervised learning, weakly-supervised
learning, semi-supervised learning, self-supervised learning, and reinforcement learning, respectively.
Method Publication Objective Network Loss Function Dataset Evaluation Learning Platform Simulation
2015 DeepFocal [21] ICIP Standard AlexNet L2 loss 1DSfM [60] Accuracy SL Caffe
PoseNet [22] ICCV Standard GoogLeNet L2 loss Cambridge Landmarks [61] Accuracy SL Caffe
2016 DeepHorizon [62] BMVC Standard GoogLeNet Huber loss HLW [63] Accuracy SL Caffe
DeepVP [36] CVPR Standard AlexNet Logistic loss YUD [64], ECD [65], HLW [63] Accuracy SL Caffe
Rong et al. [23] ACCV Distortion AlexNet Softmax loss ImageNet [66] Line length SL Caffe X
DHN [25] RSSW Cross-View VGG L2 loss MS-COCO [67] MSE SL Caffe X
2017 CLKN [68] CVPR Cross-View CNNs Hinge loss MS-COCO [67] MSE SL Torch X
HierarchicalNet [69] ICCVW Cross-View VGG L2 loss MS-COCO [67] MSE SL TensorFlow X
URS-CNN [24] CVPR Distortion CNNs L2 loss Sun [70], Oxford [71], Zubud [72], LFW [73] PSNR, RMSE SL Torch X
RegNet [27] IV Cross-Sensor CNNs L2 loss KITTI [74] MAE SL Caffe X
2018 Hold-Geoffroy et al. [26] CVPR Standard DenseNet Entropy loss SUN360 [75] Human sensitivity SL -
DeepCalib [37] CVMP Distortion Inception-V3 Logcosh loss SUN360 [75] Mean error SL TensorFlow X
FishEyeRecNet [76] ECCV Distortion VGG L2 loss ADE20K [77] PSNR, SSIM SL Caffe X
Shi et al. [78] ICPR Distortion ResNet L2 loss ImageNet [66] MSE SL PyTorch X
DeepFM [79] ECCV Cross-View ResNet L2 loss T&T [80], KITTI [74], 1DSfM [60] F-score, Mean SL PyTorch X
Poursaeed et al. [81] ECCVW Cross-View CNNs L1 , L2 loss KITTI [74] EPI-ABS, EPI-SQR SL -
UDHN [49] RAL Cross-View VGG L1 loss MS-COCO [67] RMSE USL TensorFlow X
PFNet [82] ACCV Cross-View FCN Smooth L1 loss MS-COCO [67] MAE SL TensorFlow X
CalibNet [83] IROS Cross-Sensor ResNet Point cloud distance, L2 loss KITTI [74] Geodesic distance, MAE SL TensorFlow X
Chang et al. [84] ICRA Standard AlexNet Cross-entropy loss DeepVP-1M [84] MSE, Accuracy SL Matconvnet
2019 Lopez et al. [85] CVPR Distortion DenseNet Bearing loss SUN360 [75] MSE SL PyTorch
UprightNet [86] ICCV Standard U-Net Geometry loss InteriorNet [87], ScanNet [88], SUN360 [75] Mean error SL PyTorch
Zhuang et al. [89] IROS Distortion ResNet L1 loss KITTI [74] Mean error, RMSE SL PyTorch X
SSR-Net [55] PRL Cross-View ResNet L2 loss MS-COCO [67] MAE SSL PyTorch X
Abbas et al. [90] ICCVW Cross-View CNNs Softmax loss CARLA [91] AUC [92], Mean error SL TensorFlow X
DR-GAN [31] TCSVT Distortion GANs Perceptual loss MS-COCO [67] PSNR, SSIM SL TensorFlow X
STD [93] TCSVT Distortion GANs+CNNs Perceptual loss MS-COCO [67] PSNR, SSIM SL TensorFlow X
Deep360Up [94] VR Standard DenseNet Log-cosh loss [95] SUN360 [75] Mean error SL - X
UnFishCor [54] JVCIR Distortion VGG L1 loss Places2 [96] PSNR, SSIM USL TensorFlow X
BlindCor [34] CVPR Distortion U-Net L2 loss Places2 [96] MSE SL PyTorch X
RSC-Net [97] CVPR Distortion ResNet L1 loss KITTI [74] Mean error SL PyTorch X
Xue et al. [98] CVPR Distortion ResNet L2 loss Wireframes [99], SUNCG [100] PSNR, SSIM, RPE SL PyTorch X
Zhao et al. [43] ICCV Distortion VGG+U-Net L1 loss Self-constructed+BU-4DFE [101] Mean error SL - X
NeurVPS [102] NeurIPS Standard CNNs Binary cross entropy, chamfer-L2 loss ScanNet [88], SU3 [103] Angle accuracy SL PyTorch
2020 Sha et al. [104] CVPR Cross-View U-Net Cross-entropy loss World Cup 2014 [105] IoU SL TensorFlow
Lee et al. [106] ECCV Standard PointNet + CNNs Cross-entropy loss Google Street View [107], HLW [63] Mean error, AUC [92] SL -
MisCaliDet [108] ICRA Distortion CNNs L2 loss KITTI [74] MSE SL TensorFlow X
DeepPTZ [109] WACV Distortion Inception-V3 L1 loss SUN360 [75] Mean error SL PyTorch X
MHN [110] CVPR Cross-View VGG Cross-entropy loss MS-COCO [67], Self-constructed MAE SL TensorFlow X
Davidson et al. [111] ECCV Standard FCN Dice loss SUN360 [75] Accuracy SL - X
CA-UDHN [50] ECCV Cross-View FCN + ResNet Triplet loss Self-constructed MSE USL PyTorch
DeepFEPE [112] IROS Standard VGG + PointNet L2 loss KITTI [74], ApolloScape [113] Mean error SL PyTorch
DDM [32] TIP Distortion GANs L1 loss MS-COCO [67] PSNR, SSIM SL TensorFlow X
Li et al. [114] TIP Distortion CNNs Cross-entropy, L1 loss CelebA [115] Cosine distance SL - X
PSE-GAN [116] ICPR Distortion GANs L1 , WGAN loss Place2 [96] MSE SL - X
RDC-Net [117] ICIP Distortion ResNet L1 , L2 loss ImageNet [66] PSNR, SSIM SL PyTorch X
FE-GAN [118] ICASSP Distortion GANs L1 , GAN loss Wireframe [99], LSUN [119] PSNR, SSIM, RMSE SSL PyTorch X
RDCFace [120] CVPR Distortion ResNet Cross-entropy, L2 loss IMDB-Face [121] Accuracy SL - X
LaRecNet [122] arXiv Distortion ResNet L2 loss Wireframes [99], SUNCG [100] PSNR, SSIM, RPE SL PyTorch X
Baradad et al. [123] CVPR Standard CNNs L2 loss ScanNet [88], NYU [124], SUN360 [75] Mean error, RMS SL PyTorch
Zheng et al. [125] CVPR Standard CNNs L1 loss FocaLens [126] Mean error, PSNR, SSIM SL - X
Zhu et al. [48] ECCV Standard CNNs + PointNet L1 loss SUN360 [75], MS-COCO [67] Mean error, Accuracy WSL PyTorch X
DeepUnrollNet [46] CVPR Distortion FCN L1 , perceptual, total variation loss Carla-RS [46], Fastec-RS [46] PSNR, SSIM SL PyTorch X
RGGNet [127] RAL Cross-Sensor ResNet Geodesic distance loss KITTI [74] MSE, MSEE, MRR SL TensorFlow X
CalibRCNN [128] IROS Cross-Sensor RNNs L2 , Epipolar geometry loss KITTI [74] MAE SL TensorFlow X
SSI-Calib [129] ICRA Cross-Sensor CNNs L2 loss Pascal VOC 2012 [130] Mean/standard deviation SL TensorFlow X
SOIC [131] arXiv Cross-Sensor ResNet + PointRCNN Cost function KITTI [74] Mean error SL -
NetCalib [132] ICPR Cross-Sensor CNNs L1 loss KITTI [74] MAE SL PyTorch X
SRHEN [133] ACM-MM Cross-View CNNs L2 loss MS-COCO [67], SUN397 [75] MACE SL - X
2021 StereoCaliNet [134] TCI Standard U-Net L1 loss TAUAgent [135], KITTI [74] Mean error SL PyTorch X
CTRL-C [136] ICCV Standard Transformer Cross-entropy, L1 loss Google Street View [107], SUN360 [75] Mean error, AUC [92] SL PyTorch X
Wakai et al. [137] ICCVW Distortion DenseNet Smooth L1 loss StreetLearn [138] Mean error, PSNR, SSIM SL - X
OrdianlDistortion [139] TIP Distortion CNNs Smooth L1 loss MS-COCO [67] PSNR, SSIM, MDLD SL TensorFlow X
PolarRecNet [140] TCSVT Distortion VGG + U-Net L1 , L2 loss MS-COCO [67], LMS [141] PSNR, SSIM, MSE SL PyTorch X
DQN-RecNet [58] PRL Distortion VGG L2 loss Wireframes [99] PSNR, SSIM, MSE RL PyTorch X
Tan et al. [44] CVPR Distortion U-Net L2 loss Self-constructed Accuracy SL PyTorch
PCN [142] CVPR Distortion U-Net L1 , L2 , GAN loss Place2 [96] PSNR, SSIM, FID, CW-SSIM SL PyTorch X
DaRecNet [33] ICCV Distortion U-Net Smooth L1 , L2 loss ADE20K [77] PSNR, SSIM SL PyTorch X
DLKFM [143] CVPR Cross-View Siamese-Net L2 loss MS-COCO [67], Google Earth, Google Map MSE SL TensorFlow X
LocalTrans [144] ICCV Cross-View Transformer L1 loss MS-COCO [67] MSE, PSNR, SSIM SL PyTorch X
BasesHomo [51] ICCV Cross-View ResNet Triplet loss CA-UDHN [50] MSE USL PyTorch
ShuffleHomoNet [145] ICIP Cross-View ShuffleNet L2 loss MS-COCO [67] RMSE SL TensorFlow X
DAMG-Homo [41] TCSVT Cross-View CNNs L1 loss MS-COCO [67], UDIS [146] RMSE, PSNR, SSIM SL TensorFlow X
SA-MobileNet [147] BMVC Standard MobileNet Cross-entropy loss SUN360 [75], ADE20K [77], NYU [124] MAE, Accuracy SL TensorFlow X
SPEC [45] ICCV Standard ResNet Softargmax-L2 loss Self-constructed W-MPJPE, PA-MPJPE SL PyTorch X
DirectionNet [148] CVPR Standard U-Net Cosine similarity loss InteriorNet [87], Matterport3D [149] Mean and median error SL TensorFlow X
JCD [150] CVPR Distortion FCN Charbonnier [151], perceptual loss BS-RSCD [150], Fastec-RS [46] PSNR, SSIM, LPIPS SL PyTorch
LCCNet [152] CVPRW Cross-Sensor CNNs Smooth L1 , L2 loss KITTI [74] MSE SL PyTorch X
CFNet [153] Sensors Cross-Sensor FCN L1 , Charbonnier [151] loss KITTI [74], KITTI-360 [154] MAE, MSEE, MRR SL PyTorch X
Fan et al. [155] ICCV Distortion U-Net L1 , perceptual loss Carla-RS [46], Fastec-RS [46] PSNR, SSIM, LPIPS SL PyTorch
SUNet [156] ICCV Distortion DenseNet + ResNet L1 , perceptual loss Carla-RS [46], Fastec-RS [46] PSNR, SSIM SL PyTorch
SemAlign [157] IROS Cross-Sensor CNNs Semantic alignment loss KITTI [74] Mean/median rotation errors SL PyTorch X
2022 DVPD [38] CVPR Standard CNNs Cross-entropy loss SU3 [103], ScanNet [88], YUD [64], NYU [124] Accuracy, AUC [92] SL PyTorch X
Fang et al. [57] ICRA Standard CNNs L2 loss KITTI [74], EuRoC [158], OmniCam [159] MRE, RMSE SSL PyTorch
CPL [160] ICASSP Standard Inception-V3 L1 loss CARLA [91], CyclistDetection [161] MAE SL TensorFlow X
IHN [162] CVPR Cross-View Siamese-Net L1 loss MS-COCO [67], Google Earth, Google Map MACE SL PyTorch X
HomoGAN [52] CVPR Cross-View GANs Cross-entropy, WGAN loss CA-UDHN [50] Mean error USL PyTorch X
SS-WPC [47] CVPR Distortion Transformer Cross-entropy, L1 loss Tan et al. [44] Accuracy Semi-SL PyTorch
AW-RSC [163] CVPR Distortion CNNs Charbonnier [151], perceptual loss Self-constructed, FastecRS [46] PSNR, SSIM SL PyTorch
EvUnroll [39] CVPR Distortion U-Net Charbonnier, perceptual, TV loss Self-constructed, FastecRS [46] PSNR, SSIM, LPIPS SL PyTorch
Do et al. [164] CVPR Standard ResNet L2 , Robust angular [165] loss Self-constructed, 7-SCENES [166] Median error, Recall SL PyTorch
DiffPoseNet [167] CVPR Standard CNNs + LSTM L2 loss TartanAir [168], KITTI [74], TUM-RGBD [169] PEE, AEE [170] SSL PyTorch
SceneSqueezer [171] CVPR Standard Transformer L1 loss RobotCar Seasons [172], Cambridge Landmarks [61] Mean error, Recall [170] SL PyTorch
FocalPose [173] CVPR Standard CNNs L1 , Huber loss Pix3D [174], CompCars [175], StanfordCars [175] Median error, Accuracy SL PyTorch
DXQ-Net [176] arXiv Cross-Sensor CNNs + RNNs L1 , geodesic loss KITTI [74], KITTI-360 [154] MSE SL PyTorch X
SST-Calib [42] ITSC Cross-Sensor CNNs L2 loss KITTI [74] QAD, AEAD SL PyTorch X
CCS-Net [177] IROS Distortion U-Net L1 loss TUM-RGBD [169] MAE, RPE SL PyTorch X
FishFormer [40] arXiv Distortion Transformer L2 loss Place2 [96], CelebA [115] PSNR, SSIM, FID SL PyTorch X
SIR [56] TIP Distortion ResNet L1 loss ADE20K [77], WireFrames [99], MS-COCO [67] PSNR, SSIM SSL PyTorch X
ATOP [178] TIV Cross-Sensor CNNs Cross entropy loss Self-constructed + KITTI [74] RRE, RTE SL -
FusionNet [179] ICRA Cross-Sensor CNNs+PointNet L2 loss KITTI [74] MAE SL PyTorch X
RKGCNet [180] TIM Cross-Sensor CNNs+PointNet L1 loss KITTI [74] MSE SL PyTorch X
GenCaliNet [181] ECCV Distortion DenseNet L2 loss StreetLearn [138], SP360 [182] MAE, PSNR, SSIM SL - X
Liu et al. [53] TPAMI Cross-View ResNet Triplet loss Self-constructed MSE, Accuracy USL PyTorch
6
FFN
three levels: the database frames are clustered using pair-
VP
Self-Attention
Forward
ResNet
Horizon
FFN
Feed
wise co-visibility information, a point selection module
Line
FFN
FoV
Cross-Attention
Self-Attention
Multi-Head
Multi-Head
3.3 Joint Intrinsic and Extrinsic Calibration
Forward
Softmax
Feed
FFN
1x1 Conv
3.3.1 Geometric Representations
Softmax
FFN
Vanishing Points The intersection of projections of a set of
parallel lines in the world leads to a vanishing point. The Fig. 3. Overview of CTRL-C. The figure is from [136].
detection of vanishing points is a fundamental and crucial
challenge in 3D vision. In general, vanishing points reveal trained the network to estimate multiple angles within a
the direction of 3D lines, allowing the agent to deduce 3D narrow interval of the ground truth tilt, penalizing only
scene information from a single 2D image. those values that locate outside this narrow range.
DeepVP [36] is the first learning-based work for detect-
ing the vanishing points given a single image. It reversed the 3.3.2 Composite Parameters
conventional process by scoring the horizon line candidates Calibrating the composite parameters aims to estimate
according to the vanishing points they contain. Chang et al. the intrinsic parameters and extrinsic parameters simul-
[84] redesigned this task as a CNN classification problem taneously. By jointly estimating composite parameters
using an output layer with 225 discrete possible vanishing and training using data from a large-scale panorama
point locations. For constructing the dataset, the camera dataset [75], Hold-Geoffroy et al. [26] largely outperformed
view is panned and tilted with step 5° from -35° to 35° in the previous independent calibration tasks. Moreover, Hold-
panorama scene (total 225 images) from a single GPS loca- Geoffroy et al. [26] performed human perception research
tion. To directly leverage the geometric properties of vanish- in which the participants were asked to evaluate the realism
ing points, NeurVPS [102] proposed a canonical conic space of 3D objects composited with and without accurate calibra-
and a conic convolution operator that can be implemented tion. This data was further designed to a new perceptual
as regular convolutions in this space, where the learning measure for the calibration errors. In terms of the feature
model is capable of calculating the global geometric infor- category, Lee et al. [106] and CTRL-C [136] considered both
mation of vanishing points locally. To overcome the need for semantic features and geometric cues for camera calibration.
a large amount of training data in previous methods, DVPD They showed how taking use of geometric features, is capa-
[38] incorporated the neural network with two geometric ble of facilitating the network to comprehend the underlying
priors: Hough transformation and Gaussian sphere. First, perspective structure of an image. The pipeline of CTRL-C is
the convolutional features are transformed into a Hough illustrated in Figure 3. In recent literature, more applications
domain, mapping lines to distinct bins. The projection of the are jointly studied with camera calibration, for example,
Hough bins is then extended to the Gaussian sphere, where single view metrology [48], 3D human pose and shape
lines are transformed into great circles and vanishing points estimation [45], depth estimation [57], [123], object pose
are located at the intersection of these circles. Geometric estimation [173], and image reflection removal [125], etc.
priors are data-efficient because they eliminate the necessity Considering the heterogeneousness and visual implicit-
for learning this information from data, which enables an ness of different camera parameters, CPL [160] estimated the
interpretable learning framework and generalizes better to parameters using a novel camera projection loss, exploiting
domains with slightly different data distributions. the camera model neural network to reconstruct the 3D
Horizon Lines The horizon line is a crucial contextual point cloud. The proposed loss addressed the training im-
attribute for various computer vision tasks especially im- balance problem by representing different errors of camera
age metrology, computational photography, and 3D scene parameters in terms of a unified metric.
understanding. The projection of the line at infinity onto
any plane that is perpendicular to the local gravity vector 3.4 Discussion
determines the location of the horizon line.
Given the FoV, pitch, and roll of a camera, it is straight- 3.4.1 Technique Summary
forward to locate the horizon line in its captured image The above methods target automatic calibration without
space. DeepHorizon [62] proposed the first learning-based manual intervention and scene assumption. Early litera-
solution for estimating the horizon line from an image, ture [21], [22] separately studied the intrinsic calibration
without requiring any explicit geometric constraints or other or extrinsic calibration. Driven by large-scale datasets and
cues. To train the network, a new benchmark dataset, powerful networks, subsequent works [26], [36], [62], [136]
Horizon Lines in the Wild (HLW), was constructed, which considered a comprehensive camera calibration, inferring
consists of real-world images with labeled horizon lines. SA- various parameters and geometric representations. To re-
MobileNet [147] proposed an image tilt detection and cor- lieve the difficulty of learning the camera parameters, some
rection with self-attention MobileNet [184] for smartphones. works [86], [134], [148], [167] proposed to learn an interme-
A spatial self-attention module was devised to learn long- diate representation. In recent literature, more applications
range dependencies and global context within the input are jointly studied with camera calibration [45], [48], [57],
images. To address the difficulty of the regression task, they [123], [125]. This suggests solving the downstream vision
7
[47], [93], [118], [140], [142] developed the displacement Adaptive Adaptive Adaptive
Ada-MSA
Conv 𝟑 × 𝟑
Correlation
Concatenate
Volume
𝑆𝑐𝑎𝑙𝑒2
reconstruction-based solutions exploit a U-Net-like architec- Motion Estimation Adaptive Warping Central Frame Features
address large displacements. However, every image pair multiple iterations until the stopping condition is met in
at each level requires a unique feature extraction network, the testing stage. Besides, CLKN stacked three similar LK
resulting in the redundancy of feature maps. To alleviate this networks to further boost the performance by treating the
problem, some researchers [41], [52], [146], [192] replaced output of the last LK network as the initial warp parameters
image pyramids with feature pyramids. Specifically, they of the next LK network. From Eq. 7, the IC-LK algorithm
warped the feature maps directly instead of images to heavily relied on feature maps, which tend to fail in multi-
avoid excessive feature extraction networks. To address the modal images. Instead, DLKFM [143] constructed a single-
low-overlap homography estimation problem in real-world channel feature map by using the eigenvalues of the local
images [146], Nie et al. [146] modified the unsupervised covariance matrix on the output tensor. To learn DLKFM, it
constraint (Eq. 3) to adapt to low-overlap scenes: designed two special constraint terms to align multimodal
feature maps and contribute to convergence.
L0P W =k IA (x) · 1(W(x; p)) − IB (W(x; p)) k1 , (4)
However, LK-based algorithms can fail if the Jacobian
where 1 is an all-one matrix with the same size as IA or matrix is rank-deficient [194]. Additionally, the IC-LK it-
IB . It solved the low-overlap problem by taking the original erator is untrainable, which means this drawback is theo-
images as network input and ablating the corresponding retically unavoidable. To address this issue, a completely
pixels of IA to the invalid pixels of warped IB . To solve trainable iterative homography network (IHN) [162] was
the non-planar homography estimation problem, DAMG- proposed. Inspired by RAFT [195], IHN updates the cost
Homo [41] proposed backward multi-gird deformation with volume to refine the estimated homography using the same
contextual correlation to align parallax images. Compared estimator repeatedly every iteration. Furthermore, IHN can
with traditional cost volume, the proposed contextual cor- handle dynamic scenes by producing an inlier mask in the
relation helped to reach better accuracy with lower compu- estimator without requiring extra supervision.
tational complexity. Another way to address the non-planar
problem is to focus on the dominant plane. In HomoGAN 5.4 Discussion
[52], an unsupervised GAN is proposed to impose a copla- 5.4.1 Technique Summary
narity constraint on the predicted homography, as shown in The above works are devoted to exploring different homog-
Figure 9. To implement this approach, a generator is used raphy parameterizations such as 4-pt parameterization [25],
to predict masks of aligned regions, while a discriminator is perspective field [82], and motion bases representation [51],
used to determine whether two masked feature maps were which contributes to better convergence and performance.
produced by a single homography. Other works tend to design various network architectures.
In particular, cascaded and iterative solutions are proposed
5.3 Iterative Solution
to refine the performance progressively, which can be fur-
Compared with cascaded methods, iterative solutions ther combined together to reach higher accuracy. To make
achieve higher accuracy by iteratively optimizing the last the methods more practical, various challenging problems
estimation. Lucas-Kanade (LK) algorithm [189] is usually are preliminarily addressed, such as cross resolutions [144],
used in image registration to estimate the parameterized multiple modalities [143], [162], dynamic objects [110], [162],
warps iteratively, such as affine transformation, optical flow, and non-planar scenes [41], [50], [52], etc.
etc. It aims at the incremental update of warp parameters
∆p every iteration by minimizing the sum of squared error 5.4.2 Challenge and Future Effort
between a template image T and an input image I :
We summarize the existing challenges as follows:
E(∆p) =k T (x) − I(W(x; p + ∆p)) k22 . (5) (1) Many homography estimation solutions are designed
for fixed resolutions, while real-world applications often
However, when optimizing Eq. 5 using first-order Taylor involve much more flexible resolutions. When pre-trained
expansion, ∂I(W(x; p))/∂p should be recomputed every models are applied to images with different resolutions,
iteration because I(W(x; p)) varies with p. To avoid this performance can dramatically drop due to the need for
problem, the inverse compositional (IC) LK algorithm [193], input resizing to satisfy the regulated resolution.
an equivalence to LK algorithm, can be used to reformulate (2) Unlike optical flow estimation, which assumes small
the optimization goal as follows: motions between images, homography estimation often
E 0 (∆p) =k T (W(x; ∆p)) − I(W(x; p)) k22 . (6) deals with images that have significantly low-overlap rates.
In such cases, existing methods may exhibit inferior perfor-
After linearizing Eq. 6 with first-order Taylor expansion, mance due to limited receptive fields.
we compute ∂T (W(x; 0))/∂p instead of ∂I(W(x; p))/∂p, (3) Existing methods address the parallax or dynamic
which would not vary every iteration. objects by learning to reject outliers in the feature extractor
To combine the advantages of deep learning with IC-LK [50], cost volume [196], or estimator [162]. However, it is still
iterator, CLKN [68] conducted LK iterative optimization on unclear which stage is more appropriate for outlier rejection.
semantic feature maps extracted by CNNs as follows: Based on the challenges we have discussed, some poten-
tial research directions for future efforts can be identified:
E f (∆p) =k FT (W(x; ∆p)) − FI (W(x; p)) k22 , (7)
(1) To overcome the first challenge, we can design var-
where FT and FI are the feature maps of the template and ious strategies to enhance resolution robustness, such as
input images. Then, they enforced the network to run a resolution-related data augmentation, and continual learn-
single iteration with a hinge loss, while the network runs ing on multiple datasets with different resolutions. Besides,
12
the extrinsic error to the flow prediction network. multi-scale feature extraction, cross-modal interaction, cost-
volume establishment, and confidence-guided fusion.
6.2 Semantics-level Solution (2) Directly regressing 6-DoF parameters yields weak
Semantic features can be well learned and represented generalization ability. To overcome this, intermediate rep-
by deep neural networks. A perfect calibration enables resentations like calibration flow have been introduced.
to accurately align the same instance in different sensors. Additionally, calibration flow can handle non-rigid trans-
To this end, some works [42], [129], [131], [157] explored formations that are common in real-world applications.
to guide the camera-LiDAR calibration with the semantic (3) Traditional methods require specific environments
information. SOIC [131] calibrated and transforms the ini- but have well-designed strategies. To balance accuracy and
tialization issue into the semantic centroids’ PnP problem generalization, a combination of geometric solving algo-
using semantic information. Since the 3D semantic centroids rithms and learning methods has been investigated.
of the point cloud and the 2D semantic centroids of the
picture cannot match precisely, a matching constraint cost 6.4.3 Future Effort
function based on the semantic components was presented. (1) Camera-LiDAR calibration methods typically rely on
SSI-Calib [129] reformulated the calibration as an optimiza- datasets like KITTI, which provide only initial extrinsic
tion problem with a novel calibration quality metric based parameters. To create a decalibration dataset, researchers
on semantic features. Then, a non-monotonic subgradient add noise transformations to the initial extrinsics, but this
ascent algorithm was proposed to calculate the calibration approach assumes a fixed position camera-LiDAR system
parameters. Other works utilized the off-the-shelf segmenta- with miscalibration. In real-world applications, the camera-
tion networks for point cloud and image, and optimized the LiDAR relative pose varies, making it challenging to collect
calibration parameters by minimizing semantic alignment large-scale real data with ground truth extrinsics. To address
loss in single-direction [157] and bi-direction [42]. this challenge, generating synthetic camera-LiDAR data us-
ing simulation systems could be a valuable solution.
6.3 Object/Keypoint-level Solution (2) To optimize the combination of networks and tradi-
ATOP [178] designed an attention-based object-level match- tional solutions, a more compact approach is needed. Cur-
ing network, i.e., Cross-Modal Matching Network to explore rent methods mainly use networks as feature extractors, re-
the overlapped FoV between camera and LiDAR, which fa- sulting in non-end-to-end pipelines with inadequate feature
cilitated generating the 2D-3D object-level correspondences. extraction adjustments for calibration. A deep declarative
2D and 3D object proposals were detected by YOLOv4 [200] network (DDN) is a promising framework for making the
and PointPillar [201]. Then, two cascaded PSO-based algo- entire pipeline differentiable. The aggregation of learning
rithms [202] were devised to estimate the calibration extrin- and traditional methods can be optimized using DDN.
sic parameters in the optimization stage. Using the deep (3) The most important aspect of camera-LiDAR calibra-
declarative network (DDN) [203], RKGCNet [180] combined tion is 2D-3D matching. To achieve this, the point cloud is
the standard neural layer and a PnP solver in the same net- commonly transformed into a depth image. However, large
work, formulating the 2D–3D data association and pose es- deviations in extrinsic simulation can result in detail loss.
timation as a bilevel optimization problem. Therefore, both With the great development of Transformer and cross-modal
the feature extraction capability of the convolutional layer techniques, we believe leveraging Transformer to directly
and the conventional geometric solver can be employed. learn the features of image and point cloud in the same
Microsoft’s human keypoint extraction network [204] was pipeline could facilitate better 2D-3D matching.
applied to detect the 2D–3D matching keypoints. Addition-
ally, RKGCNet [180] presented a learnable weight layer that
determines the keypoints involved in the solver, enabling 7 B ENCHMARK
the whole pipeline to be trained end-to-end. As there is no public and unified benchmark in learning-
based camera calibration, we contribute a dataset that can
6.4 Discussion serve as a platform for generalization evaluations. In this
6.4.1 Technique Summary dataset, the images and videos are captured by different
The current method can be briefly classified based on the cameras under diverse scenes, including simulation envi-
principle of building 2D and 3D matching, namely, the ronments and real-world settings. Additionally, we provide
calibration target. In summary, most pixel-level solutions the calibration ground truth, parameter label, and visual
utilized the end-to-end framework to address this task. clues in this dataset based on different conditions. Figure 11
While these solutions delivered satisfactory performances shows some samples of our collected dataset.
on specific datasets, their generalization abilities are limited. Standard Model. We collected 300 high-resolution im-
Semantics-level and object/keypoint-level methods derived ages on the Internet, captured by popular digital cameras
from traditional calibration offered both acceptable perfor- such as Canon, Fujifilm, Nikon, Olympus, Sigma, Sony, etc.
mances and generalization abilities. However, they heavily For each image, we provide the specific focal length of its
relied on the quality of fore-end feature extraction. lens. We have included a diverse range of subjects, including
landscapes, portraits, wildlife, architecture, etc. The range of
6.4.2 Research Trend focal length is from 4.5mm to 600mm.
(1) Network architecture is becoming more complex with Distortion Model. We created a comprehensive dataset
the use of different structures for feature extraction, match- for the distortion camera model, with a focus on wide-angle
ing, and fusion. Current methods employ strategies like cameras. The dataset is comprised of three subcategories.
14
Front
Unit: CM
Rear
Apollo
Left
Right
KITTI
f = 62mm ONCE
Fig. 11. Overview of our collected benchmark, which covers all models reviewed in this paper. In this dataset, the image and video derive from
diverse cameras under different environments. The accurate ground truth and label are provided for each sample.
The first is a synthetic dataset, which was generated us- 1242×375, while the LiDAR sensors are from Velodyne and
ing the widely-used 4th order polynomial model. It con- Hesai, with 16, 32, 40, 64, and 128 beams. They include
tains both circular and rectangular structures, with 1,000 not only normal surrounding multi-view images but also
distortion-rectification image pairs. The second subcategory small baseline multi-view data. Additionally, we also added
consists of data captured under real-world settings, derived random disturbance of around 20 degrees rotation and 1.5
from the raw calibration data for around 40 types of wide- meters translation based on classical settings [27] to simulate
angle cameras. For each calibration data, the intrinsics, vibration and collision.
extrinsics, and distortion coefficients are available. Finally,
we exploit a car equipped with different cameras to capture 8 F UTURE R ESEARCH D IRECTIONS
video sequences. The scenes cover both indoor and outdoor
environments, including daytime and nighttime footage. Camera calibration is a fundamental and challenging re-
search topic. From the above technical reviews and limi-
Cross-View Model. We selected 500 testing samples
tation analysis, we can conclude there is still room for im-
at random from each of four representative datasets (MS-
provement with deep learning. From Section 3 to Section 6,
COCO [25], GoogleEarch [143], GoogleMap [143], CA-
specific future efforts are discussed for each model. In this
Homo [50]) to create a dataset for the cross-view model.
section, we suggest more general future research directions.
It covers a range of scenarios: MS-COCO provides natural
synthetic data, GoogleEarch contains aerial synthetic data,
and GoogleMap offers multi-modal synthetic data. Parallax 8.1 Sequences
is not a factor in these three datasets, while CAHomo pro- Most studies focus on calibrating a single image. However,
vides real-world data with non-planar scenes. To standard- the rich spatiotemporal correlation among sequences that
ize the dataset, we converted all images to a unified format offers useful information on calibration has been over-
and recorded the matched points between two views. In looked. Learning the spatiotemporal correlation can provide
MS-COCO, GoogleEarch, and GoogleMap, we used four the network with knowledge of structure from motion,
vertices of the images as the matched points. In CAHomo, which aligns with the principles of traditional calibrations.
we identified six matched key points within the same plane. Directly applying existing calibration methods to the first
Cross-Sensor Model. We collected RGB and point frame and then propagating the calibrated objectives to
cloud data from Apollo [205], DAIR-V2X [206], KITTI [74], subsequent frames is a straightforward approach. However,
KUCL [207], NuScenes [208], and ONCE [209]. Around 300 there are no methods that can perfectly calibrate every
data pairs with calibration parameters are included in each uncalibrated input, and the calibration error will persist
category. The datasets are captured in different countries to throughout the entire sequence. Another solution is to cal-
provide enough variety. Each dataset has a different sensor ibrate all frames simultaneously. However, the calibration
setup, obtaining camera-LiDAR data with varying image results of learning-based methods heavily rely on the se-
resolution, LiDAR scan pattern, and camera-LiDAR relative mantic features of the image. As a result, unstable jitter
location. The image resolution ranges from 2448×2048 to effects may occur in calibrated sequences when the scenes
15
[16] [Online]. Available: https://docs.opencv.org/4.x/dc/dbb/ SIGGRAPH European Conference on Visual Media Production, 2018,
tutorial py calibration.html pp. 1–10.
[17] [Online]. Available: https://www.mathworks.com/help/ [38] Y. Lin, R. Wiersma, S. L. Pintea, K. Hildebrandt, E. Eisemann,
vision/camera-calibration.html and J. C. van Gemert, “Deep vanishing point detection: Ge-
[18] J. Salvi, X. Armangué, and J. Batlle, “A comparative review of ometric priors make dataset variations vanish,” arXiv preprint
camera calibrating methods with accuracy evaluation,” Pattern arXiv:2203.08586, 2022.
recognition, vol. 35, no. 7, pp. 1617–1635, 2002. [39] X. Zhou, P. Duan, Y. Ma, and B. Shi, “Evunroll: Neuromorphic
[19] C. Hughes, M. Glavin, E. Jones, and P. Denny, “Review of events based rolling shutter image correction,” in Proceedings of
geometric distortion compensation in fish-eye cameras,” 2008. the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[20] J. Fan, J. Zhang, S. J. Maybank, and D. Tao, “Wide-angle image tion, 2022, pp. 17 775–17 784.
rectification: a survey,” International Journal of Computer Vision, [40] Y. Shangrong, L. Chunyu, L. Kang, and Z. Yao, “Fishformer:
vol. 130, no. 3, pp. 747–776, 2022. Annulus slicing-based transformer for fisheye rectification with
[21] S. Workman, C. Greenwell, M. Zhai, R. Baltenberger, and N. Ja- efficacy domain exploration,” arXiv preprint arXiv:2207.01925,
cobs, “Deepfocal: A method for direct focal length estimation,” 2022.
in 2015 IEEE International Conference on Image Processing (ICIP), [41] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Depth-aware multi-
2015, pp. 1369–1373. grid deep homography estimation with contextual correlation,”
[22] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional IEEE Transactions on Circuits and Systems for Video Technology, pp.
network for real-time 6-dof camera relocalization,” in Proceedings 1–1, 2021.
of the IEEE International Conference on Computer Vision (ICCV), [42] K. Akio, Z. Yiyang, Z. Pengwei, Z. Wei, and T. Masayoshi,
December 2015. “Sst-calib: Simultaneous spatial-temporal parameter calibration
[23] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens dis- between lidar and camera,” arXiv preprint arXiv:2207.03704, 2022.
tortion correction using convolutional neural networks trained [43] Y. Zhao, Z. Huang, T. Li, W. Chen, C. LeGendre, X. Ren,
with synthesized images,” in Asian Conference on Computer Vision. A. Shapiro, and H. Li, “Learning perspective undistortion of
Springer, 2016, pp. 35–49. portraits,” in Proceedings of the IEEE/CVF International Conference
[24] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the on Computer Vision (ICCV), October 2019.
shutter: Cnn to correct motion distortions,” in Proceedings of the [44] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical
IEEE Conference on computer Vision and Pattern Recognition, 2017, wide-angle portraits correction with deep structured models,”
pp. 2291–2299. in Proceedings of the IEEE/CVF Conference on Computer Vision and
[25] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image Pattern Recognition (CVPR), June 2021, pp. 3498–3506.
homography estimation,” arXiv preprint arXiv:1606.03798, 2016. [45] M. Kocabas, C.-H. P. Huang, J. Tesch, L. Müller, O. Hilliges, and
[26] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gam- M. J. Black, “Spec: Seeing people in the wild with an estimated
baretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for camera,” in Proceedings of the IEEE/CVF International Conference on
deep single image camera calibration,” in Proceedings of the IEEE Computer Vision (ICCV), October 2021, pp. 11 035–11 045.
Conference on Computer Vision and Pattern Recognition (CVPR), [46] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter
June 2018. unrolling network,” in Proceedings of the IEEE/CVF Conference on
[27] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: Computer Vision and Pattern Recognition, 2020, pp. 5941–5949.
Multimodal sensor registration using deep neural networks,” in [47] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-
2017 IEEE intelligent vehicles symposium (IV). IEEE, 2017, pp. supervised wide-angle portraits correction by multi-scale trans-
1803–1810. former,” in Proceedings of the IEEE/CVF Conference on Computer
[28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
translation with conditional adversarial networks,” in Proceedings [48] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann,
of the IEEE conference on computer vision and pattern recognition, K. Sunkavalli, and M. Chandraker, “Single view metrology in
2017, pp. 1125–1134. the wild,” in European Conference on Computer Vision. Springer,
[29] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- 2020, pp. 316–333.
works for semantic segmentation,” in Proceedings of the IEEE [49] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and
conference on computer vision and pattern recognition, 2015, pp. V. Kumar, “Unsupervised deep homography: A fast and robust
3431–3440. homography estimation model,” IEEE Robotics and Automation
[30] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from Letters, vol. 3, no. 3, pp. 2346–2353, 2018.
a single image using a multi-scale deep network,” Advances in [50] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and
neural information processing systems, vol. 27, 2014. J. Sun, “Content-aware unsupervised deep homography estima-
[31] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic tion,” in European Conference on Computer Vision. Springer, 2020,
radial distortion rectification using conditional gan in real-time,” pp. 653–669.
IEEE Transactions on Circuits and Systems for Video Technology, [51] N. Ye, C. Wang, H. Fan, and S. Liu, “Motion basis learning
vol. 30, no. 3, pp. 725–733, 2020. for unsupervised deep homography estimation with subspace
[32] K. Liao, C. Lin, Y. Zhao, and M. Xu, “Model-free distortion projection,” in Proceedings of the IEEE/CVF International Conference
rectification framework bridged by distortion distribution map,” on Computer Vision (ICCV), October 2021, pp. 13 117–13 125.
IEEE Transactions on Image Processing, vol. 29, pp. 3707–3718, 2020. [52] M. Hong, Y. Lu, N. Ye, C. Lin, Q. Zhao, and S. Liu, “Unsu-
[33] K. Liao, C. Lin, L. Liao, Y. Zhao, and W. Lin, “Multi-level curricu- pervised homography estimation with coplanarity-aware gan,”
lum for training a distortion-aware barrel distortion rectification arXiv preprint arXiv:2205.03821, 2022.
model,” in Proceedings of the IEEE/CVF International Conference on [53] S. Liu, N. Ye, C. Wang, K. Luo, J. Wang, and J. Sun, “Content-
Computer Vision (ICCV), October 2021, pp. 4389–4398. aware unsupervised deep homography estimation and beyond,”
[34] X. Li, B. Zhang, P. V. Sander, and J. Liao, “Blind geometric IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.
distortion correction on images through deep learning,” in Pro- 1–1, 2022.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [54] S. Yang, C. Lin, K. Liao, Y. Zhao, and M. Liu, “Unsupervised fish-
Recognition (CVPR), June 2019. eye image correction through bidirectional loss with geometric
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional prior,” Journal of Visual Communication and Image Representation,
networks for biomedical image segmentation,” in International vol. 66, p. 102692, 2020.
Conference on Medical image computing and computer-assisted inter- [55] X. Wang, C. Wang, B. Liu, X. Zhou, L. Zhang, J. Zheng, and X. Bai,
vention. Springer, 2015, pp. 234–241. “Multi-view stereo in the deep learning era: A comprehensive
[36] M. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points revfiew,” Displays, vol. 70, p. 102102, 2021.
using global image context in a non-manhattan world,” in Pro- [56] J. Fan, J. Zhang, and D. Tao, “Sir: Self-supervised image rectifi-
ceedings of the IEEE Conference on Computer Vision and Pattern cation via seeing the same scene from multiple different lenses,”
Recognition (CVPR), June 2016. IEEE Transactions on Image Processing, 2022.
[37] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib: [57] J. Fang, I. Vasiljevic, V. Guizilini, R. Ambrus, G. Shakhnarovich,
a deep learning approach for automatic intrinsic calibration of A. Gaidon, and M. R. Walter, “Self-supervised camera self-
wide field-of-view cameras,” in Proceedings of the 15th ACM calibration from video,” arXiv preprint arXiv:2112.03325, 2021.
17
[58] J. Zhao, S. Wei, L. Liao, and Y. Zhao, “Dqn-based gradual fisheye [79] R. Ranftl and V. Koltun, “Deep fundamental matrix estima-
image rectification,” Pattern Recognition Letters, vol. 152, pp. 129– tion,” in Proceedings of the European Conference on Computer Vision
134, 2021. (ECCV), September 2018.
[59] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. [80] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski temples: Benchmarking large-scale scene reconstruction,” ACM
et al., “Human-level control through deep reinforcement learn- Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017.
ing,” nature, vol. 518, no. 7540, pp. 529–533, 2015. [81] O. Poursaeed, G. Yang, A. Prakash, Q. Fang, H. Jiang, B. Har-
[60] K. Wilson and N. Snavely, “Robust global translations with iharan, and S. Belongie, “Deep fundamental matrix estimation
1dsfm,” in European Conference on Computer Vision. Springer, without correspondences,” in Proceedings of the European Confer-
2014, pp. 61–75. ence on Computer Vision (ECCV) Workshops, September 2018.
[61] [Online]. Available: https://www.repository.cam.ac.uk/handle/ [82] R. Zeng, S. Denman, S. Sridharan, and C. Fookes, “Rethinking
1810/251342;jsessionid=90AB1617B8707CD387CBF67437683F77 planar homography estimation using perspective fields,” in Asian
[62] S. Workman, M. Zhai, and N. Jacobs, “Horizon lines in the wild,” Conference on Computer Vision. Springer, 2018, pp. 571–586.
arXiv preprint arXiv:1604.02129, 2016. [83] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet:
[63] [Online]. Available: https://mvrl.cse.wustl.edu/datasets/hlw/ Geometrically supervised extrinsic calibration using 3d spatial
transformer networks,” in 2018 IEEE/RSJ International Conference
[64] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based
on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1110–
methods for estimating manhattan frames in urban imagery,” in
1117.
European conference on computer vision. Springer, 2008, pp. 197–
210. [84] C.-K. Chang, J. Zhao, and L. Itti, “Deepvp: Deep learning for
vanishing point detection on 1 million street view images,” in
[65] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric
2018 IEEE International Conference on Robotics and Automation
image parsing in man-made environments,” in European confer-
(ICRA). IEEE, 2018, pp. 4496–4503.
ence on computer vision. Springer, 2010, pp. 57–70.
[85] M. Lopez, R. Mari, P. Gargallo, Y. Kuang, J. Gonzalez-Jimenez,
[66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, and G. Haro, “Deep single image camera calibration with radial
“Imagenet: A large-scale hierarchical image database,” in IEEE distortion,” in Proceedings of the IEEE/CVF Conference on Computer
Conference on Computer Vision and Pattern Recognition, 2009, pp. Vision and Pattern Recognition (CVPR), June 2019.
248–255.
[86] W. Xian, Z. Li, M. Fisher, J. Eisenmann, E. Shechtman, and
[67] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, N. Snavely, “Uprightnet: Geometry-aware camera orientation
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects estimation from single images,” in Proceedings of the IEEE/CVF
in context,” in European conference on computer vision. Springer, International Conference on Computer Vision (ICCV), October 2019.
2014, pp. 740–755.
[87] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye,
[68] C.-H. Chang, C.-N. Chou, and E. Y. Chang, “Clkn: Cascaded Y. Huang, R. Tang, and S. Leutenegger, “Interiornet: Mega-scale
lucas-kanade networks for image alignment,” in Proceedings of multi-sensor photo-realistic indoor scenes dataset,” arXiv preprint
the IEEE Conference on Computer Vision and Pattern Recognition arXiv:1809.00716, 2018.
(CVPR), July 2017.
[88] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and
[69] F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homogra- M. Nießner, “Scannet: Richly-annotated 3d reconstructions of
phy estimation from image pairs with hierarchical convolutional indoor scenes,” in Proceedings of the IEEE conference on computer
networks,” in Proceedings of the IEEE International Conference on vision and pattern recognition, 2017, pp. 5828–5839.
Computer Vision (ICCV) Workshops, Oct 2017. [89] B. Zhuang, Q.-H. Tran, G. H. Lee, L. F. Cheong, and M. Chan-
[70] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun draker, “Degeneracy in self-calibration revisited and a deep
database: Large-scale scene recognition from abbey to zoo,” in learning solution for uncalibrated slam,” in 2019 IEEE/RSJ Inter-
2010 IEEE computer society conference on computer vision and pattern national Conference on Intelligent Robots and Systems (IROS), 2019,
recognition. IEEE, 2010, pp. 3485–3492. pp. 3766–3773.
[71] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object [90] S. Ammar Abbas and A. Zisserman, “A geometric approach
retrieval with large vocabularies and fast spatial matching,” in to obtain a bird’s eye view from an image,” in Proceedings of
2007 IEEE conference on computer vision and pattern recognition. the IEEE/CVF International Conference on Computer Vision (ICCV)
IEEE, 2007, pp. 1–8. Workshops, Oct 2019.
[72] H. Shao, T. Svoboda, and L. Van Gool, “Zubud-zurich buildings [91] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
database for image based recognition,” Computer Vision Lab, Swiss “Carla: An open urban driving simulator,” in Conference on robot
Federal Institute of Technology, Switzerland, Tech. Rep, vol. 260, learning. PMLR, 2017, pp. 1–16.
no. 20, p. 6, 2003. [92] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric
[73] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “La- image parsing in man-made environments,” in European confer-
beled faces in the wild: A database forstudying face recognition ence on computer vision. Springer, 2010, pp. 57–70.
in unconstrained environments,” in Workshop on faces in’Real- [93] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Distortion rectifica-
Life’Images: detection, alignment, and recognition, 2008. tion from static to dynamic: A distortion sequence construction
[74] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au- perspective,” IEEE Transactions on Circuits and Systems for Video
tonomous driving? the kitti vision benchmark suite,” in 2012 Technology, vol. 30, no. 11, pp. 3870–3882, 2020.
IEEE conference on computer vision and pattern recognition. IEEE, [94] R. Jung, A. S. J. Lee, A. Ashtari, and J.-C. Bazin, “Deep360up:
2012, pp. 3354–3361. A deep learning-based approach for automatic vr image upright
[75] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing adjustment,” in 2019 IEEE Conference on Virtual Reality and 3D
scene viewpoint using panoramic place representation,” in 2012 User Interfaces (VR), 2019, pp. 1–8.
IEEE Conference on Computer Vision and Pattern Recognition. IEEE, [95] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust
2012, pp. 2695–2702. optimization for deep regression,” in Proceedings of the IEEE
[76] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “Fishey- international conference on computer vision, 2015, pp. 2830–2838.
erecnet: A multi-context collaborative deep network for fisheye [96] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba,
image rectification,” in Proceedings of the European Conference on “Places: A 10 million image database for scene recognition,” IEEE
Computer Vision (ECCV), September 2018. Transactions on Pattern Analysis and Machine Intelligence, vol. 40,
[77] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, no. 6, pp. 1452–1464, 2018.
“Scene parsing through ade20k dataset,” in Proceedings of the IEEE [97] B. Zhuang, Q.-H. Tran, P. Ji, L.-F. Cheong, and M. Chandraker,
conference on computer vision and pattern recognition, 2017, pp. 633– “Learning structure-and-motion-aware rolling shutter correc-
641. tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision
[78] Y. Shi, D. Zhang, J. Wen, X. Tong, X. Ying, and H. Zha, “Radial and Pattern Recognition (CVPR), June 2019.
lens distortion correction by adding a weight layer with inverted [98] Z. Xue, N. Xue, G.-S. Xia, and W. Shen, “Learning to calibrate
foveal models to convolutional neural networks,” in 2018 24th straight lines for fisheye image rectification,” in Proceedings of the
International Conference on Pattern Recognition (ICPR), 2018, pp. IEEE/CVF Conference on Computer Vision and Pattern Recognition
1–6. (CVPR), June 2019.
18
[99] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learn- [120] H. Zhao, X. Ying, Y. Shi, X. Tong, J. Wen, and H. Zha, “Rdcface:
ing to parse wireframes in images of man-made environments,” Radial distortion correction for face recognition,” in Proceedings
in Proceedings of the IEEE Conference on Computer Vision and Pattern of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
Recognition, 2018, pp. 626–635. nition (CVPR), June 2020.
[100] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and [121] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. C.
T. Funkhouser, “Semantic scene completion from a single depth Loy, “The devil of face recognition is in the noise,” in Proceedings
image,” in Proceedings of the IEEE conference on computer vision and of the European Conference on Computer Vision (ECCV), 2018, pp.
pattern recognition, 2017, pp. 1746–1754. 765–780.
[101] L. Yin, X. Sun, T. Worm, and M. Reale, “A high-resolution 3d [122] Z.-C. Xue, N. Xue, and G.-S. Xia, “Fisheye distortion rectification
dynamic facial expression database, 2008,” in IEEE International from deep straight lines,” arXiv preprint arXiv:2003.11386, 2020.
Conference on Automatic Face and Gesture Recognition, Amsterdam, [123] M. Baradad and A. Torralba, “Height and uprightness invari-
The Netherlands, vol. 126. ance for 3d prediction from a single view,” in Proceedings of the
[102] Y. Zhou, H. Qi, J. Huang, and Y. Ma, “Neurvps: Neural vanishing IEEE/CVF Conference on Computer Vision and Pattern Recognition
point scanning via conic convolution,” Advances in Neural Infor- (CVPR), June 2020.
mation Processing Systems, vol. 32, 2019. [124] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-
[103] Y. Zhou, H. Qi, Y. Zhai, Q. Sun, Z. Chen, L.-Y. Wei, and Y. Ma, mentation and support inference from rgbd images,” in European
“Learning to reconstruct 3d manhattan wireframes from a single conference on computer vision. Springer, 2012, pp. 746–760.
image,” in Proceedings of the IEEE/CVF International Conference on [125] Q. Zheng, J. Chen, Z. Lu, B. Shi, X. Jiang, K.-H. Yap, L.-Y. Duan,
Computer Vision, 2019, pp. 7698–7707. and A. C. Kot, “What does plate glass reveal about camera cal-
[104] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End- ibration?” in Proceedings of the IEEE/CVF Conference on Computer
to-end camera calibration for broadcast videos,” in Proceedings of Vision and Pattern Recognition (CVPR), June 2020.
the IEEE/CVF Conference on Computer Vision and Pattern Recogni- [126] [Online]. Available: https://figshare.com/articles/dataset/
tion (CVPR), June 2020. FocaLens/3399169/2
[105] N. Homayounfar, S. Fidler, and R. Urtasun, “Sports field local- [127] K. Yuan, Z. Guo, and Z. J. Wang, “Rggnet: Tolerance aware
ization via deep structured models,” in Proceedings of the IEEE lidar-camera online calibration with geometric deep learning and
Conference on Computer Vision and Pattern Recognition, 2017, pp. generative model,” IEEE Robotics and Automation Letters, vol. 5,
5212–5220. no. 4, pp. 6956–6963, 2020.
[106] J. Lee, M. Sung, H. Lee, and J. Kim, “Neural geometric parser
[128] J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu,
for single image camera calibration,” in European Conference on
“Calibrcnn: Calibrating camera and lidar by recurrent convo-
Computer Vision. Springer, 2020, pp. 541–557.
lutional neural network and geometric constraints,” in 2020
[107] [Online]. Available: https://developers.google.com/maps/ IEEE/RSJ International Conference on Intelligent Robots and Systems
[108] A. Cramariuc, A. Petrov, R. Suri, M. Mittal, R. Siegwart, and (IROS). IEEE, 2020, pp. 10 197–10 202.
C. Cadena, “Learning camera miscalibration detection,” in 2020
[129] Y. Zhu, C. Li, and Y. Zhang, “Online camera-lidar calibration
IEEE International Conference on Robotics and Automation (ICRA),
with sensor semantic information,” in 2020 IEEE International
2020, pp. 4997–5003.
Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.
[109] C. Zhang, F. Rameau, J. Kim, D. M. Argaw, J.-C. Bazin, and 4970–4976.
I. S. Kweon, “Deepptz: Deep self-calibration for ptz cameras,”
[130] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
in Proceedings of the IEEE/CVF Winter Conference on Applications of
and A. Zisserman, “The PASCAL Visual Object Classes
Computer Vision (WACV), March 2020.
Challenge 2012 (VOC2012) Results,” http://www.pascal-
[110] H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography
network.org/challenges/VOC/voc2012/workshop/index.html.
estimation for dynamic scenes,” in Proceedings of the IEEE/CVF
[131] W. Wang, S. Nobuhara, R. Nakamura, and K. Sakurada, “Soic: Se-
Conference on Computer Vision and Pattern Recognition (CVPR),
mantic online initialization and calibration for lidar and camera,”
June 2020.
arXiv preprint arXiv:2003.04260, 2020.
[111] B. Davidson, M. S. Alvi, and J. F. Henriques, “360° camera
alignment via segmentation,” in European Conference on Computer [132] S. Wu, A. Hadachi, D. Vivet, and Y. Prabhakar, “Netcalib: A novel
Vision. Springer, 2020, pp. 579–595. approach for lidar-camera auto-calibration based on deep learn-
ing,” in 2020 25th International Conference on Pattern Recognition
[112] Y.-Y. Jau, R. Zhu, H. Su, and M. Chandraker, “Deep keypoint-
(ICPR). IEEE, 2021, pp. 6648–6655.
based camera pose estimation with geometric constraints,” in
2020 IEEE/RSJ International Conference on Intelligent Robots and [133] Y. Li, W. Pei, and Z. He, “Srhen: stepwise-refining homography
Systems (IROS), 2020, pp. 4950–4957. estimation network via parsing geometric correspondences in
[113] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, deep latent space,” in Proceedings of the 28th ACM International
“The apolloscape open dataset for autonomous driving and its Conference on Multimedia, 2020, pp. 3063–3071.
application,” IEEE transactions on pattern analysis and machine [134] Y. Gil, S. Elmalem, H. Haim, E. Marom, and R. Giryes, “Online
intelligence, vol. 42, no. 10, pp. 2702–2719, 2019. training of stereo self-calibration using monocular depth estima-
[114] Y.-H. Li, I.-C. Lo, and H. H. Chen, “Deep face rectification for tion,” IEEE Transactions on Computational Imaging, vol. 7, pp. 812–
360° dual-fisheye cameras,” IEEE Transactions on Image Processing, 823, 2021.
vol. 30, pp. 264–276, 2021. [135] [Online]. Available: http://www.cs.toronto.edu/∼harel/
[115] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: TAUAgent/download.html
A dataset and benchmark for large-scale face recognition,” in [136] J. Lee, H. Go, H. Lee, S. Cho, M. Sung, and J. Kim, “Ctrl-c: Camera
European conference on computer vision. Springer, 2016, pp. 87– calibration transformer with line-classification,” in Proceedings of
102. the IEEE/CVF International Conference on Computer Vision (ICCV),
[116] Y. Shi, X. Tong, J. Wen, H. Zhao, X. Ying, and H. Zha, “Position- October 2021, pp. 16 228–16 237.
aware and symmetry enhanced gan for radial distortion correc- [137] N. Wakai and T. Yamashita, “Deep single fisheye image camera
tion,” in 2020 25th International Conference on Pattern Recognition calibration for over 180-degree projection of field of view,” in
(ICPR), 2021, pp. 1701–1708. Proceedings of the IEEE/CVF International Conference on Computer
[117] H. Zhao, Y. Shi, X. Tong, X. Ying, and H. Zha, “A simple yet Vision (ICCV) Workshops, October 2021, pp. 1174–1183.
effective pipeline for radial distortion correction,” in 2020 IEEE [138] P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin,
International Conference on Image Processing (ICIP), 2020, pp. 878– K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan,
882. K. Kavukcuoglu, A. Zisserman et al., “The streetlearn environ-
[118] C.-H. Chao, P.-L. Hsu, H.-Y. Lee, and Y.-C. F. Wang, “Self- ment and dataset,” arXiv preprint arXiv:1903.01292, 2019.
supervised deep learning for fisheye image rectification,” in [139] K. Liao, C. Lin, and Y. Zhao, “A deep ordinal distortion estima-
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, tion approach for distortion rectification,” IEEE Transactions on
Speech and Signal Processing (ICASSP), 2020, pp. 2248–2252. Image Processing, vol. 30, pp. 3362–3375, 2021.
[119] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and [140] K. Zhao, C. Lin, K. Liao, S. Yang, and Y. Zhao, “Revisiting radial
J. Xiao, “Lsun: Construction of a large-scale image dataset us- distortion rectification in polar-coordinates: A new and efficient
ing deep learning with humans in the loop,” arXiv preprint learning perspective,” IEEE Transactions on Circuits and Systems
arXiv:1506.03365, 2015. for Video Technology, pp. 1–1, 2021.
19
[141] A. Eichenseer and A. Kaup, “A data set providing synthetic and [161] X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, K. Li,
real-world fisheye video sequences,” in 2016 IEEE International and D. M. Gavrila, “A new benchmark for vision-based cyclist
Conference on Acoustics, Speech and Signal Processing (ICASSP). detection,” in 2016 IEEE Intelligent Vehicles Symposium (IV). IEEE,
IEEE, 2016, pp. 1541–1545. 2016, pp. 1028–1033.
[142] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively [162] S.-Y. Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep
complementary network for fisheye image rectification using homography estimation,” arXiv preprint arXiv:2203.15982, 2022.
appearance flow,” in Proceedings of the IEEE/CVF Conference on [163] M. Cao, Z. Zhong, J. Wang, Y. Zheng, and Y. Yang, “Learning
Computer Vision and Pattern Recognition (CVPR), June 2021, pp. adaptive warping for real-world rolling shutter correction,” in
6348–6357. Proceedings of the IEEE/CVF Conference on Computer Vision and
[143] Y. Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homog- Pattern Recognition, 2022, pp. 17 785–17 793.
raphy for multimodal image alignment,” in Proceedings of the [164] T. Do, O. Miksik, J. DeGol, H. S. Park, and S. N. Sinha, “Learning
IEEE/CVF Conference on Computer Vision and Pattern Recognition to detect scene landmarks for camera localization,” in Proceedings
(CVPR), June 2021, pp. 15 950–15 959. of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
[144] R. Shao, G. Wu, Y. Zhou, Y. Fu, L. Fang, and Y. Liu, “Localtrans: A nition, 2022, pp. 11 132–11 142.
multiscale local transformer network for cross-resolution homog- [165] T. Do, K. Vuong, S. I. Roumeliotis, and H. S. Park, “Surface nor-
raphy estimation,” in Proceedings of the IEEE/CVF International mal estimation of tilted images via spatial rectifier,” in European
Conference on Computer Vision (ICCV), October 2021, pp. 14 890– Conference on Computer Vision. Springer, 2020, pp. 265–280.
14 899. [166] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
[145] Y. Chen, G. Wang, P. An, Z. You, and X. Huang, “Fast and A. Fitzgibbon, “Scene coordinate regression forests for camera re-
accurate homography estimation using extendable compression localization in rgb-d images,” in Proceedings of the IEEE Conference
network,” in 2021 IEEE International Conference on Image Process- on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
ing (ICIP), 2021, pp. 1024–1028. [167] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, and
[146] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Unsupervised deep Y. Aloimonos, “Diffposenet: Direct differentiable camera pose
image stitching: Reconstructing stitched features to images,” estimation,” in Proceedings of the IEEE/CVF Conference on Computer
IEEE Transactions on Image Processing, vol. 30, pp. 6184–6197, 2021. Vision and Pattern Recognition, 2022, pp. 6845–6854.
[147] S. Garg, D. P. Mohanty, S. P. Thota, and S. Moharana, “A simple [168] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu,
approach to image tilt correction with self-attention mobilenet for A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the
smartphones,” arXiv preprint arXiv:2111.00398, 2021. limits of visual slam,” in 2020 IEEE/RSJ International Conference
[148] K. Chen, N. Snavely, and A. Makadia, “Wide-baseline relative on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–
camera pose estimation with directional learning,” in Proceedings 4916.
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- [169] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
nition, 2021, pp. 3258–3268. “A benchmark for the evaluation of rgb-d slam systems,” in 2012
[149] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, IEEE/RSJ international conference on intelligent robots and systems.
M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: IEEE, 2012, pp. 573–580.
Learning from rgb-d data in indoor environments,” arXiv preprint [170] B. J. Pijnacker Hordijk, K. Y. Scheper, and G. C. De Croon,
arXiv:1709.06158, 2017. “Vertical landing for micro air vehicles using event-based optical
[150] Z. Zhong, Y. Zheng, and I. Sato, “Towards rolling shutter cor- flow,” Journal of Field Robotics, vol. 35, no. 1, pp. 69–90, 2018.
rection and deblurring in dynamic scenes,” in Proceedings of the [171] L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, “Scenesqueezer: Learning to compress scene for camera relocal-
2021, pp. 9219–9228. ization,” in Proceedings of the IEEE/CVF Conference on Computer
[151] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Fast and Vision and Pattern Recognition, 2022, pp. 8259–8268.
accurate image super-resolution with deep laplacian pyramid [172] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year,
networks,” IEEE transactions on pattern analysis and machine in- 1000 km: The oxford robotcar dataset,” The International Journal of
telligence, vol. 41, no. 11, pp. 2599–2613, 2018. Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
[152] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and [173] G. Ponimatkin, Y. Labbé, B. Russell, M. Aubry, and J. Sivic, “Focal
camera self-calibration using cost volume network,” in Proceed- length and object pose estimation via render and compare,” in
ings of the IEEE/CVF Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF Conference on Computer Vision and
Recognition, 2021, pp. 2894–2901. Pattern Recognition, 2022, pp. 3825–3834.
[153] X. Lv, S. Wang, and D. Ye, “Cfnet: Lidar-camera registration using [174] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B.
calibration flow network,” Sensors, vol. 21, no. 23, p. 8112, 2021. Tenenbaum, and W. T. Freeman, “Pix3d: Dataset and methods
[154] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and for single-image 3d shape modeling,” in Proceedings of the IEEE
benchmarks for urban scene understanding in 2d and 3d,” IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.
Transactions on Pattern Analysis and Machine Intelligence, 2022. 2974–2983.
[155] B. Fan and Y. Dai, “Inverting a rolling shutter camera: bring [175] Y. Wang, X. Tan, Y. Yang, X. Liu, E. Ding, F. Zhou, and L. S.
rolling shutter images to high framerate global shutter video,” in Davis, “3d pose estimation for fine-grained object categories,” in
Proceedings of the IEEE/CVF International Conference on Computer Proceedings of the European Conference on Computer Vision (ECCV)
Vision, 2021, pp. 4228–4237. Workshops, 2018, pp. 0–0.
[156] B. Fan, Y. Dai, and M. He, “Sunet: symmetric undistortion [176] X. Jing, X. Ding, R. Xiong, H. Deng, and Y. Wang, “Dxq-net:
network for rolling shutter correction,” in Proceedings of the Differentiable lidar-camera extrinsic calibration using quality-
IEEE/CVF International Conference on Computer Vision, 2021, pp. aware flow,” arXiv preprint arXiv:2203.09385, 2022.
4541–4550. [177] Y. Zhang, X. Zhao, and D. Qian, “Learning-based framework for
[157] Z. Liu, H. Tang, S. Zhu, and S. Han, “Semalign: Annotation-free camera calibration with distortion correction and high precision
camera-lidar calibration with semantic alignment loss,” in 2021 feature detection,” arXiv preprint arXiv:2202.00158, 2022.
IEEE/RSJ International Conference on Intelligent Robots and Systems [178] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An
(IROS). IEEE, 2021, pp. 8845–8851. attention-to-optimization approach for automatic lidar-camera
[158] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, calibration via cross-modal object matching,” IEEE Transactions
M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle on Intelligent Vehicles, 2022.
datasets,” The International Journal of Robotics Research, vol. 35, [179] G. Wang, J. Qiu, Y. Guo, and H. Wang, “Fusionnet: Coarse-
no. 10, pp. 1157–1163, 2016. to-fine extrinsic calibration network of lidar and camera with
[159] M. Schönbein, T. Strauß, and A. Geiger, “Calibrating and center- hierarchical point-pixel fusion,” in 2022 International Conference
ing quasi-central catadioptric cameras,” in 2014 IEEE International on Robotics and Automation (ICRA). IEEE, 2022, pp. 8964–8970.
Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. [180] C. Ye, H. Pan, and H. Gao, “Keypoint-based lidar-camera online
4443–4450. calibration with robust geometric network,” IEEE Transactions on
[160] T. H. Butt and M. Taj, “Camera calibration through camera projec- Instrumentation and Measurement, vol. 71, pp. 1–11, 2021.
tion loss,” in ICASSP 2022 - 2022 IEEE International Conference on [181] N. Wakai, S. Sato, Y. Ishii, and T. Yamashita, “Rethinking generic
Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2649– camera models for deep single image camera calibration to
2653. recover rotation and fisheye distortion,” in Proceedings of European
20
Conference on Computer Vision (ECCV), vol. 13678, 2022, pp. 679– application,” IEEE transactions on pattern analysis and machine
698. intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
[182] S.-H. Chang, C.-Y. Chiu, C.-S. Chang, K.-W. Chen, C.-Y. Yao, R.-R. [206] H. Yu, Y. Luo, M. Shu, Y. Huo, Z. Yang, Y. Shi, Z. Guo, H. Li,
Lee, and H.-K. Chu, “Generating 360 outdoor panorama dataset X. Hu, J. Yuan et al., “Dair-v2x: A large-scale dataset for vehicle-
with reliable sun position estimation,” in SIGGRAPH Asia 2018 infrastructure cooperative 3d object detection,” in Proceedings of
Posters, 2018, pp. 1–2. the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[183] C. Wu, “Towards linear-time incremental structure from motion,” tion, 2022, pp. 21 361–21 370.
in 2013 International Conference on 3D Vision-3DV 2013. IEEE, [207] J. Kang and N. L. Doh, “Automatic targetless camera–LIDAR
2013, pp. 127–134. calibration by aligning edge with Gaussian mixture model,”
[184] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, Journal of Field Robotics, vol. 37, no. 1, pp. 158–179, 2020.
W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for [208] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
mobilenetv3,” in Proceedings of the IEEE/CVF international confer- A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
ence on computer vision, 2019, pp. 1314–1324. multimodal dataset for autonomous driving,” in Proceedings of
[185] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep the IEEE/CVF conference on computer vision and pattern recognition,
learning on point sets for 3d classification and segmentation,” 2020, pp. 11 621–11 631.
in Proceedings of the IEEE conference on computer vision and pattern [209] J. Mao, M. Niu, C. Jiang, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li,
recognition, 2017, pp. 652–660. J. Yu, C. Xu et al., “One million scenes for autonomous driving:
[186] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Con- Once dataset,” 2021.
volution on x-transformed points,” Advances in neural information [210] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-
processing systems, vol. 31, 2018. training,” in Proceedings of the IEEE International Conference on
[187] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for Computer Vision, 2019, pp. 4918–4927.
optical flow using pyramid, warping, and cost volume,” in [211] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park,
Proceedings of the IEEE Conference on Computer Vision and Pattern “Self-calibrating neural radiance fields,” in Proceedings of the
Recognition (CVPR), June 2018. IEEE/CVF International Conference on Computer Vision, 2021, pp.
[188] R. Hartley and A. Zisserman, Multiple view geometry in computer 5846–5854.
vision. Cambridge university press, 2003.
[189] B. D. Lucas, T. Kanade et al., An iterative image registration technique Kang Liao received his Ph.D. degree from Beijing Jiaotong University in
with an application to stereo vision. Vancouver, 1981, vol. 81. 2023. From 2021 to 2022, he was a Visiting Researcher at Max Planck
[190] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practi- Institute for Informatics in Germany. His current research interests in-
cal guidelines for efficient cnn architecture design,” in Proceedings clude camera calibration, 3D vision, and panoramic vision.
of the European conference on computer vision (ECCV), 2018, pp. 116–
131. Lang Nie is currently pursuing his Ph.D. degree at Beijing Jiaotong
[191] M. A. Fischler and R. C. Bolles, “Random sample consensus: a University. His current research interests include multi-view geometry,
paradigm for model fitting with applications to image analysis image stitching, and computer vision.
and automated cartography,” Communications of the ACM, vol. 24,
no. 6, pp. 381–395, 1981. Shujuan Huang is currently pursuing his Ph.D. degree at Beijing Jiao-
[192] L. Nie, C. Lin, K. Liao, and Y. Zhao, “Learning edge-preserved tong University. His current research interests include camera-LiDAR
image stitching from multi-scale deep homography,” Neurocom- calibration, depth completion, and computer vision.
puting, vol. 491, pp. 533–543, 2022.
[193] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying Chunyu Lin is a Professor at Beijing Jiaotong University. From 2011 to
framework,” International journal of computer vision, vol. 56, no. 3, 2012, he was a Post-Doctoral Researcher at the Multimedia Laboratory,
pp. 221–255, 2004. Ghent University, Belgium. His research interests include multi-view
[194] J. Nocedal and S. J. Wright, Numerical optimization. Springer, geometry, camera calibration, and virtual reality video processing.
1999.
[195] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for Jing Zhang is currently a Research Fellow at the School of Computer
optical flow,” in European conference on computer vision. Springer, Science, The University of Sydney. His research interests include com-
2020, pp. 402–419. puter vision and deep learning. He has published more than 60 papers
[196] Y. Li, W. Pei, and Z. He, “Ssorn: Self-supervised outlier removal on prestigious conferences and journals, such as CVPR, ICCV, ECCV,
network for robust homography estimation,” arXiv preprint IJCV and IEEE T-PAMI. He is a SPC of the AAAI and IJCAI.
arXiv:2208.14093, 2022.
[197] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Yao Zhao (Fellow, IEEE) is the Director of the Institute of Information
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Science, Beijing Jiaotong University. His current research interests in-
Advances in neural information processing systems, vol. 30, 2017. clude image/video coding and video analysis and understanding. He
[198] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, was named a Distinguished Young Scholar by the National Science
and A. Davison, “gvnn: Neural network library for geometric Foundation of China in 2010 and was elected as a Chang Jiang Scholar
computer vision,” in European Conference on Computer Vision. of Ministry of Education of China in 2013.
Springer, 2016, pp. 67–82.
Moncef Gabbouj (Fellow, IEEE) is a Professor at the Department of
[199] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep
Computing Sciences, Tampere University, Finland. He was an Academy
hierarchical feature learning on point sets in a metric space,”
of Finland Professor. His research interests include Big Data analytics,
Advances in neural information processing systems, vol. 30, 2017.
multimedia analysis, artificial intelligence, machine learning, pattern
[200] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
recognition, video processing, and coding. Dr. Gabbouj is a Fellow
timal speed and accuracy of object detection,” arXiv preprint
of the IEEE and Asia-Pacific Artificial Intelligence Association. He is
arXiv:2004.10934, 2020.
member of the Academia Europaea, the Finnish Academy of Science
[201] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Bei-
and Letters, and the Finnish Academy of Engineering Sciences.
jbom, “Pointpillars: Fast encoders for object detection from point
clouds,” in Proceedings of the IEEE/CVF conference on computer
Dacheng Tao (Fellow, IEEE) is currently the Inaugural Director of the
vision and pattern recognition, 2019, pp. 12 697–12 705.
JD Explore Academy and a Senior Vice President of JD.com, Inc. He
[202] R. Poli, J. Kennedy, and T. Blackwell, “Particle swarm optimiza- mainly applies statistics and mathematics to artificial intelligence and
tion,” Swarm intelligence, vol. 1, no. 1, pp. 33–57, 2007. data science. His research is detailed in one monograph and over 200
[203] S. Gould, R. Hartley, and D. Campbell, “Deep declarative net- publications in prestigious journals and proceedings at leading confer-
works,” IEEE Transactions on Pattern Analysis and Machine Intelli- ences. He is a fellow of the Australian Academy of Science, AAAS,
gence, vol. 44, no. 8, pp. 3988–4004, 2021. and ACM. He received the 2015 Australian Scopus-Eureka Prize, the
[204] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose 2018 IEEE ICDM Research Contributions Award, and the 2021 IEEE
estimation and tracking,” in Proceedings of the European conference Computer Society McCluskey Technical Achievement Award.
on computer vision (ECCV), 2018, pp. 466–481.
[205] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang,
“The apolloscape open dataset for autonomous driving and its
1
Legend
2015 DeepFocal (ICIP) PoseNet (ICCV)
Standard Distortion
Cross-View Cross-Sensor
2016 DeepHorizon (BMVC) DeepVP (CVPR) Rong et al. (ACCV) DHN (RSSW)
Hold-Geoffroy et al. (CVPR) DeepCalib (CVMP) FishEyeRecNet (ECCV) Shi et al. (ICPR) DeepFM (ECCV)
2018
Poursaeed et al. (ECCVW) UDHN (RAL) PFNet (ACCV) CalibNet (IROS) Chang et al. (ICRA)
Lopez et al. (CVPR) UprightNet (ICCV) Zhuang et al. (IROS) SSR-Net (RPL) Abbas et al. (ICCVW)
2019 DR-GAN (TCSVT) STD (TCSVT) Deep360Up (VR) UnFishCor (JVCIR) BlindCor (CVPR) RSC-Net (CVPR)
Sha et al. (CVPR) Lee et al. (ECCV) MisCaliDet (ICRA) DeepPTZ (WACV) MHN (CVPR) Davidson et al. (ECCV)
CA-UDHN (ECCV) DeepFEPE (IROS) DDM (TIP) FE-GAN (ICASSP) PSE-GAN (ICPR) RDC-Net (ICIP)
2020 Li et al. (TIP) RDCFace (CVPR) LaRecNet (arXiv) Baradad et al. (CVPR) Zheng et al. (CVPR) Zhu et al. (ECCV)
DeepUnrollNet (CVPR) RGGNet (RAL) CalibRCNN (IROS) SSI-Calib (ICRA) SOIC (arXiv) NetCalib (ICPR)
SRHEN (ACM-MM)
StereoCaliNet (TCI) CTRL-C (ICCV) Wakai et al. (ICCVW) SemAlign (IROS) OrdianlDistortion (TIP)
PolarRecNet (TCSVT) DQN-RecNet (PRL) Tan et al. (CVPR) PCN (CVPR) DaRecNet (ICCV)
2021 DLKFM (CVPR) LocalTrans (ICCV) BasesHomo (ICCV) ShuffleHomoNet (ICIP) DirectionNet (CVPR)
DAMG-Homo (TCSVT) SA-MobileNet (BMVC) SPEC (ICCV) JCD (CVPR) LCCNet (CVPRW) CFNet (Sensors)
Fang et al. (ICRA) CPL (ICASSP) DVPD (CVPR) IHN (CVPR) HomoGAN (CVPR) SS-WPC (CVPR)
2022 AW-RSC (CVPR) EvUnroll (CVPR) Do et al. (CVPR) DiffPoseNet (CVPR) SceneSqueezer (CVPR)
FocalPose (CVPR) FishFormer (arXiv) DXQ-Net (arXiv) CCS-Net (IROS) SST-Calib (ITSC) SIR (TIP)
ATOP (TIV) FusionNet (ICRA) RKGCNet (TIM) GenCaliNet (ECCV) Liu et al. (TPAMI)
Fig. 1. A concise milestone of deep learning-based camera calibration methods. We classify all methods based on the uncalibrated camera model
and its extended applications: standard model, distortion model, cross-view model, and cross-sensor model. Standard model: DeepFocal [1],
PoseNet [2], DeepHorizon [3], DeepVP [4], Chang et al. [5], UprightNet [6], Lee et al. [7], NeurVPS [8], Deep360Up [9], Davidson et al. [10],
DeepFEPE [11], Baradad et al. [12], Zheng et al. [13], Zhu et al. [14], StereoCaliNet [15], SA-MobileNet [16], Fang et al. [17], CPL [18], FocalPose
[19], DirectionNet [20], DVPD [21], Do et al. [22], CTRL-C [23], SPEC [24], DiffPoseNet [25], SceneSqueezer [26]. Distortion model: Rong et
al. [27], Hold-Geoffroy et al. [28], DeepCalib [29], URS-CNN [30], FishEyeRecNet [31], Shi et al. [32], DR-GAN [33], STD [34], UnFishCor [35],
BlindCor [36], RSC-Net [37], Xue et al. [38], Zhuang et al. [39], Lopez et al. [40], Zhao et al. [41], DDM [42], MisCaliDet [43], DeepPTZ [44], FE-
GAN [45], PSE-GAN [46], RDC-Net [47], Li et al. [48], RDCFace [49], LaRecNet [50], DeepUnrollNet [51], OrdianlDistortion [52], PolarRecNet [53],
DQN-RecNet [54], Tan et al. [55], PCN [56], SIR [57], DaRecNet [58], Wakai et al. [59], GenCaliNet [60], JCD [61], SS-WPC [62], AW-RSC [63],
EvUnroll [64], Fan et al. [65], SUNet [66], CCS-Net [67], FishFormer [68]. Cross-View model: DHN [69], CLKN [70], HierarchicalNet [71], DeepFM
[72], Poursaeed et al. [73], UDHN [74], PFNet [75], SRHEN [76], SSR-Net [77], Abbas et al. [78], Sha et al. [79], MHN [80], CA-UDHN [81], DLKFM
[82], LocalTrans [83], BasesHomo [84], ShuffleHomoNet [85], DAMG-Homo [86], IHN [87], HomoGAN [88], Liu et al. [89]. Cross-Sensor model:
RegNet [90], CalibNet [91], RGGNet [92], SSI-Calib [93], SOIC [94], CalibRCNN [95], NetCalib [96], LCCNet [97], CFNet [98], SemAlign [99],
DXQ-Net [100], SST-Calib [101], ATOP [102], FusionNet [103], RKGCNet [104].
where fx and fy are the focal lengths at X-axis and Y-axis unit, then Eq. (3) can be reformulated as:
of the camera, respectively. Generally, for most cameras,
fx 0 cu
fx = fy , and they are unified to f . mu and mv are the Pi = 0 fy cv [xn , yn , 1]T . (5)
number of pixels per unit distance, in which mu = mv , if 0 0 1
the image has square pixels. s is the skew coefficient. A CCD
sensor’s pixels might not be precisely square, which would In addition to numerical camera parameters, some geo-
cause a slight distortion in the X or Y axes. The number of metric representations can provide useful clues for camera
pixels on the CCD sensor per unit length in each direction calibration, such as vanishing points and horizon lines.
is known as the skew coefficient. It would become 0 when These representations establish clear relationships between
X-axis and Y-axis are perpendicular to each other. [cu , cv ]T image features and calibration objectives, which can alle-
is the coordinate of the image center. According to previous viate the difficulty of learning conventional and implicit
works and factory design, the intrinsic parameters can be camera parameters.
Lines and points are both represented as three-
refined by s = 0, mu = mv and focal length in the pixel
dimensional vectors in homogeneous coordinates. The def-
3
10 25
6 15
4 10
2 5
0 0
2015 2016 2017 2018 2019 2020 2021 2022 2015 2016 2017 2018 2019 2020 2021 2022
Radial
Distortion
Calibration objectives
SL
Intrinsics/
Learning strategy
Yes No
Extrinsics
31
Simulation
27
33 90 Semi-SL
1
1
1 RL
15 6 WSL
73 5
8 USL
21 Camera-LiDAR
SSL
Roll Shutter
Distortion Projection Matrix
Fig. 2. A statistic analysis of deep learning-based camera calibration methods. To be specific, we summarize all literature based on the number of
publications per year, calibration objectives, simulation of the dataset, and learning strategy.
initions for computing the line l that connects two points where r denotes the projection distance between the princi-
and the point p at the intersection of two lines can be given pal point and the points in the image. θ denotes the angle
by: between the incident ray and the optical axis of the camera.
p1 × p2 l1 × l2 It is straightforward to determine that θ should be less than
l= p= (6)
||p1 × p2 || ||l1 × l2 || 90◦ . Without a projection point on the image plane, the
There are two parameterizations of the horizon line: incoming ray will not cross with the image plane and the
slope-offset (θ, ρ) and left-right (l, r). Assuming that the pinhole camera will not be able to view anything behind.
viewing orientation is down the negative z -axis, with the Because of their restricted field of view (FoV), most cameras
positive x-direction to the right, and the positive y -direction cannot see all of the points in the 3D environment at the
to the up. As a result, the world viewing direction of the same time.
camera can be described by RcT [0, 0, −1]T . For the world Due to the wide FoV, wide-angle cameras are increas-
vector [0, 1, 0]T points in the zenith direction, a set of points ingly widely used in computer vision and robotics tasks
p can represent the horizon line: such as navigation, localization, and tracking. Specifically,
pT K −T R[0, 1, 0]T = 0. (7) an extra wide-angle lens called a fisheye camera is used to
create a broad, hemispherical, or panoramic image. Fisheye
As mentioned in Barnard [105], the normalized line lenses employ a specific mapping to produce convex and
direction vector d can be formulated for the Gaussian non-rectilinear images as opposed to images with straight
sphere representation of a vanishing point v. In particular, lines of perspective (rectilinear images). However, the wide-
supposed a 3D ray is described by o + λd, where o and d angle camera violates the pinhole camera assumption and
are its origin and unit direction vector, respectively. Then, the captured image suffers from geometric distortions.
the vanishing point can be represented by λ → ∞, in
Geometric distortion induced by wide-angle cameras
which the image coordinate is formed by v = [vx , vy ]T :=
can generally be classified into radial distortion and tangen-
limλ→∞ [px , py ]T ∈ R2 . Thus, the 3D direction of a line
tial distortion (de-centering distortion). Radial distortion is
based on its vanishing point can be calculated by:
the primary distortion in central single-view camera sys-
T
d = vx − cx vy − cy f ∈ R3 . (8) tems, exhibiting circular symmetry with respect to the dis-
tortion center. This distortion results in points on the image
By using d rather than v, the degraded situations where d
plane being moved away from their ideal location under the
is parallel to the image plane are eliminated. Additionally,
perspective camera model along the radial axis from the dis-
it provides a natural measurement for determining the
tortion center. Radial distortion models can be formulated as
separation between two vanishing points.
nonlinear functions of the radial distance [106]. On the other
2.2 Wide-angle Camera Model hand, tangential distortion occurs when the lens and image
plane are not parallel. Tangential distortion, also known
The perspective projection model, given a typical pinhole
as de-centering distortion, is primarily caused by the lens
camera with focal length f , can be expressed as:
assembly not being centered over and parallel to the image
r = f tan θ, (9) plane. Unlike radial distortion, tangential distortion has a
4
Some classical works demonstrate the single-parameter 2.4 Cross-View Camera Model
division model (only with distortion parameter k1 in Eq.12) The cross-view camera model is a type of multi-view camera
seems to be sufficient for most wide-angle cameras, which system used in computer vision. It involves placing two
has been widely applied in learning-based wide-angle cam- or more cameras at opposite sides of a scene to capture
era calibration [27], [29], [36], [54]. multiple views of the same scene. This setup enables the
5
creation of 3D reconstructions of the scene by triangulat- describes the homography’s rotational term and the vector
ing corresponding points from multiple camera views. The [h13 , h23 ]T denotes the translation transformation. Consid-
cross-view camera model is commonly used in surveillance, ering the rotation and shear components typically have
robotics, and augmented reality applications, and provides a smaller magnitudes than the translation component, it will
more accurate and complete representation of the scene than have a negligible impact on the loss function of the com-
what can be achieved with a single camera. Alternatively, ponent elements, leading to an imbalance training problem
a camera with stable movement can also be regarded as a with a neural network. Instead, a 4-point parameterization
cross-view camera model. [111] has been demonstrated to be more learning-friendly
In a cross-view camera model, the captured images can for learning-based homography estimation than the 3 × 3
be used to calculate the fundamental matrix and homogra- parameterization. Supposed that the offsets of the image’s
phy matrix, which are essential tools for 3D reconstruction, vertex are ∆ui = u′i − ui and ∆vi = vi′ − vi , then the 4-point
image rectification, and camera calibration. parameterization H e can describe a homography by:
Fundamental Matrix Geometric relationships between the
3D points and their projections onto the 2D plane impose
∆u1 ∆v1
constraints on the image points when two cameras capture ∆u2 ∆v2
the same 3D scene from different perspectives. This intrinsic e =
H . (19)
∆u3 ∆v3
projective geometry can be embodied by a fundamental ∆u4 ∆v4
matrix F.
F = K2 −T [t]× RK1 −1 . (16) The 4-point parameterization owns eight variables, which
Such an equation describes the epipolar geometry, where are equivalent to the matrix formulation of the homography.
It is straightforward to transform from H e to H using the
K1 and K2 indicate the intrinsic parameters of two cam-
eras, and R and [t]× are the relative camera rotation and normalized Direct Linear Transform (DLT) [112] if the four
translation, respectively. corners’ displacement is known.
The fundamental matrix can be calculated from the
correspondences of projected scene points by q T Fp = 0,
in which q and p are the matching points derived from two 2.5 Cross-Sensor Model
views. Specifically, the eight-point algorithm [110] uses 8
point correspondences and enforces the rank-2 constraint Modern robots are often equipped with various sensors
using Singular Value Decomposition (SVD), computing a to provide a comprehensive understanding of the environ-
matrix with the minimum Frobenius distance. ment. These sensors capture scenes using different types of
Homography Matrix Estimating a 2D homography matrix representations. For autonomous cars and robotics, cameras
(or projection transformation) is an elemental geometric task and Light Detection and Ranging sensors (LiDAR) are com-
for a pair of images that are captured from the same planar monly used for vision tasks. The 3D LiDAR records long-
surface in a 3D scene with different perspectives. An invert- range spatial data as sparse point clouds, while the camera
ible mapping from one image plane to another with eight captures texturally dense 2D color RGB images. Combining
degrees of freedom: two each for translation, rotation, scale, these sensors can facilitate 3D reconstruction and provide
and lines at infinity, is known as a homography. Supposed precise and robust perception for the robots, overcoming
that the homogeneous coordinates x = [u, v, 1]T ∈ R3×1 the limitations of individual sensors.
and x′ = [u′ , v ′ , 1]T ∈ R3×1 are points from two images However, collision and vibration problems can occur
but indicating the same point in the 3D scene, a non- when using different sensors in a robot or system. Addition-
singular 3 × 3 matrix can represent a linear transformation ally, the 3D point clouds cannot be effectively projected onto
that maps x ⇔ x′ as a planar projective transformation or a 2D image without accurate extrinsic parameters, making
homography H: it difficult to reliably correlate pixels in an image with depth
′ information. Therefore, it is crucial to precisely calibrate the
u h11 h12 h13 u 2D-3D matching correspondences between pairs of tempo-
v ∼ h21 h22 h23 v ,
′
(17) rally synchronized camera and LiDAR data.
1 h31 h32 h33 1 The appropriate extrinsic calibration of the transforma-
where the transformation can be simplified as x′ ∼ Hx. This tion (i.e., rotation and translation) between the camera and
transformation can be rewritten by two following equations: LiDAR in 6-DoF is a key condition for data fusion. To be
more specific, 3D LiDAR point cloud P C = [X, Y, Z] ∈ R3
h11 u + h12 v + h13 ′ h21 u + h22 v + h23 can be projected onto the image plane by transforming it
u′ = ;v = . (18)
h31 u + h32 v + h33 h31 u + h32 v + h33 into the camera coordinate using the extrinsic matrix T
Previous methods [69], [74] point out that the above between the camera and LiDAR as well as camera intrinsic
conventional 3 × 3 parameterization H is not desirable for K . The inverse depth and the projected 2D coordinates can
training neural networks. Concretely, it is challenging to be represented as d = 1/Z and p = [u, v] ∈ R2 , respectively.
guarantee the non-singularity of H due to the significant Then, the camera-LiDAR model can be described by:
variance in the size of the members of the 3 × 3 homogra-
phy matrix. Moreover, the rotation, translation, scale, and u fx (X̂/Ẑ) + cx
shear components of the homography transformation are v =
fy (Ŷ /Ẑ) + cy , (20)
mixed in H. For instance, the submatrix [h11 h12 ; h21 h22 ] d 1/Ẑ
6
where (fx , fy ) and (cx , cy ) indicate the focal lengths and the data collected from a single camera, reducing the time and
image center as listed in Eq. 4. [X̂, Ŷ , Ẑ] is the transformed effort required for calibration. Similarly, in mobile devices,
point cloud PˆC using the estimated extrinsic matrix: transfer learning can enable faster and more accurate cali-
bration of the camera, resulting in improved image quality
[X̂, Ŷ , Ẑ, 1]T = T [X, Y, Z, 1]T . (21) and performance.
Most deep learning works exploit the Lie algebra to
parameterize the calibration camera-LiDAR extrinsic pa- 3.3 Robustness to noise and outliers
rameters. In particular, the output of the calibration network Another promising application of deep learning in camera
is a 1 x 6 vector ξ = (v, ω) ∈ se(3) in which v is the calibration is improving the robustness of calibration to
translation vector, and ω is the rotation vector. To recover noise and outliers in the data. This approach can help ensure
the original objectives, the rotation vector in so(3) should be accurate calibration even in challenging environments, with
transformed to its corresponding rotation matrix. Supposed low-quality data or noisy sensor readings. Conventionally,
that ω = (ω1 , ω2 , ω3 )T , an element ω ∈ so(3) can be camera calibration algorithms are sensitive to noise and out-
transformed to SO(3) using the exponential map by: liers in the data, which can lead to significant errors in the
exp : so(3) → SO(3); ω̂ 7→ eω̂ , (22) estimated camera parameters. However, with the applica-
tion of deep learning, it is possible to learn more robust and
where ω̂ and e denote the skew-symmetric matrix from ω
ω̂
accurate models that can better handle noise and outliers in
and Taylor series expansion for the matrix exponential func- the data. For instance, regularization techniques can be used
tion, respectively. Then, the rotation matrix can be formed in to impose constraints on the learned parameters, preventing
SO(3), and its Rodrigues formula is derived from the above overfitting and enhancing the generalization ability of the
equation by: model. Moreover, outlier detection techniques can be used
to identify and exclude data points that are likely to be
ω̂ ω̂ 2
R = eω̂ = I + sin ∥ω∥ + (1 − cos(∥ω∥)). (23) outliers, reducing their impact on the calibration process.
∥ω∥ ∥ω∥2 This can be achieved using various statistical and machine-
Thus, the 3D rigid body transformation T ∈ SE(3) between learning methods, such as clustering, classification, and
camera and LiDAR can be represented by: regression.
R t
T = where R ∈ SO(3), t ≜ v ∈ R3 . (24) 3.4 Online calibration
0 1
With the rapid development of deep learning, online camera
3 M ORE F UTURE D IRECTIONS calibration is becoming more efficient and practical. This
technique involves updating the calibration parameters in
3.1 Dataset real-time, allowing for better performance as the camera
One of the main challenges of learning-based camera cali- moves or as the environment changes. This can be achieved
brations is the difficulty in constructing datasets with high using deep learning algorithms that can learn the complex
accuracy. This requires laborious manual intervention to relationships between the camera parameters and the image
obtain real-world data with labels. As we summarized, ap- data. Learning-based camera calibration has the potential
proximately 70% of the works rely on synthesized datasets. to revolutionize various industries, such as robotics and
However, the significant differences between synthesized augmented reality. In robotics, online calibration can im-
and real-world datasets cannot be ignored, leading to do- prove the accuracy of robot vision, which is crucial for
main gaps in the learned models. Therefore, the construc- tasks such as object detection and manipulation. Similarly,
tion of a standardized, large-scale calibration dataset would in augmented reality, online calibration can enhance the
significantly benefit this community. Recent works have user experience by ensuring that virtual objects are correctly
demonstrated that well-designed learning strategies, such as aligned with the real world. This can help create more
semi-supervised learning [62], self-supervised learning [17], realistic and immersive AR applications, which have numer-
[77], and unsupervised learning [74], [81], can help address ous practical applications in fields such as entertainment,
the demand for annotations in learning-based camera cali- education, and training.
brations. These strategies also have the potential to discover
additional calibration priors within the data itself. 3.5 Multimodal calibration
The potential of deep learning techniques in camera cali-
3.2 Transfer learning bration goes beyond traditional photography and computer
The advancements in deep learning have led to the de- vision applications. It could also be applied to calibrate
velopment of transfer learning techniques, which could cameras with other sensors, such as remote sensing, infrared
facilitate the transfer of knowledge learned from one camera sensors, or radar. This advancement could lead to more
to another. This approach can significantly speed up and precise and robust perception in various applications, in-
streamline the calibration process, making it more efficient cluding but not limited to autonomous driving, where mul-
and cost-effective. Transfer learning can be especially useful tiple sensors are used. Incorporating deep learning-based
in applications that involve multiple cameras or mobile calibration methods with multiple sensors could enhance
devices. For example, in a multi-camera system, transfer the accuracy of the fusion of data from different sources.
learning can be used to calibrate all the cameras using the It could facilitate more accurate perception in challenging
7
environments such as low-light conditions, occlusions, and [18] T. H. Butt and M. Taj, “Camera calibration through camera projec-
adverse weather conditions. Furthermore, the ability to cal- tion loss,” in ICASSP 2022 - 2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 2649–
ibrate multiple sensors with deep learning methods could 2653.
provide more reliable and consistent results compared to [19] G. Ponimatkin, Y. Labbé, B. Russell, M. Aubry, and J. Sivic, “Focal
traditional calibration techniques. length and object pose estimation via render and compare,” in
These are a few potential directions for future research in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2022, pp. 3825–3834.
camera calibration with deep learning. As the field contin- [20] K. Chen, N. Snavely, and A. Makadia, “Wide-baseline relative
ues to evolve, there may be many other exciting avenues for camera pose estimation with directional learning,” in Proceedings
exploration and innovation. In addition, it is also thrilling of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
to see how this technology will continue to impact various nition, 2021, pp. 3258–3268.
[21] Y. Lin, R. Wiersma, S. L. Pintea, K. Hildebrandt, E. Eisemann,
industries in the future. and J. C. van Gemert, “Deep vanishing point detection: Ge-
ometric priors make dataset variations vanish,” arXiv preprint
R EFERENCES arXiv:2203.08586, 2022.
[22] T. Do, O. Miksik, J. DeGol, H. S. Park, and S. N. Sinha, “Learning
[1] S. Workman, C. Greenwell, M. Zhai, R. Baltenberger, and N. Ja- to detect scene landmarks for camera localization,” in Proceedings
cobs, “Deepfocal: A method for direct focal length estimation,” of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
in 2015 IEEE International Conference on Image Processing (ICIP), nition, 2022, pp. 11 132–11 142.
2015, pp. 1369–1373. [23] J. Lee, H. Go, H. Lee, S. Cho, M. Sung, and J. Kim, “Ctrl-c: Camera
[2] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional calibration transformer with line-classification,” in Proceedings of
network for real-time 6-dof camera relocalization,” in Proceedings the IEEE/CVF International Conference on Computer Vision (ICCV),
of the IEEE International Conference on Computer Vision (ICCV), October 2021, pp. 16 228–16 237.
December 2015. [24] M. Kocabas, C.-H. P. Huang, J. Tesch, L. Müller, O. Hilliges, and
[3] S. Workman, M. Zhai, and N. Jacobs, “Horizon lines in the wild,” M. J. Black, “Spec: Seeing people in the wild with an estimated
arXiv preprint arXiv:1604.02129, 2016. camera,” in Proceedings of the IEEE/CVF International Conference on
[4] M. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points Computer Vision (ICCV), October 2021, pp. 11 035–11 045.
using global image context in a non-manhattan world,” in Pro- [25] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, and
ceedings of the IEEE Conference on Computer Vision and Pattern Y. Aloimonos, “Diffposenet: Direct differentiable camera pose
Recognition (CVPR), June 2016. estimation,” in Proceedings of the IEEE/CVF Conference on Computer
[5] C.-K. Chang, J. Zhao, and L. Itti, “Deepvp: Deep learning for Vision and Pattern Recognition, 2022, pp. 6845–6854.
vanishing point detection on 1 million street view images,” in
[26] L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan,
2018 IEEE International Conference on Robotics and Automation
“Scenesqueezer: Learning to compress scene for camera relocal-
(ICRA). IEEE, 2018, pp. 4496–4503.
ization,” in Proceedings of the IEEE/CVF Conference on Computer
[6] W. Xian, Z. Li, M. Fisher, J. Eisenmann, E. Shechtman, and
Vision and Pattern Recognition, 2022, pp. 8259–8268.
N. Snavely, “Uprightnet: Geometry-aware camera orientation
[27] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens dis-
estimation from single images,” in Proceedings of the IEEE/CVF
tortion correction using convolutional neural networks trained
International Conference on Computer Vision (ICCV), October 2019.
with synthesized images,” in Asian Conference on Computer Vision.
[7] J. Lee, M. Sung, H. Lee, and J. Kim, “Neural geometric parser
Springer, 2016, pp. 35–49.
for single image camera calibration,” in European Conference on
Computer Vision. Springer, 2020, pp. 541–557. [28] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gam-
[8] Y. Zhou, H. Qi, J. Huang, and Y. Ma, “Neurvps: Neural vanishing baretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for
point scanning via conic convolution,” Advances in Neural Infor- deep single image camera calibration,” in Proceedings of the IEEE
mation Processing Systems, vol. 32, 2019. Conference on Computer Vision and Pattern Recognition (CVPR),
[9] R. Jung, A. S. J. Lee, A. Ashtari, and J.-C. Bazin, “Deep360up: June 2018.
A deep learning-based approach for automatic vr image upright [29] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib:
adjustment,” in 2019 IEEE Conference on Virtual Reality and 3D a deep learning approach for automatic intrinsic calibration of
User Interfaces (VR), 2019, pp. 1–8. wide field-of-view cameras,” in Proceedings of the 15th ACM
[10] B. Davidson, M. S. Alvi, and J. F. Henriques, “360° camera SIGGRAPH European Conference on Visual Media Production, 2018,
alignment via segmentation,” in European Conference on Computer pp. 1–10.
Vision. Springer, 2020, pp. 579–595. [30] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the
[11] Y.-Y. Jau, R. Zhu, H. Su, and M. Chandraker, “Deep keypoint- shutter: Cnn to correct motion distortions,” in Proceedings of the
based camera pose estimation with geometric constraints,” in IEEE Conference on computer Vision and Pattern Recognition, 2017,
2020 IEEE/RSJ International Conference on Intelligent Robots and pp. 2291–2299.
Systems (IROS), 2020, pp. 4950–4957. [31] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “Fishey-
[12] M. Baradad and A. Torralba, “Height and uprightness invari- erecnet: A multi-context collaborative deep network for fisheye
ance for 3d prediction from a single view,” in Proceedings of the image rectification,” in Proceedings of the European Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition Computer Vision (ECCV), September 2018.
(CVPR), June 2020. [32] Y. Shi, D. Zhang, J. Wen, X. Tong, X. Ying, and H. Zha, “Radial
[13] Q. Zheng, J. Chen, Z. Lu, B. Shi, X. Jiang, K.-H. Yap, L.-Y. Duan, lens distortion correction by adding a weight layer with inverted
and A. C. Kot, “What does plate glass reveal about camera cal- foveal models to convolutional neural networks,” in 2018 24th
ibration?” in Proceedings of the IEEE/CVF Conference on Computer International Conference on Pattern Recognition (ICPR), 2018, pp.
Vision and Pattern Recognition (CVPR), June 2020. 1–6.
[14] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann, [33] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic
K. Sunkavalli, and M. Chandraker, “Single view metrology in radial distortion rectification using conditional gan in real-time,”
the wild,” in European Conference on Computer Vision. Springer, IEEE Transactions on Circuits and Systems for Video Technology,
2020, pp. 316–333. vol. 30, no. 3, pp. 725–733, 2020.
[15] Y. Gil, S. Elmalem, H. Haim, E. Marom, and R. Giryes, “Online [34] ——, “Distortion rectification from static to dynamic: A distortion
training of stereo self-calibration using monocular depth estima- sequence construction perspective,” IEEE Transactions on Circuits
tion,” IEEE Transactions on Computational Imaging, vol. 7, pp. 812– and Systems for Video Technology, vol. 30, no. 11, pp. 3870–3882,
823, 2021. 2020.
[16] S. Garg, D. P. Mohanty, S. P. Thota, and S. Moharana, “A simple [35] S. Yang, C. Lin, K. Liao, Y. Zhao, and M. Liu, “Unsupervised fish-
approach to image tilt correction with self-attention mobilenet for eye image correction through bidirectional loss with geometric
smartphones,” arXiv preprint arXiv:2111.00398, 2021. prior,” Journal of Visual Communication and Image Representation,
[17] J. Fang, I. Vasiljevic, V. Guizilini, R. Ambrus, G. Shakhnarovich, vol. 66, p. 102692, 2020.
A. Gaidon, and M. R. Walter, “Self-supervised camera self- [36] X. Li, B. Zhang, P. V. Sander, and J. Liao, “Blind geometric
calibration from video,” arXiv preprint arXiv:2112.03325, 2021. distortion correction on images through deep learning,” in Pro-
8
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [57] J. Fan, J. Zhang, and D. Tao, “Sir: Self-supervised image rectifi-
Recognition (CVPR), June 2019. cation via seeing the same scene from multiple different lenses,”
[37] B. Zhuang, Q.-H. Tran, P. Ji, L.-F. Cheong, and M. Chandraker, IEEE Transactions on Image Processing, 2022.
“Learning structure-and-motion-aware rolling shutter correc- [58] K. Liao, C. Lin, L. Liao, Y. Zhao, and W. Lin, “Multi-level curricu-
tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision lum for training a distortion-aware barrel distortion rectification
and Pattern Recognition (CVPR), June 2019. model,” in Proceedings of the IEEE/CVF International Conference on
[38] Z. Xue, N. Xue, G.-S. Xia, and W. Shen, “Learning to calibrate Computer Vision (ICCV), October 2021, pp. 4389–4398.
straight lines for fisheye image rectification,” in Proceedings of the [59] N. Wakai and T. Yamashita, “Deep single fisheye image camera
IEEE/CVF Conference on Computer Vision and Pattern Recognition calibration for over 180-degree projection of field of view,” in
(CVPR), June 2019. Proceedings of the IEEE/CVF International Conference on Computer
[39] B. Zhuang, Q.-H. Tran, G. H. Lee, L. F. Cheong, and M. Chan- Vision (ICCV) Workshops, October 2021, pp. 1174–1183.
draker, “Degeneracy in self-calibration revisited and a deep [60] N. Wakai, S. Sato, Y. Ishii, and T. Yamashita, “Rethinking generic
learning solution for uncalibrated slam,” in 2019 IEEE/RSJ Inter- camera models for deep single image camera calibration to
national Conference on Intelligent Robots and Systems (IROS), 2019, recover rotation and fisheye distortion,” in Proceedings of European
pp. 3766–3773. Conference on Computer Vision (ECCV), vol. 13678, 2022, pp. 679–
[40] M. Lopez, R. Mari, P. Gargallo, Y. Kuang, J. Gonzalez-Jimenez, 698.
and G. Haro, “Deep single image camera calibration with radial [61] Z. Zhong, Y. Zheng, and I. Sato, “Towards rolling shutter cor-
distortion,” in Proceedings of the IEEE/CVF Conference on Computer rection and deblurring in dynamic scenes,” in Proceedings of the
Vision and Pattern Recognition (CVPR), June 2019. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[41] Y. Zhao, Z. Huang, T. Li, W. Chen, C. LeGendre, X. Ren, 2021, pp. 9219–9228.
A. Shapiro, and H. Li, “Learning perspective undistortion of [62] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-
portraits,” in Proceedings of the IEEE/CVF International Conference supervised wide-angle portraits correction by multi-scale trans-
on Computer Vision (ICCV), October 2019. former,” in Proceedings of the IEEE/CVF Conference on Computer
[42] K. Liao, C. Lin, Y. Zhao, and M. Xu, “Model-free distortion Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
rectification framework bridged by distortion distribution map,” [63] M. Cao, Z. Zhong, J. Wang, Y. Zheng, and Y. Yang, “Learning
IEEE Transactions on Image Processing, vol. 29, pp. 3707–3718, 2020. adaptive warping for real-world rolling shutter correction,” in
[43] A. Cramariuc, A. Petrov, R. Suri, M. Mittal, R. Siegwart, and Proceedings of the IEEE/CVF Conference on Computer Vision and
C. Cadena, “Learning camera miscalibration detection,” in 2020 Pattern Recognition, 2022, pp. 17 785–17 793.
IEEE International Conference on Robotics and Automation (ICRA), [64] X. Zhou, P. Duan, Y. Ma, and B. Shi, “Evunroll: Neuromorphic
2020, pp. 4997–5003. events based rolling shutter image correction,” in Proceedings of
[44] C. Zhang, F. Rameau, J. Kim, D. M. Argaw, J.-C. Bazin, and the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
I. S. Kweon, “Deepptz: Deep self-calibration for ptz cameras,” tion, 2022, pp. 17 775–17 784.
in Proceedings of the IEEE/CVF Winter Conference on Applications of [65] B. Fan and Y. Dai, “Inverting a rolling shutter camera: bring
Computer Vision (WACV), March 2020. rolling shutter images to high framerate global shutter video,” in
[45] C.-H. Chao, P.-L. Hsu, H.-Y. Lee, and Y.-C. F. Wang, “Self- Proceedings of the IEEE/CVF International Conference on Computer
supervised deep learning for fisheye image rectification,” in Vision, 2021, pp. 4228–4237.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
[66] B. Fan, Y. Dai, and M. He, “Sunet: symmetric undistortion
Speech and Signal Processing (ICASSP), 2020, pp. 2248–2252.
network for rolling shutter correction,” in Proceedings of the
[46] Y. Shi, X. Tong, J. Wen, H. Zhao, X. Ying, and H. Zha, “Position- IEEE/CVF International Conference on Computer Vision, 2021, pp.
aware and symmetry enhanced gan for radial distortion correc- 4541–4550.
tion,” in 2020 25th International Conference on Pattern Recognition
[67] Y. Zhang, X. Zhao, and D. Qian, “Learning-based framework for
(ICPR), 2021, pp. 1701–1708.
camera calibration with distortion correction and high precision
[47] H. Zhao, Y. Shi, X. Tong, X. Ying, and H. Zha, “A simple yet
feature detection,” arXiv preprint arXiv:2202.00158, 2022.
effective pipeline for radial distortion correction,” in 2020 IEEE
[68] Y. Shangrong, L. Chunyu, L. Kang, and Z. Yao, “Fishformer:
International Conference on Image Processing (ICIP), 2020, pp. 878–
Annulus slicing-based transformer for fisheye rectification with
882.
efficacy domain exploration,” arXiv preprint arXiv:2207.01925,
[48] Y.-H. Li, I.-C. Lo, and H. H. Chen, “Deep face rectification for
2022.
360° dual-fisheye cameras,” IEEE Transactions on Image Processing,
vol. 30, pp. 264–276, 2021. [69] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image
homography estimation,” arXiv preprint arXiv:1606.03798, 2016.
[49] H. Zhao, X. Ying, Y. Shi, X. Tong, J. Wen, and H. Zha, “Rdcface:
Radial distortion correction for face recognition,” in Proceedings [70] C.-H. Chang, C.-N. Chou, and E. Y. Chang, “Clkn: Cascaded
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- lucas-kanade networks for image alignment,” in Proceedings of
nition (CVPR), June 2020. the IEEE Conference on Computer Vision and Pattern Recognition
[50] Z.-C. Xue, N. Xue, and G.-S. Xia, “Fisheye distortion rectification (CVPR), July 2017.
from deep straight lines,” arXiv preprint arXiv:2003.11386, 2020. [71] F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homogra-
[51] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter phy estimation from image pairs with hierarchical convolutional
unrolling network,” in Proceedings of the IEEE/CVF Conference on networks,” in Proceedings of the IEEE International Conference on
Computer Vision and Pattern Recognition, 2020, pp. 5941–5949. Computer Vision (ICCV) Workshops, Oct 2017.
[52] K. Liao, C. Lin, and Y. Zhao, “A deep ordinal distortion estima- [72] R. Ranftl and V. Koltun, “Deep fundamental matrix estima-
tion approach for distortion rectification,” IEEE Transactions on tion,” in Proceedings of the European Conference on Computer Vision
Image Processing, vol. 30, pp. 3362–3375, 2021. (ECCV), September 2018.
[53] K. Zhao, C. Lin, K. Liao, S. Yang, and Y. Zhao, “Revisiting radial [73] O. Poursaeed, G. Yang, A. Prakash, Q. Fang, H. Jiang, B. Har-
distortion rectification in polar-coordinates: A new and efficient iharan, and S. Belongie, “Deep fundamental matrix estimation
learning perspective,” IEEE Transactions on Circuits and Systems without correspondences,” in Proceedings of the European Confer-
for Video Technology, pp. 1–1, 2021. ence on Computer Vision (ECCV) Workshops, September 2018.
[54] J. Zhao, S. Wei, L. Liao, and Y. Zhao, “Dqn-based gradual fisheye [74] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and
image rectification,” Pattern Recognition Letters, vol. 152, pp. 129– V. Kumar, “Unsupervised deep homography: A fast and robust
134, 2021. homography estimation model,” IEEE Robotics and Automation
[55] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical Letters, vol. 3, no. 3, pp. 2346–2353, 2018.
wide-angle portraits correction with deep structured models,” [75] R. Zeng, S. Denman, S. Sridharan, and C. Fookes, “Rethinking
in Proceedings of the IEEE/CVF Conference on Computer Vision and planar homography estimation using perspective fields,” in Asian
Pattern Recognition (CVPR), June 2021, pp. 3498–3506. Conference on Computer Vision. Springer, 2018, pp. 571–586.
[56] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively [76] Y. Li, W. Pei, and Z. He, “Srhen: stepwise-refining homography
complementary network for fisheye image rectification using estimation network via parsing geometric correspondences in
appearance flow,” in Proceedings of the IEEE/CVF Conference on deep latent space,” in Proceedings of the 28th ACM International
Computer Vision and Pattern Recognition (CVPR), June 2021, pp. Conference on Multimedia, 2020, pp. 3063–3071.
6348–6357. [77] X. Wang, C. Wang, B. Liu, X. Zhou, L. Zhang, J. Zheng, and X. Bai,
9
“Multi-view stereo in the deep learning era: A comprehensive [97] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and
revfiew,” Displays, vol. 70, p. 102102, 2021. camera self-calibration using cost volume network,” in Proceed-
[78] S. Ammar Abbas and A. Zisserman, “A geometric approach ings of the IEEE/CVF Conference on Computer Vision and Pattern
to obtain a bird’s eye view from an image,” in Proceedings of Recognition, 2021, pp. 2894–2901.
the IEEE/CVF International Conference on Computer Vision (ICCV) [98] X. Lv, S. Wang, and D. Ye, “Cfnet: Lidar-camera registration using
Workshops, Oct 2019. calibration flow network,” Sensors, vol. 21, no. 23, p. 8112, 2021.
[79] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End- [99] Z. Liu, H. Tang, S. Zhu, and S. Han, “Semalign: Annotation-free
to-end camera calibration for broadcast videos,” in Proceedings of camera-lidar calibration with semantic alignment loss,” in 2021
the IEEE/CVF Conference on Computer Vision and Pattern Recogni- IEEE/RSJ International Conference on Intelligent Robots and Systems
tion (CVPR), June 2020. (IROS). IEEE, 2021, pp. 8845–8851.
[80] H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography [100] X. Jing, X. Ding, R. Xiong, H. Deng, and Y. Wang, “Dxq-net:
estimation for dynamic scenes,” in Proceedings of the IEEE/CVF Differentiable lidar-camera extrinsic calibration using quality-
Conference on Computer Vision and Pattern Recognition (CVPR), aware flow,” arXiv preprint arXiv:2203.09385, 2022.
June 2020. [101] K. Akio, Z. Yiyang, Z. Pengwei, Z. Wei, and T. Masayoshi,
[81] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and “Sst-calib: Simultaneous spatial-temporal parameter calibration
J. Sun, “Content-aware unsupervised deep homography estima- between lidar and camera,” arXiv preprint arXiv:2207.03704, 2022.
tion,” in European Conference on Computer Vision. Springer, 2020, [102] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An
pp. 653–669. attention-to-optimization approach for automatic lidar-camera
[82] Y. Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homog- calibration via cross-modal object matching,” IEEE Transactions
raphy for multimodal image alignment,” in Proceedings of the on Intelligent Vehicles, 2022.
IEEE/CVF Conference on Computer Vision and Pattern Recognition [103] G. Wang, J. Qiu, Y. Guo, and H. Wang, “Fusionnet: Coarse-
(CVPR), June 2021, pp. 15 950–15 959. to-fine extrinsic calibration network of lidar and camera with
[83] R. Shao, G. Wu, Y. Zhou, Y. Fu, L. Fang, and Y. Liu, “Localtrans: A hierarchical point-pixel fusion,” in 2022 International Conference
multiscale local transformer network for cross-resolution homog- on Robotics and Automation (ICRA). IEEE, 2022, pp. 8964–8970.
raphy estimation,” in Proceedings of the IEEE/CVF International [104] C. Ye, H. Pan, and H. Gao, “Keypoint-based lidar-camera online
Conference on Computer Vision (ICCV), October 2021, pp. 14 890– calibration with robust geometric network,” IEEE Transactions on
14 899. Instrumentation and Measurement, vol. 71, pp. 1–11, 2021.
[84] N. Ye, C. Wang, H. Fan, and S. Liu, “Motion basis learning [105] S. T. Barnard, “Interpreting perspective images,” Artificial intelli-
for unsupervised deep homography estimation with subspace gence, vol. 21, no. 4, pp. 435–462, 1983.
projection,” in Proceedings of the IEEE/CVF International Conference [106] J. Fan, J. Zhang, S. J. Maybank, and D. Tao, “Wide-angle image
on Computer Vision (ICCV), October 2021, pp. 13 117–13 125. rectification: a survey,” International Journal of Computer Vision,
[85] Y. Chen, G. Wang, P. An, Z. You, and X. Huang, “Fast and vol. 130, no. 3, pp. 747–776, 2022.
accurate homography estimation using extendable compression [107] A. Wang, T. Qiu, and L. Shao, “A simple method of radial
network,” in 2021 IEEE International Conference on Image Process- distortion correction with centre of distortion estimation,” Journal
ing (ICIP), 2021, pp. 1024–1028. of Mathematical Imaging and Vision, vol. 35, no. 3, pp. 165–172,
[86] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Depth-aware multi- 2009.
grid deep homography estimation with contextual correlation,” [108] A. W. Fitzgibbon, “Simultaneous linear estimation of multiple
IEEE Transactions on Circuits and Systems for Video Technology, pp. view geometry and lens distortion,” in Proceedings of the 2001
1–1, 2021. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. CVPR 2001, vol. 1. IEEE, 2001, pp. I–I.
[87] S.-Y. Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep
[109] O. Ait-Aider, N. Andreff, J. M. Lavest, and P. Martinet, “Simul-
homography estimation,” arXiv preprint arXiv:2203.15982, 2022.
taneous object pose and velocity computation using a single
[88] M. Hong, Y. Lu, N. Ye, C. Lin, Q. Zhao, and S. Liu, “Unsu-
view from a rolling shutter camera,” in European Conference on
pervised homography estimation with coplanarity-aware gan,”
Computer Vision. Springer, 2006, pp. 56–68.
arXiv preprint arXiv:2205.03821, 2022.
[110] H. C. Longuet-Higgins, “A computer algorithm for reconstruct-
[89] S. Liu, N. Ye, C. Wang, K. Luo, J. Wang, and J. Sun, “Content- ing a scene from two projections,” Nature, vol. 293, no. 5828, pp.
aware unsupervised deep homography estimation and beyond,” 133–135, 1981.
IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. [111] S. Baker, A. Datta, and T. Kanade, “Parameterizing homogra-
1–1, 2022. phies,” Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-
[90] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: 06-11, 2006.
Multimodal sensor registration using deep neural networks,” in [112] B. K. Horn, “The direct linear transformation from comparator
2017 IEEE intelligent vehicles symposium (IV). IEEE, 2017, pp. coordinates into object-space coordinates in close-range pho-
1803–1810. togrammetry,” ISPRS Journal of Photogrammetry and Remote Sens-
[91] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet: ing, vol. 42, no. 3, pp. 125–133, 1987.
Geometrically supervised extrinsic calibration using 3d spatial
transformer networks,” in 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1110–
1117.
[92] K. Yuan, Z. Guo, and Z. J. Wang, “Rggnet: Tolerance aware
lidar-camera online calibration with geometric deep learning and
generative model,” IEEE Robotics and Automation Letters, vol. 5,
no. 4, pp. 6956–6963, 2020.
[93] Y. Zhu, C. Li, and Y. Zhang, “Online camera-lidar calibration
with sensor semantic information,” in 2020 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2020, pp.
4970–4976.
[94] W. Wang, S. Nobuhara, R. Nakamura, and K. Sakurada, “Soic: Se-
mantic online initialization and calibration for lidar and camera,”
arXiv preprint arXiv:2003.04260, 2020.
[95] J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu,
“Calibrcnn: Calibrating camera and lidar by recurrent convo-
lutional neural network and geometric constraints,” in 2020
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2020, pp. 10 197–10 202.
[96] S. Wu, A. Hadachi, D. Vivet, and Y. Prabhakar, “Netcalib: A novel
approach for lidar-camera auto-calibration based on deep learn-
ing,” in 2020 25th International Conference on Pattern Recognition
(ICPR). IEEE, 2021, pp. 6648–6655.