Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
Scene-aware Learning Network for Radar Object Detection
Zangwei Zheng
Nanjing University
Xiangyu Yue Kurt Keutzer Alberto Sangiovanni Vincentelli
UC Berkeley
UC Berkeley
ABSTRACT
fail in bad driving conditions, such as frogging weather, dimming
night, and strong lighting. Compared with visual light, LiDAR can
provide direct and robust distance measurement of the surrounding environment [29, 31], but LiDAR scanners are so expensive
that many autonomous car manufacturers prefer not to use them.
Similar to LiDAR, millimeter-wave can function reliably and detect range accurately; and similar to RGB camera, radar sensors
are fairly competitive in terms of manufacturing cost. Therefore,
object detection based on a frequency modulated continuous wave
(FMCW) radar has been considered as a more robust and practical
choice.
Compared with visual images, radar frequency (RF) data is much
harder to annotate. In order for better representation, RF data are
usually transformed into the format of range-azimuth frequency
heatmaps (RAMaps), whose horizontal and vertical dimensions denote angle and distance (a bird-eye view), respectively. Recently,
[30] proposed a pipeline for radar object detection, a cross-modal
supervision framework that generates labels for RF data without laborious and inconsistent human labeling, which enables neural network training on a large amount of consistently annotated RF data.
High-performance detection models are utilized for labeling on the
RGB images and transform object positions into points on RAMaps.
To train the models, annotations of objects are transformed into
object confidence distribution maps (ConfMaps). During the test
phase, the output ConfMaps will be processed to generate the detection results. To evaluate the final results, [30] defines an average
precision metric similar to the one used in traditional object detection. Our framework follows a similar annotation generation
process.
To perform detection on the RF data, [30] directly applies 3Dversion of previous models [17, 24] for object detection without
considering the inherent property of radar sequences. For example,
more attention should be paid to the velocity information which
can be retrieved from the RF data. The unique properties of RF data
can provide us more understanding of the semantic meanings of
an object.
In this paper, we propose a branched scene-aware learning framework for radar object detection. Specifically, the framework consists
of two parts: a scene classifier and a radar object detector. We find
that RF data in different driving scenes exhibit significant differences. Therefore, we partition all radar sequences into different
sets based on the driving scene. The scene classifier will predict
the scene category for each input radar sequence, e.g. static or
moving background. The object detection branches are trained for
two stages. In the first stage, a Scene-aware Learning Network
(SLNet) is first trained on all the RF sequences to learn a universal
well-behaved object detector. In the second stage, for each type of
scene, a scene-specific radar object detector is fine-tuned with the
corresponding radar sequences on top of the universal model. As a
result, the fine-tuned models are able to learn more scene-specific
features for better performance.
Object detection is essential to safe autonomous or assisted driving.
Previous works usually utilize RGB images or LiDAR point clouds
to identify and localize multiple objects in self-driving. However,
cameras tend to fail in bad driving conditions, e.g. bad weather or
weak lighting, while LiDAR scanners are too expensive to get widely
deployed in commercial applications. Radar has been drawing more
and more attention due to its robustness and low cost. In this paper,
we propose a scene-aware radar learning framework for accurate
and robust object detection. First, the learning framework contains
branches conditioning on the scene category of the radar sequence;
with each branch optimized for a specific type of scene. Second,
three different 3D autoencoder-based architectures are proposed
for radar object detection and ensemble learning is performed over
the different architectures to further boost the final performance.
Third, we propose novel scene-aware sequence mix augmentation
(SceneMix) and scene-specific post-processing to generate more
robust detection results. In the ROD2021 Challenge, we achieved
a final result of average precision of 75.0% and an average recall
of 81.0%. Moreover, in the parking lot scene, our framework ranks
first with an average precision of 97.8% and an average recall of
98.6%, which demonstrates the effectiveness of our framework.
CCS CONCEPTS
· Computing methodologies → Object detection; Scene understanding; Neural networks.
KEYWORDS
Auto-driving; Radar Frequency Data; Object Detection; Neural Network; Data Augmentation
ACM Reference Format:
Zangwei Zheng Xiangyu Yue Kurt Keutzer Alberto Sangiovanni Vincentelli . 2021. Scene-aware Learning Network for Radar Object Detection.
In Proceedings of the 2021 International Conference on Multimedia Retrieval
(ICMR ’21), August 21ś24, 2021, Taipei, Taiwan. ACM, New York, NY, USA,
7 pages. https://doi.org/10.1145/3460426.3463655
1
UC Berkeley
INTRODUCTION
Accurate object detection is a fundamental necessity for autonomous
or assisted driving. Many previous works [8, 22, 25, 27, 35] have
achieved good performance based on visual images or videos captured by RGB cameras. However, camera-based methods can easily
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from
[email protected].
ICMR ’21, August 21ś24, 2021, Taipei, Taiwan
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8463-6/21/08. . . $15.00
https://doi.org/10.1145/3460426.3463655
573
Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
Radar Video Clips
Scene-based
Switch
3D Classifier
Scene
Dynamic
SLN
Post-process
SLN
Post-process
Detection Results
Static
Figure 1: An overview of the scene-aware learning framework in test phase. Radar sequences are first classified into different
scenes by a 3D-classifier. Then the scene-based switch will pass the radar snippet to the corresponding SLNet branch (orange:
Dynamic, blue: Static). SLNet trained on corresponding radar sequence will predict the ConfMaps of objects. Detection results
are outputted with a scene-specific post-processing process.
Based on well-performed neural network architectures in video
recognition, e.g. Conv(2+1)D [26] and ResNet [11], we build different variants of SLNet. For better model generalization accuracy, we
design and apply scene-aware augmentations SceneMix on the RF
data during training. SceneMix creates a new training radar snippet
by inserting a piece of one radar snippet into another snippet of
the same scene category. More specifically, the snippet of the radar
sequence can be mixed up, cropped and replaced, or de-noised and
added with other radar snippets. To make our results more robust,
we further design a new type of post-processing to the detection
results and vary the process in different scenes.
We train and evaluate our network in ROD2021 challenge. The
ROD2021 dataset in this challenge contains 40 sequences for training and 10 sequences for testing. We simply define two scenes:
Static and Dynamic, which depend on whether the car carrying
the radar sensor is moving or not. In this challenge, our SLNet can
achieve about 75.0% average precision (AP) and 81.0% average recall
(AR). Moreover, we achieve 97.8% AP and 98.6% AR in the Parking
Lot category, ranking first in the challenge.
Dynamic
Radar Video Clips
SceneMix
Training
Together
Static
Radar Video Clips
SLN
Fine-tuning
SLN
Fine-tuning
SceneMix
SLN
Figure 2: An overview of the scene-aware learning framework in training stage. Coloring represents different scenes
(orange: Dynamic, blue: Static).
Then in each RoI, detection results will be obtained from the corresponding features. Variants of this model [7, 10, 23] further improve
the speed and accuracy of R-CNN. The single-stage pipelines, on
the other hand, directly predict the results by a single convectional
network, e.g. YOLO [22].
To capture the relationship between frames in videos, many
works have proposed different methods. [28, 35] introduce flowbased methods, which combine the flow information and features
extracted on one frame to obtain the prediction. Wang et al. [27]
propose memory and self-attention to extract information in the
temporal dimension. A more direct way to learn spatiotemporal features is proposed in [25], which builds 3D convolutional networks
to extract the features from radar snippets.
• We propose a novel scene-aware learning framework for
radar object detection based on the type of driving scene.
• We propose to leverage the spatio-temporal convolutional
block "R(2+1)D" and build the Scene-aware Learning Network (SLNet) for accurate radar object detection.
• We customize some image-processing methods for radar.
Specifically, we propose novel augmentation, post-processing,
and ensembling schemes for the new data modality.
2 RELATED WORK
2.1 Object Detection for Images and Videos
2.2 Radar Object Detection
Convolution neural network has achieved remarkable performance
in various computer vision tasks, including object detection for
visual images and videos. Most state-of-the-art image-based object
detection methods can be classified into two categories: multiplestage pipeline and single-stage pipeline.
The classic model of multiple-stage is R-CNN [8]. In this setting,
regions of interests (RoIs) are first generated by neural networks.
To overcome the bad quality of camera sensors in severe weather or
unsatisfied lighting, some prior works [4, 6, 14, 16, 18ś20] exploit
the data from a Frequency Modulated Continuous Wave (FMCW)
radar to detect the object more robustly. Considering it is hard to
annotate the radar data since a human has less knowledge about
what an object like in radar images, previous radar object detection
574
Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
Conv3d
Conv(2+1)D
ReLU
PReLU
BN
Upsample
format of complex numbers, [5] proposed to utilize complex-valued
CNN to enhance radar recognition.
To better extract spatiotemporal information for radar object
detection, prior works utilize 3D convolution on radar data. Hazara
et al. [9] propose to use 3D CNN architecture to learn the embedding
model with a distance-based triplet-loss similarity metric. In [30],
three encoder-decoder-based convolution network structures are
proposed for radar object detection. The encoder consists of a series
of 3D convolution layers and the decoder is composed of several
transpose convolution layers.
Conv3D Transpose
Bottleneck
Skip Connection
Skip Connection
(a) SLN-C21D
3
(b) SLN-R18D
Skip Connection
Conv(2+1)D
1x1x1
Conv(2+1)D
3x3x3
Conv(2+1)D
1x1x1
(c) SLN-R18UC
APPROACH
Following [30], we formulate the radar objection detection as follows: with a training radar sequence in the format of RAMaps
𝑅𝑡𝑟𝑎𝑖𝑛 and its annotation (points with semantic class label) 𝑦𝑡𝑟𝑎𝑖𝑛 ,
we are required to detect radar objects on the testing sequences
𝑅𝑡𝑒𝑠𝑡 . ConfMap 𝐶 is generated from annotation 𝑦 for neural network supervision by utilizing Gaussian distributions to set the
values around the object location. With a sliding window 𝜏, the network is fed with a snippet of 𝑅𝑡𝑟𝑎𝑖𝑛 with dimension (𝐶𝑅𝐹 , 𝜏, 𝜔, ℎ)
and predicts a ConfMap 𝐶ˆ with dimension (𝐶𝑐𝑙𝑠 , 𝜏, 𝜔, ℎ). 𝐶𝑅𝐹 is the
number of channels in RAMaps, which consists of real and imaginary [34], while 𝐶𝑐𝑙𝑠 is the number of object classes. In the test
ˆ
phase, the output ConfMaps are processed to point detect results 𝑦.
The performance of the model will be evaluated between 𝑦𝑡𝑒𝑠𝑡 and
ˆ
𝑦.
In this section, we will first introduce the scene-aware learning
framework for radar object detection. Then the components in the
framework, which are the architecture of Scene-aware Learning
Network (SLNet), SceneMix, and Post-Processing, will be described
in detail.
(d) Bottleneck
Figure 3: The architectures of our three SLNet models.
works can be classified into two categories by whether visual data
is required during learning.
The first type of radar object detection is to fuse radar and vision
images together to obtain a more robust detection result. Nabati,
et al. [16] fuses the data collected from radars with vision data to
obtain faster and more accurate detections. Meanwhile, Nobis et
al. [18] extract and combine features of visual images and sparse
radar data in the network encoding layers to improve the 2D object
detection results.
The second type is detecting objects based on radar data only.
To effectively and efficiently collect radar object annotation, with
a calibrated camera or LiDAR sensor, some annotations are automatically generated by high-accuracy object detection algorithms
on these data [15, 30]. Wang et al. [30] proposes a cross-modal supervision pipeline to annotation radar sequences with less human
labor and represent the radar frequency data in the rage-azimuth
coordinates (RAMaps). This pipeline facilitates the development of
a radar object detection algorithm.
3.1 Scene-aware Learning Framework
Radar frequency data under different driving scenes differ greatly.
One reason for this is the inherent velocity information in radar
signals. Therefore, whether the ego car is moving or not will lead to
a great difference in the signals of objects and noise. For example,
the relative velocity of a car on the highway may be zero, while
it can be very high if the ego car is static. Another is the different
possibility of objects appearing in different scenes. Thus, we design
a scene-aware learning framework to tackle this problem.
Specifically, we divide all RF data sequences into two scene
categories: Dynamic and Static, depending on whether the ego
car is moving or not. We adopt a two-stage training approach for
the SLNet. In the first phase, all radar snippets are used to train a
universal SLNet (described in Section 3.2). In the second phase, we
create two branches, each responsible for the radar object detection
in one scene. In each branch, we fine-tune the SLNet based on the
universal model obtained in the first phase with radar snippets
of the corresponding scene. The whole framework is shown in
Figure 2.
We also train a scene classifier to classify the input radar snippets into one of these two scenes. With the classifier, the overall
framework for scene-aware learning is shown in Figure 1. During
the test phase, the scene classifier will first classify a test radar
2.3 CNN for Radar Processing
In the processing of radar data, a series of research [1, 3, 13, 21,
30] explores convolution neural networks to extract features of
radar data. To obtain good feature representations for radar data,
Capobianco et al. [3] apply a convolutional neural network to the
rage Doppler signature. While Angelov et al. [1] try out various
network structures, including residual networks and a combination
of the convolutional and recurrent networks to classify radar objects.
To prevent overfitting, Kwon et al. [13] adds Gaussian noise to the
input radar data. Since radar data are usually represented in the
575
Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
VideoMix
VideoCropMix
NoiseMix
Figure 4: Example results of the SceneMix augmentation. The left two frames are of static scenes. The right three frames are
the results of VideoMix, VideoCropMix and NoiseMix respectively.
w/o post-processing
car 0.6
No Collision
SLNet-CDC21 is adopted from RODNet-CDC [30], but we replace the 3D convolution in RODNet-CDC with (2+1)D convolution [26], and add shortcut connections according to [17]. Specifically, a (2+1)D convolutional block splits 3D convolution into a
spatial 2D convolution followed by a temporal 1D convolution. Compared with 3D convolutional layer, a (2+1)D convolutional block
introduces additional nonlinear rectification between temporal and
spatial convolution. Besides, the decomposition of temporal-spatial
convolution facilitates the optimization according to [26].
SLNet-R18D substitutes the encoder of SLNet-C21DC with
ResNet(2+1)D18 [11]. We also utilize ResNet(2+1)D18 as the classifier to discriminate different scenes and this backbone is experimentally strong enough for this classification task. As for SLNet-R18UC,
the encoder is the same as SLNet-R18D while we adopt the structure of the decoder in [2]. The decoder is composed of upsampling
and convolution instead of transposed convolution.
With sliced RAMap frames and ConfMaps, we train our SLNet
with mean squared error loss:
ÕÕ
𝑐𝑙𝑠 2
𝑐𝑙𝑠
) ,
(1)
− 𝐶𝑖,𝑗
L𝑀𝑆𝐸 = −
(𝐶ˆ𝑖,𝑗
with post-processing
car 0.6
bike 0.4
pedestrain 0.9
pedestrain 0.9
Continuity
bike
car
Entering
from border
car
bike
car
bike
car
bike
bike
car
pedestrain
Figure 5: The examples of post-processing. An bounding box
represents for one frame (different size of bounding boxes
represents frames of the same size for better visualization).
A sequence of frames are placed from left to right in time
sequence. Different color of points represent different detection results. The number next to the class name stands for
the confidence of the predicting results.
𝑐𝑙𝑠 𝑖,𝑗
where 𝐶 represents the ConfMaps generated from annotations, 𝐶ˆ
𝑐𝑙𝑠 represents the probarepresents the network prediction, and 𝐶𝑖,𝑗
bility that object of class 𝑐𝑙𝑠 appear at pixel (𝑖, 𝑗).
Finally, we use an ensemble method on the aforementioned models to get the final results. Specifically, we average the ConfMaps of
each model and then identify detection results from the averaged
ConfMaps.
snippet into one of the two scene categories. Based on the scene category, the test radar snippet will then be fed into the corresponding
SLNet branch to generate the ConfMap of radar objects. The scenespecific post-processing will be applied to the output ConfMaps of
the SLNet (described in Section 3.4) to get final detection results.
3.3 SceneMix
Many data augmentation methods have proven effective in different
2D and 3D tasks. VideoMix [33] and CutMix [32] are powerful
augmentation strategies that can both create new training snippets
from two existing ones. These augmented images can not only
enlarge our training dataset but also enforce our model to be more
robust to different scenes.
We propose an augmentation method called scene-aware radar
data mixing (SceneMix), which composes of VideoMix, VideoCropMix, and NoiseMix. Note that mixing radar snippets of different
scenes may lead to absurd results, such as a static pedestrian in the
3.2 Network Architecture
We build three different network architectures for the ROD2021.
The architectures are shown in Figure 3 with (2+1)D Convolution
Deconvolution (SLNet-C21DC), ResNet(2+1)D18 Deconvolution
(SLNet-R18D), and ResNet(2+1)D18 Upsamle-Convolution (SLNetR18UC), respectively.
576
Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
Table 1: Radar object detection performance on ROD2021 dataset.
Architectures
AP
AP0.5
AP0.6
AP0.7
AP0.8
AP0.9
AR
AR0.5
AR0.6
AR0.7
AR0.8
AR0.9
RODNet-CDC
RODNet-HG
RODNet-HGwI
45.38
41.28
38.82
50.89
47.66
44.26
49.62
46.49
42.28
47.81
44.63
40.63
43.85
39.64
37.44
31.69
25.01
27.22
50.74
47.83
45.96
54.90
52.74
50.23
53.83
51.80
48.66
52.58
50.23
47.17
49.39
46.57
44.79
40.89
35.50
37.44
SLNet-C21D
SLNet-R18D
SLNet-R18UC
46.84
47.22
53.41
52.32
53.50
60.00
51.02
52.16
58.51
49.36
49.63
56.18
45.33
44.99
51.06
33.12
33.30
37.59
52.23
54.49
59.52
56.45
59.39
63.84
55.29
58.33
62.72
53.96
56.45
61.33
50.91
52.57
58.18
42.59
43.67
48.73
Ensemble
54.15
59.89
58.96
56.53
52.10
40.39
60.84
65.37
64.28
62.57
59.09
50.76
Table 2: Teams with high ranking and corresponding model
performances in ROD2021 Challenge
Team
AP (total)
AR (total)
AP (PL)
82.2
79.7
76.1
75.1
75.0
69.3
90.1
88.9
83.9
84.9
81.0
77.3
97.0
95.6
96.1
95.2
97.8
69.3
Baidu-VIS&ITD
USTC-NELSLIP
No_Bug
DD_Vision
Ours
acvlab
Collision, Continuity, and Entering from the border. An illustration
of the following post-processing constraints is shown in Figure 5.
No Collision: If two objects of the different classes are close
to each other, then the less confident one is removed to prevent a
collision. We measure the distance of two objects by object location
similarity (OLS) [30].
Continuity: If one object appears continuously in frames but
gets missing or changes into another class in one or two frames
among them, then we use linear interpolation to add or change the
class of the object to those frames.
Entering from the border: If one object appears suddenly
(which means cannot be tracked back in frames to the border of
the radar image), then we consider it noise and delete it.
All three constraints are applied to the outputs in Static scenes.
For Dynamic scenes, we find the last two constrain have little effect
due to the fast speed of the vehicle carrying the radar sensor. Thus,
we only apply the first constraint in this scene.
Static scene will be running at the speed of a car on the highway if
mixed to a Dynamic scene. Hence, only radar snippets of the same
scene will be mixed together.
Denote radar snippet with 𝑥 ∈ R𝐶𝑅𝐹 ×𝑇 ×𝑊 ×𝐻 and corresponding
ConfMaps with 𝑐 ∈ R𝐶𝑐𝑙𝑠 ×𝑇 ×𝑊 ×𝐻 . The VideoMix algorithm mix
two radar snippets with random proportion 𝜆 ∈ [0, 1]. The new
radar snippet is generated by:
𝑥 = 𝜆𝑥𝐴 + (1 − 𝜆)𝑥 𝐵
𝑐 = 𝜆𝑐 𝐴 + (1 − 𝜆)𝑐 𝐵
4 EXPERIMENTS
4.1 Datasets
The ROD2021 dataset used in ROD2021 Challenge is a subset of
CRUW [30] dataset. There are 50 sequences in total, where 40 of
them are provided with annotations. Each sequence lasts around 2560s with 800-1700 frames. Each frame is a RAMap with a dimension
128×128. The provided annotation is created by a camera-radar
fusion algorithm [30].
To validate our algorithm, we randomly choose 8 sequences from
40 sequences with annotations as the validation set and the rest
32 sequences as the training set. Among 40 sequences, about 15%
are classified as Dynamic and the rest are Static. In additional to
presenting the performance on test set from ROD2021 competition server, we also provide more detailed analysis by conducting
experiments on the validation set.
(2)
The VideoCropMix algorithm mixes two radar snippets in another way: randomly crop on a radar snippet and replace the
cropped area with the corresponding area in another video. The
same process is also performed on the ConfMaps.
Adding noise to training samples has proven to help train a more
robust neural network. To generate diverse radar noise, we introduce the NoiseMix augmentation. Notice that each radar snippet
contains noisy signals naturally. To extract the noise from radar
snippets, we set the area in which one of the semantic classes has a
probability greater than a threshold in ConfMaps to zero. Then, the
extracted noise is added to other radar snippets without modifying
its ConfMaps.
4.2 Evaluation Metrics
To evaluate our methods, we use the average precision (AP) and
average recall (AR) metrics proposed in [30]. Specifically, the object
location similarity (OLS) [30] between our detection results and
ground truth are calculated. Then, with threshold 𝑡, detection results
with OLS higher than 𝑡 are considered a correct match and thus
the precision and recall can be computed. With 𝑡 ranging from 0.5
to 0.9 with a step of 0.05, we get the AP and AR as our evaluation
metrics.
3.4 Post-Processing
After predicting ConfMaps from SLNet, post-processing needs to
be applied to transform the ConfMaps into final detections. The
L-NMS [30] in the proposed pipeline is a good choice but fails to
take the property of driving scenes into consideration. Apart from
using L-NMS to identify detection from ConfMaps, we introduce a
series of constraints to make the results more robust, including No
577
Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
Table 3: Ablation study on different components of the framework. The vallina version of SLNet-R18UC is trained in a direct
way. AP𝑆 and AR𝑆 means the AP and AR on sequences of Static scenes while AP𝐷 and AR𝐷 denote those of Dynamic ones.
Methods
AP
AP𝑆
AP𝐷
AR
AR𝑆
AR𝐷
SLNet-R18UC (vallina)
with SceneMix
with Fine-tuning on S
with Fine-tuning on D
47.03
49.97
52.69
50.49
70.51
73.55
74.23
73.32
18.47
23.16
22.91
28.27
46.95
55.94
58.85
56.09
75.07
78.09
78.00
77.51
27.40
32.14
30.98
37.85
SLNet-R18UC
53.41
74.23
28.27
60.00
78.00
37.85
Table 4: Number of different scenes and prediction accuracy.
Scene
Static
Dynamic
# seq in train
# seq in test
Accuracy
28
6
100%
4
2
100%
4.5 Ablation Study
Next, we investigate the effectiveness of different components in the
scene-aware learning framework on the validation set of ROD2021
dataset. Table 3 shows the results with and without SceneMix augmentation, and also the results fine-tuning on different scenes.
The vallina version of SLNet-R18UC is trained directly without fine-tuning and SceneMix augmentations. When training with
SceneMix, the final AP increases by 1.77%. Based on the SLNetR18UC trained with SceneMix, we fine-tune on Static and Dynamic
scenes separately, and both of the fine-tuned models achieve a
better result by 2.72% and 0.52% in AP respectively. Besides, finetuning on different scenes will lead to an obvious improvement in
the corresponding scene.
Finally, by applying the scene-aware learning framework, we predict each scene with the corresponding model and achieve final AP
of 53.41%. We can observe that adding each component contributes
to the final results without any performance degradation.
4.3 Training Details
Our experiments utilize Adam [12] to optimize the network, and
the learning rate is set to 1 × 10−4 . A cosine annealing with warmup
restart scheduler is applied on the optimizer in order to make the
training process more smooth. The model is paralleled on 4 GPUs
with batch size 64 in total. Given an input radar snippet, the probability of applying VideoMix and VideoCropMix are both 13 . The
chance of augmentation with NoiseMix is 12 . After 50 epochs of
training on all sequences, our model is then fine-tuned on sequences
of different scenes for 30 epochs. Finally, all trained models are ensembled to get the detection results.
5
DISCUSSION
The scene-aware learning framework can remarkably improve the
performance of radar object detection, especially for the Static scene.
Despite the success of this model, there are also some limitations
which need further attention. First, although scene-aware learning achieves high accuracy on Static scene, the performance on
Dynamic ones is not satisfactory enough. More analysis should be
done to investigate why this method has inferior performance in
other scenarios like campus road, city street, and highways. Besides, apart from two scenes division, it may be possible for the
model to generalize to more categories, or even velocity-aware one.
Finally, how to apply this model to an unseen scene may be another
practical issue. We leave these questions for future work.
4.4 Results
Table 2 presents final results of models with high ranking in ROD2021
competition. Our model achieves a AP of 75.0%, which outperforms
a baseline of 69.8% by simply applying RODNet-CDC without the
scene-aware learning framework. It is worth noting that our model
ranked first in Parking Lot(PL) scene with respect to the AP score.
To further compare performance of different models, we utilize
the validation set to compare our methods with the three models
proposed in [30]. The results are shown in Table 1. RODNet-CDC
is a shallow 3D CNN encoder-decoder network. RODNet-HG is
adopted from [17] with only one stack, while RODNet-HGwI replaces the 3D convolution layers in RODNet-HG with temporal
inception layers [24].
To complete the scene-aware learning pipeline, we need to train
a 3D scene classifier. The number of different scenes in train and
test datasets are shown in Table 4. Our scene classifier can obtain
100% accuracy in predicting the driving scenes.
All results of SLNet in Table 1 are trained in the scene-aware
learning framework. We can see that the ensemble version of the
scene-aware learning framework outperforms the best results of
baselines 8.77% in average precision and 10.10% in average recall.
All SLNet (RODNet-CDC, SLNet-R18C, and SLNet-R18UD) outperforms three baselines by 1.46%, 1.84%, and 8.03% in AP respectively.
6
CONCLUSION
In this paper, we proposed a scene-aware learning framework to
detect objects from radar sequences. In the framework, radar sequences will be detected by models fine-tuned on the same scenes.
The proposed SLNet can robustly detect objects with high precision.
In addition, the paper presents a new augmentation SceneMix and
post-processing method for radar object detection. The proposed
method offers a novel and effective solution to take advantage of
the properties of radar data. Our experiments conducted on the
ROD2021 dataset demonstrate our proposed framework is an accurate and robust method to detect objects based on radar.
578
Challenge Paper
ICMR ’21, August 21–24, 2021, Taipei, Taiwan
REFERENCES
[19] AD Olver and LG Cuthbert. 1988. FMCW radar for hidden object detection. In
IEE Proceedings F (Communications, Radar and Signal Processing), Vol. 135. IET,
354ś361.
[20] Minh-Tan Pham and Sébastien Lefèvre. 2018. Buried object detection from B-scan
ground penetrating radar data using Faster-RCNN. In IGARSS 2018-2018 IEEE
International Geoscience and Remote Sensing Symposium. IEEE, 6804ś6807.
[21] Xingshuai Qiao, Tao Shan, and Ran Tao. 2020. Human identification based
on radar micro-Doppler signatures separation. Electronics Letters 56, 4 (2020),
195ś196.
[22] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You
only look once: Unified, real-time object detection. In Proceedings of the IEEE
conference on computer vision and pattern recognition. 779ś788.
[23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
Towards real-time object detection with region proposal networks. arXiv preprint
arXiv:1506.01497 (2015).
[24] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.
Going deeper with convolutions. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 1ś9.
[25] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri.
2015. Learning spatiotemporal features with 3d convolutional networks. In
Proceedings of the IEEE international conference on computer vision. 4489ś4497.
[26] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar
Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition.
In CVPR. 6450ś6459.
[27] Hao Wang, Weining Wang, and Jing Liu. 2021. Temporal Memory Attention for
Video Semantic Segmentation. arXiv preprint arXiv:2102.08643 (2021).
[28] Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. 2018. Fully motionaware network for video object detection. In Proceedings of the European conference on computer vision (ECCV). 542ś557.
[29] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell,
and Kilian Q Weinberger. 2019. Pseudo-lidar from visual depth estimation:
Bridging the gap in 3d object detection for autonomous driving. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8445ś8453.
[30] Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang, Guanbin Xing,
and Hui Liu. 2021. Rodnet: Radar object detection using cross-modal supervision.
In WACV. 504ś513.
[31] Xiangyu Yue, Bichen Wu, Sanjit A Seshia, Kurt Keutzer, and Alberto L
Sangiovanni-Vincentelli. 2018. A lidar point cloud generator: from a virtual
world to autonomous driving. In Proceedings of the 2018 ACM on International
Conference on Multimedia Retrieval. 458ś464.
[32] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and
Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers
with localizable features. In Proceedings of the IEEE/CVF International Conference
on Computer Vision. 6023ś6032.
[33] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung
Kim. 2020. VideoMix: Rethinking Data Augmentation for Video Classification.
arXiv preprint arXiv:2012.03457 (2020).
[34] Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong Tian, Hang
Zhao, Antonio Torralba, and Dina Katabi. 2018. Through-wall human pose
estimation using radio signals. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 7356ś7365.
[35] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flowguided feature aggregation for video object detection. In Proceedings of the IEEE
International Conference on Computer Vision. 408ś417.
[1] Aleksandar Angelov, Andrew Robertson, Roderick Murray-Smith, and Francesco
Fioranelli. 2018. Practical classification of different moving targets using automotive radar and deep neural networks. IET Radar, Sonar & Navigation 12, 10
(2018), 1082ś1089.
[2] Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. 2020.
Toward fast and accurate human pose estimation via soft-gated skip connections.
arXiv preprint arXiv:2002.11098 (2020).
[3] Samuele Capobianco, Luca Facheris, Fabrizio Cuccoli, and Simone Marinai. 2017.
Vehicle classification based on convolutional networks applied to fmcw radar
signals. In Italian Conference for the Traffic Police. Springer, 115ś128.
[4] Andreas Danzer, Thomas Griebel, Martin Bach, and Klaus Dietmayer. 2019. 2d
car detection in radar data with pointnets. In 2019 IEEE Intelligent Transportation
Systems Conference (ITSC). IEEE, 61ś66.
[5] Jingkun Gao, Bin Deng, Yuliang Qin, Hongqiang Wang, and Xiang Li. 2018.
Enhanced radar imaging using a complex-valued convolutional neural network.
IEEE Geoscience and Remote Sensing Letters 16, 1 (2018), 35ś39.
[6] Xiangyu Gao, Guanbin Xing, Sumit Roy, and Hui Liu. 2019. Experiments with
mmwave automotive radar test-bed. In 2019 53rd Asilomar Conference on Signals,
Systems, and Computers. IEEE, 1ś6.
[7] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference
on computer vision. 1440ś1448.
[8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich
feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern recognition.
580ś587.
[9] Souvik Hazra and Avik Santra. 2019. Short-range radar-based gesture recognition
system using 3D CNN with triplet loss. IEEE Access 7 (2019), 125623ś125633.
[10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn.
In Proceedings of the IEEE international conference on computer vision. 2961ś2969.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770ś778.
[12] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[13] Jihoon Kwon and Nojun Kwak. 2017. Human detection by neural networks
using a low-cost short-range Doppler radar sensor. In 2017 IEEE Radar Conference
(RadarConf). IEEE, 0755ś0760.
[14] Ankith Manjunath, Ying Liu, Bernardo Henriques, and Armin Engstle. 2018.
Radar based object detection and tracking for autonomous driving. In 2018 IEEE
MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM).
IEEE, 1ś4.
[15] Michael Meyer and Georg Kuschk. 2019. Automotive radar dataset for deep
learning based 3d object detection. In 2019 16th European Radar Conference
(EuRAD). IEEE, 129ś132.
[16] Ramin Nabati and Hairong Qi. 2019. Rrpn: Radar region proposal network for
object detection in autonomous vehicles. In 2019 IEEE International Conference
on Image Processing (ICIP). IEEE, 3093ś3097.
[17] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks
for human pose estimation. In European conference on computer vision. Springer,
483ś499.
[18] Felix Nobis, Maximilian Geisslinger, Markus Weber, Johannes Betz, and Markus
Lienkamp. 2019. A deep learning-based radar and camera sensor fusion architecture for object detection. In 2019 Sensor Data Fusion: Trends, Solutions, Applications
(SDF). IEEE, 1ś7.
579