Academia.eduAcademia.edu

Scene-aware Learning Network for Radar Object Detection

2021, Proceedings of the 2021 International Conference on Multimedia Retrieval

Object detection is essential to safe autonomous or assisted driving. Previous works usually utilize RGB images or LiDAR point clouds to identify and localize multiple objects in self-driving. However, cameras tend to fail in bad driving conditions, e.g. bad weather or weak lighting, while LiDAR scanners are too expensive to get widely deployed in commercial applications. Radar has been drawing more and more attention due to its robustness and low cost. In this paper, we propose a scene-aware radar learning framework for accurate and robust object detection. First, the learning framework contains branches conditioning on the scene category of the radar sequence; with each branch optimized for a specific type of scene. Second, three different 3D autoencoder-based architectures are proposed for radar object detection and ensemble learning is performed over the different architectures to further boost the final performance. Third, we propose novel scene-aware sequence mix augmentation (SceneMix) and scene-specific post-processing to generate more robust detection results. In the ROD2021 Challenge, we achieved a final result of average precision of 75.0% and an average recall of 81.0%. Moreover, in the parking lot scene, our framework ranks first with an average precision of 97.8% and an average recall of 98.6%, which demonstrates the effectiveness of our framework. CCS CONCEPTS • Computing methodologies → Object detection; Scene understanding; Neural networks.

Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan Scene-aware Learning Network for Radar Object Detection Zangwei Zheng Nanjing University Xiangyu Yue Kurt Keutzer Alberto Sangiovanni Vincentelli UC Berkeley UC Berkeley ABSTRACT fail in bad driving conditions, such as frogging weather, dimming night, and strong lighting. Compared with visual light, LiDAR can provide direct and robust distance measurement of the surrounding environment [29, 31], but LiDAR scanners are so expensive that many autonomous car manufacturers prefer not to use them. Similar to LiDAR, millimeter-wave can function reliably and detect range accurately; and similar to RGB camera, radar sensors are fairly competitive in terms of manufacturing cost. Therefore, object detection based on a frequency modulated continuous wave (FMCW) radar has been considered as a more robust and practical choice. Compared with visual images, radar frequency (RF) data is much harder to annotate. In order for better representation, RF data are usually transformed into the format of range-azimuth frequency heatmaps (RAMaps), whose horizontal and vertical dimensions denote angle and distance (a bird-eye view), respectively. Recently, [30] proposed a pipeline for radar object detection, a cross-modal supervision framework that generates labels for RF data without laborious and inconsistent human labeling, which enables neural network training on a large amount of consistently annotated RF data. High-performance detection models are utilized for labeling on the RGB images and transform object positions into points on RAMaps. To train the models, annotations of objects are transformed into object confidence distribution maps (ConfMaps). During the test phase, the output ConfMaps will be processed to generate the detection results. To evaluate the final results, [30] defines an average precision metric similar to the one used in traditional object detection. Our framework follows a similar annotation generation process. To perform detection on the RF data, [30] directly applies 3Dversion of previous models [17, 24] for object detection without considering the inherent property of radar sequences. For example, more attention should be paid to the velocity information which can be retrieved from the RF data. The unique properties of RF data can provide us more understanding of the semantic meanings of an object. In this paper, we propose a branched scene-aware learning framework for radar object detection. Specifically, the framework consists of two parts: a scene classifier and a radar object detector. We find that RF data in different driving scenes exhibit significant differences. Therefore, we partition all radar sequences into different sets based on the driving scene. The scene classifier will predict the scene category for each input radar sequence, e.g. static or moving background. The object detection branches are trained for two stages. In the first stage, a Scene-aware Learning Network (SLNet) is first trained on all the RF sequences to learn a universal well-behaved object detector. In the second stage, for each type of scene, a scene-specific radar object detector is fine-tuned with the corresponding radar sequences on top of the universal model. As a result, the fine-tuned models are able to learn more scene-specific features for better performance. Object detection is essential to safe autonomous or assisted driving. Previous works usually utilize RGB images or LiDAR point clouds to identify and localize multiple objects in self-driving. However, cameras tend to fail in bad driving conditions, e.g. bad weather or weak lighting, while LiDAR scanners are too expensive to get widely deployed in commercial applications. Radar has been drawing more and more attention due to its robustness and low cost. In this paper, we propose a scene-aware radar learning framework for accurate and robust object detection. First, the learning framework contains branches conditioning on the scene category of the radar sequence; with each branch optimized for a specific type of scene. Second, three different 3D autoencoder-based architectures are proposed for radar object detection and ensemble learning is performed over the different architectures to further boost the final performance. Third, we propose novel scene-aware sequence mix augmentation (SceneMix) and scene-specific post-processing to generate more robust detection results. In the ROD2021 Challenge, we achieved a final result of average precision of 75.0% and an average recall of 81.0%. Moreover, in the parking lot scene, our framework ranks first with an average precision of 97.8% and an average recall of 98.6%, which demonstrates the effectiveness of our framework. CCS CONCEPTS · Computing methodologies → Object detection; Scene understanding; Neural networks. KEYWORDS Auto-driving; Radar Frequency Data; Object Detection; Neural Network; Data Augmentation ACM Reference Format: Zangwei Zheng Xiangyu Yue Kurt Keutzer Alberto Sangiovanni Vincentelli . 2021. Scene-aware Learning Network for Radar Object Detection. In Proceedings of the 2021 International Conference on Multimedia Retrieval (ICMR ’21), August 21ś24, 2021, Taipei, Taiwan. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3460426.3463655 1 UC Berkeley INTRODUCTION Accurate object detection is a fundamental necessity for autonomous or assisted driving. Many previous works [8, 22, 25, 27, 35] have achieved good performance based on visual images or videos captured by RGB cameras. However, camera-based methods can easily Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICMR ’21, August 21ś24, 2021, Taipei, Taiwan © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8463-6/21/08. . . $15.00 https://doi.org/10.1145/3460426.3463655 573 Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan Radar Video Clips Scene-based Switch 3D Classifier Scene Dynamic SLN Post-process SLN Post-process Detection Results Static Figure 1: An overview of the scene-aware learning framework in test phase. Radar sequences are first classified into different scenes by a 3D-classifier. Then the scene-based switch will pass the radar snippet to the corresponding SLNet branch (orange: Dynamic, blue: Static). SLNet trained on corresponding radar sequence will predict the ConfMaps of objects. Detection results are outputted with a scene-specific post-processing process. Based on well-performed neural network architectures in video recognition, e.g. Conv(2+1)D [26] and ResNet [11], we build different variants of SLNet. For better model generalization accuracy, we design and apply scene-aware augmentations SceneMix on the RF data during training. SceneMix creates a new training radar snippet by inserting a piece of one radar snippet into another snippet of the same scene category. More specifically, the snippet of the radar sequence can be mixed up, cropped and replaced, or de-noised and added with other radar snippets. To make our results more robust, we further design a new type of post-processing to the detection results and vary the process in different scenes. We train and evaluate our network in ROD2021 challenge. The ROD2021 dataset in this challenge contains 40 sequences for training and 10 sequences for testing. We simply define two scenes: Static and Dynamic, which depend on whether the car carrying the radar sensor is moving or not. In this challenge, our SLNet can achieve about 75.0% average precision (AP) and 81.0% average recall (AR). Moreover, we achieve 97.8% AP and 98.6% AR in the Parking Lot category, ranking first in the challenge. Dynamic Radar Video Clips SceneMix Training Together Static Radar Video Clips SLN Fine-tuning SLN Fine-tuning SceneMix SLN Figure 2: An overview of the scene-aware learning framework in training stage. Coloring represents different scenes (orange: Dynamic, blue: Static). Then in each RoI, detection results will be obtained from the corresponding features. Variants of this model [7, 10, 23] further improve the speed and accuracy of R-CNN. The single-stage pipelines, on the other hand, directly predict the results by a single convectional network, e.g. YOLO [22]. To capture the relationship between frames in videos, many works have proposed different methods. [28, 35] introduce flowbased methods, which combine the flow information and features extracted on one frame to obtain the prediction. Wang et al. [27] propose memory and self-attention to extract information in the temporal dimension. A more direct way to learn spatiotemporal features is proposed in [25], which builds 3D convolutional networks to extract the features from radar snippets. • We propose a novel scene-aware learning framework for radar object detection based on the type of driving scene. • We propose to leverage the spatio-temporal convolutional block "R(2+1)D" and build the Scene-aware Learning Network (SLNet) for accurate radar object detection. • We customize some image-processing methods for radar. Specifically, we propose novel augmentation, post-processing, and ensembling schemes for the new data modality. 2 RELATED WORK 2.1 Object Detection for Images and Videos 2.2 Radar Object Detection Convolution neural network has achieved remarkable performance in various computer vision tasks, including object detection for visual images and videos. Most state-of-the-art image-based object detection methods can be classified into two categories: multiplestage pipeline and single-stage pipeline. The classic model of multiple-stage is R-CNN [8]. In this setting, regions of interests (RoIs) are first generated by neural networks. To overcome the bad quality of camera sensors in severe weather or unsatisfied lighting, some prior works [4, 6, 14, 16, 18ś20] exploit the data from a Frequency Modulated Continuous Wave (FMCW) radar to detect the object more robustly. Considering it is hard to annotate the radar data since a human has less knowledge about what an object like in radar images, previous radar object detection 574 Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan Conv3d Conv(2+1)D ReLU PReLU BN Upsample format of complex numbers, [5] proposed to utilize complex-valued CNN to enhance radar recognition. To better extract spatiotemporal information for radar object detection, prior works utilize 3D convolution on radar data. Hazara et al. [9] propose to use 3D CNN architecture to learn the embedding model with a distance-based triplet-loss similarity metric. In [30], three encoder-decoder-based convolution network structures are proposed for radar object detection. The encoder consists of a series of 3D convolution layers and the decoder is composed of several transpose convolution layers. Conv3D Transpose Bottleneck Skip Connection Skip Connection (a) SLN-C21D 3 (b) SLN-R18D Skip Connection Conv(2+1)D 1x1x1 Conv(2+1)D 3x3x3 Conv(2+1)D 1x1x1 (c) SLN-R18UC APPROACH Following [30], we formulate the radar objection detection as follows: with a training radar sequence in the format of RAMaps 𝑅𝑡𝑟𝑎𝑖𝑛 and its annotation (points with semantic class label) 𝑦𝑡𝑟𝑎𝑖𝑛 , we are required to detect radar objects on the testing sequences 𝑅𝑡𝑒𝑠𝑡 . ConfMap 𝐶 is generated from annotation 𝑦 for neural network supervision by utilizing Gaussian distributions to set the values around the object location. With a sliding window 𝜏, the network is fed with a snippet of 𝑅𝑡𝑟𝑎𝑖𝑛 with dimension (𝐶𝑅𝐹 , 𝜏, 𝜔, ℎ) and predicts a ConfMap 𝐶ˆ with dimension (𝐶𝑐𝑙𝑠 , 𝜏, 𝜔, ℎ). 𝐶𝑅𝐹 is the number of channels in RAMaps, which consists of real and imaginary [34], while 𝐶𝑐𝑙𝑠 is the number of object classes. In the test ˆ phase, the output ConfMaps are processed to point detect results 𝑦. The performance of the model will be evaluated between 𝑦𝑡𝑒𝑠𝑡 and ˆ 𝑦. In this section, we will first introduce the scene-aware learning framework for radar object detection. Then the components in the framework, which are the architecture of Scene-aware Learning Network (SLNet), SceneMix, and Post-Processing, will be described in detail. (d) Bottleneck Figure 3: The architectures of our three SLNet models. works can be classified into two categories by whether visual data is required during learning. The first type of radar object detection is to fuse radar and vision images together to obtain a more robust detection result. Nabati, et al. [16] fuses the data collected from radars with vision data to obtain faster and more accurate detections. Meanwhile, Nobis et al. [18] extract and combine features of visual images and sparse radar data in the network encoding layers to improve the 2D object detection results. The second type is detecting objects based on radar data only. To effectively and efficiently collect radar object annotation, with a calibrated camera or LiDAR sensor, some annotations are automatically generated by high-accuracy object detection algorithms on these data [15, 30]. Wang et al. [30] proposes a cross-modal supervision pipeline to annotation radar sequences with less human labor and represent the radar frequency data in the rage-azimuth coordinates (RAMaps). This pipeline facilitates the development of a radar object detection algorithm. 3.1 Scene-aware Learning Framework Radar frequency data under different driving scenes differ greatly. One reason for this is the inherent velocity information in radar signals. Therefore, whether the ego car is moving or not will lead to a great difference in the signals of objects and noise. For example, the relative velocity of a car on the highway may be zero, while it can be very high if the ego car is static. Another is the different possibility of objects appearing in different scenes. Thus, we design a scene-aware learning framework to tackle this problem. Specifically, we divide all RF data sequences into two scene categories: Dynamic and Static, depending on whether the ego car is moving or not. We adopt a two-stage training approach for the SLNet. In the first phase, all radar snippets are used to train a universal SLNet (described in Section 3.2). In the second phase, we create two branches, each responsible for the radar object detection in one scene. In each branch, we fine-tune the SLNet based on the universal model obtained in the first phase with radar snippets of the corresponding scene. The whole framework is shown in Figure 2. We also train a scene classifier to classify the input radar snippets into one of these two scenes. With the classifier, the overall framework for scene-aware learning is shown in Figure 1. During the test phase, the scene classifier will first classify a test radar 2.3 CNN for Radar Processing In the processing of radar data, a series of research [1, 3, 13, 21, 30] explores convolution neural networks to extract features of radar data. To obtain good feature representations for radar data, Capobianco et al. [3] apply a convolutional neural network to the rage Doppler signature. While Angelov et al. [1] try out various network structures, including residual networks and a combination of the convolutional and recurrent networks to classify radar objects. To prevent overfitting, Kwon et al. [13] adds Gaussian noise to the input radar data. Since radar data are usually represented in the 575 Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan VideoMix VideoCropMix NoiseMix Figure 4: Example results of the SceneMix augmentation. The left two frames are of static scenes. The right three frames are the results of VideoMix, VideoCropMix and NoiseMix respectively. w/o post-processing car 0.6 No Collision SLNet-CDC21 is adopted from RODNet-CDC [30], but we replace the 3D convolution in RODNet-CDC with (2+1)D convolution [26], and add shortcut connections according to [17]. Specifically, a (2+1)D convolutional block splits 3D convolution into a spatial 2D convolution followed by a temporal 1D convolution. Compared with 3D convolutional layer, a (2+1)D convolutional block introduces additional nonlinear rectification between temporal and spatial convolution. Besides, the decomposition of temporal-spatial convolution facilitates the optimization according to [26]. SLNet-R18D substitutes the encoder of SLNet-C21DC with ResNet(2+1)D18 [11]. We also utilize ResNet(2+1)D18 as the classifier to discriminate different scenes and this backbone is experimentally strong enough for this classification task. As for SLNet-R18UC, the encoder is the same as SLNet-R18D while we adopt the structure of the decoder in [2]. The decoder is composed of upsampling and convolution instead of transposed convolution. With sliced RAMap frames and ConfMaps, we train our SLNet with mean squared error loss: ÕÕ 𝑐𝑙𝑠 2 𝑐𝑙𝑠 ) , (1) − 𝐶𝑖,𝑗 L𝑀𝑆𝐸 = − (𝐶ˆ𝑖,𝑗 with post-processing car 0.6 bike 0.4 pedestrain 0.9 pedestrain 0.9 Continuity bike car Entering from border car bike car bike car bike bike car pedestrain Figure 5: The examples of post-processing. An bounding box represents for one frame (different size of bounding boxes represents frames of the same size for better visualization). A sequence of frames are placed from left to right in time sequence. Different color of points represent different detection results. The number next to the class name stands for the confidence of the predicting results. 𝑐𝑙𝑠 𝑖,𝑗 where 𝐶 represents the ConfMaps generated from annotations, 𝐶ˆ 𝑐𝑙𝑠 represents the probarepresents the network prediction, and 𝐶𝑖,𝑗 bility that object of class 𝑐𝑙𝑠 appear at pixel (𝑖, 𝑗). Finally, we use an ensemble method on the aforementioned models to get the final results. Specifically, we average the ConfMaps of each model and then identify detection results from the averaged ConfMaps. snippet into one of the two scene categories. Based on the scene category, the test radar snippet will then be fed into the corresponding SLNet branch to generate the ConfMap of radar objects. The scenespecific post-processing will be applied to the output ConfMaps of the SLNet (described in Section 3.4) to get final detection results. 3.3 SceneMix Many data augmentation methods have proven effective in different 2D and 3D tasks. VideoMix [33] and CutMix [32] are powerful augmentation strategies that can both create new training snippets from two existing ones. These augmented images can not only enlarge our training dataset but also enforce our model to be more robust to different scenes. We propose an augmentation method called scene-aware radar data mixing (SceneMix), which composes of VideoMix, VideoCropMix, and NoiseMix. Note that mixing radar snippets of different scenes may lead to absurd results, such as a static pedestrian in the 3.2 Network Architecture We build three different network architectures for the ROD2021. The architectures are shown in Figure 3 with (2+1)D Convolution Deconvolution (SLNet-C21DC), ResNet(2+1)D18 Deconvolution (SLNet-R18D), and ResNet(2+1)D18 Upsamle-Convolution (SLNetR18UC), respectively. 576 Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan Table 1: Radar object detection performance on ROD2021 dataset. Architectures AP AP0.5 AP0.6 AP0.7 AP0.8 AP0.9 AR AR0.5 AR0.6 AR0.7 AR0.8 AR0.9 RODNet-CDC RODNet-HG RODNet-HGwI 45.38 41.28 38.82 50.89 47.66 44.26 49.62 46.49 42.28 47.81 44.63 40.63 43.85 39.64 37.44 31.69 25.01 27.22 50.74 47.83 45.96 54.90 52.74 50.23 53.83 51.80 48.66 52.58 50.23 47.17 49.39 46.57 44.79 40.89 35.50 37.44 SLNet-C21D SLNet-R18D SLNet-R18UC 46.84 47.22 53.41 52.32 53.50 60.00 51.02 52.16 58.51 49.36 49.63 56.18 45.33 44.99 51.06 33.12 33.30 37.59 52.23 54.49 59.52 56.45 59.39 63.84 55.29 58.33 62.72 53.96 56.45 61.33 50.91 52.57 58.18 42.59 43.67 48.73 Ensemble 54.15 59.89 58.96 56.53 52.10 40.39 60.84 65.37 64.28 62.57 59.09 50.76 Table 2: Teams with high ranking and corresponding model performances in ROD2021 Challenge Team AP (total) AR (total) AP (PL) 82.2 79.7 76.1 75.1 75.0 69.3 90.1 88.9 83.9 84.9 81.0 77.3 97.0 95.6 96.1 95.2 97.8 69.3 Baidu-VIS&ITD USTC-NELSLIP No_Bug DD_Vision Ours acvlab Collision, Continuity, and Entering from the border. An illustration of the following post-processing constraints is shown in Figure 5. No Collision: If two objects of the different classes are close to each other, then the less confident one is removed to prevent a collision. We measure the distance of two objects by object location similarity (OLS) [30]. Continuity: If one object appears continuously in frames but gets missing or changes into another class in one or two frames among them, then we use linear interpolation to add or change the class of the object to those frames. Entering from the border: If one object appears suddenly (which means cannot be tracked back in frames to the border of the radar image), then we consider it noise and delete it. All three constraints are applied to the outputs in Static scenes. For Dynamic scenes, we find the last two constrain have little effect due to the fast speed of the vehicle carrying the radar sensor. Thus, we only apply the first constraint in this scene. Static scene will be running at the speed of a car on the highway if mixed to a Dynamic scene. Hence, only radar snippets of the same scene will be mixed together. Denote radar snippet with 𝑥 ∈ R𝐶𝑅𝐹 ×𝑇 ×𝑊 ×𝐻 and corresponding ConfMaps with 𝑐 ∈ R𝐶𝑐𝑙𝑠 ×𝑇 ×𝑊 ×𝐻 . The VideoMix algorithm mix two radar snippets with random proportion 𝜆 ∈ [0, 1]. The new radar snippet is generated by: 𝑥 = 𝜆𝑥𝐴 + (1 − 𝜆)𝑥 𝐵 𝑐 = 𝜆𝑐 𝐴 + (1 − 𝜆)𝑐 𝐵 4 EXPERIMENTS 4.1 Datasets The ROD2021 dataset used in ROD2021 Challenge is a subset of CRUW [30] dataset. There are 50 sequences in total, where 40 of them are provided with annotations. Each sequence lasts around 2560s with 800-1700 frames. Each frame is a RAMap with a dimension 128×128. The provided annotation is created by a camera-radar fusion algorithm [30]. To validate our algorithm, we randomly choose 8 sequences from 40 sequences with annotations as the validation set and the rest 32 sequences as the training set. Among 40 sequences, about 15% are classified as Dynamic and the rest are Static. In additional to presenting the performance on test set from ROD2021 competition server, we also provide more detailed analysis by conducting experiments on the validation set. (2) The VideoCropMix algorithm mixes two radar snippets in another way: randomly crop on a radar snippet and replace the cropped area with the corresponding area in another video. The same process is also performed on the ConfMaps. Adding noise to training samples has proven to help train a more robust neural network. To generate diverse radar noise, we introduce the NoiseMix augmentation. Notice that each radar snippet contains noisy signals naturally. To extract the noise from radar snippets, we set the area in which one of the semantic classes has a probability greater than a threshold in ConfMaps to zero. Then, the extracted noise is added to other radar snippets without modifying its ConfMaps. 4.2 Evaluation Metrics To evaluate our methods, we use the average precision (AP) and average recall (AR) metrics proposed in [30]. Specifically, the object location similarity (OLS) [30] between our detection results and ground truth are calculated. Then, with threshold 𝑡, detection results with OLS higher than 𝑡 are considered a correct match and thus the precision and recall can be computed. With 𝑡 ranging from 0.5 to 0.9 with a step of 0.05, we get the AP and AR as our evaluation metrics. 3.4 Post-Processing After predicting ConfMaps from SLNet, post-processing needs to be applied to transform the ConfMaps into final detections. The L-NMS [30] in the proposed pipeline is a good choice but fails to take the property of driving scenes into consideration. Apart from using L-NMS to identify detection from ConfMaps, we introduce a series of constraints to make the results more robust, including No 577 Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan Table 3: Ablation study on different components of the framework. The vallina version of SLNet-R18UC is trained in a direct way. AP𝑆 and AR𝑆 means the AP and AR on sequences of Static scenes while AP𝐷 and AR𝐷 denote those of Dynamic ones. Methods AP AP𝑆 AP𝐷 AR AR𝑆 AR𝐷 SLNet-R18UC (vallina) with SceneMix with Fine-tuning on S with Fine-tuning on D 47.03 49.97 52.69 50.49 70.51 73.55 74.23 73.32 18.47 23.16 22.91 28.27 46.95 55.94 58.85 56.09 75.07 78.09 78.00 77.51 27.40 32.14 30.98 37.85 SLNet-R18UC 53.41 74.23 28.27 60.00 78.00 37.85 Table 4: Number of different scenes and prediction accuracy. Scene Static Dynamic # seq in train # seq in test Accuracy 28 6 100% 4 2 100% 4.5 Ablation Study Next, we investigate the effectiveness of different components in the scene-aware learning framework on the validation set of ROD2021 dataset. Table 3 shows the results with and without SceneMix augmentation, and also the results fine-tuning on different scenes. The vallina version of SLNet-R18UC is trained directly without fine-tuning and SceneMix augmentations. When training with SceneMix, the final AP increases by 1.77%. Based on the SLNetR18UC trained with SceneMix, we fine-tune on Static and Dynamic scenes separately, and both of the fine-tuned models achieve a better result by 2.72% and 0.52% in AP respectively. Besides, finetuning on different scenes will lead to an obvious improvement in the corresponding scene. Finally, by applying the scene-aware learning framework, we predict each scene with the corresponding model and achieve final AP of 53.41%. We can observe that adding each component contributes to the final results without any performance degradation. 4.3 Training Details Our experiments utilize Adam [12] to optimize the network, and the learning rate is set to 1 × 10−4 . A cosine annealing with warmup restart scheduler is applied on the optimizer in order to make the training process more smooth. The model is paralleled on 4 GPUs with batch size 64 in total. Given an input radar snippet, the probability of applying VideoMix and VideoCropMix are both 13 . The chance of augmentation with NoiseMix is 12 . After 50 epochs of training on all sequences, our model is then fine-tuned on sequences of different scenes for 30 epochs. Finally, all trained models are ensembled to get the detection results. 5 DISCUSSION The scene-aware learning framework can remarkably improve the performance of radar object detection, especially for the Static scene. Despite the success of this model, there are also some limitations which need further attention. First, although scene-aware learning achieves high accuracy on Static scene, the performance on Dynamic ones is not satisfactory enough. More analysis should be done to investigate why this method has inferior performance in other scenarios like campus road, city street, and highways. Besides, apart from two scenes division, it may be possible for the model to generalize to more categories, or even velocity-aware one. Finally, how to apply this model to an unseen scene may be another practical issue. We leave these questions for future work. 4.4 Results Table 2 presents final results of models with high ranking in ROD2021 competition. Our model achieves a AP of 75.0%, which outperforms a baseline of 69.8% by simply applying RODNet-CDC without the scene-aware learning framework. It is worth noting that our model ranked first in Parking Lot(PL) scene with respect to the AP score. To further compare performance of different models, we utilize the validation set to compare our methods with the three models proposed in [30]. The results are shown in Table 1. RODNet-CDC is a shallow 3D CNN encoder-decoder network. RODNet-HG is adopted from [17] with only one stack, while RODNet-HGwI replaces the 3D convolution layers in RODNet-HG with temporal inception layers [24]. To complete the scene-aware learning pipeline, we need to train a 3D scene classifier. The number of different scenes in train and test datasets are shown in Table 4. Our scene classifier can obtain 100% accuracy in predicting the driving scenes. All results of SLNet in Table 1 are trained in the scene-aware learning framework. We can see that the ensemble version of the scene-aware learning framework outperforms the best results of baselines 8.77% in average precision and 10.10% in average recall. All SLNet (RODNet-CDC, SLNet-R18C, and SLNet-R18UD) outperforms three baselines by 1.46%, 1.84%, and 8.03% in AP respectively. 6 CONCLUSION In this paper, we proposed a scene-aware learning framework to detect objects from radar sequences. In the framework, radar sequences will be detected by models fine-tuned on the same scenes. The proposed SLNet can robustly detect objects with high precision. In addition, the paper presents a new augmentation SceneMix and post-processing method for radar object detection. The proposed method offers a novel and effective solution to take advantage of the properties of radar data. Our experiments conducted on the ROD2021 dataset demonstrate our proposed framework is an accurate and robust method to detect objects based on radar. 578 Challenge Paper ICMR ’21, August 21–24, 2021, Taipei, Taiwan REFERENCES [19] AD Olver and LG Cuthbert. 1988. FMCW radar for hidden object detection. In IEE Proceedings F (Communications, Radar and Signal Processing), Vol. 135. IET, 354ś361. [20] Minh-Tan Pham and Sébastien Lefèvre. 2018. Buried object detection from B-scan ground penetrating radar data using Faster-RCNN. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 6804ś6807. [21] Xingshuai Qiao, Tao Shan, and Ran Tao. 2020. Human identification based on radar micro-Doppler signatures separation. Electronics Letters 56, 4 (2020), 195ś196. [22] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779ś788. [23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015). [24] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1ś9. [25] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489ś4497. [26] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450ś6459. [27] Hao Wang, Weining Wang, and Jing Liu. 2021. Temporal Memory Attention for Video Semantic Segmentation. arXiv preprint arXiv:2102.08643 (2021). [28] Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. 2018. Fully motionaware network for video object detection. In Proceedings of the European conference on computer vision (ECCV). 542ś557. [29] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. 2019. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8445ś8453. [30] Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang, Guanbin Xing, and Hui Liu. 2021. Rodnet: Radar object detection using cross-modal supervision. In WACV. 504ś513. [31] Xiangyu Yue, Bichen Wu, Sanjit A Seshia, Kurt Keutzer, and Alberto L Sangiovanni-Vincentelli. 2018. A lidar point cloud generator: from a virtual world to autonomous driving. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 458ś464. [32] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6023ś6032. [33] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. 2020. VideoMix: Rethinking Data Augmentation for Video Classification. arXiv preprint arXiv:2012.03457 (2020). [34] Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong Tian, Hang Zhao, Antonio Torralba, and Dina Katabi. 2018. Through-wall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7356ś7365. [35] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flowguided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408ś417. [1] Aleksandar Angelov, Andrew Robertson, Roderick Murray-Smith, and Francesco Fioranelli. 2018. Practical classification of different moving targets using automotive radar and deep neural networks. IET Radar, Sonar & Navigation 12, 10 (2018), 1082ś1089. [2] Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. 2020. Toward fast and accurate human pose estimation via soft-gated skip connections. arXiv preprint arXiv:2002.11098 (2020). [3] Samuele Capobianco, Luca Facheris, Fabrizio Cuccoli, and Simone Marinai. 2017. Vehicle classification based on convolutional networks applied to fmcw radar signals. In Italian Conference for the Traffic Police. Springer, 115ś128. [4] Andreas Danzer, Thomas Griebel, Martin Bach, and Klaus Dietmayer. 2019. 2d car detection in radar data with pointnets. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, 61ś66. [5] Jingkun Gao, Bin Deng, Yuliang Qin, Hongqiang Wang, and Xiang Li. 2018. Enhanced radar imaging using a complex-valued convolutional neural network. IEEE Geoscience and Remote Sensing Letters 16, 1 (2018), 35ś39. [6] Xiangyu Gao, Guanbin Xing, Sumit Roy, and Hui Liu. 2019. Experiments with mmwave automotive radar test-bed. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers. IEEE, 1ś6. [7] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440ś1448. [8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580ś587. [9] Souvik Hazra and Avik Santra. 2019. Short-range radar-based gesture recognition system using 3D CNN with triplet loss. IEEE Access 7 (2019), 125623ś125633. [10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961ś2969. [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770ś778. [12] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [13] Jihoon Kwon and Nojun Kwak. 2017. Human detection by neural networks using a low-cost short-range Doppler radar sensor. In 2017 IEEE Radar Conference (RadarConf). IEEE, 0755ś0760. [14] Ankith Manjunath, Ying Liu, Bernardo Henriques, and Armin Engstle. 2018. Radar based object detection and tracking for autonomous driving. In 2018 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM). IEEE, 1ś4. [15] Michael Meyer and Georg Kuschk. 2019. Automotive radar dataset for deep learning based 3d object detection. In 2019 16th European Radar Conference (EuRAD). IEEE, 129ś132. [16] Ramin Nabati and Hairong Qi. 2019. Rrpn: Radar region proposal network for object detection in autonomous vehicles. In 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 3093ś3097. [17] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, 483ś499. [18] Felix Nobis, Maximilian Geisslinger, Markus Weber, Johannes Betz, and Markus Lienkamp. 2019. A deep learning-based radar and camera sensor fusion architecture for object detection. In 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF). IEEE, 1ś7. 579