Zusc S 24 00845

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Frontiers of Information Technology & Electronic Engineering

UNDERSTANDING THE 3D SURROUNDING FOR IMPROVED OBJECT


CLASSIFICATION FROM RGBD IMAGE FUSION
--Manuscript Draft--

Manuscript Number:

Full Title: UNDERSTANDING THE 3D SURROUNDING FOR IMPROVED OBJECT


CLASSIFICATION FROM RGBD IMAGE FUSION

Article Type: Article

Corresponding Author: Mehmet akif alper, Ph. D.


Eastern Michigan University
Ypsilanti, Michigan UNITED STATES OF AMERICA

Order of Authors: Mehmet akif alper, Ph. D.

Saif Muhammad Imran, PhD

Corresponding Author Secondary


Information:

Corresponding Author's Institution: Eastern Michigan University

Corresponding Author's Secondary


Institution:

First Author: Mehmet akif alper, Ph. D.

First Author Secondary Information:

Order of Authors Secondary Information:

Funding Information: GameAbove Dr. Mehmet akif alper


(Post Doctoral Researcher)

Abstract: Object classification algorithms have a wide range of applications in the real world.
Autonomous vehicles (AVs) are needed to detect and classify nearby objects for
autonomy. In this project, we propose an Object Classification algorithm using color
and depth imagery. Our method estimates depth with neural networks using a
monocular camera. Then, we used depth estimates with color image features for object
classification on AVs. Fusing depth and color image features enhances object
classification accuracy. We explained our approach and reported quantitative results
for the problem.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Title Page Click here to access/download;Manuscript (No Author
Information);Object Classification-Autonomous.docx
Click here to view linked References

1
2
3 UNDERSTANDING THE 3D SURROUNDING FOR
4
5 IMPROVED OBJECT CLASSIFICATION FROM RGBD
6
7
IMAGE FUSION
8
9 *
10 Dr. Mehmet Akif Alper1, Dr. Saif Imran2
11 Cybersecurity
12 Eastern Michigan University1, Mi, USA, 48197
13 Stoneridge Inc, Novi2, MI, USA, 48377
14 [email protected]
15 [email protected]
16
17
18
19
20
21
22
23
24
25 Abstract
26
27 Object classification algorithms have a wide range of applications in the real world. Autonomous
28 vehicles (AVs) are needed to detect and classify nearby objects for autonomy. In this project, we propose an
29 Object Classification algorithm using color and depth imagery. Our method estimates depth with neural
30 networks using a monocular camera. Then, we used depth estimates with color image features for object
31 classification on AVs. Fusing depth and color image features enhances object classification accuracy. We
32 explained our approach and reported quantitative results for the problem.
33
34
35 Keywords: Fusion, object detection, perception, lidar, monocular camera.
36
37
38 1. Introduction
39
40 Object detection and classification has a wide range of applications in the field of Computer
41 Vision [1-2]. Perception of the 3D environment has turned out to be vital for self-driving cars. It
42
43
can tremendously help in self-driving by estimating how far an object is from the vehicles, the
44 pose of the vehicle and even behavioral analysis of static and dynamic objects in addition to robust
45 object detection and classification. To understand the 3D environment more accurately, selecting
46 hi-resolution sensors for estimating depth turns out to be very important for the vehicle industries.
47
48 The sensors need to be cheap, portable, dense resolution and cover the entire field of view of the
49 mobile platform. Recently, lidars have emerged as an accurate depth sensing device for the outside
50 environment. But lidars are expensive, bulky, and have low resolution that gives sparse data
51
52
compared to standard color cameras. Also, the issue of calibration and fusion are additional
53 overheads if multiple sensors are to be used for robust object detection. The motivation for this
54 project comes from the idea of using a monocular color camera alone for inferring depth, then
55 using both color and depth information for improved object classification/recognition.
56
57 Specifically, our algorithm improves classification performance on AVs given in-depth and color
58
59
60
61
62
63
64
65
1 information.
2 2. Related Study
3
4
5 There are many researchers that propose object detection by processing RGB-D imagery.
6
7
However, most of the depth images are obtained from active depth sensors like Kinect, or Intel
8 Senz3D cameras [3,4]. Both these sensors can produce dense, reliable depth-maps which can be
9 fused with RGB cameras relatively easily. However, Kinect and Intel-Senz Cameras can work
10 only in closed door scenarios, and it completely fails in outdoor environments due to sunlight
11
12 interference. Fusion of lidars with cameras can also be a good alternative for detection problems,
13 but lidar information is sparse and even if it is possible to get a reliable dense depth map by super
14 resolution [5,6], and lidars are currently very expensive. Our aim in this research is to use
15
16
monocular cameras for depth map estimation, and to explore whether the depth estimates are
17 reliable enough for improved RGB-D object classification. There is plenty of literature in this
18 field; but Eigen et al. [7] changed the landscape of this research by incorporating deep learning to
19 estimate depth. Much new research has now used their own deep learning paradigm to estimate
20
21 depth [8,9]. Most of the previous research relies on aggregate statistics, and it is hard to tell which
22 of them would perform best based on the statistics. Therefore, in this research we propose a state-
23 of-the-art technique to estimate depth to check whether it can improve RGB-D object
24 classification.
25
26
27
3. Method
28
29
30
31 Since estimating depth is vital for scene understanding and improved 3D perception, we
32 propose using a single monocular camera for estimating depth and robust object classification. A
33 single camera has high resolution, portable and much cheaper than depth cameras. Also, several
34
35 state-of-the-art methods have already found impressive object classification performance based
36 on color imagery. The effort for this research is to estimate the depth map from a monocular
37 camera to leverage information with color images to improve object classification performance.
38
39
The idea is that if monocular cameras can infer depth alone, the cost of purchasing depth sensors
40 can come down significantly. However, we need to make sure that depth inferred from a
41 monocular camera is useful for object classification or recognition purposes. Therefore, the project
42 comprises two parts; firstly, we apply deep learning for images captured from monocular camera
43
44 estimating depth. Secondly, we use estimated depth along with the conventional color features to
45 improve object classification. The reason for fusing depth and color information is because we
46 believe color and depth contain useful complementary characteristics that can be utilized for
47 improved object detection. The following section describes our approach.
48
49
50
51 For the first part of the research, we use two state of the art methods and compare its
52 performance to select the best method for use in depth estimation. For the second part, extract
53
relevant depth and RGB features, from the fused RGB and depth images together to classify
54
55 objects. The following subsections describe the approaches in detail.
56
57
58
59
60
61
62
63
64
65
1 3.1 Depth Estimation from Monocular Camera
2 Liu et al. [8] shows that the property of continuous characteristic of depth values can be used
3
4 to formulate the problem of depth estimation as a continuous conditional random field (CRF)
5 learning problem. A deep convolutional neural field model is proposed which can jointly explore
6 the capacity of deep CNN and continuous CRF. The structured inference model (CRF model) can
7 learn the unary and pairwise potentials of that model jointly in a deep CNN framework. CNN
8
9 predicts the unary and pairwise potentials on given superpixels which are input to the CRF loss
10 layer. The predicted depth map is then backpropagated to the loss to the CNN network see fig 1.
11 It showed that the integral of the partition function in a CRF can be calculated in closed form,
12
13
allowing an exact solution to the maximum-likelihood estimation problem. As a result, no
14 approximate inferences are required. To go at the problem, Liu et al. [8] proposes over segmenting
15 the image into plenty of small regions called superpixels. Each superpixel defines a homogeneous
16 region where depth is assumed to be constant. Based on this assumption, the CRF model is built
17
18 by defining unary and pairwise potentials on superpixels and its neighbors. A rectangular patch is
19 extracted from each superpixel so that it can capture the local context of the superpixel, and then
20 the patch is fed into a deep network for estimating depth of each superpixel. The final depth map
21
22
needs to be very close to the estimated depth of the superpixel. The rectangular patch is then fed
23 into the AlexNet architecture for deep learning to estimate the depth of the superpixel. The task of
24 the deep network is to infer depth based on the local context of the region around each superpixel.
25 Once a depth value is inferred from the deep network, the method then estimates depth by MAP
26
27 inference:
28
29
30 𝑦 ∗ = 𝑎𝑟𝑔 𝑃𝑟(𝑦|𝑥)
31 Unary and pairwise potentials of the superpixels are given by the joint energy function:
32
33
34
35 𝐸(𝑦, 𝑥) = ∑ 𝑈(𝑦𝑝 , 𝑥) + ∑ 𝑉(𝑦𝑝 , 𝑦𝑞 , 𝑥)
36 𝑝𝜖𝑛 (𝑝,𝑞)𝜖𝑠
37
38
39
40 where 𝑈(𝑦𝑝 , 𝑥) is given by:
41
42
43 𝑈(𝑦𝑝 , 𝑥; 𝜃) = (𝑦𝑝 −𝑧𝑝 (𝜃))2 , ∀𝑝 = 1,2, … 𝑛
44
45
46 The unary potential captures how close the inferred depth value can be to the ground-truth.
47
48 The 𝑦𝑝 is the depth-measurement obtained from lidar. If there are no depth measurements available
49 in that region, it can fill it up with a random value. The optimization should then pick up the correct
50
depth value at that region. The pairwise potentials 𝑉(𝑦𝑝 , 𝑦𝑞 , 𝑥) is given by:
51
52 1
𝑉(𝑦𝑝 , 𝑦𝑞 , 𝑥; 𝛽) = 𝑅𝑝𝑞 (𝑦𝑝 −𝑦𝑞 )2 , ∀𝑝 = 1,2, … 𝑛
53 2
54
55 The pairwise potential basically captures the context and the property of the neighboring super
56 pixels by measuring the similarity of the color features between them. It is mainly captured by Rpq
57 where
58
59
60
61
62
63
64
65
𝐾
1 𝑇 1 𝐾 𝑇 𝑘
2 𝑅𝑝𝑞 = 𝛽 [𝑆𝑝𝑞 , … , 𝑆𝑝𝑞 ] = ∑ 𝛽𝑘 𝑆𝑝𝑞
3 𝑘=1
4
5 β comes from the weights learnt by the single layer network 1 and
𝑘 𝑘
6 𝑘
𝑆𝑝𝑞 = 𝑒𝑥𝑝−𝛾‖𝑆𝑝 −𝑆𝑞 ‖ , 𝑘 = 1,2,3;
7
8 Sp and Sq are observation values of the superpixel obtained from the color, color histogram
9 and LBP, kk is the l2 norm. The MAP inference can be solved once we have closed form
10
11
expression for the energy function and the partition function.
12 𝑒𝑥𝑝(−𝐸(𝑦, 𝑥))
𝑃𝑟 𝑃𝑟 (𝑦|𝑥) =
13 𝑍(𝑥)
14
15 It proved analytically that a closed form expression is possible. The final optimization is given
16 by:
17 𝑁
18
19
𝜆1 /2‖𝜃‖22 + 𝜆2 /2‖𝛽‖22 − ∑ 𝑙𝑜𝑔𝑃𝑟(𝑦𝑖 |𝑥𝑖 ; 𝜃, 𝛽)
20 1
21 3.2 Unsupervised Learning by Deep CNN
22
23 The second method we evaluated was Gargi et al.’s work [9]. This uses an unsupervised
24 learning framework to estimate depth. They evaluate their depth predictions in one image based
25
26 on how well the corresponding disparities predict the other image. The method is similar to
27 autoencoders, and it uses both left and right cameras provided in the KITTI dataset for calculating
28 the reconstruction error. Although no ground-truth (absolute depth values) is explicitly needed,
29 they needed two cameras with known focal length and baseline (for knowing the scale) to
30
31 efficiently train the network. Images fed to the deep CNN for input and a predicted depthmap is
32 obtained at the output. To retrain the network, we need a ground-truth depthmap. In this case,
33 since ground-truth depth map is not possible, we go around this problem by reconstructing the
34
35
input image by warping a right image given the disparity measures for each pixel. The left and
36 right images are captured from two side-by-side cameras located at a baseline of 0.5m. The
37 concept mainly comes from the principle of stereo vision. The network basically comprises two
38 parts: a convolutional encoder and deconvolutional decoder network. The convolutional encoder
39
40 basically encodes the color image so that it can infer the depthmap of the same size of the color
41 image at the output of the deconvolutional network. The main point is the skip architecture that
42 they used to sharpen the local details that get blurred as the deconvolutional layer upsamples the
43
44
depthmap based on the encoded map at the end of the convolutional layer. The total error of the
45 network is given by the following equation:
46 𝑁
47 𝑖 𝑖
∑ 𝐸𝑟𝑒𝑐𝑜𝑛𝑠𝑡 + 𝛾𝐸𝑠𝑚𝑜𝑜𝑡ℎ
48
49 𝑖
50 Here 𝐸 comprises of 𝐸𝑟𝑒𝑐𝑜𝑛𝑠𝑡 , photometric reconstruction error and 𝐸𝑠𝑚𝑜𝑜𝑡ℎ error. It is given
51 by the sum of the error over each pixel. 𝐸𝑟𝑒𝑐𝑜𝑛𝑠𝑡 is obtained from warping the image of the right
52
53 camera with the disparity map to the left camera, and is given by:
2
54
∫ ‖𝐼𝑖𝑤 (𝑥) − 𝐼𝑖1 (𝑥)‖2 𝑑𝑥 = ∫ ‖𝐼𝑖2 (𝑥 + 𝐷𝑖 (𝑥)) − 𝐼𝑖1 (𝑥)‖ 𝑑𝑥
55
56
57 Here Ω is the region over all the pixels.
58
59
60
61
62
63
64
65
1 𝐼2 (𝑥 + 𝐷𝑛 (𝑥)) = 𝐼2 (𝑥 + 𝐷𝑛−1 (𝑥)) + (𝐷𝑛 (𝑥) + 𝐷𝑛−1 (𝑥))𝐼2ℎ (𝑥 + 𝐷𝑛−1 (𝑥))
2
3
The most fascinating part is the way the warping function is developed. The warping function
4 can be linearized by breaking 𝐼2 (𝑥 + 𝐷(𝑥)) as a Taylor series expansion given the disparity
5 measure between two iterations at that pixel remains small. 𝐷𝑖 (𝑥) is given by: 𝑓𝐵/𝑑 𝑖 (𝑥); where f
6
7
is the focal length of the camera, B is the baseline between the two cameras, and 𝑑 𝑖 is the disparity
8 between left and right stereo. The smooth function is basically a penalty function to keep the
9 gradient of the depth map low. The fact that the gradient of the depth map is low is a very important
10 property satisfied by natural depth maps. The penalty is given by:
11 2
𝑖
12 𝐸𝑠𝑚𝑜𝑜𝑡ℎ = ‖𝛻𝐷𝑖 (𝑥)‖
13
14
15 3.3 Object Classification from RGBD image
16
17
18 Neural networks have a proven record of excellence in object classification. For our approach,
19
20 we will use the rich semantic feature information obtained via the best performance pre-built
21 network available. Currently, one the best pre-built object classifiers available through
22 MatConvNet is the ResNet-50 [10]. The ResNet-50 is not a conventional neural network, but a
23 Deep Residual Neural Network. Typical Deep Convolutional Neural Networks combined feature
24
25 level information extracted at multiple levels of the network to attempt to combine high, mid and
26 low-level feature information. Continuously adding layers to a network to make it deeper is only
27 expected to work up to a point due to vanishing gradients and no guarantee that the additional
28
29
layers learn anything new of value. Residual networks solve both problems. To help ensure that
30 an added layer will learn new information in the network, the layer output will have access to the
31 input before transformation. The intuition behind this type of architecture is that it is easier to
32 optimize the residual mapping than the original mapping. These Deep Residual Networks have
33
34 been shown to outperform Deep Convolutional Neural Networks, where the best performing
35 network has 152 layers. We extract information from the final fully connected layer of the ResNet-
36 50 model architecture and use this as feature input for our SVM. This ensures that the RGB
37
38
features we use to classify vehicles and pedestrians are rich and discriminative. Each training
39 image is provided with 2048 meaningful RGB feature descriptors. From the depthmap, we
40 generate 3D point clouds and extract two sets of features: depth features and normal features. For
41 the depth features, we calculate the covariance matrix of the point-cloud at the bounded/masked
42
43 regions and use it as features. For the features, we calculate the mean and the variance of the local
44 surface normals calculated for each pixel based on the 3D point-cloud. Surface normals are
45 calculated by extracting a rectangular box and selecting points having estimated depth within a
46
given threshold. We use those points for calculating the minimum eigen-vector of the point-
47
48 cluster. The minimum eigen-vector then represents the surface normal for those selected points.
49 In training each SVM, we make sure to standardize the feature values. We also test performance
50 using k-Fold Cross Validation to ensure that there is no overfitting occurring in our model. In our
51
52 experiments we used a k value of 10.
53
54
55 4. Datasets
56
57
58
59
60
61
62
63
64
65
1 We use publicly available KITTI dataset for training, testing and evaluation purposes. The
2 KITTI dataset contains images from 4 different cameras at fixed baselines between them. They
3 can be used as a stereo camera system. In addition, it has a 64 channel Velodyne lidar that can be
4 used for ground-truth or evaluation purposes. For estimating depth, we use 56 different video
5
6 scenes, belonging to categories ’city’, ’road’, ’residential’. Frames can be extracted from videos
7 for training and testing. Lidar measurements are used for evaluation purposes only. KITTI also
8 provides a subset database meant specifically for object detection. This database contains 7481
9
10
training and 7518 test scenes extracted from their videos, these images are independent meaning
11 they come from various videos and various points in time from each video. Each scene contains a
12 variety of vehicles, pedestrians, and other points of interest. The ground truth files for this dataset
13 contain tight bounding boxes for each object, as well as a label for the object at that location.
14
15 Officially, there are 8 labeled classes in the dataset; Car, Truck, Van, Tram, Pedestrian, Person
16 Sitting, Cyclist, Misc. and Don't Care. Both the misc. and tram class types contain very few
17 examples within the database, and so we mostly ignore their use. We bundle the Car, Truck and
18
19
Van classes into a single Vehicle class. Likewise, the Pedestrian, Person Sitting and Cyclist classes
20 are combined to represent a singular Pedestrian class. The Don’t care objects are those regions of
21 interest which can not yet be labeled, typically these are vehicles or pedestrians which are too far
22 from the camera or lidar sensor to obtain meaningful information from the vehicle. We collect a
23
24 database of 500 samples per class; vehicles, pedestrians, and a class of miscellaneous objects
25 which we refer to as the negative class ’other’. The subregions in a scene of vehicles and
26 pedestrians are provided to use by the ground truth file from the KITTI dataset. Meaningful
27
negative examples of varying sizes were obtained by guaranteeing that negatives shared anywhere
28
29 between a 5 to 25 percent region overlap with a ground truth area. Therefore, we ensure that our
30 negative examples are difficult and relevant. They each contain fragments which belong to either
31 a vehicle or pedestrian, presumably this should help our classification performance on objects
32
33 which are occluded in part. Additionally, this should make the classification of true vehicle and
34 pedestrian samples more challenging and therefore the feature descriptors that define them will be
35 more robust.
36
37
38
39 5. Experiments and Results
40 5.1 Depth Estimation
41
42 There were several metrics that were used for comparing depth estimation between these two
43 methods. The metrics are root mean square linear error (RMSE(linear)), root mean square
44
45 logarithm error (RMSE(log)), absolute relative difference (AbsRelDiff), squared relative
46 difference (SquaredRelDiff).
47 1
48 𝑅𝑀𝑆𝐸(𝑙𝑖𝑛𝑒𝑎𝑟) = √ ∑ ‖𝑦𝑖 − 𝑦𝑖∗ ‖2
49 |𝑇|
𝑦𝑖 𝜖𝑇
50
51 1
52 𝑅𝑀𝑆𝐸(𝑙𝑜𝑔) = √ ∑ ‖ 𝑙𝑜𝑔 𝑙𝑜𝑔 𝑦𝑖 −𝑙𝑜𝑔 𝑙𝑜𝑔 𝑦𝑖∗ ‖2
53 𝑇
𝑦𝑖 𝜖𝑇
54
55 1 𝑦𝑖 − 𝑦𝑖∗
𝐴𝑏𝑠𝑅𝑒𝑙𝐷𝑖𝑓𝑓 = ∑
56 |𝑇| 𝑦𝑖∗
57 𝑦𝑖 𝜖𝑇
58
59
60
61
62
63
64
65
2
1 1 ‖𝑦𝑖 − 𝑦𝑖∗ ‖
𝑆𝑞𝑢𝑎𝑟𝑒𝑑𝑅𝑒𝑙𝐷𝑖𝑓𝑓 = ∑
2 |𝑇| 𝑦𝑖∗
3 𝑦𝑖 𝜖𝑇
4
5
We used a 64 channel lidar for evaluation purposes. Table 1 shows the evaluation of depth
6 estimate for two different methods. It clearly shows the depthmap obtained from unsupervised
7 learning performs way better compared to the supervised learning framework. The RMSE for
8 unsupervised learning is about 6m while for supervised learning it is about 12m which is quite
9
10 significant. None of the methods can generate depthmap with precision but the former can be used
11 as depth estimator for our purpose. Fig 1 shows the RMSE depth error at different depth ranges
12 for both methods. Once again, the unsupervised method clearly outperforms the latter one at most
13
14
of the ranges, only at highest range d > 40m is when the former breaks down because the principle
15 of stereo is not good for inferring depth at long ranges. We can also see the RMSE magnitude of
16 the yellow bar increases quadratically as ranges are increased, while the supervised framework
17 has a consistent depth error on the overall image. The depthmap estimate obtained from the color
18
19 image can generate clutter in the 3D world and might turn out to be useless in the long run. To
20 investigate this scenario, we now generate the point cloud based on the depthmap obtained from
21 the color image. Fig. 2 (a) shows the lidar points projected into the RGB image. We selected a
22 depth-range within 7 – 12 m to concentrate on a small region of the point cloud. When the same
23
24 point cloud is projected back to the world, we can recreate a point cloud as shown in Fig. 2 (b).
25 As can be seen, the roads, poles and the traffic signs could be easily recognized. If we project the
26 dense depthmap into the 3D world within the same range, we get lots of clutter. It is harder to
27
28
recognize the poles and traffic signs. But the road could be easily recognized. Based on that, we
29 might conclude that the dense depthmap generated from the color image introduces a lot of clutter
30 in the 3D environment. In the next section, we now see whether this dense depthmap can be used
31 to improve the object classification.
32
33
34
35 Table 1. Errors given in different parameters.
36
Methods RMSE RMS-log Abs-RelDiff Sqr-RelDiff
37
38 Unsupervised 6.1 0.3 0.22 1.33
39
40 Supervised 12.29 0.6 0.74 10.16
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 Figure 1 Bar diagram shows the depth-estimate error of two different methods at different
24
25 depth ranges.
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52 Figure 2 This image illustrates the depth points projected into 3D space for Velodyne depth
53
54 and also for monocular depth. (a) shows the Velodyne points projected into the image space. The
55 points within a range of 7 − 12m are selected for clarity. (b) The same points are projected back
56 to the 3D space and the poles, traffic signs and the road could be clearly distinguishable from one
57 another.(c) shows the point cloud originated from the dense depth estimate from the monocular
58
59
60
61
62
63
64
65
1 image. It shows that there is a lot of clutter now and the structures other than the road are hard to
2 distinguish.
3
4
5 5.1 RGBD Object Classification
6
7 In this part, we fuse RGB features with depth features extracted from RGBD images and use
8 it for object classification. We utilized Garg et al.’s [9] work for generating depth-map which was
9
10 then used to extract depth and normal features. Since this problem is about classification, we
11 assume that we have already detected objects (true positives are vehicle, pedestrian, false positives
12 are any region of interest other than those objects) and all it needs to do is to classify amongst
13 different objects. We concentrated on three classes: vehicle, pedestrian, and others. ’Others’ class
14
15 contains negative samples obtained by generating random ROI on images. To make negative
16 classes harder, we made sure that the negative samples have ROI that overlaps between 5 and 20
17 percent of the positive classes but ensure that neither bounded region is overlapped by more than
18
19 40 percent of its own area. Also, the bounding box might contain some background along with the
20 original mask of the objects of interest. Therefore, we create tight elliptical masks within the
21 bounded box to free the 3D point cloud as much as possible from any background clutter that
22 might corrupt the depth and normal features.
23
24
25
26 While the baseline RGB feature SVM works quite well for our three-class problem, it is
27 evident that the addition of various in-depth information helps class discrimination (even if only
28 slightly). The region normals provided more improvement over regular depth features, but the
29
30 combination of both normals and depth information helped improve classification performance.
31 Because the ground truth bounding boxes were tightly fit around an object, we thought that
32 creating a mask of the region would help improve the significance of the depth information. We
33
34
thought that pixels nearer to the center would be more relevant than the information obtained from
35 pixels on the far corners which were less likely to contain our actual target. It turns out that adding
36 this mask hurt our overall performance slightly.
37
38 Table 2. SVM Accuracy using various Features with masks.
39 RGB+Normals RGB+Depth+
40 Methods RGB RGB+Depth
41 Normals
42 Accuracy 93.5 93.59 93.69 94.38
43
44
45
46 Table 3. SVM accuracy using Various Features without mask.
47 RGB+Normals RGB+Depth+
48 Methods RGB RGB+Depth
49 Normals
50 Accuracy 93.5 93.67 93.77 94.37
51
52 6. Conclusion
53
54
55 Our research study has shown that depth extracted from monocular images and color image
56
57 features improves the classification accuracy of objects even though it has created clutters to the
58
59
60
61
62
63
64
65
1 3D point cloud distribution. The overall improvement percentages are slight, but the proposed
2 algorithm is relatively consistent with previous methods. As an example, [11] implements a Fully
3 Convolutional Neural Network for object segmentation purposes. The authors briefly report the
4 success of combining depth information from the NYUDv2 dataset [12] which was collected
5
6 through a Microsoft Kinect. Some of their earlier attempts to integrate in-depth information led to
7 marginal performance increases. They believe this was due to the difficulty of passing down
8 meaningful depth gradients through their network. However, in following a depth encoding
9
10
defined by [4] they were able to obtain far more success with obtaining improvements in
11 segmentation up to 5 percent. Their depth map embedding contained information containing
12 horizontal disparity, height above ground, and local surface normal angles with an inferred gravity
13 direction. Therefore, we think that encoding the depthmap into far more meaningful features (e.g
14
15 height above ground, angle of the surface normal with the inferred gravity direction), we might be
16 able to uplift the performance significantly. Li et al. developed AFI-NET for object detection [13]
17 that is developed on image features as our proposed method. Additionally, the dataset we worked
18
19
with contained many ground truth samples of vehicles and pedestrians which were a good distance
20 away from the camera which may have impacted the significance of the depth information. One
21 of the reasons that depth information might be bad at far ranges is because we used the principle
22 of stereo to train the CNN architecture. Stereo depth error quadratically increases with far-ranges
23
24 and it was also evident with Fig 1 where at ranges beyond 50 m, the unsupervised learning
25 approach fails to compete with the supervised learning approach on depth estimation.
26
27 The purpose of the research is to explore the limits of monocular cameras in estimating depths
28 and perceive the 3D environment based on enhanced prediction. We mainly studied the object
29 classification performances using RGB-D information. We fused the depth-map prediction based
30
on monocular camera and RGB imagery for improved object classification and concluded that
31
32 depth information improves classification performance. In the future, we plan to build more
33 meaningful depth features as opposed to statistical measures of the point-cloud to see if that can
34 improve performance further. Also, we realized that the evaluation performance goes down with
35
36 far-away objects because the current method is bad at depth-prediction at far away ranges. We
37 plan to develop an algorithm that can work for short-range and long-range distances as well. We
38 also plan to deploy a neural network based on RGB-D channels for the neural network to learn the
39 combined feature information for us. That might help us improve the classification performance
40
41 further. This can also help us segment ground-planes (road surface), walls, and buildings by
42 clustering and analyzing the surface normals and help us give important insight into the complex
43 field of scene understanding. Analyzing RGB images based on depth maps from a monocular
44
45
camera alone contains a research opportunity that can be exploited for improved perception of the
46 3D environment.
47 7. Funding
48
49 This research is partially funded by GameAbove.
50 8. Conflict of Interest
51
52 There is no conflict of interest.
53 9. Availability of Data and Material
54
55 We used KITTI object detection dataset, can be freely downloaded here.
56 10. Code Availability
57
58 Code of this research will be shared once this paper has been published.
59
60
61
62
63
64
65
1 11. Authors Contributions
2 Dr. Mehmet Akif Alper designed and developed this research, Dr. Saif Imran has contributed with
3 previous experiences and paper writing.
4 12. Acknowledgements
5
6 Authors of this paper is grateful to GameAbove partial funding of this research.
7 13. References
8
9
10 1. Pu, S., Zhao, W., Chen, W. et al. Unsupervised object detection with scene-adaptive concept
11 learning. Frontiers of Information Technology and Electronic Engineering, Eng 22, 638–651 (2021).
12 https://doi.org/10.1631/FITEE.2000567
13
14 2. Ravpreet Kaur, Sarbjeet Singh, A comprehensive review of object detection with deep learning,
15
16 Digital Signal Processing, Volume 132, 2023,
17
3. M. Schwarz, H. Schulz, and S. Behnke, “Rgb-d object recognition and pose estimation based on pre-
18
19 trained convolutional neural network features,” in Robotics and Automation (ICRA), 2015 IEEE
20
21 International Conference on. IEEE, 2015, pp. 1329–1335.
22
4. S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from rgb-d images for object
23
24 detection and segmentation,” in European Conference on Computer Vision. Springer, 2014, pp. 345–
25
26 360
27
5. X. Song, Y. Dai, and X. Qin, “Deep depth super-resolution: Learning depth super-resolution using
28
29 deep convolutional neural network,” arXiv preprint arXiv:1607.01977, 2016.
30
31 6. D. Herrera, J. Kannala, J. Heikkilä et al., “Depth map inpainting under a second-order smooth-ness
32
33
prior,” in Scandinavian Conference on Image Analysis. Springer, 2013, pp. 555–566.
34 7. D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-
35
36 scale convolutional architecture,” in Proceedings of the IEEE International Conference on Computer
37
38 Vision, 2015, pp. 2650–2658.
39 8. F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep
40
41 convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38,
42
43 no. 10, pp. 2024–2039, 2016.
44 9. Garg, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to
45
46 the rescue,” in European Conference on Computer Vision. Springer, 2016, pp. 740–756.
47
48 10. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol.
49 abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
50
51 11. E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,”
52
53 CoRR, vol. abs/1605.06211, 2016. [Online]. Available: http://arxiv.org/abs/ 1605.06211
54 12. P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference
55
56 from rgbd images,” in ECCV, 2012.
57
58 13. Li, Liming et al. “AFI-Net: Attention-Guided Feature Integration Network for RGBD Saliency
59
60
61
62
63
64
65
Detection.” Computational Intelligence and Neuroscience 2021 (2021)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Title Page Click here to access/download;Title Page;Object
Classification-Autonomous - FirstPage.docx

UNDERSTANDING THE 3D SURROUNDING FOR


IMPROVED OBJECT CLASSIFICATION FROM RGBD
IMAGE FUSION
*
Dr. Mehmet Akif Alper1, Dr. Saif Imran2
Cybersecurity
Eastern Michigan University1, Mi, USA, 48197
Stoneridge Inc, Novi2, MI, USA, 48377
[email protected]
[email protected]

Abstract
Object classification algorithms have a wide range of applications in the real world. Autonomous
vehicles (AVs) are needed to detect and classify nearby objects for autonomy. In this project, we propose an
Object Classification algorithm using color and depth imagery. Our method estimates depth with neural
networks using a monocular camera. Then, we used depth estimates with color image features for object
classification on AVs. Fusing depth and color image features enhances object classification accuracy. We
explained our approach and reported quantitative results for the problem.

Keywords: Fusion, object detection, perception, lidar, monocular camera.

1. Introduction
Object detection and classification has a wide range of applications in the field of Computer
Vision [1-2]. Perception of the 3D environment has turned out to be vital for self-driving cars. It
can tremendously help in self-driving by estimating how far an object is from the vehicles, the
pose of the vehicle and even behavioral analysis of static and dynamic objects in addition to robust
object detection and classification. To understand the 3D environment more accurately, selecting
hi-resolution sensors for estimating depth turns out to be very important for the vehicle industries.
The sensors need to be cheap, portable, dense resolution and cover the entire field of view of the
mobile platform. Recently, lidars have emerged as an accurate depth sensing device for the outside
environment. But lidars are expensive, bulky, and have low resolution that gives sparse data
compared to standard color cameras. Also, the issue of calibration and fusion are additional
overheads if multiple sensors are to be used for robust object detection. The motivation for this
project comes from the idea of using a monocular color camera alone for inferring depth, then
using both color and depth information for improved object classification/recognition.
Specifically, our algorithm improves classification performance on AVs given in-depth and color
information.
2. Related Study

There are many researchers that propose object detection by processing RGB-D imagery.
However, most of the depth images are obtained from active depth sensors like Kinect, or Intel
Senz3D cameras [3,4]. Both these sensors can produce dense, reliable depth-maps which can be
fused with RGB cameras relatively easily. However, Kinect and Intel-Senz Cameras can work
only in closed door scenarios, and it completely fails in outdoor environments due to sunlight
interference. Fusion of lidars with cameras can also be a good alternative for detection problems,
but lidar information is sparse and even if it is possible to get a reliable dense depth map by super
resolution [5,6], and lidars are currently very expensive. Our aim in this research is to use
monocular cameras for depth map estimation, and to explore whether the depth estimates are
reliable enough for improved RGB-D object classification. There is plenty of literature in this
field; but Eigen et al. [7] changed the landscape of this research by incorporating deep learning to
estimate depth. Much new research has now used their own deep learning paradigm to estimate
depth [8,9]. Most of the previous research relies on aggregate statistics, and it is hard to tell which
of them would perform best based on the statistics. Therefore, in this research we propose a state-
of-the-art technique to estimate depth to check whether it can improve RGB-D object
classification.

You might also like