Garnett Real-Time Category-Based and ICCV 2017 Paper
Garnett Real-Time Category-Based and ICCV 2017 Paper
Garnett Real-Time Category-Based and ICCV 2017 Paper
Abstract
Detecting obstacles, both dynamic and static, with near-
to-perfect accuracy and low latency, is a crucial enabler
of autonomous driving. In recent years obstacle detection
methods increasingly rely on cameras instead of Lidars.
Camera-based obstacle detection is commonly solved by
detecting instances of known categories. However, in many
situations the vehicle faces un-categorized obstacles, both
static and dynamic. Column-based general obstacle detec-
tion covers all 3D obstacles but does not provide object-
instance classification, segmentation and motion predic- Figure 1. Unified method output examples. General obstacles
tion. In this paper we present a unified deep convolutional are marked by blue bars. The bottom position of each bar is de-
network combining these two complementary functions in termined by the max probability bin of the StixelNet branch (See
one computationally efficient framework capable of real- Section 2.1). Bar height is shown for display purposes only, and
time performance. Training the network uses both manually computed as a linear function of the bottom image position. Cars,
pedestrians and cycles are marked as bounding boxes and color
and automatically generated annotations using Lidar. In
coded by their class. Notice in the examples that real world driv-
addition, we show several improvements to existing column-
ing involves complex scenarios requiring both types of obstacles
based obstacle detection, namely an improved network ar- to be detected.
chitecture, a new dataset and a major enhancement of the
automatic ground truth algorithm.
10] leading to dramatic performance improvements in re-
cent years [28, 6, 22]. In the common scenario, during de-
1. Introduction ployment, a bounding box and class is marked for each ob-
Autonomous vehicles depend on detecting static and dy- ject belonging to a set of pre-defined classes. In addition,
namic obstacles in real-time and predicting their behav- the object’s pose may be assigned [25]. This is particularly
ior with no room for error. High resolution Lidar has useful since knowing the class and pose of the object is in-
been the sensor to go in this domain for most projects strumental to predicting its behavior and motion. However,
aiming Level 5 automation such as the winning entries in objects outside the pre-defined class set are not detected.
the DARPA Urban Challenge [33] and Google’s self driv- A complementary enabling technology is free-space or
ing car [14]. Recently, more projects aim at a camera- general obstacle detection [3, 19]. The task is to identify in
centric approach [23, 2, 1] due to the disadvantages of high- each image column the row position of the nearest roughly
resolution Lidars (cost, packaging, moving parts) and the vertical obstacle. In our formulation obstacles consist of ev-
boost in computer vision accuracy. To fully or partially re- ery object higher than a typical curb. This allows detecting
place Lidars, vision-based obstacle detection should reach all relevant static and dynamic obstacles, both on the road
at least the same performance. We divide the task to two and on the sidewalk. Several examples include construction
main sub-tasks: categorized and general obstacle detection. zone delimiters, shopping carts, un-categorized animals and
Category-based detection (aka detection and classifica- toddlers. We introduce a method performing both types of
tion), has been extensively studied in computer vision [7, obstacle detection in one unified network capable of run-
1198
ning in real-time (30fps) and sharing most computation be- is MultiNet [32], a combined network for road segmenta-
tween the tasks. Figure 1 shows detection examples of our tion and object detection. In comparison our approach uses
network in three different scenes. Notice the complemen- column-based detection and operates over two times faster
tary nature of the two capabilities. In addition, for each on a comparable hardware.
categorized object, the network is capable of estimating its We believe automated ground truth (AGT) methods will
pose with negligible additional computation cost. be instrumental in next generation automotive computer
Our contributions are as follows. First, we introduce a vision. Their main appeal is the ability to collect large
novel unified network architecture for categorized object amounts of training data with a relatively low effort. Recent
detection, pose estimation and general obstacle detection benefits of the approach were shown in learning monocular
running in real time. The architecture is single-shot and depth estimation from Lidar [18] and Stereo [11]. [19] have
learned end-to-end. It combines Single-Shot Multi-Box shown the effectiveness of AGT for general obstacle detec-
Detector (SSD) [22] for categorized object detection and tion. Lidar is very efficient at detecting free-space and accu-
pose estimation, with our version of the StixelNet [19] for racy can be obtained by ignoring low confidence columns.
general obstacles. Training data consists of images with In this paper we introduce a new and improved algorithm
both manually and automatically generated ground truth for the this task.
(GT). Second, we introduce several significant improve- Finally, while numerous datasets with vehicle on board
ments to the StixelNet: a new network architecture, a new imagery exist [13, 5] only few exist from low mounted fish-
automatic ground truth method for generating training data eye lens cameras [20]. Such setup is in wide use for sur-
and the use of a newly collected dataset. In our experi- round view applications and requires special adaptation of
ments, we improve state-of-the-art general obstacle detec- computer vision methods. We introduce a new such dataset
tion. The combined multi-task network maintains single- with automatically generated GT. The remaining of the pa-
task networks accuracy while sharing most computation be- per is organized as follows: We start with the description of
tween the tasks. Finally, we show that column-based gen- the combined network followed by an experimental valida-
eral obstacle detection can be generalized to cope with sub- tion on multiple datasets and conclusions.
stantially different cameras and viewing angles.
2. Combined network for obstacle detection
1.1. Related Work
We next describe our architecture for each of the three
Network architectures for modern object detectors are tasks we address: general obstacle detection, object detec-
divided to single shot [22, 27] and region-proposal tion and pose estimation. We then present the combined
based [28, 6]. The output of such detectors is a tight bound- network architecture and its training procedure.
ing box around each object instance. Recently, it has been
shown that with little computational overhead rich addi-
2.1. General obstacle detection
tional information can be extracted for each instance. This Our network for general obstacle detection is derived
includes object pose [25], instance segmentation [15] and from the StixelNet [19]. The network gets as input an im-
3D bounding box estimation [4]. For our combined net- age of arbitrary width and predefined height Ih = 370. The
work we built our architecture on the SSD [22] which was final desired result is the pixel location y of the closest ob-
shown to have the best accuracy when processing time is stacle bottom point in each image column (with stride s).
the first priority [16]. As in the original StixelNet the network outputs for each
There exist several approaches for general obstacle or column are represented by k position neurons, represent-
free space detection. We follow the column-based ap- ing the centers of k equally spaced bins in the y image axis.
proach [3, 34]. In particular we further develop and im- The output of each neuron represents the predicted proba-
prove the StixelNet monocular approach introduced in [19]. bility that the obstacle position is in the respective bin.
This representation of the free space is both compact and An important addition we introduce is the ability to han-
useful since it can be efficiently translated to the occupancy dle two edge cases: the presence of a near obstacle trun-
grid representation commonly used by autonomous agents. cated by the lower image boundary (“near obstacle”), and
A contrasting approach is based on pixel-level road seg- no obstacle presence in the column (“clear”). This was
mentation [24, 32]. It has the advantage of detecting free not previously handled since the Automatic Ground Truth
space areas not immediately reachable at the expense of an (AGT) [19] did not detect such columns and hence they
over-parametrization: an output for each pixel. [26] de- were not included in the training set. Our AGT described
tect unexpected obstacles using a joint geometric and deep below does handle these cases, however since the represen-
learning approach. In [29] stereo-based stixels and pixel- tation of these in the training set is extremely imbalanced,
level segmentation are combined to obtain “Semantic Stix- we introduced the following modification to the network:
els”: Stixels with class labels. The most related to our work additional per-column type neurons with three possible
199
output values: “near obstacle”, “clear” and “regular” ac-
cording to the aforementioned cases. This dual per-column
representation is combined to one position output as fol-
lows: if the type is one of the edge cases, the first or last
position neurons probability is set to the type probability
respectively. The rest of the position neurons are re-scaled
proportionally s.t. the probabilities sum equals 1. During
training, Softmax-loss is used for the type-neurons, while
the PL-loss [19] is used for the position neurons. PL-loss Figure 2. Example of the Lidar-based automatic ground truth for
has been shown to be effective in regression problems such obstacle detection. Each obstacle instance is colored uniquely and
as ours, which require a multi-modal representation while obstacle bottom contour is marked in white.
being able to preserve order information between neighbor-
ing position neurons.
Following [19] we use AGT to detect the obstacle po- The second condition prevents columns to be classified as
sition in each image column using the Lidar point cloud. clear although there is a close dark object absorbing the Li-
The original AGT suffers from several drawbacks which dar beams.
we address. The proposed AGT has two main differences: When training on an cropped image patch, if bottom-
an object-centric obstacle bottom position estimation and most cropping position is above or at the GT bottom of a
column-type detection. The target of AGT is to bring fully valid column, then it is classified “near obstacle” (e.g. right
reliable and consistent GT while covering as many columns most object in figure 2). Compared to the automatic ground
as possible. The new AGT, described in detail next, pro- truth procedure described in [19] which operates directly in
vides high reliability while covering a much larger percent the image domain our object-centric approach takes advan-
of the columns as shown in the experimental section. tage of the obstacles continuity to output a more complete
The AGT is composed of two algorithms: one detects and smooth ground truth annotation.
for each obstacle its bottom contour in the image, and the We next describe the method for object detection and
second detects columns which are certain to be clear of ob- pose estimation, trained using manual annotations, in con-
stacles. Both algorithms get as input the 3D point cloud, trast to the general obstacle detection.
from which ground plane points are removed by fitting a
2.2. Category-based object detection
plane model. We start by describing rest of the stages of the
first algorithm. Our object detection is based on the SSD framework [22]
The 3D point cloud is separated to clusters in 3D. Obsta- which provides an excellent run-time vs. accuracy trade-off.
cles (clusters) with maximal height (position above ground) The network is trained with four classes: vehicles, pedestri-
below 20cm are ignored. The algorithms continues process- ans, cycles (bicycles + motorcycles) and background. We
ing each cluster separately. 3D Points are projected onto found a slight modification to the ground truth bounding
the image and dilated resulting in a dense point blob. Let box association in the training procedure to improve the net-
IB be the binary image of all lowest points in each image work accuracy. When a GT bounding box is associated to
column. IB is smoothed by a Gaussian kernel. Finally, a a proposal, a hard limit on the overlap ratio is originally
1-dimensional conditional random field is applied to find an used to classify the proposal as true or false. In our version
optimal trajectory with maximal value in IB and minimal a buffer zone in the overlap ratio value (0.4-0.6) is defined
y-position discontinuity. This trajectory is considered the in which proposals are ignored. This helps preventing am-
obstacle’s bottom contour. An example result is depicted in biguities in the association process and better defines class
figure 2. and background train samples. In addition we modify the
Note that the Lidar points do not cover the entire im- learning such that difficult examples are ignored instead of
age. Therefore a special handling is required for near ob- treated as background. We used a version of the code pro-
jects whose bottom is below the Lidar coverage. We first vided by the authors in which we optimized the network
detect such cases which occur when the bottom most Lidar deployment efficiency.
point of an obstacle cluster is above the ground. Then, we
project these points to the road plane in 3D and add them to
2.3. Object pose estimation
the cluster. Object pose is defined for a car by its heading direction
The second part of the AGT consists of detecting “clear” angle in top view ΘH in camera-centric coordinates. The
columns. There are two conditions to be met for such a pose angle is defined as ΘP = ΘH − ΘC where ΘC is
column: all points in it (projected from 3D) are lower than angle of the line passing through the camera and car cen-
5cm and there exist points beyond 18 meters in distance. ter position. For representation in the network ΘP is dis-
200
# instances
cretized to 8 equally spaced bins between 0 and 2π. For
# images
Objects
each bounding box proposal, the network outputs probabil-
Stixels
Pose
ity of each angle bin. A cyclic version of the PL-loss is used
to train the output layer against the continuous ground-truth Data-set name / source
ΘP . Supporting the cyclic nature of the output cyclic-PL- kitti-objects-train [13] X X 5K 15K
loss considers first and last bins as neighboring, such that Cityscapes [5] X 3K 15K
angles close to zero contribute both. The SSD architecture TDCB [21] X 9K 15K
is modified by adding per proposal, 8 pose neurons, similar Caltech-peds [9] X 32K 53K
to those for class and box regression. internal-objects-train X 3K 19K
internal-pose-train X 155K 160K
2.4. The combined network kitti-stixels-train [12] X 6K 5M
The combined network architecture is illustrated internal-stixels-train X 16K 20M
in 3. The feature extraction layers are based on the kitti-objects-test [13] X X 891 3.5K
GoogLeNet[31] backbone. From this, two main branches internal-objects-test X 1K 8K
split: object/pose (SSD+Pose) and general obstacle detec- internal-stixels-test X 910 19K
tion (StixelNet). The StixelNet branch is trained with AGT kitti-stixels-test [12] X 760 11K
as described previously while the object detection, classi- Table 1. Datasets used in paper
fication and pose estimation are trained with manually la-
beled data. Inspired by GoogLeNet [31] and VGG [30], our
version of StixelNet uses a deeper architecture than the one Three datasets in 1 were internally collected: object-
described in [19]. Feature map sizes in Illustration 3 cor- internal, pose-internal and stixel-internal. The first two,
respond to a 800 × 370 image. Note however that the two used for object detection and pose estimation respectively
branches may operate on different image sizes by cropping were collected with a roof top mounted camera similar in
feature maps accordingly when branching out. setting to the kitti-dataset [13] and manually labeled. The
stixel-internal dataset is aimed at short range general ob-
The combined network objective loss is defined as a lin-
stacle detection with fisheye-lens camera mounted in typi-
ear combination of the SSD object classification, bounding
cal production surround vision systems position. To obtain
box regression, pose estimation, stixel-position and stixel-
accurate automatic ground truth we mounted a Velodyne
type losses with relative weighing of 1, 0.5, 0.5, 1, 1 respec-
HDL64 Lidar right below the camera as depicted in Fig-
tively. These weights were experimentally set to minimize
ure 4. Co-locating the sensors eliminates differences stem-
accuracy loss on each task in the combined network. We
ming from viewpoint variation. The camera is triggered to
start all training sessions with the GoogLeNet layers pre-
capture an image every time the Lidar is pointed directly
trained on the ImageNet [8] as provided by the authors. We
forward. Each image is corrected before processing by vir-
found that fine-tuning the network with the combined ob-
tually rotating the camera to forward view, and un-distorting
jective to produce degraded results. Instead, we first train
it to a cylindrical image plane. Figure 1 bottom left shows
the SSD branch without pose, then fix all weights and train
detection results on an image from the test portion of this
pose neurons only, then fix again and train the StixelNet
dataset.
branch, and finally allow all network weights to freely learn
the combined objective loss. 3.1. Implementation details
201
SSD + Pose Layers
GoogleLeNet
Input image through inception_4 layer
Conv: 3X3 *(4(priors)*(3(Class confidences)+4(Box)+8(pose)))
Non-maximal suppression
46 Conv: 3X3 *(6*15)
370
23 Conv: 3X3 *(6*15)
inception_3 12
inception_4 6 Conv: 3X3 *(6*15)
800 3
100
50 25 13 7
3 480 832 512 256 2 56
Conv: 1X1*256 / s1 Conv: 1X1*128 / s1 Conv: 1X1*128 / s1
Conv: 3X3*512 / s2 Conv: 3X3*256 / s2 Conv: 3X3*256 / s2
StixelNet Layers
46
23
12 6
202
Accuracy measure Run-time
DataSet: {kitti, internal}-objects-test kitti-objects-test kitti-stixels-test internal-stixels-test
Test: Car Pedestrian Bicycle Pose Max Pr. Avg. Pr. Max Pr. Avg. Pr. ms/f
SSD-only 0.901 0.574 0.560 27
SSD + pose 0.900 0.578 0.541 0.890 28
StixelNet-only 0.854 0.824 0.827 0.774 19
Combined net 0.900 0.570 0.536 0.892 0.825 0.788 0.824 0.772 33
Table 2. Accurracy vs. run-time measures (in milliseconds per frame) on the three tasks for the combined net and its subsets. See text for
more details on accuracy measures and datasets.
work versus the SSD-only adds 20% to the total run-time, kitti-stixels-test internal-stixels-test
meaning most computation is shared between the tasks. At Max Pr. Avg. Pr. Max Pr. Avg. Pr.
30 frames per second (33ms/f) the combined network is Original 0.689 0.671 0.681 0.654
most suitable for run-time even with more power efficient Ours 0.854 0.824 0.827 0.774
GPUs. Edge cases excluded from test
Original 0.779 0.758 0.704 0.678
3.3. General obstacle detection
Ours 0.827 0.807 0.819 0.760
Due to the improved AGT, dataset and architecture the Table 3. Comparison of original StixelNet [19] and ours on stixel
general obstacle detection module performs significantly test sets
better than the originally proposed StixelNet. Figure 5 il-
lustrates these differences on some examples from the kitti- tained when trained only on the corresponding train set or
stixels-test set. Most apparent are, not surprisingly, for on the combined one. Having negligible degradation in ac-
the edge cases (near object and clear column). The sig- curacy when trained on the entire dataset suggests that the
nificant qualitative improvement has much to do with the network learned a general representation transferable to dif-
new AGT which provides a much more complete cover- ferent cameras.
age: 81% (internal-stixels-dataset) and 69% (kitti-stixels-
dataset) are provided with valid ground truth. For com- Train / Test kitti-stixels-test inter.-stixels-test
parison the original AGT provides a 25% coverage on the Max P. Avg P. Max P. Avg P.
kitti-stixels-dataset. Specifically, the edge cases were al- kitti-stixels-tr. 0.855 0.827 0.740 0.721
most completely excluded in the previous AGT. inter.-stixels-tr. 0.685 0.679 0.833 0.776
In table 3 we show consistent performance improvement both 0.854 0.824 0.827 0.774
on the new test set compared to the original StixelNet. To Table 4. Generalization across datasets
have a fairer comparison we also show the results when
edge cases are excluded from the test set, since the original
StixelNet was not trained with such cases. A full ablation 4. Conclusions and future work
study to single out the different factors for the improvement
is prohibitive since the new network architecture, training We presented a unified network with real-time detec-
set, test set and AGT are coupled together, and in fact solve tion capability for both categorized and uncategorized ob-
a slightly altered, more complete problem than the original ject. The network is trained with a combination of manual
one. We compared the effect of the backbone change in and automatic ground truth based on Lidar. Our novel au-
the architecture by altering the original StixelNet to use the tomated ground truth (AGT) algorithm covers most image
same backbone as in our implementation (GoogLeNet[31]). parts facilitating the learning of a generic obstacle detection
The network was trained and tested on the original test set module. Using the new AGT, in combination with a new
from [19]. Results show a marginal improvement in the ac- network architecture and dataset our version of the Stixel-
curacy (1% or less in all measures) indicating the backbone Net improves state-of-the-art column-based general obsta-
is not an important factor in the improvement. cle detection. We believe future research should focus on
To test the transferability of StixelNet from one camera obtaining a unified AGT process that covers all aspects of
setup to another we do a cross examination of the method obstacle detection.
trained and tested on both kitti and internal datasets. Note
the large difference: a roof mounted forward camera ver- References
sus a low mount fisheye-lens tilted downwards. As sum- [1] https://www.tesla.com/blog/
marized in table 4 highest accuracy on each test set is at- all-tesla-cars-being-produced-now/
203
Figure 5. StixelNet example results on kitti-test dataset. Left: Ours, Right: [19].
204
[15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R- [30] K. Simonyan and A. Zisserman. Very deep convolu-
CNN. arXiv preprint arXiv:1703.06870, 2017. 2 tional networks for large-scale image recognition. CoRR,
[16] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, abs/1409.1556, 2014. 4
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and [31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
K. Murphy. Speed/accuracy trade-offs for modern convolu- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
tional object detectors. CoRR, abs/1611.10012, 2016. 2 Going deeper with convolutions. In 2015 IEEE Conference
[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- on Computer Vision and Pattern Recognition (CVPR), pages
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- 1–9, June 2015. 4, 6
tional architecture for fast feature embedding. arXiv preprint [32] M. Teichmann, M. Weber, J. M. Zöllner, R. Cipolla, and
arXiv:1408.5093, 2014. 4 R. Urtasun. Multinet: Real-time joint semantic reasoning
[18] Y. Kuznietsov, J. Stückler, and B. Leibe. Semi-supervised for autonomous driving. CoRR, abs/1612.07695, 2016. 2
deep learning for monocular depth map prediction. In IEEE [33] C. Urmson, J. Anhalt, J. A. D. Bagnell, C. R. Baker, R. E.
International Conference on Computer Vision and Pattern Bittner, J. M. Dolan, D. Duggins, D. Ferguson, T. Galatali,
Recognition (CVPR), 2017. 2 H. Geyer, M. Gittleman, S. Harbaugh, M. Hebert, T. Howard,
[19] D. Levi, N. Garnett, and E. Fetaya. Stixelnet: A deep con- A. Kelly, D. Kohanbash, M. Likhachev, N. Miller, K. Pe-
volutional network for obstacle detection and road segmen- terson, R. Rajkumar, P. Rybski, B. Salesky, S. Scherer, Y.-
tation. In Proceedings of the British Machine Vision Confer- W. Seo, R. Simmons, S. Singh, J. M. Snider, A. T. Stentz,
ence (BMVC), pages 109.1–109.12. BMVA Press, Septem- W. R. L. Whittaker, and J. Ziglar. Tartan racing: A multi-
ber 2015. 1, 2, 3, 4, 5, 6, 7 modal approach to the darpa urban challenge. Technical Re-
[20] D. Levi and S. Silberstein. Tracking and motion cues for port CMU-RI-TR-, Pittsburgh, PA, April 2007. 1
rear-view pedestrian detection. In 2015 IEEE 18th Inter- [34] J. Yao, S. Ramalingam, Y. Taguchi, Y. Miki, and R. Urta-
national Conference on Intelligent Transportation Systems, sun. Estimating drivable collision-free space from monocu-
pages 664–671, Sept 2015. 2 lar video. In 2015 IEEE Winter Conference on Applications
[21] X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, K. Li, of Computer Vision, pages 420–427, Jan 2015. 2
and D. M. Gavrila. A new benchmark for vision-based cy-
clist detection. In 2016 IEEE Intelligent Vehicles Symposium
(IV), pages 1028–1033, June 2016. 4
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
In ECCV, 2016. 1, 2, 3
[23] R. Metz. Autox has built a self-driving car that navigates
with a bunch of 50usd webcams. MIT Technology Review,
March, 2017. 1
[24] G. L. Oliveira, W. Burgard, and T. Brox. Efficient deep mod-
els for monocular road segmentation. In 2016 IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS), pages 4885–4891, Oct 2016. 2
[25] P. Poirson, P. Ammirato, C. Fu, W. Liu, J. Kosecka, and A. C.
Berg. Fast single shot detection and pose estimation. CoRR,
abs/1609.05590, 2016. 1, 2
[26] S. Ramos, S. K. Gehrig, P. Pinggera, U. Franke, and
C. Rother. Detecting unexpected obstacles for self-driving
cars: Fusing deep learning and geometric modeling. CoRR,
abs/1612.06573, 2016. 2
[27] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.
You only look once: Unified, real-time object detection.
In 2016 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,
2016, pages 779–788. IEEE Computer Society, 2016. 2
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. In Neural Information Processing Systems (NIPS),
2015. 1, 2
[29] L. Schneider, M. Cordts, T. Rehfeld, D. Pfeiffer, M. En-
zweiler, U. Franke, M. Pollefeys, and S. Roth. Semantic
Stixels: Depth is not enough. In IEEE Intelligent Vehicles
Symposium, Proceedings, pages 110–117, Piscataway, NJ,
2016. IEEE. 2
205