Kinect 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Visual Navigation for Micro Air Vehicles

Abraham Bachrach∗ , Albert S. Huang∗ , Daniel Maturana† , Peter Henry‡ ,


Michael Krainin‡ , Dieter Fox‡ , and Nicholas Roy∗
∗ ComputerScience and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
Email: {abachrac,albert,nickroy}@csail.mit.edu
† Department of Computer Science, Pontificia Universidad Católica de Chile, Santiago, Chile. Email: [email protected]
‡ Department of Computer Science & Engineering, University of Washington, Seattle, WA

Email:{peter, mkrainin, fox}@cs.washington.edu

Abstract—Recent advances in state estimation, planning, map-


ping, and control algorithms combined with ever-increasing
computational power available onboard micro air vehicles have
enabled a number of researchers to demonstrate advanced
autonomous capabilities in GPS-denied environments. Much of
this work has focused around the use of laser range-finder
sensors, however the use of a 2D sensor in a 3D environment
requires a number of assumptions that limit the ability of the
MAV to operate in unconstrained environments. In this talk,
we describe a system for visual odometry and mapping using
a stereo or RGB-D camera. By leveraging results from recent
state-of-the-art algorithms and hardware, our system enables 3D
flight and planning in cluttered environments using only onboard
sensor data. Fig. 1. Our quadrotor micro-air vehicle (MAV). The RGB-D camera is
mounted at the base of the vehicle, tilted slightly down.
I. I NTRODUCTION
Stable and precise control of an autonomous micro-air
vehicle (MAV) demands fast and accurate estimates of the II. A PPROACH
vehicle’s pose and velocity. Previously, we have developed
algorithms for MAV flight in cluttered environments using pla- The problem we address is that of a quadrotor helicopter
nar LIDAR [2]. LIDAR sensors provide range measurements navigating in an unknown environment. The quadrotor must
with unparalleled precision, but are unable to detect objects use the onboard sensors to estimate its own position (local
that do not intersect the sensing plane. Thus, they are most estimation), build a dense 3D model of the environment
useful in environments characterized by vertical structures, and (global simultaneous localization and mapping) and use this
less so in more complex scenes. We have also developed a model to plan trajectories through the environment.
system for controlling the vehicle using stereo cameras [1], Our algorithms are implemented on the vehicle shown in
however that algorithm was too slow to run on the vehicle. Figure 1. The vehicle is a Pelican quadrotor manufactured by
Estimating a vehicle’s 3D motion from sensor data typically Ascending Technologies GmbH. The vehicle has a maximal
consists of first estimating its relative motion at each time step dimension of 70 cm, and a payload of up to 1000 g. We have
from successive frames or scans. The 3D trajectory is then ob- mounted a stripped down Microsoft Kinect sensor which is
tained by integrating the relative motion estimates. While often connected to the onboard flight computer. The flight computer,
useful for local position control and stability, these methods developed by the Pixhawk project at ETH Zurich [15], is
suffer from long-term drift and are not suitable for building a 1.86 GHz Core2Duo processor with 4 Gb of RAM. The
large-scale maps. To solve this problem, we incorporate our computer is powerful enough to allow all of the real-time
previous work on RGB-D Mapping [8], which detects loop estimation and control algorithms to run onboard the vehicle.
closures and maintains a representation of consistent pose Following our previous work, we developed a system that
estimates for previous frames. decouples the real-time local state estimation from the global
This paper presents our approach to providing an au- simultaneous localization and mapping (SLAM). The local
tonomous micro-air vehicle (MAV) with fast and reliable state state estimates are computed from visual odometry (section
estimates and a 3D map of its environment by using an on- II-A), and to correct for drift in these local estimates the
board stereo or RGB-D camera and inertial measurement unit estimator periodically incorporates position corrections pro-
(IMU). Together, these allow the MAV to safely operate in vided by the SLAM algorithm (section II-B). This architecture
cluttered, GPS-denied indoor environments. A more detailed allows the SLAM algorithm to use much more processing time
coverage of the work presented here can be found in Huang than would be possible if the state estimates from the SLAM
et al. [12]. algorithm were directly being used to control the vehicle.
(SAD) of their feature descriptors [11]. A feature match
is declared when two features have the lowest scoring
SAD with each other, and they lie within the search
window defined by the initial rotation estimation.
Once an initial match is found, the feature location
in the newest frame is refined to obtain a sub-pixel
match. Refinement is computed by minimizing the sum-
of-square errors of the descriptors, using ESM [3].
5) Inlier Detection: Although the constraints imposed by
the initial rotation estimation substantially reduce the
rate of incorrect matches, an additional step is necessary
Fig. 2. The input RGB-D data to the visual odometry algorithm alongside the
to further remove bad matches. We follow Howard’s
detected feature matches. Inliers are drawn in blue, while outliers are drawn approach of computing a graph of consistent feature
in red. matches, and then using a greedy algorithm to approxi-
mate the maximal clique in the graph [11, 9].
The graph is constructed to leverage the fact that rigid
A. Visual Odometry
body motions are distance-preserving operations – the
The visual odometry algorithm that we have developed is Euclidean distance between two features at one time
based around a standard stereo visual odometry pipeline, with should match their distance at another time. Thus, each
components adapted from existing algorithms. While most feature match is a vertex in the graph, and an edge is
visual odometry algorithms follow a common architecture, formed between two feature matches if the 3D distance
a large number of variations and specific approaches exist, between the features does not change substantially. For
each with its own attributes. Our overall algorithm is most a static scene, the set of inliers make up the maximal
closely related to the approaches taken by Mei et al. [14] and clique of consistent matches. The max-clique search is
Howard [11]. approximated by starting with an empty set of feature
1) Preprocessing: A Gaussian pyramid is constructed to matches and iteratively adding the feature match with
enable more robust feature detection at different scales. greatest degree that is consistent with all feature matches
Each level of the pyramid corresponds to one octave in the clique (Fig. 2). Overall, this algorithm has a
in scale space. Features at the higher scales generally runtime quadratic in the number of feature matches, but
correspond to larger image structures in the scene, which runs very quickly due to the speed of the consistency
generally makes them more repeatable and robust to checking.
motion blur. 6) Motion estimation
2) Feature Extraction: Features are extracted at each The final motion estimate is computed from the feature
level of the Gaussian pyramid using the FAST feature matches in three steps. First, an initial motion is esti-
detector [16]. The threshold for the FAST detector mated using an absolute orientation method to find the
is adaptively chosen to ensure a sufficient number of rigid body motion minimizing the Euclidean distance
features are detected in each frame. Feature bucketing between the inlier feature matches [10]. Second, the
is employed to maintain a more uniform distribution motion estimate is refined by minimizing feature repro-
of features. The depth corresponding to each feature is jection error. This refinement step implicitly accounts
computed using either sparse stereo, or by querying the for the fact that the depth uncertainty originates from
depth image provided by the RGB-D camera. the stereo matching in image space. Finally, feature
3) Initial Rotation Estimation: For small motions such matches exceeding a fixed reprojection error threshold
as those encountered in successive image frames, the are discarded from the inlier set and the motion estimate
majority of a feature’s apparent motion in the image is refined once again.
plane is caused by 3D rotation. Estimating this rotation To reduce short-scale drift, we additionally use a
allows us to constrain the search window when matching keyframe technique. Motion is estimated by comparing
features between frames. We use the technique proposed the newest frame against a reference frame. If the camera
by Mei et al. [14] to compute an initial rotation by motion relative to the reference frame is successfully
directly minimizing the sum of squared pixel errors be- computed with a sufficient number of inlier features,
tween downsampled versions of the current and previous then the reference frame is not changed. Otherwise,
frames. the newest frame replaces the reference frame after the
4) Feature Matching: Each feature is assigned a descrip- estimation is finished. This simple heuristic serves to
tor consisting of the intensity values of the 9 × 9 pixel eliminate drift in situations where the camera viewpoint
patch around the feature. Features are then matched does not vary significantly, a technique especially useful
across frames using a mutual-consistency check. The when hovering.
score of two features is the sum-of-absolute differences
The algorithms have been heavily optimized, using SIMD Position Hold Trajectory
instructions to speed up computations where possible. On a
2.6 Ghz laptop computer, our algorithm requires roughly 15 ms 0.2
per frame. The timing per stage is as follows. Preprocessing:

Y−Deviation (m)
2.1 ms, feature extraction: 3.1 ms, initial rotation estimation: 0.1
1.0 ms, feature matching: 6.0 ms, inlier detection: 2.2 ms, and
motion estimation required less than 0.1 ms. Runtimes on the 0
MAV are slightly higher due to the slower clock speed, but
are well within real-time.
−0.1
B. Mapping
Visual odometry provides locally accurate pose estimates; −0.2
however global consistency is needed for metric map gen- −0.2 −0.1 0 0.1 0.2
eration and navigation over long time-scales. We therefore X−Deviation (m)
integrate our visual odometry system with our previous work Metric
in RGBD-Mapping [8]. This section focuses on the key Duration 90 s
decisions required for real-time operation; we refer readers to Mean speed 0.10 m/s
our previous publication for details on the original algorithm Mean pos. deviation 6.2 cm
that emphasizes mapping accuracy [8]. Max pos. deviation 19 cm
Unlike the local pose estimates needed for maintaining
stable flight, map updates and global pose updates are not
Fig. 3. A plot showing the ground truth trajectory of the vehicle during
required at a high frequency and can therefore be processed position hold. The red dot near the center is the origin around which the
on an offboard computer. The groundstation computer detects vehicle was hovering.
loop closures, computes global pose corrections, and con-
structs a 3D log-likelihood occupancy grid map. For coarse
navigation, we found a 10 cm resolution to provide a use- they were based was taken, we incorporate the correction by
ful balance between map size and precision. Depth data is retroactively modifying the appropriate position estimate in the
downsampled to 128×96 prior to a voxel map update to state history. All future state estimates are then recomputed
increase the update speed, resulting in spacing between rays of from this corrected position, resulting in globally consistent
approximately 5 cm at a range of 6 m. Incorporating a single real-time state estimates.
frame to the voxel map currently takes approximately 1.5 ms.
We emoploy a keyframe approach to loop closure – new III. E XPERIMENTS
frames are matched against a small set of keyframes to detect
loop closures, using a fast image matching procedure [8]. We have conducted a number of autonomous flight exper-
When a new keyframe is added, a RANSAC procedure over iments, both in a motion capture studio, and in larger envi-
FAST keypoints [16] with Calonder randomized tree descrip- ronments at a number of locations around the MIT campus,
tors [5] is run. The correspondence is accepted if there are and the Intel Research office in Seattle. The vehicle flew
at least 10 inliers with small enough reprojection errors after autonomously with state estimates provided by the algorithms
RANSAC. The final refined relative pose between keyframes presented in this paper. The vehicle was commanded through
is obtained by solving a two-frame sparse bundle adjustment the environment by a human operator selecting destination
system [13], which minimizes overall reprojection error. The waypoints using a graphical interface.
final pose graph is optimized using TORO [6]. Figure 3 shows an example where the MAV was com-
manded to hover at a target point, along with statistics about
C. State estimation and control how well it achieved this goal. Ground truth was recorded by
To control the quadrotor, we integrate the new visual a motion capture system. In larger environments, the SLAM
odometry and RGB-D Mapping algorithms into our system algorithm limits the global drift on its position estimates by
previously developed around 2D laser scan-matching and detecting loop closures and correcting the trajectory estimates.
SLAM [2]. The motion estimates computed by the visual The trajectory history can then be combined with the depth
odometry are fused with measurements from the onboard IMU data to automatically generate maps that are useful both for a
in an Extended Kalman Filter. The filter computes estimates human operator’s situational awareness, and for autonomous
of both the position and velocity, which are used by the PID path planning and decision making. Figure 5a shows an
position controller to stabilize the position of the vehicle. occupancy voxel map constructed during a flight. Figure 5b
We keep the SLAM process separate from the real-time shows a rendering of the MAV’s internal state estimates
control loop, instead having it provide corrections for the real- as it flew through the environment depicted in Figure 6b,
time position estimates. Since these position corrections are and a path planned using the occupancy map and a simple
delayed significantly from when the measurement upon which dynamic programming search strategy. A video demonstrating
(a) (b)
Fig. 4. Trajectories flown by the MAV in two navigation experiments.

(a) (b)
Fig. 5. (a) Dense maximum-likelihood occupancy voxel map of the environment depicted in Fig. 4a, false-colored by height. Unknown/unobserved cells are
also tracked, but not depicted here. (b) Using the voxel map generated for Fig. 4b, the vehicle plans a collision-free 3D trajectory (green).

autonomous flight and incremental mapping is available at: extremely uncertain [4], or to plan in ways that minimize
http://groups.csail.mit.edu/rrg/rss2011-mav. future uncertainty [7].
Looking further afield, the current system allows the vehicle
IV. D ISCUSSION AND F UTURE W ORK to reason about the world in places that is has mapped, or
The system described in this paper enables autonomous that are within the sensing horizon of the range sensors.
MAV flight in many unknown indoor environments. However, However, it will usually not have information in the direction
there remain a great number more challenging situations that of travel. Extending the planning horizon will require us to
would severely tax our system’s abilities. Motion estimation break the dependence on metric range sensing, and leverage
algorithms based on matching visual features, such as ours and the qualitative information about the environment available
virtually all other visual odometry techniques, do not perform in monocular imagery. In doing so, we hope to continue to
as well in regions with few visual features. In large open areas, increase the capabilities of the vehicle to plan and explore in
the visible structure is often far beyond the maximum range large scale unknown environments.
of the stereo based sensors.
Handling these challenges will likely require integration of
other sensors such as laser range-finders. As these sensors ACKNOWLEDGMENTS
have different failure modes, they serve to complement each
other’s capabilities. Additional sensing modalities can reduce,
but not eliminate, state estimation failures. Further robustness This research was supported by the Office of Naval Re-
can be gained by designing planning and control systems search under MURI N00014-07-1-0749 and the Army Re-
able to respond appropriately when the state estimates are search Office under the MAST CTA.
Fig. 6. Textured surfaces generated using sparse bundle adjustment, with data collected from autonomous flights.

R EFERENCES cameras for dense 3d modeling of indoor environments.


In Int. Symposium on Experimental Robotics, Dec. 2010.
[1] M. Achtelik, A. Bachrach, R. He, S. Prentice, and [9] H. Hirschmuller, P.R. Innocent, and J.M. Garibaldi. Fast,
N. Roy. Stereo vision and laser odometry for autonomous unconstrained camera motion estimation from stereo
helicopters in gps-denied indoor environments. In Pro- without tracking and robust statistics. In Proc. Int.
ceedings of the SPIE Unmanned Systems Technology XI, Conference on Control, Automation, Robotics and Vision,
volume 7332, Orlando, F, 2009. volume 2, pages 1099 – 1104, dec. 2002.
[2] A. Bachrach, R. He, and N. Roy. Autonomous flight in [10] B. K. P. Horn. Closed-form solution of absolute ori-
unknown indoor environments. International Journal of entation using unit quaternions. J. Optical Society of
Micro Air Vehicles, 1(4):217–228, December 2009. America, 4(4):629–642, 1987.
[3] S. Benhimane and E. Malis. Improving vision-based [11] A. Howard. Real-time stereo visual odometry for au-
control using efficient second-order minimization tech- tonomous ground vehicles. In IEEE Int. Conf. on
niques. In IEEE Int. Conf. Robotics and Automation, Intelligent Robots and Systems, Sep. 2008.
Apr. 2004. [12] A. Huang, A. Bachrach, , P. Henry, M. Krainin, D. Matu-
[4] A. Bry and N. Roy. Rapidly-exploring random belief rana, D. Fox, and N. Roy. Visual odometry and mapping
trees for motion planning under uncertainty. In IEEE Int. for autonomous flight using an rgb-d camera. In Under
Conf. Robotics and Automation, Shanghai, China, 2011. Review, 2011.
[5] M. Calonder, V. Lepetit, and P. Fua. Keypoint signatures [13] K. Konolige and W. Garage. Sparse Sparse Bundle
for fast learning and recognition. In European Confer- Adjustment. In Proc. of the British Machine Vision
ence on Computer Vision, pages 58–71, 2008. Conference (BMVC), 2010.
[6] G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard. A [14] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid.
tree parameterization for efficiently computing maximum A constant time efficient stereo slam system. In British
likelihood maps using gradient descent. In Proceedings Machine Vision Conference, 2009.
of Robotics: Science and Systems, 2007. [15] L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys.
[7] R. He, S. Prentice, and N. Roy. Planning in information Pixhawk: A system for autonomous flight using onboard
space for a quadrotor helicopter in a gps-denied envi- computer vision. In IEEE Int. Conf. Robotics and
ronments. In IEEE Int. Conf. Robotics and Automation, Automation, May 2011.
pages 1814–1820, Los Angeles, CA, 2008. [16] E. Rosten and T. Drummond. Machine learning for
[8] Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng high-speed corner detection. In European Conference
Ren, and Dieter Fox. RGB-D Mapping: Using depth on Computer Vision, 2006.

You might also like