Mobile AR 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Research article

Wide area registration on camera phones for


mobile augmented reality applications
Liya Duan
School of Hydropower & Information Engineering, Huazhong University of Science & Technology, Wuhan, China, and

Tao Guan and Yawei Luo


School of Computer Science and Technology, Huazhong University of Science & Technology, Wuhan, China
Abstract
Purpose The authors aim to present a vision based wide area registration method for camera phones based mobile augmented reality applications.
Design/methodology/approach The tracking system uses a drift-free 6 DOF tracker based on prebuilt multiple maps, and can be initialized using
the authors compacted key-frames recognition engine.
Findings Given the current location and camera pose, the authors show how the corresponding virtual objects can be accurately superimposed even
in the case of varying user positions.
Originality/value The authors system can be used in wide area scenarios and provides an accurate registration between real and virtual objects.
Keywords Mobile augmented reality, Camera phones, Wide area, Registration, Cameras, Telephones
Paper type Research paper

for low power consumption. Even well written code for a


modern mobile phone will still run about 10-15 times slower
than on a modern PC. Thus, it is inconvenient to fulfill some
time-consuming steps such as wide area localization and camera
tracking on these low-end devices. Second, memory size is
another important obstacle. Limitations in the mobile phone
operating systems do usually not allow more than about 15 MB
per application. Obviously, 15 MB is not large enough to
accommodate the data (such as key-frames and 3D points) from
the wide area workspace, since these data will consume tens or
hundreds of MB of memory even with the use of the most
advanced compression techniques (Irschara et al., 2009).
In our research, we design a flexible wide area registration
method that can overcome the above difficulties. First, instead
of using a single global map, we use multiple maps to represent
the workspace. Since the built maps are geometrically
independent to each other, we need only to load the map
currently being tracked instead of the global map into the
memory to realize camera tracking. Thus, the multiple maps
method is flexible in memory management and especially
suitable for solving the wide area registration problem on lowmemory devices. Second, we also make some modifications to
the traditional ferns classifier and design a compacted keyframes recognition engine to integrate all the built maps into a
single tracking system. While providing fast and accurate
location results, the method is also efficient in memory usage,
thus it makes the wide area localization possible on low-memory
camera phones.

1. Introduction
Modern camera phones have become a compelling platform for
mobile augmented reality applications. They almost contain all
the equipment typically required for video based AR. In the past
few years, a lot of researchers have tried to implement
algorithms such as natural features matching (Wagner et al.,
2010a, b) and online mapping (Klein and Murray, 2009) on
mobile phones to realize real time registration. By contrast, little
attention has been paid to realize wide area registration on
camera phones for mobile AR use. With the use of small and
flexible mobile devices, mobile AR allows for more unrestricted
user movement. Thus, the requirement of wide area registration
is more urgent for mobile AR than for traditional PC based AR.
In this paper, we especially focus on camera phones based
wide area registration which is one of the most important
problems in the mobile AR field and has not yet been solved
properly. While great strides have been made to address wide
area registration problems for PC based AR, this is not the
case for camera phones based mobile AR systems.
There are mainly two factors that limit the usability of camera
phones as platforms to realize wide area registration (Wagner
and Schmalstieg, 2009a, b): first, instead of increasing
processing speed, mobile phones units are primarily designed
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0260-2288.htm

Sensor Review
33/3 (2013) 209 219
q Emerald Group Publishing Limited [ISSN 0260-2288]
[DOI 10.1108/02602281311324663]

This research is supported by the National Natural Science Foundation of


China (NSFC) under Grant No. 60903095 and 61272202.

209

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

2. Related work

a current smart phone at about 2-3 Hz, which is sufficient


to initialize incremental tracking methods. Wagner et al.
(2010a, b) design a method to build and track panoramic
maps in real time on mobile phones. However, the method
cannot give the full six DOF pose, which makes it unsuitable for
3D augmentation purposes. Arth et al. (2011) realize wide area
localization and tracking by using panoramic maps and
pre-built 3D models. While promising, users are limited at the
place where the panoramic map is built, which really weakens
the flexibility of the system. While promising, there are still
some limits. For example, the performing of global localization
is hardware dependent. Additional sensors such as GPS, Wi-Fi
triangulation, Bluetooth or infrared beacons are needed to
provide users with the initial location for the PVS. The approach
proposed in this paper overcomes the above problems to some
degree. With the use of our compacted ferns based scene
recognition engine, the global localization can be performed in
real time without the use of additional sensors.

In this section, we first briefly review related work in the field


of wide area registration for PC based AR, and then
introduce some work previously done for registration on
mobile phones.
Wide area registration for PC based AR has attracted so
much attention in the past few years. Initially, different
kinds of filter techniques such as extended Kalman filter
(Davison et al., 2007), particle filter (Pupilli and Calway,
2005) and so on are used to implement vision-based real time
SLAM systems to realize wide area registration for AR use.
While promising, filter based methods are commonly timeconsuming, which mandates the use of sparse and relatively
small maps. To overcome the problems of filter techniques,
researchers recently make use of structure from motion
(SFM) techniques to get denser and larger maps to improve
tracking quality. For example, Klein and Murray (2007)
employ online SFM to design a system that produces detailed
maps with thousands of natural features which can be tracked
at frame-rate, with an accuracy and robustness rivaling that of
SLAM-based systems. However, one of the most important
limits of the above methods is that they attempt to obtain the
observed scenes global 3D map which is not always gettable
due to the complexity of workspace or the immaturity of
reconstructing techniques. Thus, these systems can only work
in a relatively small and straightforward AR workspace. To
overcome the problems of single map system, the authors
have previously proposed a wide area registration framework
(Guan and Wang, 2007) based on multiple maps. This
method partitions the whole scene into different maps
according to the requirements of AR applications, and all
the built maps are integrated into a single tracking system by
using a fast scene learning and recognition engine. The result
system can cope with thousands of maps at frame-rate, which
enhances the usability of AR systems to a large degree. In this
research, we prove that the multiple maps based strategy is
also flexible in memory management and especially suitable
for solving the wide area registration problem on low-memory
devices.
Compared with PC, camera phones do not have sufficient
computational power. The first approaches for mobile phone
position are based on a client-server mode. All the computation
steps are outsourced to a server connected via a wireless
connection, and the client device is reduced to a pure display
plus camera. Since typical response time is reported to be about
10 seconds for processing a single frame, this kind of methods
are not suitable for real time AR applications. More recently,
researchers have proved that some important steps such as
natural features matching and online mapping can be
implemented in real time on mobile phones. Wagner et al.
(2010a, b) create the first real time six DOF natural feature
tracking system running on mobile phones by using modified
SIFT and Ferns approaches. They also present a method
(Wagner et al., 2009) to track multiple known planar targets in
real time and simultaneously detect new targets for tracking.
Arth et al. (2009) propose an efficient wide area localization
method for mobile AR applications. The method relies on a
previously acquired 3D feature model which is built manually in
advance. A potentially visible set (PVS) technique based
representation method is also used to make the method suitable
to work in real time on a mobile phone. Experiment results show
that robust recovery of full six DOF pose can be obtained on

3. Overview of our method


As shown in Figure 1, our method can be divided into three
stages, namely, map building, localization and real time
registration. We use PCs to deal with the task of map building.
The tasks of localization and real time registration are
performed by using camera phones.
Map building is to get the data (key-frames and 3D
features) that will be used in key-frames learning and real time
registration. We use the camera phone to get the video of the
scene in which virtual objects will be augmented, and then
build the map of this scene on a PC by using key-frames based
SFM technique. Once all the needed maps are built, we
organize all the obtained key-frames by using the compacted
ferns discussed in Section 4 for online localization use.
With the maps and ferns built, we can then transmit them
to the camera phone on which the real time registration will
be carried out. While the compacted ferns are always in the
main memory of the camera phone in the registration process,
this is not the case for the built maps. Different maps will be
loaded individually to make our system flexible in memory
usage. To initialize the tracking, a scene recognition process is
first carried out by using the compacted ferns. If the user
thinks that a valid scene is found, he needs to give a
confirmation to enable the system to load the map of the
corresponding scene for tracking use. When he turns to
another scene, a new confirmation is needed to notify the
system to load the corresponding map to start a new tracking.
Since the map building is carried out on the PC in our
system, we directly use the SFM technique proposed in Klein
and Murray (2007) to get 3D features and key-frames for
each scene. In this research, we mainly focus on the problems
of fast scene recognition and real time registration on camera
phones.

4. Compacted ferns for key-frames recognition


This section describes the key-frames recognition method we
used to achieve fast wide area localization in mobile AR system.
We have found random ferns (Ozuysal, 2010) are eminently
suitable for our research because they naturally handle
multi-class problems and are fast enough to meet real time
performance. While ferns have been used previously by
researchers (Wagner et al., 2010a, b) to perform features
210

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 1 Overview of our method

matching task on mobile phones, we prove that ferns are also


efficient in solving wide area localization problem in our research.
In this section, we first briefly introduce the ferns based keyframes learning and recognition method, and then propose some
modifications that make ferns suitable for solving wide area
localization problem in mobile AR applications.

class{h1 ; h2 ; . . . ; hM } arg max


z

M X
F
1 X
P^ d f ; hm
M m1 f 1

Rhm z
If the user thinks that the recognized key-frame belongs to a valid
scene, he can make a confirmation to enable the system to load the
corresponding map into the memory for real time registration use.

4.1 Ferns based key-frames learning and recognition


We build the key-frames classifier R by using a set of ferns
constructed by randomly generating all the node tests in a
one-off style as in Ozuysal et al. (2010).
To train the ith key-frame, we first get a certain number of
randomly selected patches {h1, h2, . . . , hM} and then simply
evaluate each patch according to the binary tests on a fern. Once
all the random patches are evaluated, we use the equation (1) to
compute the log posterior probabilities for this fern as follows:
 d

n 1
1
P^ d Rh i Log i
ni D

4.2 Ferns compressing


Ferns require large amounts of memory, which makes them
unsuitable for running on low memory mobile phones. To
solve this problem, we make following modifications to reduce
the memory usage of traditional ferns.
First, the original Ferns stored log posterior probabilities as
four-byte floating point values obtained through equation (1).
However, it has been reported that representing the log
probabilities as eight-bit bytes gives enough precision and will
not cause any degradation in performance. We use the
method proposed in Calonder et al. (2009) to fulfill the
above compressing process. The compressed probability
P~ d Rh i of P^ d Rh i can be computed as:

where ndi is the number of randomly selected patches of class


(key-frame) i that evaluates to fern value d, ni is the total number
of random patches for class i, D 2S and S is the depth of the
fern.
The training process for the ith key-frame can be
accomplished by repeating the above operation for each fern
independently. With all the key-frames from different maps
been trained, the R can then be used for the wide area
localization purpose.
To recognize a key-frame, we first get the random patches set
{h1, h2, . . . , hM} from the input frame. Each patch is simply
evaluated by allPthe ferns belonging to R to compute the posterior
F ^
probabilities
f 1 Pd f ;hm Rhm z; z 1; . . . ; Z for each
trained class, where F is the number of ferns, and d( f, hm) is the
fth ferns value reached by hm. With all the patches evaluated, we
use equation (2) to get the trained key-frame that has the greatest
average posterior probability (subject to a threshold) as the
recognition result:

P~ d Rh i
"

#
minP^ d Rh i; P^ 95 P^ min 8
2 2 1
P^ 95 2 P^ min

where P^ min denotes the minimum value occurring over all leaves
in the current Fern and P^ 95 is the corresponding 95 percent
percentile. With the above modification, we can reduce memory
requirements by a factor of 4. Moreover, since the recognition
process can be performed by using integer values instead of
floating point values, we can speedup the recognition process by
a factor of at least 3 on camera phones, even some of which do
not support floating point arithmetic.
Second, as can be seen from the left part of Figure 2 (we use
a fern which contains three random tests for simplicity), with
the first modification, we need to store all the compressed
211

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 2 Using thresholds and inverted files in ferns


T1
T2 Fern
T3
Values

Compressed
Probabilities

0
0
0

Inverted file

0
0

000

001

010

1
0

011
100

101

110
111

1
1
0
0
1
0
1
1

Threshold

1
0
1
1
1

probabilities obtained in the training process. However, we find


that not all the probabilities play same roles and the probabilities
with small values in fact make little sense to the recognition
process. This leads to a considerable portion of memory spent
to store small values which make little contribution to ensuring
satisfied recognition performance. We abandon these small
values generated in training process to further reduce the
memory usage. We set a threshold to select the values required
to be stored. For the jth class (key-frame), and we set the
threshold as:


P~ max Rh j
4
a

by using equation (4) for this class. Only the value larger
than the computed threshold will be stored in the inverted file,
while the others will be discarded to save the memory usage. We
take an experiment by using ZUBUD image database to prove
the efficiency of the above improvements in reducing memory
usage and recognition time. The compression ratios and
recognition time are shown in Figure 3(a) and (b), respectively,
from which we can see that the memory usage and recognition
time decrease obviously with the increasing of 1/a. However, as
shown in Figure 3(c), the recognition performance does not
degrade sharply with the changing of 1/a. In our system, we set
1/a as 0.08 by which we can get satisfying compression ratio
together with reasonable recognition performance.
Third, we also compress the built inverted files by using index
compression which is an established method in the literature of
text retrieval to further reduce the memory consumption. To
our knowledge, index compression has never been considered in
compressing the key-frame recognition engine for mobile AR
use. We use the improved rice coding algorithm proposed in

where P~ max Rh j is the maximum of all the probabilities


deduced from the equation (3) in the process of training the jth
key-frame. We also use inverted files to store the needed
probabilities. As shown in Figure 2, once the compressed
probabilities have been obtained, we first compute a threshold
212

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 3 The compression ratios, recognition time and recognition


rates when using different thresholds

the above method, we can process about two to three times more
key-frames compared with a standard inverted file when using
the same amount of memory.

5. Fast initializing and tracking on camera phones


5.1 Camera initialization
Once the recognized map is loaded, the next step is to initialize
the camera for tracking use. We first use the built ferns to find a
key-frame most similar to the current frame from the key-frames
set corresponding to the recognized scene. With a key-frame
recognized, we then carry out a matching process by using sum
of squared difference (SSD) to get feature correspondences that
will be used to compute the initial camera pose. We do not take
an exhaustive matching between the current and recognized
frames directly because it will be very time consuming. Instead,
we use the homograph to assist the matching process. We
convert both the input and recognized frames to the size of
40 30 pixels, and then compute the homograph between the
two converted frames by using an efficient second-order
minimization method (Benhimane and Malis, 2007). For each
feature of the recognized key-frame, we first compute its
projection on the current frame by using the computed
homograph H as:
2 3
2 3
xc
xr
6 7
6 7
6 yc 7 H 6 yr 7
4 5
4 5
1
1

(a)

(b)

where x r xr yr 1  is the 2D position of a feature on the


recognized key-frame, x c xc yc 1  is the projection on the
current frame. The matching operations are performed only
with the features detected within ^ 20 pixels surrounding the
projection. With all the features matched, we then perform a
non-linear refinement by using a robust M-estimator to
compute the camera pose and eliminate outliers
simultaneously. We treat a pose as valid only if more than
30 inliers are found. We do this to avoid getting an invalid pose
from a small number of mismatches.
5.2 Tracking
Once a successful initialization has been done, we can start to
track natural features between consecutive frames to compute
camera poses for registration use.
We extract 8 8 local patches from the previous frame to
represent the features that are used to estimate the previous
camera. To track these features in the current frame, we use
SSD measure to estimate patch similarity. To speed up the
tracking process, features are only tracked in a search region
of 20 20 pixels. A motion model is also used to predict the
search region for each feature. We use the linear motion
model (Wagner et al., 2010a, b) which has been proved to be
effective as long as the cameras motion does not change
drastically. We calculate the difference between the poses of
the previous two frames to predict the current camera pose.
The centers of the search regions can then be obtained by
computing the projections of the features on the current
frame.
The predicted camera pose will also be used to recover the lost
features. To do this, we first carry out a key-frame recognition
process as we do in Section 5.1. Only features from the first three
recognized key-frames which belong to the current scene and

(c)

Jiangong et al. (2008) to compress the generated inverted files.


We compress 32 numbers at a time. For each set of 32 numbers,
we first choose a parameter b which is restricted to powers of two
and is the smallest integer that equal to or greater than
0.69 avg. Here, avg is the average of the 32 values to be
compressed. Accordingly, each value n can be encoded in two
parts, a quotient q bn=bc stored as a unary code, and a
remainder r n mod b stored in binary form. We first store all
the binary parts using log(b) bits each in a (32 b)-bit field, and
then store the unary parts. During decompression, we first copy
the (32 b)-bit values into an integer array, and then parse the
unary parts and adjust the decoded binary values accordingly.
In fact, we do not need to decompress all the compressed
batches in recognition process. We only need to find the
corresponding batches of the reached values and then
decompress them to get the corresponding probabilities which
will be used in equation (2) for recognition purpose. By using
213

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

are currently not being tracked will be considered as lost


features. The following process is very similar to the tracking
process. The only difference is that the matching process is
carried out by using affine warped patches from the recognized
key-frames instead of local patches from previous frame. This is
mainly because we cannot get correct local patches of the
lost features in case of occlusion or large illumination changes
which are two main reasons that lead to the problem of losing
features.
With the matches set obtained, we then perform a nonlinear refinement as we do in initialization stage to compute
the camera pose for registration use. Since the fixed-point
based non-linear refinement is not stable enough in case of
changes in the number of matches (Wagner et al., 2010a, b),
we use the floating-point method to perform refinement
process. However, the floating-point refinement will be time
consuming especially in case that too much data is involved.
Thus, we control the number of matches and select at most
120 correspondences to perform non-linear refinement. We
also select features to cover a large area of the camera image to
improve stability. To do this, we partition the input 320 240
image into 4 5 20 blocks equally. In each block, we first
try to find six correspondences with the highest matching
scores. If the number of correspondences from all the blocks
is less than 120, we then select the remaining needed
correspondences from the whole matches set by considering
matching scores.

virtual object, he can simply turn to another scene and repeat


the above operations.

7. Results
7.1 Preliminary feedback on usability
We recruited eight users (one female/seven male, age
22-35 year) with no previous knowledge of AR to test the
usability of the proposed application prototype. For each user,
we take two tests, in which the first is carried out immediately
after the building of the prototype system, and the other one is
carried out 4 hours later to test the performance of the
proposed method in scene structure and illumination
changes. After the tests we take an unofficial interview to
collect user feedback.
The results show that 61 out of 64 scenes can be recognized
(95.31 percent) and 59 out of 64 virtual objects can be
augmented (92.19 percent) in the tests which are carried out
immediately after the building of the prototype system. The
recognition rate is slightly lower (57 out of 64, 89.06 percent)
for the tests carried out 4 hours later. However, this is not the
case for the augmentation. Only 37 out of 64 virtual objects
can be correctly augmented (57.81 percent). It is because that
the illumination and scene structure changes lead to the
failures of feature point matching.
All users agree that the registration is stable and fast. They
experience occasional registration failures when camera
moves out of the scope of the target scene. However, these
failures can always be recovered when camera returns to the
target scene. Moreover, all the users stated that as they
became more familiar with the prototype system, they can
avoid nearly all the problems.
Finally, the user interface generally received positive
comments, especially in the map selecting function. All
the users agree that selecting the corresponding map by a
simple point-and-click interface does not make any impact on
the usability of our prototype system.

6. Applications
We build a virtual museum exhibition prototype to prove the
usability of the proposed method for wide area mobile AR
applications. In mobile AR based virtual museum
applications, we should enable users to walk anywhere they
want to observe different virtual objects like antiques,
calligraphies and paintings superimposed in different places
in the exhibition hall. We built a prototype system by using
the proposed method on a HTC camera phones (with 1G
CPU) to meet the wide area localization and tracking
requirements for these mobile AR applications.
The prototype system is built in our laboratory which
covers an area of about 80 square meters. We build eight
maps and each map contains nine to 18 key-frames and
295-833 mapped points. Thus, the result system contains
totally 121 key-frames and 4045 3D points. 20 random ferns
each with the depth of 12 are used in our system. While the
original ferns take about 38.72 M bytes, the compressed ferns
only take about 0.87 MB bytes which make them suitable for
running on our low memory mobile phones. Each map
(including key frames and 3D features) takes 1.3-3.1 MB
memory and will be loaded or unloaded according to the
users confirmation. Our system requires , 20 ms to perform
scene recognition and , 40 ms to track and augment a single
frame, thus the application system can run at interactive
frame rates (12 Hz).
If the user decides to browse a virtual object, he can select a
map according to the recognition results returned by the
compressed ferns. The selecting operation will automatically
trigger the loading of the corresponding map into the main
memory, and then the initializing process will be started.
With the initial pose obtained, the corresponding virtual
object will be superimposed by using camera poses provided
by tracking process. If the user decides to browse another

7.2 Localization and registration results


Some localization and registration results are shown in
Figures 4 and 5. Figure 4 shows the results of the tests carried
out immediately after the building of the prototype system.
Figure 4(a)-(h) shows the registration results when tracking
the eight built maps. Figure 5(a) is the result of illumination
changes in tracking the first map. Figure 5(b) is the result
of camera shaking in tracking the fourth map. Figure 5(c) and
(d) are results of volume and view angle changes in tracking
the second and fifth map, respectively. Figure 5(e) and (f)
shows the localization results in case of initializing the second
and first map, respectively. The above results convincingly
demonstrate the validity of the proposed scene recognition
and registration method.
Figure 6 shows the results of the second test. Figure 6(a)-(f) are
the results in tracking the first, third, fourth, fifth, sixth and eighth
map, respectively. From Figures 6(d) and 4(e), we can see that
the illumination in the second test change obviously, compared
with the first test. However, the registration can still be carried
out successfully, which really demonstrates the robustness of the
proposed method. However, there are some problems in tracking
the second and seventh map. Due to obvious scene structure
changes, no user can get successful augmentation in these two
maps even in case that the compacted ferns give the correct
recognition results.
214

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 4 Results from the first test

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

errors (Guan and Wang, 2007) between the re-projections of


the four specified points (used to define the virtual object
coordinate) computed from the camera poses calculated by
using different numbers of features are compared. In these two

7.3 Tracking accuracy and computation time


We carry out two experiments to test the accuracy of the
proposed tracking system in which the number of matches used
in camera pose estimation are controlled. The average RMS
215

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 5 Results of illumination changes, camera shaking, view angles and volume changes

(a)

(b)

(c)

(d)

(e)

(f)

8. Discussion

experiments, the re-projections obtained by using the camera


pose calculated from all the tracked features are used as the
ground truth for comparison. The map used in these two
experiments is built in an offline stage and contains 13 keyframes and 684 mapped points. In order to facilitate the
comparison, we also record the errors obtained by using
different numbers of matches. Figure 7(a) shows errors when
camera moves along z-axis from 468.29 to 1,131.37 mm.
Figure 7(b) shows errors when camera rotated along z-axis from
25.21 to 75.21. We also record the average time taken by feature
matching and camera pose estimating when using different
numbers of features. It takes about 96 ms in case of using all the
features. The average time reduced to 32, 21 and 19 ms when
using 120, 80 and 50 features, respectively. We can see that a
reasonable compromise between the tracking accuracy and
computational cost can be obtained by using 120 features,
which really demonstrate the effectiveness of the proposed
tracking method.

This section discuss two limitations of the proposed method,


and then gives the direction of our future work:
1 We propose to compress the built random ferns to reduce the
memory usage of our key-frames recognition engine. As
discussed in Section 6, our current system can manage
thousands of key-frames by using 10 M memory space.
However, this is not enough for city scale mobile AR
applications in which millions of key-frames will be used to
perform location recognition. Quantization based
compressing and searching methods (Brandt, 2010;
Jegou et al., 2011) may be a good choice to solve the above
problem. This will be a main direction of our future work.
2 In our current system, we directly use local patches to
describe the detected nature feature points. This is
mainly because the traditional local descriptors such as
SIFT and SURF are computation inefficient on low power
mobile phones. In our future work, we will try to accelerate
216

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 6 Results from the second test

(a)

(b)

(c)

(d)

(e)

(f)

the SIFT method and use it to further improve the


performance of our camera initialization system.
In our current tracking system, only feature points are used
in camera tracking process. As can be seen from Figure 7,
the average RMS errors will be acceptable when more than
100 natural features are used. However, our current method
is not very robust to motion blur which is a common
problem for low frame rate camera phones. In our future
work we will try to use some other kinds of features such as
line and contour to further improve the performance of our
tracking system.

workspace and use compacted ferns to integrate all the built


maps into a single mobile AR system. Preliminary applications
show that our method is flexible and the registration works
robustly.
Although our method needs users intervention to
start tracking in a map, this is reasonable for most mobile
AR applications and will not weaken the usability of the
proposed method obviously. In fact, the solving of the wide area
registration problem on low power mobile phones is precisely
because we allow a simple users interaction.
In our future work, we will try to compress the built maps
to further reduce the memory usage on mobile phones. We
also want to use mobile phones GPU to accelerate the ferns
based key-frames recognition algorithm to further improve
the scalability of our system.

9. Conclusions and future work


We show how wide area registration can be realized on a
mobile phone. We use multiple maps to represent wide area
217

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Figure 7 Comparison of RMS errors

(a)

(b)

References

Calonder, M., Lepetit, V., Konolige, K., Mihelich, P.,


Bowman, J. and Fua, P. (2009), Compact signatures
for high-speed interest point description and matching,
Proc. International Conference on Computer Vision, pp. 357-364.
Davison, A., Reid, I., Molton, N.D. and Stasse, O. (2007),
MonoSLAM: realtime single camera SLAM,
IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 29 No. 6, pp. 1052-1067.
Guan, T. and Wang, C. (2007), Registration based on scene
recognition and natural features tracking techniques for
wide-area augmented reality systems, IEEE Transactions on
Multimedia, Vol. 11 No. 8, pp. 1393-1406.
Irschara, A., Zach, C., Frahm, J.M. and Bischof, H. (2009),
From structure-from-motion point clouds to fast location
recognition, Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1063-6919.

Arth, C., Klopschitz, M., Reitmayr, G. and Schmalstieg, D.


(2011), Real-time self-localization from panoramic images on
mobile devices, Proc. IEEE International Symposium on Mixed
and Augmented Reality, Basel, Switzerland, pp. 37-46.
Arth, C., Wagner, D., Klopschitz, M., Irschara, A. and
Schmalstieg, D. (2009), Wide area localization on mobile
phones, Proc. IEEE and ACM International Symposium on
Mixed and Augmented Reality, pp. 73-82.
Benhimane, S. and Malis, E. (2007), Homography-based
2D visual tracking and servoing, Journal of Robotics
Research, Vol. 26 No. 7, pp. 661-676.
Brandt, J. (2010), Transform coding for fast approximate
nearest neighbor search in high dimensions,
Proc. International Conference on Computer Vision and Pattern
Recognition, pp. 1815-1822.
218

Wide area registration for mobile augmented reality applications

Sensor Review

Liya Duan, Tao Guan and Yawei Luo

Volume 33 Number 3 2013 209 219

Further reading

Jegou, H., Douze, M. and Schmid, C. (2011), Product


quantization for nearest neighbor search, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
Vol. 33 No. 1, pp. 117-128.
Klein, G. and Murray, D.W. (2007), Parallel tracking
and mapping for small AR workspaces, International
Symposium on Mixed and Augmented Reality, pp. 225-234.
Klein, G. and Murray, D.W. (2009), Parallel tracking and
mapping on a camera phone, Proc. IEEE and ACM
International Symposium on Mixed and Augmented Reality,
pp. 83-86.
Ozuysal, M., Calonder, M., Lepetit, V. and Fua, P. (2010),
Fast keypoint recognition using random ferns, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
Vol. 32 No. 3, pp. 448-461.
Pupilli, M. and Calway, A. (2005), Real-time camera
tracking using a particle filters, Proc. British Machine Vision
Conference.
Wagner, D. and Schmalstieg, D. (2009a), Making augmented
reality practical on mobile phones, part 1, IEEE Computer
Graphics and Applications, Vol. 29 No. 3, pp. 12-15.
Wagner, D. and Schmalstieg, D. (2009b), Making
augmented reality practical on mobile phones, part 2,
IEEE Computer Graphics and Applications, Vol. 29 No. 4,
pp. 6-9.
Wagner, D., Schmalstieg, D. and Bischof, H. (2009), Multiple
target detection and tracking with guaranteed frame rates on
mobile phones, Proc. IEEE and ACM International
Symposium on Mixed and Augmented Reality, pp. 57-64.
Wagner, D., Mulloni, A., Langlotz, T. and Schmalstieg, D.
(2010a), Real-time panoramic mapping and tracking on
mobile phones, Proc. IEEE Virtual Reality Conference,
Waltham, Massachusetts, USA, pp. 211-218.
Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T.
and Schmalstieg, D. (2010b), Real-time detection and
tracking for augmented reality on mobile phones, IEEE
Transactions on Visualization and Computer Graphics, Vol. 16
No. 3, pp. 355-368.

Zhang, J., Long, X. and Suel, T. (2008), Performance


of compressed inverted list caching in search engines,
Proc. International Conference on World Wide Web,
pp. 387-396.

About the authors


Liya Duan received her BSc and MSc degrees in Automation
and Electronic Engineering from Qingdao University of
Science and Technology, Qingdao, China, in 2001 and 2008,
respectively, and she is currently a PhD student at Digital
Engineering and Simulation Centre of Huazhong University
of Science & Technology, Wuhan, China. Her current
research interests are augmented reality, computer vision
and mobile visual search.
Tao Guan received his BSc and MSc degrees in
Automation and Electronic Engineering from Qingdao
University of Science and Technology, Qingdao, China, in
2001 and 2004, respectively, and the PhD degree at Digital
Engineering and Simulation Centre of Huazhong University
of Science & Technology, Wuhan, China, in 2008. He
is currently an Associated Professor at School of Computer
Science & Technology of Huazhong University of
Science & Technology, Wuhan, China. His research
interests are augmented reality, mobile visual search and
high-dimensional data quantization and encoding. Tao
Guan is the corresponding author and can be contacted at:
[email protected]
Yawei Luo is currently a BSc student at School of Computer
Science & Technology of Huazhong University of Science
& Technology, Wuhan, China. His research interests are
augmented reality, mobile visual search and high-dimensional
data quantization and encoding.

To purchase reprints of this article please e-mail: [email protected]


Or visit our web site for further details: www.emeraldinsight.com/reprints

219

You might also like