First-Person Hyper-Lapse Videos
First-Person Hyper-Lapse Videos
First-Person Hyper-Lapse Videos
Johannes Kopf
Microsoft Research
Michael F. Cohen
Microsoft Research
Richard Szeliski
Microsoft Research
Figure 1: Our system converts first-person videos into hyper-lapse summaries using a set of processing stages. (a) 3D camera and point
cloud recovery, followed by smooth path planning; (b) 3D per-camera proxy estimation; (c) source frame selection, seam selection using a
MRF, and Poisson blending.
Abstract
We present a method for converting first-person videos, for example, captured with a helmet camera during activities such as
rock climbing or bicycling, into hyper-lapse videos, i.e., timelapse videos with a smoothly moving camera. At high speed-up
rates, simple frame sub-sampling coupled with existing video stabilization methods does not work, because the erratic camera shake
present in first-person videos is amplified by the speed-up. Our algorithm first reconstructs the 3D input camera path as well as dense,
per-frame proxy geometries. We then optimize a novel camera path
for the output video that passes near the input cameras while ensuring that the virtual camera looks in directions that can be rendered
well from the input. Finally, we generate the novel smoothed, timelapse video by rendering, stitching, and blending appropriately selected source frames for each output frame. We present a number
of results for challenging videos that cannot be processed using traditional techniques.
CR Categories: I.3.8 [Computer Graphics]: Applications;
Keywords: hyper-lapse, time-lapse, image-based rendering
Links:
DL
W EB
V IDEO
DATA
C ODE
Introduction
Previous Work
Overview
Scene Reconstruction
4.1
Preprocessing
4.2
Structure-from-Motion
Once these images have been reprojected, our next step is to estimate the extrinsic camera parameters of the input video frames
as well as depth maps for the input images. This problem can
be solved using structure-from-motion algorithms. We use an incremental algorithm similar to the ones described by Snavely et
al. [2006] and Wu [2013]. The algorithm estimates the location and
orientation of the input cameras. In addition, it computes a sparse
3D point cloud, where each point has an associated list of frames
where it is visible.
The algorithm starts by finding feature points in every image and
matching them among all pairs. For a small percentage of frames
Figure 2: Removing rendundant frames and batch processing. Places where the camera has stopped moving, e.g., at a red light, show up as
blocks on the diagonal of the match graph. We detect and remove these, and partition the resulting reduced match graph into overlapping
blocks for independent processing.
that are strongly affected by motion blur or are completely textureless the algorithm might not find enough feature points that can be
reliably matched. We drop these frames from the reconstruction
and ignore them in the subsequent stages. Next, we remove redundant frames by searching for rows and columns in the match table
that have large off-diagonal components and removing these from
the original set of video frames (Figures 2ac). We then run the
remainder of the incremental structure-from-motion pipeline.
A difficulty with this approach is that incremental structure-frommotion algorithms do not scale well to problems as large as ours. In
fact, we terminated an initial attempt to reconstruct a long video after a week of runtime. To alleviate this problem, we divide our
datasets into overlapping batches of 1400 frames each with 400
frames overlap and reconstruct each batch separately in parallel
(Figure 2d), which takes less than an hour for each batch.
To combine the batches into a single global coordinate system
we first compute the best rigid alignment between all pairs using
Horns method [1987]. The cameras within the overlaps between
batches now have 2 coordinates. We resolve the ambiguity by linearly combining them, where the blending weight moves from 0 to
1 in the overlap zone. Finally, we run one more round of bundle
adjustment on the global system. After having obtained the full reconstruction, we scale it so that the average distance between two
consecutive input cameras is 1. Figure 1a shows an example of the
estimated camera path and sparse 3D point cloud produced by this
pipeline.
4.3
Proxy Geometry
min
Edepth (x)
{d(x)} xV
(1)
2
(2)
(3)
v2
v1
ck(x,y,t)
v3
kULG
(x,y)
s^
u^
J
p(t)
pkin
(a)
(x,y)
^
s
u^ 2
p(t)
pkin
(b)
Figure 3: Rendering quality term computations. (a) The Unstructured Lumigraph angular error kULG measures the angle between the
direction vectors s and u from the new p(t) and original pin
k camera centers and the proxy surface point ck (x, y,t). The affine transform J can
be used to compute the texture stretch kTS at the reference pixel (x, y) but depends on the orientation of the new (rendering) camera. (b) The
view invariant texture stretch kTS3 can be computed using the condition number of the 3 3 mapping M between the unit vectors si and u i
pointing from the camera centers to the vertices vi of the proxy triangle.
encourages a smooth solution. = 1 balances between both objectives. The solution to Eq. 1 is a set of sparse linear equations, which
we solve using a standard linear least-squares solver.
At this point, we have recovered the camera positions and orientations for each frame. In addition, we have a depth map in the form
of a dense rectangular grid of points for each frame (Figure 1b).
5.1
The Objectives
in in
Let pk , Rk be the set of input camera positions and rotation matrices, and let p(t) and f(t) be the desired output camera continuous
position and orientation curves, respectively. f(t) is represented as
a unit front vector, i.e., it has two degrees of freedom. We define
the remaining right and up vectors by taking cross products with a
global world up vector. We assume the field of view of the output
p00 (t) 2 dt
(5)
Z
2
Esmooth-f =
f 00 (t)
dt
(6)
For the approximation requirement, we seek to minimize the distance of every input camera to the closest point on the path:
2
Eapprox = min
pin
p(t)
(7)
k
t
(4)
Esmooth-p =
Path Planning
Z
p0 (t)
2 dt
Z
Elength =
(t) =
ZZ
(8)
We now integrate this penalty over the length of the curve to get our
rendering quality penalty term:
Equality =
(t) dt.
(9)
How should k be defined? Many possible choices have been discussed in the literature. In the Unstructured Lumigraph, Buehler et
al. [2001] provide an overview, and propose using the angular error,
i.e.:
,
kULG = cos1 (s u)
(10)
where
ck (x, y,t) pin
c (x, y,t) p(t)
k
, u = k
s =
ck (x, y,t) pin
kc
k (x, y,t) p(t)k
k
(11)
where 1 = 100, 2 = 100, 3 = 1000, 4 = 0.1, 5 = 0.01 are balancing coefficients (recall that the scale of the reconstructed scene
is normalized).
mini iJ
kTS = 1
maxi iJ
(12)
where iJ are the singular values of the Jacobian of the texture coordinates:
u
x
J = v
x
u
y
.
v
y
(13)
v pin
k
i
,
vi pin
k
u i =
vi p(t)
.
kvi p(t)k
(14)
mini iM
,
maxi iM
(15)
(16)
5.2
Optimization Strategy
First, we optimize the location p(t) of the path while ignoring the
energy terms that depend on the orientation f(t). While this reduced objective is still nonlinear, it can be efficiently optimized by
iteratively solving sparse linear subproblems, as described in Section 5.3.
Next, we optimize the orientation f(t) of the output cameras while
keeping the previously computed position curve p(t) fixed. This
strategy dramatically improves efficiency because we designed our
proxy penalty function kTS3 to be rotation invariant. This enables
us to precompute the min expression in Equation 8 once for all
directions. We describe this part of the optimization in Section 5.5.
5.3
In the first stage, our goal is to optimize the path location curve p(t)
by minimizing the objectives Elength , Esmooth-p , and Eapprox , that do
not depend on the orientation. We represent p(t) as a cubic B-spline
curve, with the number of control vertices set to 5% of the number
of input frames. The reduced objective now becomes:
2
Elocation = 4
pin
k p(tk )
+
k
Z
Z
2
2
1
p0 (t)
dt + 2
p00 (t)
dt,
(17)
where tk = arg mint
pin
k p(t) , is the parameter of the closest
curve point to camera k. This is a standard spline fitting problem,
which is frequently encountered in the literature.
While this is a non-linear objective, it can be efficiently solved using an iterative strategy. Note that the two integral terms in Eq. 17
have simple quadratic closed-form expressions for cubic B-splines.
Now, the solution idea is to fix tk during one iteration, which turns
Eq. 17 into a quadratic problem that can be optimized by solving
a sparse linear set of equations. The overall strategy is then to alternately optimize Eq. 17 and to update the tk . Figure 5 shows a
sample mapping of input frames to their corresponding tk values
while Figure 1a shows an example of a smoothed 3D camera path.
1000
800
600
Per-pixel orientation penalty i (f)
400
200
0
350
2350
4350
6350
8350
10350
5.4
5.5
(18)
(19)
Next, we compute the image integrals for the i (f) IBR fitness
terms using these precomputed quantities:
i (f) =
ZZ
I(f)
i (f0 ) df0 ,
(20)
where I(f) indicates the set of rays f0 that are in image I, and again
store the results in cube maps (Figure 6). Since the functions in
Rendering
Our final stage is to render images for the novel camera positions
and orientations we discussed in the previous section. We use a
greedy algorithm to select a subset of source images from which to
assemble each output frame (Section 6.1). Then, we render each
of the selected source images and stitch the results together using a
Markov random field (MRF) and reconstruct the final images in the
gradient domain (Section 6.2).
6.1
x,y
x,y
(23)
Source 1 of 5
Source 5 of 5
MRF stitched
Figure 7: Source frame selection and stitching. Left: source frame selection to ensure that all pixels are adequately covered with high-quality
inputs; Right: Markov random field source pixel selection. (In practice, a spatio-temporal MRF is usedsee Figure 1c.)
where an is an accumulation buffer that contains the previously selected best value for every pixel:
an (x, y) = max wsm (x, y).
m<n
(24)
wk bk
.
maxl bl
(25)
6.2
Fusion
Our last remaining task is to stitch and blend the previously selected source frames together to obtain the final output frames. We
first optimize a discrete pixel labeling, where every (space-time)
pixel p in the output video chooses the label p from one of the
rendered source proxies that have been selected for that particular
frame [Agarwala et al. 2004]. We define the objective:
min
{ p }
p,qN(p)
s-t
Es (p, q, p , q ),
(26)
p,qT (p)
where the data term Ed (p, p ) = 1 w p (p) encourages selecting high quality pixels, the smoothness terms Es , defined below,
encourage invisible stitch seams, s-s = 10, and s-t = 0.1. N(p)
denotes the set of 4-neighbors within the same output frame, and
T (p) denotes the two temporal neighbors in the previous and next
frames, which generally will not lie at the same pixel coordinates,
and which we obtain by computing the medoid of the motion vectors of all candidate proxies at the given pixel).
Our smoothness terms are defined following previous work [Agarwala et al. 2004]:
Es (p, q, p , q ) =
t p (p) tq (p)
+
t p (q) tq (q)
, (27)
where t p (p) denotes the RGB value of the rendered proxy at
pixel p. We solve Eq. 26 in a greedy fashion by successivly optimizing single frames while keeping the previously optimized frames
fixed. Each frames labels are optimized using the alpha expansion
algorithm [Kolmogorov and Zabih 2004] in a coarse-to-fine manner
[Lombaert et al. 2005].
The optimized labeling hides visible seams as much as possible.
However, there might still be significant color differences because
of exposure and white balancing changes in the source frames. We
B IKE 1
B IKE 2
B IKE 3
C LIMBING
S CRAMBLING
WALKING
Figure 9: Sample rendered output frames from our six test videos
Name
B IKE 1
B IKE 2
B IKE 3
S CRAMBLING
C LIMBING
WALKING
Duration
mm:ss
6:00
3:55
13:11
9:07
3:37
9:36
Input
Frames
10,787
7,050
23,700
16,400
6,495
17,250
Output
Frames
1111
733
2189
1508
636
1744
b-d
p
2
r(p) t p (p) +
2
2
x r(p) x t p (p) + y r(p) y t p (p) +
2
b-t i r(p) i t p (p) ,
(28)
where x , y , and i denote the horizontal, vertical, and temporal finite forward difference operator, respectively. b-d = 0.001,
b-s = 1, and b-t = 0.01 are balancing coefficients. We solve Eq. 28
in a reduced domain using multi-splines with a spacing of 32 pixels
[Szeliski et al. 2011] using a standard conjugate gradients solver
[Shewchuk 1994].
b-s
We present our experimental results on six video sequences acquired with GoPro Hero2 and Hero3 cameras. All of the videos
were taken in a single shot and captured at 29.97 frames per second
with a resolution of 1280 960 pixels. Table 1 lists the sequences,
which are described by their activities and clip numbers, their input duration and frame number, and the number of output frames.
The videos are between 3 and 13 minutes in length, and the output
Stage
Match graph (kd-tree)
Initial SfM reconstruction
Densification
Path optimization
IBR term precomputation
Orientation optimization
Source selection
MRF stitching
Poisson blending
Computation time
10-20 minutes
1 hour (for a single batch)
1 hour (whole dataset)
a few seconds
1-2 minutes
a few seconds
1 min/frame (95% spent in GMM)
1 hour
15 minutes
Taubins method
Snavelys method
Figure 10: Simple smoothing algorithms cannot produce satisfactory camera paths. The original camera path is shown in blue, our path in
red, and the alternative path in black. Left: Low pass filtering the input camera positions produces a path that is over-smoothing in some
part while it still contains sudden turns in other parts. Middle: Taubins method [Taubin et al. 1996] runs into numerical problems. Right:
The pull-back term in Snavely et als algorithm [2008] prevents reaching a smooth configuration.
ing. In this work, we were more interested in building a proofof-concept system rather than optimizing for performance. As a
result, our current implementation is slow. It is difficult to measure
the exact timings, as we distributed parts of the SfM reconstruction
and source selection to multiple machines on a cluster. Table 2 lists
an informal summary of our computation times. We expect that
substantial speedups are possible by replacing the incremental SfM
reconstruction with a real-time SLAM system [Klein and Murray
2009], and finding a faster heuristic for preventing selecting of occluded scene parts in the source selection. We leave these speed-ups
to future work.
7.1
Evaluating Alternatives
Conclusions
References
AGARWALA , A., D ONTCHEVA , M., AGRAWALA , M., D RUCKER ,
S., C OLBURN , A., C URLESS , B., S ALESIN , D., AND C OHEN ,
M. 2004. Interactive digital photomontage. ACM Trans. Graph.
23, 3, 294302.
B EARDSLEY, P., T ORR , P., AND Z ISSERMAN , A. 1996. 3D model
acquisition from extended image sequences. In Fourth European
Conference on Computer Vision (ECCV96), Springer-Verlag,
vol. 2, 683695.
B UEHLER , C., B OSSE , M., M C M ILLAN , L., G ORTLER , S., AND
C OHEN , M. 2001. Unstructured lumigraph rendering. Proceedings of SIGGRAPH 01, 425432.
C HEN , F., C OOPER , M., AND A DCOCK , J. 2007. Video summarization preserving dynamic content. In Proceedings of the International Workshop on TRECVID Video Summarization, ACM,
New York, NY, USA, TVS 07, 4044.
C RANDALL , D., OWENS , A., S NAVELY, N., AND H UTTEN LOCHER , D. 2013. SfM with MRFs: Discrete-continuous optimization for large-scale reconstruction. IEEE Transactions on
Pattern Analysis and Machine Intelligence 35, 12, 28412853.
D ETYNIECKI , M., AND M ARSALA , C. 2008. Adaptive acceleration and shot stacking for video rushes summarization. In Proceedings of the 2Nd ACM TRECVid Video Summarization Workshop, ACM, New York, NY, USA, TVS 08, 109113.
F URUKAWA , Y., AND P ONCE , J. 2010. Accurate, dense, and robust multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 8, 13621376.
G RUNDMANN , M., K WATRA , V., AND E SSA , I. 2011. Autodirected video stabilization with robust l1 optimal camera paths.
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
H ASTIE , T., T IBSHIRANI , R., F RIEDMAN , J., AND F RANKLIN , J.
2005. The elements of statistical learning: data mining, inference
and prediction. The Mathematical Intelligencer 27, 2, 8385.
H ORN , B. K. P. 1987. Closed-form solution of absolute orientation
using unit quaternions. Journal of the Optical Society of America
A 4, 4, 629642.
K LEIN , G., AND M URRAY, D. 2009. Parallel tracking and mapping on a camera phone. International Symposium on Mixed and
Augmented Reality (ISMAR 2009).