Full Text 01
Full Text 01
Full Text 01
Field
Gunnar Farneb ck a Computer Vision Laboratory Department of Electrical Engineering Link ping University o SE-581 83 Link ping, Sweden o [email protected] Abstract
In [10] we presented a new velocity estimation algorithm, using orientation tensors and parametric motion models to provide both fast and accurate results. One of the tradeoffs between accuracy and speed was that no attempts were made to obtain regions of coherent motion when estimating the parametric models. In this paper we show how this can be improved by doing a simultaneous segmentation of the motion eld. The resulting algorithm is slower than the previous one, but more accurate. This is shown by evaluation on the well-known Yosemite sequence, where already the previous algorithm showed an accuracy which was substantially better than for earlier published methods. This result has now been improved further. tion models. The model parameters can in a very straightforward manner be computed from the orientation tensors in the region. The weak point of the previous paragraph is that we would want the region to contain a coherent motion (with respect to the afne motion model). In particular we do not want it to span discontinuities in the motion eld, since this would clearly degrade the estimated velocity eld. On the other hand, in order to segment out regions of coherent motion, we would rst need to know the velocities. The solution to this problem is to do a simultaneous segmentation and velocity estimation. This is accomplished with the help of a region growing based segmentation algorithm. Spatiotemporal ltering and parametric motion models, in particular afne motion, are today standard components of motion estimation algorithms. The use of orientation tensors is, however, less common. The basic relations between 3D orientation tensors and motion have been explored in e.g. [4, 15, 14, 25, 13]. A more sophisticated tensor based algorithm has been presented by Karlholm [19]. A survey of some other tensor based approaches can be found in [17]. The idea to do simultaneous segmentation and velocity estimation is also well-known, see e.g. [7] for an overview. The rst two components of this algorithm are common with the algorithm presented in [10]. More details on the orientation tensor estimation can be found in [11]. An early prototype of the segmentation algorithm used here, with an emphasis on segmentation rather than velocity estimation, can be found in [8]. More details on all aspects of the algorithm presented here can be found in [9].
1. Introduction
The algorithm presented in this paper has three distinct components; estimation of orientation tensors, estimation of parametric motion models, and simultaneous segmentation of the motion eld. The estimation of orientation tensors involves spatiotemporal ltering of the volume obtained by stacking the frames of an image sequence onto each other. The lter responses are combined pointwise into 3D orientation tensors, which give a powerful representation of the oriented structures in the volume, and correspondingly of the local constraints on the motion that can be inferred from the intensity variations in the sequence. In particular this representation is well suited to deal with the aperture problem. To improve the robustness of the velocity estimation it is assumed that the motion locally conforms to a parametric motion model. In this paper we only discuss the afne model, but the framework works for a larger class of mo-
2. Preliminaries
Before we can discuss the segmentation part of the algorithm, we need to recapitulate the algorithm presented in [10]. Obviously this will have to be done briey. For further
Instead we assume that we have a region where the motion is coherent with respect to the afne motion model, vx (x, y) = ax + by + c, vy (x, y) = dx + ey + f, (5)
where x and y are image coordinates. This can be rewritten in terms of the spatiotemporal vector (4) as v = Sp, where x y 1 0 S = 0 0 0 x 0 0 0 0 p= a b c d e (6) 0 y 0 f 0 0 1 0 0 1 1
T
and
(7) (8)
The parameters of the model can be estimated directly from the orientation tensors in the region by using v T Tv as a cost function. Summing over the region and applying the motion model we get dtot (p) =
i T vi T i vi = i
pT ST Ti Si p = pT Qtot p, i (9) S T Ti S i . i
i
The parameters A, b, and c are computed by a Gaussian weighted least squares approximation of the signal. This can be implemented very efciently by a hierarchical scheme of separable convolutions. From the model parameters, the orientation tensor is constructed by T = AAT + bbT , (2)
where Qtot =
i
Qi =
(10)
where is a non-negative weight factor between the even and the odd parts of the signal. As a further preprocessing step we compute the isotropy compensated tensor T = T min I, where min is the smallest eigenvalue of T. (3)
We obtain the motion model parameters by minimizing (9) under the constraint that the last element of p be 1. In order to do this we partition p and Qtot as p= turning (9) into dtot (p) = pT Q + pT q + qT p + , p which is minimized by p = Q1 q. (13) Compared to rst estimating the velocity vectors on a point by point basis from the orientation tensors and then tting them to the motion model, this approach has two important advantages. The rst advantage is that the effects of noise and inaccuracies in the tensor estimation are reduced signicantly. The second advantage is that even if the aperture problem is present in some part of the region, information obtained from other parts can help to ll in the missing velocity component. There does remain a possibility that the motion eld cannot be uniquely determined, but that requires the signal structures over the whole region to be oriented in such a way that the motion becomes ambiguous; a generalized aperture problem. Additionally this approach inherently incorporates all information we have at each point, whether it is a full velocity vector or only the normal component. (12) p , 1 Qtot = Q qT q , (11)
The orientation tensor computed above has the useful property that for a translating neighborhood it will satisfy the relation vT Tv = 0. In the presence of the aperture problem, all velocity vectors with the correct normal velocity component will satisfy the relation. In principle we could now compute velocity vectors from the orientation tensors point for point. This would, however, have the drawback of yielding rather noisy estimates and of course we would have the usual problems with the aperture problem.
3 1 7 8
7 4 7 3 9 4 5 k 4 2 5 6 6 1
8 9 7 6 5 4 3 5 3 1
An illustration of this algorithm is provided in gure 1. We have two regions, one to the left marked with vertical bars and one to the right marked with horizontal bars. Around each region are shown the values of the cost function CR (x) for respective region. The underlined values belong to each border R, assuming four-connectivity. The minimum cost Cmin (R) is 2 for the left region and 3 for the right region. Hence the circled pixel is added to the left region in this iteration. In the next iteration the pixel with an underlined 3 will be added to the right region because the minimum cost for the left region is increased to 4 after the addition of the circled pixel.
(14)
The corresponding minimum cost for adding the candidate to the region is denoted Cmin (R). In the case of an empty border, N (R) is undened and Cmin (R) is innite. Assuming that a number of regions {Rn } in some way have been obtained, the rest of the image is partitioned as follows. 1. Find the region Ri for which the cost to add a new pixel is the least, i.e. let i = arg minn Cmin (Rn ). 2. Add the least expensive pixel N (Ri ) to Ri . 3. Repeat the rst two steps until no unassigned pixels remain. Notice that it does not matter what the actual values of the cost functions are. It is only relevant which of them is smallest. Hence the algorithm is called competitive.
until the candidate region has grown to a specied size. This size is called the candidate region size, m0 and is a design parameter of the segmentation algorithm. The effect of the regrowing procedure is that the candidate region now consists of the m0 connected pixels, starting from a xed point, that are most consistent with the candidate regions motion model. When the candidate region has been regrown, new optimal parameters are computed. Each candidate region is regrown twice, a number which seems to be sufcient to obtain reasonably coherent regions. The choice of m0 is a somewhat complex tradeoff. A small value yields higher speed and the possibility to detect smaller coherent regions, while a higher value gives more robust parameter estimation and reduces tendencies to oversegment.
not be regrown to its full size m0 , it is removed. The same thing happens when a candidate regions starting point becomes occupied by a real region. The cost of the most expensive included pixel is called the maximum cost of the candidate region. 2. Find the candidate region with the least maximum cost. This is the aspirant for inclusion among the real regions. 3. As in the competitive algorithm, nd the least expensive pixel that may be added to one of the already existing real regions. 4. Compare the least maximum cost from step 2 with the cost of the least expensive pixel in step 3. (a) If the least maximum cost is smallest, raise the corresponding candidate region to the status of a real region. (b) Otherwise, add the least expensive pixel to the corresponding region. In the rst iteration there are no real regions yet, so the rst thing that happens is that the best candidate region is transformed into the rst real region. When a candidate region is converted into a real region in step 4a, all its m0 pixels are immediately included in the real region. To see how the segmentation algorithm works, one frame of the Yosemite sequence, gures 2 and 3, has been segmented. In gure 4 we can see how the regions develop and how new regions are successively added.
using the afne motion model and about 3.5 seconds using the simpler constant motion model. Source code for the implementation is available at http://www.isy.liu.se/gf. The algorithm has been evaluated on a commonly used test sequence with known velocity eld, Lynn Quams Yosemite sequence [16], gures 2 and 3. This synthetic sequence was generated with the help of a digital terrain map and therefore has a motion eld with depth variation and discontinuities at occlusion boundaries. The accuracy of the velocity estimates has been measured using the average spatiotemporal angular error, arccos(est vtrue ) [3]. The sky region is excluded from the vT error analysis because the variations in the cloud textures induce an image ow that is quite different from the ground truth values computed solely from the camera motion. The orientation tensors have been computed using spatiotemporal lters of effective size 9 9 9 and a standard deviation = 1.4 for the Gaussian weighting function. The tensor computation also involves the parameter , equation 1 (2), which has been set to 8 . With the design parameters for the segmentation algorithm, and m0 , set to 0.06 and 500 respectively, we obtain an average angular error of 1.30 with standard deviation 2.29 . An interesting question is how the design parameters affect the results. This has been studied in detail in [9]. It turns out that the result is not very sensitive to changes in the and values. The value of is more critical, but this is predictable since it controls the scale of the structures for which the orientation is estimated. The dependence on m0 is more troublesome, since the results uctuate quite unpredictably between 1.25 and 1.45 when m0 is varied, as shown in gure 5. While the sensitivity to the value of m0 is disturbing, it points out a possible strategy for improving the algorithm. By computing the velocity for a number of different m0 values and averaging the results, we can reduce the overall errors and stabilize the estimates. This does of course come at the cost of more computations, but since we in this paper have settled for accuracy being more important than speed, we nd the addition worthwhile and include it in the algorithm. Using 11 evenly spaced values 400, 420, . . . , 600 of m0 we get an average angular error of 1.14 and a standard deviation of 2.14 . Picking 11 m0 values randomly in the same interval, we consistently get average angular errors between 1.13 and 1.18 . The running time increases to about 5.5 minutes with this modication. Statistics on the distribution of errors are given in table 1. Comparison with previously published results, table 2, shows that the algorithm presented here gives excellent accuracy, clearly improving on the results presented in [10].
While the comparison in step 4 can be made directly between the given values, it is benecial to introduce a design parameter , with which the least maximum cost is multiplied before the comparison is made. The effect of is that for a large value, new regions are added only if it would be very expensive to enlarge the existing ones. This may be desired e.g. if the segmentation is intended for a video coding application, where excessive fragmentation into many small regions can be costly. A small value means that existing regions are enlarged only if there are pixels available that are very consistent with the motion models, which we in this paper prefer since we are more interested in the velocity eld than in the segmentation itself. The regrowing of candidate regions in step 1 of the algorithm may seem prohibitively computationally expensive. In practice though, a reasonable shortcut is to assume that the maximum cost always increases when a candidate region has to be regrown. Therefore it is sufcient to regrow candidate regions only when the least maximum cost is smaller than the least expensive pixel and also only a few of the top candidate regions need to be regrown. An algorithm with some basic elements in common with the competitive algorithm can be found in [1], being applied to grayscale segmentation. The algorithm presented here was developed later but independently.
4. Evaluation
The algorithm has been implemented in Matlab, with certain parts in the form of C mex les. Typical running times for the algorithm on a 360 MHz SUN Ultra 60, computing the velocity for one frame of the 252 316 Yosemite sequence, is in the order of a minute. As a comparison, the faster algorithm presented in [10] takes about 16 seconds
1.5
0.5
0 400
Acknowledgments
The author wants to acknowledge the nancial support of WITAS: The Wallenberg Laboratory for Information Technology and Autonomous Systems.
References
Average error Standard deviation < 0.5 Proportion < 1 of estimates < 2 with errors < 3 below: < 5 < 10 afne 1.14 2.14 32.0% 64.4% 87.8% 94.0% 98.0% 99.7% [1] R. Adams and L. Bischof. Seeded region growing. 16(6):641647, June 1994. [2] A. Bab-Hadiashar and D. Suter. Robust optic ow computation. International Journal of Computer Vision, 29(1):5977, August 1998. [3] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical ow techniques. Int. J. of Computer Vision, 12(1):4377, 1994. [4] J. Big n. Local Symmetry Features in Image Processu ing. PhD thesis, Link ping University, Sweden, 1988. o Dissertation No 179, ISBN 91-7870-334-4. [5] M. J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth ow elds. Computer Vision and Image Understanding, 63(1):75104, Jan. 1996. [6] M. J. Black and A. Jepson. Estimating optical ow in segmented images using variable-order parametric models with local deformations. 18(10):972986, 1996. [7] F. Dufaux and F. Moscheni. Segmentation-based motion estimation for second generation video coding techniques. In L. Torres and M. Kunt, editors, Video Coding: The Second Generation Approach, chapter 6, pages 219263. Kluwer Academic Publishers, 1996. [8] G. Farneb ck. Motion-based Segmentation of Ima age Sequences. Masters Thesis LiTH-ISY-EX-1596, Computer Vision Laboratory, SE-581 83 Linkoping, Sweden, May 1996.
Lucas & Kanade [22] Uras et al. [24] Fleet & Jepson [12] Black & Anandan [5] Szeliski & Coughlan [23] Black & Jepson [6] Ju et al. [18] Karlholm [19] Lai & Vemuri [21] Bab-Hadiashar & Suter [2] Farneb ck, constant motion [10] a Farneb ck, afne motion [10] a This algorithm
2.80 3.37 2.97 4.46 2.45 2.29 2.16 2.06 1.99 1.97 1.94 1.40 1.14
3.82 3.37 5.76 4.21 3.05 2.25 2.0 1.72 1.41 1.96 2.31 2.57 2.14
35% 14.7% 34.1% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
Table 2. Comparison with other methods, Yosemite sequence. The sky region is excluded for all results.
[9] G. Farneb ck. Spatial Domain Methods for Orientaa tion and Velocity Estimation. Lic. Thesis LiU-TekLic-1999:13, Dept. EE, Link ping University, SE-581 o 83 Link ping, Sweden, March 1999. Thesis No. 755, o ISBN 91-7219-441-3. [10] G. Farneb ck. Fast and Accurate Motion Estimaa tion using Orientation Tensors and Parametric Motion Models. In Proceedings of 15th International Conference on Pattern Recognition, volume 1, pages 135 139, Barcelona, Spain, September 2000. IAPR. [11] G. Farneb ck. Orientation Estimation Based on a Weighted Projection onto Quadratic Polynomials. In B. Girod, G. Greiner, H. Niemann, and H.-P. Seidel, editors, Vision, Modeling, and Visualization 2000: proceedings, pages 8996, Saarbr cken, November u 2000. [12] D. J. Fleet and A. D. Jepson. Computation of Component Image Velocity from Local Phase Information. Int. Journal of Computer Vision, 5(1):77104, 1990. [13] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1. [14] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis, Link ping University, Sweden, SE-581 83 o Link ping, Sweden, October 1992. Dissertation No o 284, ISBN 91-7870-988-1. [15] L. Haglund, H. B rman, and H. Knutsson. Estimaa tion of Velocity and Acceleration in Time Sequences. In Proceedings of the 7th Scandinavian Conference on Image Analysis, pages 10331041, Aalborg, Denmark, August 1991. Pattern Recognition Society of Denmark. [16] D. J. Heeger. Model for the extraction of image ow. J. Opt. Soc. Am. A, 4(8):14551471, 1987. [17] B. J hne, H. Haussecker, and P. Geissler, editors. a Handbook of Computer Vision and Applications. Academic Press, 1999. ISBN 0-12-379770-5. [18] S. X. Ju, M. J. Black, and A. D. Jepson. Skin and bones: Multi-layer, locally afne, optical ow and regularization with transparency. In Proceedings CVPR96, pages 307314. IEEE, 1996. [19] J. Karlholm. Local Signal Models for Image Sequence Analysis. PhD thesis, Link ping University, Sweden, o SE-581 83 Link ping, Sweden, 1998. Dissertation No o 536, ISBN 91-7219-220-8.
[20] H. Knutsson. Representing Local Structure Using Tensors. In The 6th Scandinavian Conference on Image Analysis, pages 244251, Oulu, Finland, June 1989. Report LiTHISYI1019, Computer Vision Laboratory, Link ping University, Sweden, 1989. o [21] S.-H. Lai and B. C. Vemuri. Reliable and efcient computation of optical ow. International Journal of Computer Vision, 29(2):87105, August/September 1998. [22] B. Lucas and T. Kanade. An Iterative Image Registration Technique with Applications to Stereo Vision. In Proc. Darpa IU Workshop, pages 121130, 1981. [23] R. Szeliski and J. Coughlan. Hierarchical splinebased image registration. In Proc. IEEE Conference on Computer Vision Pattern Recognition, pages 194 201, Seattle, Washington, 1994. [24] S. Uras, F. Girosi, A. Verri, and V. Torre. A computational approach to motion perception. Biological Cybernetics, pages 7997, 1988. [25] C-F. Westin. A Tensor Framework for Multidimensional Signal Processing. PhD thesis, Link ping o University, Sweden, SE-581 83 Link ping, Sweden, o 1994. Dissertation No 348, ISBN 91-7871-421-4.