Academia.eduAcademia.edu

Hand jitter descriptor for mobile video identification

2011, … (ICCE), 2011 IEEE International Conference on

The hand jitter usually has a negative effect on recorded video, but combined with intentional camera motion, it forms a unique global motion pattern intrinsic to the video sequence. Therefore a useful application that involves hand jitter would be to identify videos taken by hand-held devices by extracting and matching a compact descriptor entirely based on global motion. We confirm this idea via a video matching experiment that shows how even a simple and fast global motion extraction method, based on integral projections, provides excellent short clip identification and localization results, despite severe video modifications.

Hand Jitter Descriptor for Mobile Video Identification Alexander Sibiryakov, Mitsubishi Electric Research Centre Europe, Guildford, United Kingdom Abstract—The hand jitter usually has a negative effect on recorded video, but combined with intentional camera motion, it forms a unique global motion pattern intrinsic to the video sequence. Therefore a useful application that involves hand jitter would be to identify videos taken by hand-held devices by extracting and matching a compact descriptor entirely based on global motion. We confirm this idea via a video matching experiment that shows how even a simple and fast global motion extraction method, based on integral projections, provides excellent short clip identification and localization results, despite severe video modifications. I. INTRODUCTION We define a video identification task as the process of detecting a video from a large dataset containing a short query clip. The query clip can be distorted by cropping, resizing, filtering, compressing, etc. To localize the query clip we determine its start position in the identified video. Due to the widespread use of hand-held camera devices, specialized video identification and localization methods are of much interest. Though hand jitters are often considered problematic to handheld video quality, this paper presents an opposite point of view that suggests hand jitter can be useful descriptor for video identification. Combined with intentional camera motion, the random hand jitter forms a unique global motion pattern intrinsic to the video sequence. The global motion can be extracted by one of the existing algorithms and stored as a video descriptor. As a consequence, this approach is not applicable to videos without global motion, e.g. videos shot from stationary cameras. Ideas for video description using motion information have been implemented in the MPEG-7 standard [4]. These descriptors have been proposed to improve higher-level video retrieval or event description (e.g. based on object trajectory). They have not been tested in a scenario of motion-based video identification under severe video modification, which is the main topic of this paper. A typical computer vision algorithm of global motion extraction (see [2] for example) is based on feature extraction, local descriptor matching and parametric transformation estimation by a robust estimator, such as RANSAC. Although parametric model describes global motion accurately, efficient approximation methods have been also proposed. Centroid motion method [1] that also has been applied to video description approximates global motion by the movement of high and low intensity areas from frame to frame. Other methods replace high-order motion model by translation. For example, image projection matching by 1D phase correlation [3] demonstrated reliable motion estimation results in case of significant blur due to rapid camera motion. The method converts each frame Ik(x,y) of size N×M into integral projections Xk and Yk: M −1 X k ( x) = ∑ y =0 N −1 I k ( x, y ), Yk ( y ) = ∑ I ( x, y ) . k (1) x=0 Then the global motion is computed by 1D phase correlation:  F ( w( X k )) F * ( w( X k −1 )) ∆xk = arg max F -1   F ( w( X k )) F * ( w( X k −1 ))    ,  (2) where F and F-1 are forward and inverse Fourier Transforms and ‘*’ is a complex conjugation, while w(x) is a windowing function that reduces frequency artifacts caused by signal boundaries. The component y is computed similarly from the other projection pair, Yk(y) and Yk-1(y). II. HAND JITTER DESCRIPTOR Consider ( xk, yk) to be a relative translation of video frame k with respect to previous frame k-1 that is extracted by one of the global motion estimation algorithms (e.g. [1-3]). In the proposed method, the concatenated sequence from several frames, Dn,m = { xn, yn, xn+1, yn+1,…, xn+m-1, yn+m-1}, forms a compact descriptor (two numbers per frame). In this notation, Dn,m is the descriptor of a video clip of length m+1 starting from frame n. Thus D1,N-1 is a descriptor of the entire video, with a length of N. To compare two fragments of the same length m taken from videos 0 and 1 starting from frames n0 and n1 respectively, we compute the distance between their descriptors. We use a sum of absolute differences between normalized descriptor components: m ) (1) (0) (1) Dist Dn(0), m , Dn(1), m = ∑ ∆X i(0) + n − ∆X i + n + ∆Yi + n − ∆Yi + n , (3) ( 0 1 0 i =1 1 0 1 where Xi= xi/σx(D), Yi = yi/σy(D) are normalized motion components, and σx(D), σy(D) are normalization factors. Authors of [1] propose to normalize each motion component by the corresponding by image size, but such normalization works well only against video resize. It does not help with videos that are cropped or have added borders. To deal with the full range of a video’s geometric modifications, we suggest using standard deviations of motion components as normalization factors. To accomplish identification and localization, we use a sliding window approach. In (3), we consider all possible start positions n1 and find the index s* of the video among D(1), D(2), D(3),… with minimal distance and start position n* of the matching fragment: ( ( )) s* = arg min min Dist Dn(0), m , Dn( s, m) , s = 1, 2,... s n 0 ( n* = arg min Dist Dn(0), m , Dn( s, m*) n 0 ) (4) (5) The normalized motion components of the longer sequence D(s) are recomputed for each D n( s,m) by a sliding window algorithm. III. EXPERIMENTAL RESULTS For experimental testing, we collected 192 videos mainly taken by hand-held cameras. The main sources of data are the author’s personal collection and internet video services. A few standard sequences with prominent camera motion, such as “Basketball” (Fig.1), have been also included in the dataset. The total duration of all videos is 1 hour 40 minutes. Individual video durations range from 5 seconds to 15 minutes. Video lengths (number of frames) range from 50 to 23,042 frames. A video descriptor’s discriminability and robustness are usually tested by modifying the video and matching the modified video fragments to original set of videos. We used a sliding window algorithm (4), (5) to search a query clip in the original dataset. To emphasize the robustness of the proposed descriptor, we use very strict matching criteria: if descriptors D(0) and D(s) rep- (b) (a) (c) (d) (e) (f) (g) Fig.1 Video modifications applied to the “Basketball” sequence. Only modifications of the heavy preset applied to top-left quarter of a video frame are shown. (a) Adding border 20% of image size; (b) Scale 25% of image size; (c) Crop 20% of image size; (d) Blur by 3x3 box filter three times; (e) Sharpening by 3x3 box filter two times; (f) Brightness change by gamma-correction (γ=1/4); (g) Compression by MPEG-4 codec with 100kbs bit-rate. resent the original video and its modified version, then the applying the algorithm (4),(5) to the entire dataset should result in the following conditions: s=s*, n0=n*. Thus the matched position is expected to correspond exactly to the ground truth. The accuracy of the matching position is computed as a proportion of all successfully matched query clips. In our experiment, we applied seven video modifications; black border addition, scale, crop, filtering (blur and sharpen), and brightness change and compression. For each modification we used two sets of parameters: light and heavy presets. Examples of the heavy modifications are shown in Fig.1. We tested there global motion extraction methods: combined motion of low and high intensity centoids [1], frame centre translation after applying estimated affine motion model based on feature extraction and RANSAC [2] and projection-based translation of the entire frame [3]. We found that half-cosine window w(x) in (2) gives the best results of the projection-based method. The results of modified clip matching are shown in Fig.2. We increased query clip lengths from 5 to 200 frames, which, taking into account video durations, is equivalent to nearest neighbor search among roughly 2⋅106 sub-clips. As seen in Fig.2, method performance increases with length of query clip. Graph descriptions in Fig.2 reveal more details about light and heavy modification parameters. The performance of all methods under heavy modifications is lower than that under light modifications. Cenroid-based descriptor is not robust enough to even light video modifications, because they significantly distort centroid locations and thus alter descriptor values. The Features+RANSAC method is robust enough to video modifications, although interest point locations can be seriously distorted by image filtering. The projection-based descriptor outperforms the other two descriptors: it has a higher matching accuracy for Fig.2 Result of modified video clip matching against original video set for three global motion extraction methods, seven video modifications (light and heavy presets) and variable query clip length (from 5 to 200 frames) . video clips with lengths as small as 25 frames (which correspond to average clip duration of 1 second). Even very short, 5-frame query clips result in more than 50 percent matching accuracy of under all modifications except for scale change. We also performed other experiments not shown in Fig.2. Robustness to horizontal flip of the video frame can be achieved by using the absolute value of horizontal motion vector (| X| instead of X in (3)), which lowers the matching accuracy of all methods by about 1 to 5 percent. However, the performance curve behavior remains the same. The aspect ratio change experiment did not demonstrate any new behavior of the performance curves with respect to uniform scale change. IV. CONCLUSION This paper proposed a useful application of the hand jitter effect. The videos from hand-held cameras can be identified based entirely on their global motion descriptor extracted by one of the prior art algorithms, preferably based on image projections. The superiority of projection-based method to other methods can be explained by the fact that it uses all pixels from the image, whereas the other two methods use only a small fraction of the image pixels. Integral effects from large amounts of pixels result in image projections that are robust to local changes, while other methods use distorted local image information. REFERENCES [1] T.Hoad, J.Zobel, Detection of Video Sequences Using Compact Signatures, ACM Trans. on Information Systems, Vol.24. No.1, Jan. 2006, pp.1-50 [2] K.Pulli, M.Tico, Y.Xiong, Mobile Panoramic Imaging System, 6th IEEE Workshop on Embedded Computer Vision, San Francisco, , Jun 13, 2010 [3] A.Sibiryakov, M.Bober, “Real-Time Multi-Frame Analysis of Dominant Translation”, Int. Conf. on Pattern Recognition (ICPR’06), Aug.2006 [4] B.S.Manjunath, P.Salembier, T.Sikora, Introduction to MPEG-7: Multimedia Content Description Standard. New York: Wiley, 2001