Articulo
Articulo
Articulo
net/publication/260708678
CITATIONS READS
11 876
3 authors:
Marco La Cascia
Università degli Studi di Palermo
145 PUBLICATIONS 3,475 CITATIONS
SEE PROFILE
All content following this page was uploaded by Alessandro Bruno on 12 March 2014.
Abstract: In this paper we present a novel technique for object modeling and object recognition in video. Given a set
of videos containing 360 degrees views of objects we compute a model for each object, then we analyze
short videos to determine if the object depicted in the video is one of the modeled objects. The object model
is built from a video spanning a 360 degree view of the object taken against a uniform background. In order
to create the object model, the proposed techniques selects a few representative frames from each video and
local features of such frames. The object recognition is performed selecting a few frames from the query
video, extracting local features from each frame and looking for matches in all the representative frames
constituting the models of all the objects. If the number of matches exceed a fixed threshold the
corresponding object is considered the recognized objects .To evaluate our approach we acquired a dataset
of 25 videos representing 25 different objects and used these videos to build the objects model. Then we
took 25 test videos containing only one of the known objects and 5 videos containing only unknown objects.
Experiments showed that, despite a significant compression in the model, recognition results are
satisfactory.
662
VideoObjectRecognitionandModelingbySIFTMatchingOptimization
The contributes of our paper are: a new technique the objects in function of time evolution. The model
for object modeling; a new method for video object is based on capturing the video content in terms of
recognition based on objects matching; a new video video objects. The authors differentiate foreground
dataset that consists of 360 degree video collection video objects and background video objects. The
of thirty objects (CVIPLab, 2013). method includes the detection of background video
We suppose to analyze the case in which a person objects, foreground video objects, static video
take a video of an object with a videocamera and objects, moving video objects, motion vectors. In
then wants to know information about the object. (Sivic, 2006) Sivic et al. developed an approach to
The scenario of our system is to upload the video to object retrieval which localizes all the occurrences
a system able to recognize the video object taken by of an object in a video. Given a query image of the
the videocamera. object, this is represented by a set of viewpoint
We developed a new model for video objects by invariant region descriptors.
giving a very compact and complete description of
the object. We also developed a new video object 2.2 Object Recognition
recognition based on object matching that achieves
very good results in terms of accuracy. Object recognition is one of the most important issue
The rest of this papers is organized as follows: in computer vision community. Some works use
in section 2 we describe the related work of the state video to detect moving objects by motion. In
of the arts in object modeling and object recognition; (Kavitha, 2007), for example, the authors use two
in section 3 a detailed description of the video object consecutive frames to first estimate motion vectors
models dataset is given; in section 4 we describe the and then they perform edge detection using canny
proposed method for object recognition; in section 5 detector. Estimated moving objects are updated with
we show the experimental results; the section 6 ends a watershed based transformation and finally merged
the paper with some conclusions and future works. to prevent over-segmentation.
In geometric based approaches (Mundy, 2006)
the main idea is that the geometric description of a
2 RELATED WORKS 3D object allows the projected shape to be
accurately analyzed in a 2D image under projective
In this section we show the most popular method for projection, thereby facilitating recognition process
object modeling and object recognition with using edge or boundary information.
particular attention to video oriented methods. The most notable appearance-based algorithm is
the eigenface method (Turk, 1991) applied in face
recognition. The underlying idea of this algorithm is
2.1 Object Modeling
to compute eigenvectors from a set of vectors where
The most important factors in object retrieval are the each one represents one face image as a raster scan
vector of gray-scale pixel values. The central idea of
data representation (modeling) and the search
(matching) strategy. In (Li, 1999) the authors use feature-based object recognition algorithms lies in
multiresolution modeling because it preserves finding interesting points, often occurring at
intensity discontinuity, that are invariant to change
necessary details when they are appropriate at
various scales. Features such as color, texture, shape due to scale, illumination and affine transformation.
are used to build object models, more particularly Object recognition algorithms based on views or
appearances, are still a hot research topic (Zhao,
GHT (the Generalized Hough Transform) is adopted
over the others shape representations because it is 2004) (Wang, 2007). In (Pontil,1998)) Pontil et al.
robust againts noise and occlusion. Moreover it can proposed a method that recognize the objects also if
the objects are overlapped. In recognition systems
be applied hierarchically to describe the object at
multiple resolution. based on view, the dimensions of the extracted
features may be of several hundreds. After obtaining
In recogniton kernel (Li,1996) based method, the
the features of 3D object from 2D images, the 3D
features of an object are extracted at levels that are
the most appropriate to yield only the necessary object recognition is reduced to a classification
problem and features can be considered from the
details; in (Day,1995) the authors proposed a
graphical data model for specifying spatio-temporal perspective of pattern recognition. In (Murase, 1995)
semantics of video data for object detection and the recognition problem is formulated as one of
appearance matching rather than shape matching.
recognition. The most important information used in
(Chen, 2002) are the relative spatial relationships of The appearance of an object depends on its
663
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
shape, reflectance properties, pose in the scene and problem of near-duplicates (ND), that are similar
the illumination conditions. Shape and reflectance images that can be divided in identical (IND) and
are intrinsic properties of the object, on the contrary non-identical (NIND). IND is formed by
pose and illumination vary from scene to scene. In transformed versions of an initial image (i.e. blurred,
(Murase, 1995) the authors developed a compact cropped, filtered), NIND by pictures containing the
representation of objects, parameterized by object same scene or objects. In this case, the subjectivity
pose and illumination (parametric eigenspace, of “how much” two image are similar is a hard
constructed by computing the most prominent problem to face off. They present a NIND ground
eigenvectors of the set) and the object is represented truth derived by asking directly to ten subjects and
as a manifold. The exact position of the projection they make it available on the web.
on the manifold determines the object's pose in the A high-speed and high-performance ND retrieval
image. The authors suppose that the objects in the system is presented in the work of (Dong, 2012).
image are not occluded by others objects and They use an entropy-based filtering to eliminate
therefore can be segmented from the remaining points that can lead to false positive, like those
scene. associated to near-empty regions, and a sketch
In (Lowe, 1999) the author developed an object representation for filtered descriptors. Then they use
recognition system based on SIFT descriptors a query expansion method based on graph cut.
(Lowe, 2004), more particularly, the author used Recognizing in video includes the problem of
SIFT keypoints and descriptors as input to a nearest- detection and in same cases tracking of the object.
neighbor indexing method that identifies candidate The paper of (Chau, 2013) is an overview on
object matches. The features of SIFT descriptors are tracking algorithms classification where the authors
invariant to image scaling, translation and rotation, divide the different approaches in point, appearance
partially invariant to illumination changes and affine and silhouette tracking.
or 3D projection. The SIFT keypoints are used as In our method we use SIFT for obtaining the
input to a nearest-neighbor indexing method, this object model from multiple views (multiple frames)
identifies candidate object matches. of the object in the video. In our method the
In (Wu, 2011) the authors analyzed the features recognition of the object is performed by matching
which characterize the difference of similar views to the keypoints of the sampled frames from the video
recognize 3D objects. Principal Component Analysis with the keypoints of the objects models. Similarly
(PCA) and Kernel PCA (KPCA) are used to extract to the method of Peng Chang et al. (Chang, 1999)
features and then classify the 3D objects with we used object modeling for object recognition but
Support Vector Machine (SVM). The performances we preferred to extract local features (SIFT) rather
of SVM, tested on Columbia Object Image Library than global features such as the color co-occurrence
(COIL-100) have been compared. The best histogram.
performance is achieved by SVM with KPCA.
KPCA is used for feature extraction in view-based
3D object recognition. 3 DATASET CREATION AND
In (Wu, 2011) different algorithms are shown by
comparing the performances only for four angles of OBJECT MODELING
rotation (10° 20° 45° 90°). Furthermore, the
experimental results are based only on images with The recognition algorithm is based on a collection of
dimensions 128 x 128. models built from videos of known objects. To test
Chang et al. (Chang, 1999) used the color co- the performance of the proposed method we first
occurrence histogram (that adds geometric constructed a dataset of videos representing several
information to the usual color histogram) for objects. Then the modeling method is proposed.
recognizing objects in images. The authors
computed model of color co-occurrence histogram 3.1 Dataset
based on images of known objects taken from
different points of view. The models are then 3.1.1 Video Description of the Object
matched to sub-regions in test images to find the
object. Moreover they developed a mathematical For each object of the dataset the related video
probabilistic model for adjusting the number of contain a 360 degree view of the object starting from
colors in color co-occurrence histogram. a frontal position. This is done using a turntable, a
In (Jinda-Apiraksa, 2013) the focus is on the fixed camera and a uniform background. Video
resolution is 1280 x 720p (HD) at 30 fps and the
664
VideoObjectRecognitionandModelingbySIFTMatchingOptimization
665
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
4 PROPOSED RECOGNITION
METHOD
Given the dataset and the extracted object models,
we propose a method that performs recognition
using a video as query input. Input query video may
contain or not one of the known objects, the only
hypothesis on the video is that if it contains an
object of the database then the object is almost
always visible in the related video even if subject to
changes on scale and orientation. Figure 5: The smoothed chart for Panda Object (blue line).
Yellow stars are local maxima and minima. Red dash line
is the original chart.
666
VideoObjectRecognitionandModelingbySIFTMatchingOptimization
Table 1: In this table, experimental results and statistical values about the video object modeling are shown: object id,
object name, the number of the views composing the object 'full model' and the object 'smoothed model', the compression
factor (i.e the ratio between the number of object model views and the number of all the object views in the dataset).
obj. ID name full model compression smoothed model compression
1 Dancer 14 38.89% 10 27.78%
2 Bible 15 41.67% 9 25.00%
3 Beer 7 19.44% 5 13.89%
4 Cipster 12 33.33% 5 13.89%
5 Tour Eiffel 17 47.22% 10 27.78%
6 Energy Drink 17 47.22% 7 19.44%
7 Paper tissue 13 36.11% 13 36.11%
8 Digital camera 13 36.11% 7 19.44%
9 iPhone 13 36.11% 9 25.00%
10 Statue of Liberty 17 47.22% 11 30.56%
11 Motorcycle 9 25.00% 7 19.44%
12 Nutella 19 52.78% 9 25.00%
13 Sunglasses 23 63.89% 15 41.67%
14 Watch 16 44.44% 9 25.00%
15 Panda 15 41.67% 7 19.44%
16 Cactus 17 47.22% 11 30.56%
17 Plastic plant 19 52.78% 9 25.00%
18 Bottle of perfume 13 36.11% 5 13.89%
19 Shaving foam 10 27.78% 8 22.22%
20 Canned meat 20 55.56% 9 25.00%
21 Alarm clock (black) 15 41.67% 11 30.56%
22 Alarm clock (red) 15 41.67% 8 22.22%
23 Coffee cup 20 55.56% 11 30.56%
24 Cordless phone 15 41.67% 7 19.44%
25 Tuna can 17 47.22% 7 19.44%
Tot. 381 219
Mean Value 15.24 42.33% 8.76 24.33%
667
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
Figure 10: The best matches results for each object and for
unknow objects (NO OBJ).
ACKNOWLEDGEMENTS
The authors wish to acknowledge Christian Caruso
Figure 9: Matching results with different thresholds: 2 for helping us in the implementation and
(lower) and default value, 1.5 (upper). experimental phases.
668
VideoObjectRecognitionandModelingbySIFTMatchingOptimization
669
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
670