Articulo

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/260708678
Video Object Recognition and Modeling by SIFT Matching Optimization
Conference Paper · March 2014
CITATIONS READS
11 876
3 authors:
Alessandro Bruno Luca Greco

IULM Libera Università di Lingue e Comunicazione di Milano National Institute of Geophysics and Volcanology
106 PUBLICATIONS 821 CITATIONS 45 PUBLICATIONS 155 CITATIONS
SEE PROFILE SEE PROFILE
Marco La Cascia
Università degli Studi di Palermo
145 PUBLICATIONS 3,475 CITATIONS
SEE PROFILE
All content following this page was uploaded by Alessandro Bruno on 12 March 2014.
The user has requested enhancement of the downloaded file.

Video Object Recognition and Modeling by SIFT Matching
Optimization
Alessandro Bruno, Luca Greco and Marco La Cascia

Dipartimento di Ingegneria Chimica, Gestionale, Informatica, Meccanica,
Università degli studi di Palermo, Palermo, Italy
{alessandro.bruno15, luca.greco, marco.lacascia}@unipa.it
Keywords: Object Modeling, Video Query, Object Recognition.
Abstract: In this paper we present a novel technique for object modeling and object recognition in video. Given a set
of videos containing 360 degrees views of objects we compute a model for each object, then we analyze
short videos to determine if the object depicted in the video is one of the modeled objects. The object model
is built from a video spanning a 360 degree view of the object taken against a uniform background. In order
to create the object model, the proposed techniques selects a few representative frames from each video and
local features of such frames. The object recognition is performed selecting a few frames from the query
video, extracting local features from each frame and looking for matches in all the representative frames
constituting the models of all the objects. If the number of matches exceed a fixed threshold the
corresponding object is considered the recognized objects .To evaluate our approach we acquired a dataset
of 25 videos representing 25 different objects and used these videos to build the objects model. Then we
took 25 test videos containing only one of the known objects and 5 videos containing only unknown objects.
Experiments showed that, despite a significant compression in the model, recognition results are
satisfactory.
1 INTRODUCTION attention on the first case (the specific instance of a

particular object). More in details we developed a
The ever-increasing popularity of mobile devices new technique for video object recognition and
such as smartphones and digital cameras, enables modeling (data representation).
new classes of dedicated applications of image Matching and learning visual objects is a
analysis such as mobile visual search, image challenge on a number of fronts. The instances of
cropping, object detection, object recognition, data the same object can appear very differently
representation (object modeling), etc.... Object depending on variables such as illumination
modeling and object recognition are two of the most conditions, object pose, camera viewpoint, partial
important issues in the field of computer vision. occlusions, backgroud clutter.
Object modeling aims to give a compact and Object recognition is accomplished by finding a
complete representation of an object. Object models correspondence between certain features of the
can be used for many computer vision applications image and comparable features of the object model.
such as object recognition and object indexing in The two most important issues that a method must
large database. address are what constitutes a feature, and how is the
Object recognition is the core problem of correspondence found between image features and
learning visual object categories and visual object model features. Some methods use global features,
instance. Researchers of computer vision considered which summarize information about the entire
two types of recognition: the specific object case and visible portion of an object, other methods use local
the generic category case. In the specific case the features invariant to affine transforms such as local
goal is to identify instances of a particular object. In keypoints descriptors (Lowe, 2004).
the generic category case the goal is to recognize We focus our work on methods that use local
different instances of objects as belonging to the features, such as local keypoints descriptors such as
same conceptual class. In this paper we focused our SIFT.
662
VideoObjectRecognitionandModelingbySIFTMatchingOptimization
The contributes of our paper are: a new technique the objects in function of time evolution. The model
for object modeling; a new method for video object is based on capturing the video content in terms of
recognition based on objects matching; a new video video objects. The authors differentiate foreground
dataset that consists of 360 degree video collection video objects and background video objects. The
of thirty objects (CVIPLab, 2013). method includes the detection of background video
We suppose to analyze the case in which a person objects, foreground video objects, static video
take a video of an object with a videocamera and objects, moving video objects, motion vectors. In
then wants to know information about the object. (Sivic, 2006) Sivic et al. developed an approach to
The scenario of our system is to upload the video to object retrieval which localizes all the occurrences
a system able to recognize the video object taken by of an object in a video. Given a query image of the
the videocamera. object, this is represented by a set of viewpoint
We developed a new model for video objects by invariant region descriptors.
giving a very compact and complete description of
the object. We also developed a new video object 2.2 Object Recognition
recognition based on object matching that achieves
very good results in terms of accuracy. Object recognition is one of the most important issue
The rest of this papers is organized as follows: in computer vision community. Some works use
in section 2 we describe the related work of the state video to detect moving objects by motion. In
of the arts in object modeling and object recognition; (Kavitha, 2007), for example, the authors use two
in section 3 a detailed description of the video object consecutive frames to first estimate motion vectors
models dataset is given; in section 4 we describe the and then they perform edge detection using canny
proposed method for object recognition; in section 5 detector. Estimated moving objects are updated with
we show the experimental results; the section 6 ends a watershed based transformation and finally merged
the paper with some conclusions and future works. to prevent over-segmentation.
In geometric based approaches (Mundy, 2006)
the main idea is that the geometric description of a
2 RELATED WORKS 3D object allows the projected shape to be
accurately analyzed in a 2D image under projective
In this section we show the most popular method for projection, thereby facilitating recognition process
object modeling and object recognition with using edge or boundary information.
particular attention to video oriented methods. The most notable appearance-based algorithm is
the eigenface method (Turk, 1991) applied in face
recognition. The underlying idea of this algorithm is
2.1 Object Modeling
to compute eigenvectors from a set of vectors where
The most important factors in object retrieval are the each one represents one face image as a raster scan
vector of gray-scale pixel values. The central idea of
data representation (modeling) and the search
(matching) strategy. In (Li, 1999) the authors use feature-based object recognition algorithms lies in
multiresolution modeling because it preserves finding interesting points, often occurring at
intensity discontinuity, that are invariant to change
necessary details when they are appropriate at
various scales. Features such as color, texture, shape due to scale, illumination and affine transformation.
are used to build object models, more particularly Object recognition algorithms based on views or
appearances, are still a hot research topic (Zhao,
GHT (the Generalized Hough Transform) is adopted
over the others shape representations because it is 2004) (Wang, 2007). In (Pontil,1998)) Pontil et al.
robust againts noise and occlusion. Moreover it can proposed a method that recognize the objects also if
the objects are overlapped. In recognition systems
be applied hierarchically to describe the object at
multiple resolution. based on view, the dimensions of the extracted
features may be of several hundreds. After obtaining
In recogniton kernel (Li,1996) based method, the
the features of 3D object from 2D images, the 3D
features of an object are extracted at levels that are
the most appropriate to yield only the necessary object recognition is reduced to a classification
problem and features can be considered from the
details; in (Day,1995) the authors proposed a
graphical data model for specifying spatio-temporal perspective of pattern recognition. In (Murase, 1995)
semantics of video data for object detection and the recognition problem is formulated as one of
appearance matching rather than shape matching.
recognition. The most important information used in
(Chen, 2002) are the relative spatial relationships of The appearance of an object depends on its
663
ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods
shape, reflectance properties, pose in the scene and problem of near-duplicates (ND), that are similar
the illumination conditions. Shape and reflectance images that can be divided in identical (IND) and
are intrinsic properties of the object, on the contrary non-identical (NIND). IND is formed by
pose and illumination vary from scene to scene. In transformed versions of an initial image (i.e. blurred,
(Murase, 1995) the authors developed a compact cropped, filtered), NIND by pictures containing the
representation of objects, parameterized by object same scene or objects. In this case, the subjectivity
pose and illumination (parametric eigenspace, of “how much” two image are similar is a hard
constructed by computing the most prominent problem to face off. They present a NIND ground
eigenvectors of the set) and the object is represented truth derived by asking directly to ten subjects and
as a manifold. The exact position of the projection they make it available on the web.
on the manifold determines the object's pose in the A high-speed and high-performance ND retrieval
image. The authors suppose that the objects in the system is presented in the work of (Dong, 2012).
image are not occluded by others objects and They use an entropy-based filtering to eliminate
therefore can be segmented from the remaining points that can lead to false positive, like those
scene. associated to near-empty regions, and a sketch
In (Lowe, 1999) the author developed an object representation for filtered descriptors. Then they use
recognition system based on SIFT descriptors a query expansion method based on graph cut.
(Lowe, 2004), more particularly, the author used Recognizing in video includes the problem of
SIFT keypoints and descriptors as input to a nearest- detection and in same cases tracking of the object.
neighbor indexing method that identifies candidate The paper of (Chau, 2013) is an overview on
object matches. The features of SIFT descriptors are tracking algorithms classification where the authors
invariant to image scaling, translation and rotation, divide the different approaches in point, appearance
partially invariant to illumination changes and affine and silhouette tracking.
or 3D projection. The SIFT keypoints are used as In our method we use SIFT for obtaining the
input to a nearest-neighbor indexing method, this object model from multiple views (multiple frames)
identifies candidate object matches. of the object in the video. In our method the
In (Wu, 2011) the authors analyzed the features recognition of the object is performed by matching
which characterize the difference of similar views to the keypoints of the sampled frames from the video
recognize 3D objects. Principal Component Analysis with the keypoints of the objects models. Similarly
(PCA) and Kernel PCA (KPCA) are used to extract to the method of Peng Chang et al. (Chang, 1999)
features and then classify the 3D objects with we used object modeling for object recognition but
Support Vector Machine (SVM). The performances we preferred to extract local features (SIFT) rather
of SVM, tested on Columbia Object Image Library than global features such as the color co-occurrence
(COIL-100) have been compared. The best histogram.
performance is achieved by SVM with KPCA.
KPCA is used for feature extraction in view-based
3D object recognition. 3 DATASET CREATION AND
In (Wu, 2011) different algorithms are shown by
comparing the performances only for four angles of OBJECT MODELING
rotation (10° 20° 45° 90°). Furthermore, the
experimental results are based only on images with The recognition algorithm is based on a collection of
dimensions 128 x 128. models built from videos of known objects. To test
Chang et al. (Chang, 1999) used the color co- the performance of the proposed method we first
occurrence histogram (that adds geometric constructed a dataset of videos representing several
information to the usual color histogram) for objects. Then the modeling method is proposed.
recognizing objects in images. The authors
computed model of color co-occurrence histogram 3.1 Dataset
based on images of known objects taken from
different points of view. The models are then 3.1.1 Video Description of the Object
matched to sub-regions in test images to find the
object. Moreover they developed a mathematical For each object of the dataset the related video
probabilistic model for adjusting the number of contain a 360 degree view of the object starting from
colors in color co-occurrence histogram. a frontal position. This is done using a turntable, a
In (Jinda-Apiraksa, 2013) the focus is on the fixed camera and a uniform background. Video
resolution is 1280 x 720p (HD) at 30 fps and the
664
lenght is approximately 15 seconds.
3.1.2 Relation with Real Applications

This type of dataset try to simulate a simple video-
acquisition that can be done with a mobile device
(i.e. a smartphone) circumnavigating an object that
have to be added to the known object database. In
real application the resulting video have to be re-
elaborated, for example trying to estimate motion
velocity and jitter. If a video contain a partial view
of the object (i.e. less than 360 degree) recognition Figure 1: Complete 360 degree view of the video object.
task can be still performed but only for the visible
part of the object.
3.1.3 Image Dataset

The constructed dataset is formed by videos of 25
different objects. As the angular velocity of the Figure 2: On the right, the video object frame, on the left
turntable is constant, a subset of 36 frame is sampled the video object, without background.
uniformly for each object so extracting views that
differ by 10 degrees of rotation (see fig. 1). So, 3.2.2 Maxima and Minima Extraction
starting from the video dataset, an image dataset is
also constructed with these samples containing 900 Rotating an object by few degrees, most part of the
views of the 25 objects. Although original object that is visible starting the rotation generally is
background is uniform, shadows, light changes or still visible at the end. This is related to the object
camera noise can produce a slightly changing geometry (shape, occluding parts, symmetrics) and
resulting color. In the extracted views the original the pattern features (color change, edges).
background is segmented and replaced with a real Calculating SIFT descriptors of two consecutive
uniform background (i.e. white) that not produce views (views that differ by 10 degrees of rotation), it
SIFT keypoint, so storing only the visual is expected that a large part of the descriptors will
information about the object (fig. 2). match.
For each view, if only the keypoints matching
3.2 Object Modeling with the previous and the next view are considered
and the others are discarded, the remaining
Starting from the image dataset of 900 images a keypoints are representative of the shared visual
reduced version is extacted to have, for each object, informations in a three images range. Only repeated
only a subset of the initial 36 images representing and visible points in at least two views are present in
the visual model to be used for recognition. the resulting subset. The number of remaining points
is used like a discrete similarity function and local
3.2.1 Overview maxima and minima are extracted. Taking local
minima of this function, the related images are the
For each object, the model is extracted as follow: most visually different in their neighborhood, so
1. SIFT descriptors and keypoints are these represent views that contain a visual change of
calculated for all views; the object. Local maxima, on the other hand,
2. for each view, only SIFT points that match correspond to pictures that contain common details
with points in previous or next view are used in their neighborhood, so being representative of
as view descriptors; this. Only views corresponding to local maxima and
3. the number of point of each view is used as minima are used to model the object, so taking the
discrete function and local maxima and images that contain “typical” views (maxima) and
minima are extracted; visual breaking views (minima) such as in fig. 3. In
4. object model is obtained taking images Fig. 4 and 6 we plot a curve that shows, for a given
corresponding to maxima and minima. view (x-axis), the number of SIFT points that match
(y-axis) with points in previous or next view.
665
The curves shown in fig. 4 and 6 can be 4.1 Proposed Method

characterized by a lot of local maxima and minima
that could correspond to views that are very close The proposed recognition follows this steps:
each other. This would go in the opposite direction 1. extract N frames from video query;
from the objective of our method, that, on the 2. match every frame with all components of all
contrary, aims to represent the object with the fewest models;
possible views. This is the reason why we also apply 3. counting the number of matching points for all
a 'smooth' interpolation function to the curves shown the views of the models and all frames of the
in fig. 4 and 6. The results of 'smooth' interpolations video, take the maximum value. The object
are depicted in fig. 5 and 7, showing curves very related to this match is the recognized object, if
close to the original ones (fig, 4 and 6). Furthermore the number of matches exceeds a fixed
the curves in fig.5 and 7 have a number of local threshold (10 in our experiments).
maxima and minima lower than the curves in fig. 4
and 6. Since now on we call 'dataset model' the 4.1.1 Refining Matches
model of the object that consists of 36 images/views
(that differ by 10 degree of rotation). We call 'full If the models give a complete representation of the
model' the model that consists of the views that appearance of the object, step two is crucial for
correspond to local maxima and minima in not recognition task. Experimental results shows that
smoothed curves (such as in fig. 4 and 6). We call results can be corrupted in real-word query because
'smoothed model' the model that consists of the cluttered background can lead to incorrect or
views corresponding to local maxima and minima in multiple matches.results can be corrupted in real-
smoothed curves (as in fig. 5 and 7). In tab. 1 we word query because cluttered background can lead
show, for each object, the size of full and smoothed to incorrect or multiple matches. To make a more
robust matching phase, it is important to exclude
model and the model compression. The latter is the
these noisy points. This can be done using RANSAC
ratio between the number of views composing the
current model (i.e. 'full model' or 'smoothed model')
and the number of views composing the 'dataset
model'.
Figure 3: On the left side, Panda Object View corresponds

to a local maxima (0 degree view) of the curve in fig. 4, on
the right side Panda Object View corresponds to a local
Figure 4: The chart of matching keypoints for all views of
minima (110 degrees view) of the curve in fig.4.
Panda Object. Yellow circles are local maxima and
minima.
4 PROPOSED RECOGNITION
METHOD
Given the dataset and the extracted object models,
we propose a method that performs recognition
using a video as query input. Input query video may
contain or not one of the known objects, the only
hypothesis on the video is that if it contains an
object of the database then the object is almost
always visible in the related video even if subject to
changes on scale and orientation. Figure 5: The smoothed chart for Panda Object (blue line).
Yellow stars are local maxima and minima. Red dash line
is the original chart.
666
Figure 6: The chart of matching keypoints for all views of

Tour Eiffel Object. Yellow circles are local maxima and Figure 7: The smoothed chart for Tour Eiffel Object.
minima. Yellow stars are local maxima and minima. Red dash line
is the original chart.
(Fischler, 1981) in the matching operation, to
exclude points that don’t fit an homography 5 RESULTS
transformation (fig. 8). Furthermore, considering
that a single keypoint of an image can have more Object recognition using video and image dataset
than one match with keypoints of the second image, was done using the MATLAB implementation of
we consider multiple matches of the same keypoint SIFT present in (Vedaldi, 2010) and RANSAC
as a single match. implementation present in (Kovesi, 2003) following
the process described in section 4.1. To achieve
matches with less but more robust points the
Table 1: In this table, experimental results and statistical values about the video object modeling are shown: object id,
object name, the number of the views composing the object 'full model' and the object 'smoothed model', the compression
factor (i.e the ratio between the number of object model views and the number of all the object views in the dataset).
obj. ID name full model compression smoothed model compression
1 Dancer 14 38.89% 10 27.78%
2 Bible 15 41.67% 9 25.00%
3 Beer 7 19.44% 5 13.89%
4 Cipster 12 33.33% 5 13.89%
5 Tour Eiffel 17 47.22% 10 27.78%
6 Energy Drink 17 47.22% 7 19.44%
7 Paper tissue 13 36.11% 13 36.11%
8 Digital camera 13 36.11% 7 19.44%
9 iPhone 13 36.11% 9 25.00%
10 Statue of Liberty 17 47.22% 11 30.56%
11 Motorcycle 9 25.00% 7 19.44%
12 Nutella 19 52.78% 9 25.00%
13 Sunglasses 23 63.89% 15 41.67%
14 Watch 16 44.44% 9 25.00%
15 Panda 15 41.67% 7 19.44%
16 Cactus 17 47.22% 11 30.56%
17 Plastic plant 19 52.78% 9 25.00%
18 Bottle of perfume 13 36.11% 5 13.89%
19 Shaving foam 10 27.78% 8 22.22%
20 Canned meat 20 55.56% 9 25.00%
21 Alarm clock (black) 15 41.67% 11 30.56%
22 Alarm clock (red) 15 41.67% 8 22.22%
23 Coffee cup 20 55.56% 11 30.56%
24 Cordless phone 15 41.67% 7 19.44%
25 Tuna can 17 47.22% 7 19.44%
Tot. 381 219
Mean Value 15.24 42.33% 8.76 24.33%
667
Table 2: Video object recognition correctness results.

obj. ID name result
1 Dancer uncorrect
2 Bible correct
3 Beer correct
4 Cipster correct
5 Tour Eiffel correct
6 Energy Drink correct
7 Paper tissue correct
8 Digital camera correct
9 iPhone correct
Figure 8: The images show the matches with (lower) and
10 Statue of Liberty correct
without (upper) RANSAC.
11 Motorcycle correct
12 Nutella correct
threshold of the match function used was 2 instead 13 Sunglasses uncorrect
of the default 1.5 value. The difference of the 14 Watch correct
resulting number of points can be seen in fig. 9. The 15 Panda correct
proposed method was tested with 30 different 16 Cactus uncorrect
videos. Each video contains one of the known object 17 Plastic plant uncorrect
except five videos that contain unknown objects. 18 Bottle of perfume correct
Query videos have an average length of 4 seconds 19 Shaving foam correct
and the first step of the method is performed with a 20 Canned meat correct
uniform frame sampling rate fixing N (the number 21 Alarm clock (black) correct
of the selected frames per video) at 4 (so 22 Alarm clock (red) correct
approximately one frame for second). In fig. 10 23 Coffee cup uncorrect
best match number is shown with relationship to the 24 Cordless phone correct
number of experiments (step 3). In step 3 the 25 Tuna can correct
NO OBJ1 correct
selection of an appropriate threshold (10) is
NO OBJ2 correct
performed by statistical analysis of the correct
NO OBJ3 correct
match. The chart in fig. 10 shows that the best NO OBJ4 correct
matches, for each object, are distributed into two NO OBJ5 correct
major groups. In tab.2 recognition correctness
results are shown for each test video query, Table 3: The precision of video object recognition system.
including the original id and name for the present
object (or NO OBJ# for unknown object). Total Testset size correct uncorrect precision
recognition performance is shown in tab. 3, with an 30 25 5 83.33%
average precision of the system of 83%. The number
of matches performed is 291, so only 24% of the full
dataset dimension of 900. In fig. 8 an example of
correct recognition is shown. Fig. 11 shows the
matches for an unrecognized object (dancer) and for
a correct not recognition of unknown object.
Figure 10: The best matches results for each object and for
unknow objects (NO OBJ).
ACKNOWLEDGEMENTS
The authors wish to acknowledge Christian Caruso
Figure 9: Matching results with different thresholds: 2 for helping us in the implementation and
(lower) and default value, 1.5 (upper). experimental phases.
668
6 CONCLUSIONS AND FUTURE data. In Proceedings of the Eleventh International

Conference on Data Engineering.
WORKS Chen, L., Ozsu, M. T., 2002. Modeling of video objects in
a video databases. In Proceedings of IEEE
In this paper we proposed a new method for video International Conference on Multimedia and Expo.
object recognition based on video object models. Sivic, J., Zisserman, A., 2006. Video Google: Efficient
The results of video object recognition, in terms of visual search of videos. In Toward Category-Level
Object Recognition, pp. 127-144, Springer.
accuracy are very encouraging (83%). We created a
Vedaldi, A., Fulkerson, B., 2010. VLFeat: An open and
video dataset of 25 video object, it consists of 360 portable library of computer vision algorithms. In
degree-views of the objects. From the video dataset Proceedings of the International Conference on
an image dataset is also constructed by sampling the Multimedia.
video frames. It contains 900 views of the 25 Kavitha, G., Chandra, M. D., Shanmugan, J., 2007. Video
objects. Our method for object modeling gives, as Object Extraction Using Model Matching Technique:
result, a compact and complete representation of the A Novel Approach. In 14th IWSSIP, 2007 and 6th
objects, it achieves almost 76% data compression of EURASIP Conference focused on Speech and Image
the models. With regard to object recognition Processing, Multimedia Communications and
Services, pp. 118-121.
method, one of the possible improvement is to refine
Mundy, Joseph L. 2006. Object recognition in the
the selection of the frames for the query in the geometric era: A retrospective. Toward category-level
objects models database. Given a video, the camera object recognition. pp.3-28.
motion could be estimated and the frame samples Lowe, D.G., 2004. Distinctive Image Features from Scale-
extracted according to motion, for example trying to Invariant Keypoints, In International Journal of
get a frame every fixed angular displacing. Best Computer Vision n. 60 vol.2 pp. 91-110, Springer.
results should be reached using a sampling rate that Turk, M., Pentland, A., 1991. Eigenfaces for recognition.
approximate the rate used in the dataset creation. If In Journal of cognitive neuroscience vol.3, n.1, pp. 71-
the video is long enough to have a high number of 86, MIT press.
Zhao, L. W., Luo, S. W., Liao, L. Z., 2004. 3D object
selected frames, the same modeling process could be
recognition and pose estimation using kernel PCA. In
used in the query to increase time performance of Proceedings of 2004 International Conference on
the recognition, preserving the accuracy taking only Machine Learning and Cybernetics.
the most relevant views. Wang, X. Z., Zhang, S. F., Li, J., 2007. View-based 3D
object recognition using wavelet multiscale singular-
value decomposition and support vector machine. In
ICWAPR.
Pontil, M., Verri, A., 1998. Support vector machines for
3D object recognition. In IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol.20 n.6,
pp. 637-646.
Murase, H., Nayar, S. K., 1995. Visual learning and
recognition of 3-D objects from appearance. In
International journal of computer vision, vol.14 n.1,
Figure 11: Two examples of results: a false negative (the pp. 5-24. Springer.
dancer) and a true negative (unknown object). Lowe, D. G., 1999. Object recognition from local scale-
invariant features. In . The proceedings of the seventh
IEEE international conference on Computer vision.
Chang, P., Krumm, J., 1999. Object recognition with color
REFERENCES cooccurrence histograms. In IEEE Computer Society
Conference on Computer Vision and Pattern
Li, Z. N., Zaiane, O. R., Tauber, Z., 1999. Illumination Recognition.
Invariance and Object Model Content-Based Image Wu, Y. J., Wang, X. M., Shang, F. H., 2011. Study on 3D
and Video Retrieval. In Journal of Visual Object Recognition Based on KPCA-SVM. In
Communication and Image Representation, vol 10, pp International Conference on Information and
219-224. Intelligent Computing, vol.18 pp. 55-60. IACSIT
Z. Li and B. Yan., 1996 Recognition Kernel for content- Press, Singapore.
based search. In Proc. IEEE Conf. on Systems, Man, Fischler, Martin A and Bolles, Robert C.,1981. Random
and Cybernetics, pages 472-477. sample consensus: a paradigm for model fitting with
Day, Y. F., Dagtas, S., Iino, M., Khokhar, A., Ghafoor, A., applications to image analysis and automated
1995. Object-oriented conceptual modeling of video cartography In Communications of the ACM, vol. 24,
num.6, pp. 381–395.
669
Kovesi, P., 2003. MATLAB and Octave Functions for

Computer Vision and Image Processing. [online]
Available at: <http://www.csse.uwa.edu.au/~pk>
[Accessed September 2013]
Jinda-Apiraksa, A., Vonikakis, V., Winkler, S., 2013.
California-ND: An annotated dataset for near-
duplicate detection in personal photo collections. In
Proceedings of 5th International Workshop on Quality
of Multimedia Experience (QoMEX), Klagenfurt,
Austria.
CVIPLab, 2013. Computer Vision & Image Processing
Lab, Università degli studi di Palermo Available at:
<https://www.dropbox.com/sh/sqkq03tsembdu4m/N1
mCVCFxGQ>
Dong, W., Wang, Z., Charikar, M., Li, K., 2012. High-
confidence near-duplicate image detection. In
Proceedings of the 2nd ACM International Conference
on Multimedia Retrieval.
Chau, D. P., Bremond, F., Thonnat, M., 2013. Object
Tracking in Videos: Approaches and Issues. arXiv
preprint arXiv:1304.5212.
670
View publication stats

Articulo

Uploaded by

Copyright:

Available Formats

Articulo

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Articulo

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Video Object Recognition and Modeling by SIFT Matching Optimization

Conference Paper · March 2014

Alessandro Bruno Luca Greco

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Alessandro Bruno, Luca Greco and Marco La Cascia

Keywords: Object Modeling, Video Query, Object Recognition.

1 INTRODUCTION attention on the first case (the specific instance of a

lenght is approximately 15 seconds.

3.1.2 Relation with Real Applications

3.1.3 Image Dataset

The curves shown in fig. 4 and 6 can be 4.1 Proposed Method

Figure 3: On the left side, Panda Object View corresponds

Figure 6: The chart of matching keypoints for all views of

Table 2: Video object recognition correctness results.

6 CONCLUSIONS AND FUTURE data. In Proceedings of the Eleventh International

Kovesi, P., 2003. MATLAB and Octave Functions for

View publication stats

You might also like