Academia.eduAcademia.edu

A Mathematical Approach Towards Semi-Automatic Image Annotation

2011

Publication in the conference proceedings of EUSIPCO, Barcelona, Spain, 2011

19th European Signal Processing Conference (EUSIPCO 2011) Barcelona, Spain, August 29 - September 2, 2011 L. Seneviratne and E. Izquierdo Multimedia and Vision Research Group, School of Electronic Engineering and Computer Science, Queen Mary, University of London, Mile End Road, London, E1 4NS, UK. phone: +44(0)2078827880, fax: + 44(0)2078827997, email:{lasantha.s, ebroul.izquierdo}@elec.qmul.ac.uk In this paper, an interactive approach to obtain semantic annotations for images is presented. The proposed approach aims at what millions of single, online and cooperative gamers are keen to do, enjoy themselves in a competitive environment. It focuses on computer gaming and the use of humans in a widely distributed fashion. This approach deviates from the conventional “content#based image retrieval (CBIR)” paradigm favoured by the research community to tackle the problems related to the semantic annotation and tagging of multimedia contents. The proposed approach uses a multifaceted mathematical model based on game theories to aggregate numbers of different key#paradigms, such as Image Processing, Machine Learning and Game based approaches to generate accurate annotations. As a consequence, this approach is capable of identifying less#rational (cheating oriented) players, thus eliminating them from generating incorrect annotations. The performance of the proposed framework is tested with a number of game players. Result shows that this approach is capable of obtaining correct annotations in practice. Semantic annotation, Interactive gaming, MPEG87 features and object recognition Object recognition and semantic image representations are predominant research topics in the computer vision community. Although the technological developments in recent years for mapping low8level features with high8level concepts have improved, the “semantic gap” still remains as an open challenge in the computer vision community. Over the last decade, challenges in the semantic gap have attracted researchers from different communities. As a result, a large number of approaches for image annotation have been developed. One such approach is to accomplish image annotation by using collaborative efforts. Collaborative annotation aims at splitting an activity into reasonable chunks to be divided among people who are willing to contribute their resources or efforts [1]. Another approach is to design a user8centric interactive framework that is instrumental in harvesting human intelligence. The “ESP game” [2] and “Manhattan Story Mashup” [3] are two different innovative game strategies that are instrumental in harvesting human brainpower for annotating images. Since humans can inherently describe © EURASIP, 2011 - ISSN 2076-1465 image semantics the games exploit human cognitive intelligence to annotate images. Furthermore, this activity is a hidden activity and users often do not realize the contribution they have made while playing the games. The ‘Entertainment Software Association’, recently illustrated that in the United States alone there are more than 200 million hours spent each day playing computer and video games [4]. Moreover, it statistically shows that by the age of 21 an average American will spend more than 10,000 hours of playing computer games which is equivalent to five years of full time working. What if this time and effort can be utilized to address the semantic gap issue in the computer8vision community? By considering computer gaming and computer vision techniques we believe that there are numerous techniques available to overcome the issue of manual image annotation. One of the most effective ways to overcome this issue is to design interactive frameworks which would be able to captivate a large number of game players. This has to be carefully done by considering the player's psychological aspects; in general, players won’t play games for unravelling computational problems, but they do so to entertain themselves. The remainder of this paper is organized as follows; Section 2 gives an overview of the proposed approach. Section 3 introduces the proposed outcome prediction approach. In section 4, the Two8player game model and Payoff calculation are introduced. In section 5, the evaluation of gaming environment is discussed. Finally, section 6 summarizes the paper along with the future research goals. A diagrammatic overview of the proposed approach is given in Figure 1. The complete framework comprises two modules. The first module (right) analyses the annotations made for fully annotated images and second module (left) analyses the annotations made for partially annaotated and non8annotated images. Hence, the entire framework consists of three databases, namely: fully annotated, partially annotated and non8annotated images. The first module i.e., analyzing fully annotated content, is used to understand player’s behaviour, confirm results from statistical inference, as well as, estimate model parameters and the shape of its payoff functions. The second module i.e., analyzing partially and non8annotated content, is the actual annotation engine providing semantic metadata for partially annotated and non8annotated images. The image 559 subject for annotation is visualized by the visual game interface, where the players are expected to comment on them using a single character string. More details about the interface are given in [5]. player’s annotation could be rejected by the framework. This could happen when the dictionary mechanism fails in detecting an existing word in the English language. The risk of a correct annotation is being rejected is the risk that player faces in this game and is denoted by α. This risk is measured by the OC curve and its associated AQL parameter. (2) denotes the number of valid labels, i.e., Where, annotations with correct spelling which are correctly denotes the number of detected by the framework, and correct labels given by all players. Rejectable quality level (RQL) is measured by (3) based on the player’s commitment to the game. Here, β denotes the probability of accepting a lot of the RQL quality. (3) Figure 1 – A complete block diagram of the framework. At the start, a small set of fully annotated content is fed to the game to initiate the process of learning player’s behaviour and model parameters. Next, content is extracted from one of the three available databases (fully, partially or non8annotated) and uploaded into the system. Database selection for content extraction depends on the predicted player’s behaviour. However, a module to force extraction of content from the fully annotated multimedia database at random time intervals is also used. The payoff calculation is used to aggregate all the various information. Here, equilibrium analysis [6] provides valuable output on player confidence, i.e., it makes the decision whether to accept a player annotation as correct or incorrect based on a fair trade solution. Score computation is used to motivate players by giving them points thus acknowledging their contribution. There are players with different attitudes, cheating oriented and rational minded. It is always difficult to correctly distinguish all players. In addressing this problem, we have proposed an approach based on sequential sampling plans to predict the player’s outcome. This approach uses an Operating Characteristic curve (OC) to demonstrate the player’s distribution in gaming; hence it represents the picture of the sampling plan. As shown in Equation (1), player’s distribution in gaming is measured by Binomial distribution. It shows exactly x defective annotations in n images as a probability distribution. (1) represents the proportion of non8 Where, variable confirming outcomes (bad annotations) of the incoming annotations. In the proposed sampling plan, acceptance quality (AQL) is measured by (2) based on the quality in detecting correct keywords by the framework, i.e., using the dictionary mechanism. In practice, there is a risk that Where, ! denotes the number of wrong annotations given by the player and " denotes the number of fully annotated contents exposed to the player. Figure 2 shows a picture of the proposed OC curve. Here, RQL is frequently updated whenever player annotates a fully annotated content. α β Figure 2 8 Operating characteristic curve. In the proposed approach, item8by8item sequential sampling is used. Here, hitting or crossing a line results in making a decision [7]. Given a set of quality levels, AQL (#$ , α, RQL #% , β and & (number of exposed images), the acceptance and rejection lines are calculated by (4) and (5). Acceptance line: '( )$ * +& (4) Rejection line: ', )% * +& (5) The origin of acceptance line is computed as: 0 -./ 1 )$ 2$ The origin of rejection line is computed as: 1 -./ 0 )% 2$ 560 The line slope is computed as: 3 #$ 4 -./ 3 #% 4 + 2$ Where, 2$ #% -./ #$ #$ #% The proposed prediction mechanism works as follows. Before exposing a non8annotated content, number of defective annotations and exposed images are increased by 1. This simulates the worst outcome in this game, which is of having a wrong annotation. Next, the OC curve and other parameters such as, RQL, β, α, '( 5 ', and the plotted point in the sequential sampling plan is updated. In this instant, as shown in Figure 3 if the plotted point falls on or below the lower line i.e., acceptance line, a non8annotated content is exposed to the player when only the average good contribution of player 6 7 8 9$ , or else, a partially annotated content will be exposed. If the plotted point falls within the parallel lines or above the upper line, a fully annotated content is exposed to the player. The above process will keep exposing fully annotated contents until the lot has been accepted. Figure 4 8 Segmenting player's outcome into set of tags Given by the player outcome at time > on preceding multimedia content, the probabilistic outcome at time > * is estimated by using the transition matrix ?. This matrix gives the change of behaviors of player in the Markovian chain. 6 7;<$ =7; 6 :;<$ =7; ? @ A 6 :;<$ =:; 6 7;<$ =:; Where, 6 7;<$ =7; denotes the probability of obtaining a correct annotation at time > * , when player has given a correct annotation at time >. Similarly, other probabilities 6 7;<$ =:; 5 6 :;<$ =7; and 6 :;<$ =:; are measured using player’s historical data, i.e., using segmented outcomes. A diagrammatic overview of the proposed MM is given in Figure 5. Figure 5 8 Player's probability distribution in gaming. Since players do not know as to what type of content they are exposed to, it is sensible to assume that they respond in the same way to any of the three types of content: fully annotated, partially annotated or non8 annotated. Using this assumption, player 1’s payoff is always estimated by calculating overall good (BC and bad DC contributions in gaming. Figure 3 8 Proposed sequential sampling plan. Initially the framework feeds players with a number of fully annotated images; then it analyzes all the annotations in order to measure player confidence, thus, the transition probabilities. This is done by using a Markovian model [8]. The two states of the Markov Model (MM) are: a “correct” and an “incorrect” tag or annotation is entered, and they are represented by the variable 7 and :, respectively. The outcomes for fully annotated contents are sequentially ordered and segmented into sets of tags for the purpose of calculating conditional probabilities in the transition matrix. For example, the probability of 6 7;<$ =:; is estimated by dividing the number of sets in which the label ‘correct’ occurs before ‘incorrect’ by the total number of tag sets containing ‘incorrect’. In Figure 4, an overview of the segmenting process is given. 6 B$ 6 7;<$ =7; 6 7 * 6 7;<$ =:; 6 : 6 D$ 6 :;<$ =7; 6 7 * 6 :;<$ =E; 6 : Where, 6 7;<$ =7; 6 7 is the probability of having correct annotations at > * , when the state ‘correct’ considered and 6 7;<$ =:; 6 : is the probability of having correct annotations at time > * , when the state ‘incorrect’ considered. Where, :;<$ =7; 6 7 is the probability of having incorrect annotations at > * when the state ‘correct’ considered and 6 :;<$ =:; 6 : is the probability of having incorrect annotations at > * when the state ‘incorrect’ considered. When player 2 is considered, 6 B% in gaming is estimated as, 6 B% F$ G$ 5 G% * 6 H * 6 I J2% L5 MN 6 H OPQ 6 I OROM-OS-T 2% = K U5 MN 6 H .V 6 I OROM-OS-T 5 .WXTVYMZT 561 [ Where, F$ G$ 5 G% is the overall payoff of player 1 in gaming; 6 H is the probability of entering a given annotation, i.e., number of annotations given similar to the player input keyword / total number of annotations; 6 I is the outcome of low8level feature classification [9]; 2% is the normalising constant that defines the availability of 6 H and 6 I . In practice, classification outcomes are not entirely accurate and therefore, are being used when only classification outcomes are greater than the F8measure of the concept. When considering player 2, 6 D% in gaming is estimated as: \ ] 9% 6 D% Where, \ is the number of dissimilar annotations assigned to an image (measured by the Wordnet dictionary tool) and 9% is the allocated cost per annotation. If the framework performs good in annotation, it can be assumed that the number of \ would be smaller. When we assume that player is given a non8 or partially annotated content, the profile of actions is estimated as follows. Let the action of player i taken at each round be GC . Action G$ indicates that outcome of player 1 is good or bad in a game round and is observed by the output prediction module. When the outcomes of prediction say player will enter a good annotation, it is assigned G$ = 1. ^ 5 MN VTQM_WM.P ZO`Z /..Q OPP.WOWM.P [ a5 .WXTVYMZT G% is the player 2’s action property, and is being calculated using a threshold score. When player 1’s game score is less or equal to a certain threshold score 9b (player 1 score ≤ threshold score), action G% is assigned 0. G% is assigned 1 when the player score is greater than the threshold score (player 1 score > threshold score). Although player 1 increases his score by feeding in ‘correct’ annotations, framework keeps a difference in game points between the player’s score and threshold score. Whenever player cheats, his/her score will be reduced according to the score computation module, while keeping the threshold score unchanged. Additionally, whenever player 1’s score is less than the threshold score, the threshold score will be kept unchanged until the player score becomes greater than the threshold score with a lead of 9b c Therefore, it can be assumed that this process represents the long term contribution of the player in gaming. G% ^ 5 MN #dGefg hijgf k >)gfh)jdl hijgf [ a5 .WXTVYMZT Player 1’s short term contribution G$ 6 B$ G% 6 D$ G% 6 B% G$ 6 D% good (G$ = 1) 6dGefg m ′ h hijgf (0, 0) (8, +) (+, 0) (+, +) #dGefg m ′ h #Gejnn ] aa The performance of the proposed framework is studied for 96 fully annotated, 48 partially and 60 non annotated images. All experiments were conducted with the following threshold parameters, which have been chosen using a validation set of images, not part of the test set that used to measure performances of the proposed model. The threshold 9$ , i.e., exposing partially or non8annotated content is assigned a value of 0.63. It has been found that the good contribution of true game players are always greater than 0.63. By assigning this number partially annotated contents will be mostly exposed to the true game players. As a consequence, more accurate annotations are extracted. AQL is assigned a value of 0.03; threshold 9% (limits the maximum number of annotations per content) is assigned a value of 0.166. In practice, it has been found that minimum of 1 and maximum of 5 annotations are needed to find a single correct annotation. By assigning 0.166, framework allows players to annotate images with 5 (used different annotations. Threshold 9b for G% calculation) is assigned a value of 301. This is an acceptable number for the game because it has been found that rational minded players do not complete 3 incorrect annotations in a single row. In Figure 8, a correct annotation detected by the framework i.e., true positive, is shown by a square sign; an incorrect annotation completed by the framework i.e., false positive, is shown by a triangular. P(C) P(I) 1.2 1 0.8 0.6 0.4 0.2 (6) 0 1 Payoff of player 2: F% G$ 5 G% bad (G$ = 0) If players are cooperative, Table 1 shows action pair Short good, Long good forms the unique Nash equilibrium. It can be simply found by analyzing game outcomes for each action configuration. Score computation module is used for two purposes; firstly, to reward players for their contribution in gaming thus, to yield game points; secondly, to measure action property of the player 2 (G% c For each round, given all information including action profile G$ 5 G% , a general function for calculating player 1 and 2’s payoff can be defined by (6) and (7). Payoff of player 1: F$ G$ 5 G% Player 2’s long term contribution level bad (G% = 0) good (G% = 1) Actions Player output distribution G$ Table 1 8 Payoff representation for all actions. 3 5 7 9 No.of images 11 13 Figure 6 8 Performance of a classical player. (7) 562 15 P(C) P(I) 1.2 It shows in terms of accuracy, uracy, the proposed framework outperforms both other framework eworks in image annotation. Player output distribution 1 0.8 0.6 0.4 0.2 0 -0.2 1 3 5 7 9 11 13 15 No.of images Figure 7 8 Performance of a random ndom pplayer. 1.2 P(C) P(I) Player output distribution 1 0.8 0.6 We have presented a labour8intensive labou approach to harvest human brain power for addressing the semantic gap problem in the computer puter vision community. The proposed approach is a standalon dalone game that is capable of entertaining, motivating and d is used u to provide valuable information on image contents.. Besides B the fun factor, this framework provides high accuracy accura in image annotation. Result shows that the proposed posed framework outperforms Markovian based framework and nd the conventional labour8 intensive framework. This his approach a extended the behaviour of a conventional nal labour8intensive game by eliminating annotations from less8rational les gamers. As a result, accuracy in imagee annotation ann is significantly improved. Future research will focus mainly on improving frameworks efficiency, and to obtain a large number of annotations using a small number mbers of players. 0.4 0.2 0 1 3 5 7 9 11 13 15 No.of images 17 19 21 23 25 27 Figure 8 8 Performance of a true pl player. The proposed framework is evaluated ted us using 310 human players. Figures 688 show the framework work outcomes for a selected classical (a player who doess good annotations in the beginning of the game and then cheat cheats), random and genuine player. For classical cheaters,, the overall accuracy of the system is 84% and for random cheat cheaters it was about 79%. For true game players the accuracy uracy is about 89%. This leads to an overall accuracy of the he sys system for 84% in image annotation. lts of the proposed Figure 9 shows comparison results method against a conventional frame framework, i.e., a framework where no mathematical techniqu chniques are involved and as a fact it collects all the annotations ions th that are given by the player, and another framework which ich is proposed in [9] based on a Markovian model. Here,, some modifications have been done to the Markovian based ed sys system to improve its performances. 100% Accuracy 80% 60% 40% 20% 0% Conventional approch Markovian approch Pro Proposed approch ie, "Social "So Browsing on Flickr," [1] L. Kristina and A. Laurie, in Fifth International AAAI AI Conference on Weblogs and Social Media, 2007. [2] L. von Ahn and L. Dabbish, bish, "Labelling images with a computer game," in proceedings proc of the ACM Conference on Human Facto actors in Computing Systems (CHI), 2004. nd H. Nyholm, "Combining [3] V. Tuulos, J. Scheible, and Web, Mobile Phones and nd Public Pu Displays in Large8 Scale: Manhattan Story Mashup," Mashu in proceedings of the 5th international conference nce on Pervasive computing, 2007, pp. 37854. bish, "Designing games with a [4] L. von Ahn and L. Dabbish, purpose," Communicationss of the ACM, vol. 51, no. 8, 2008. [5] L. Seneviratne and E. Izquierdo, Izqui "Image annotation through gaming (TAG4FUN 4FUN)," in 16th International Conference on Digital Signal nal Processing, 2009. [6] G. Scutari, D. P. Palomar, ar, and S. Barbarossa, "Optimal Linear Precoding Strategies Strate for Wideband Noncooperative Systemss Based Bas on Game Theory8Part I: Nash Equilibria," in IEEE EEE Transactions on Signal Processing, vol. 56, no. 3, 2008, 200 pp. 123081249. [7] A. K. M. Abdul and F. A. Burney, Bu "Program for Item8 by8Item Sequential Samplin mpling by Attributes," King Abdulaziz University, Technic echnical Report 1992. [8] M. L. Puterman, Markov Decision Dec Processes: Discrete Stochastic Dynamic Progra gramming.: John Wiley & Sons, Inc, 1998. Izq "An interactive [9] L. Seneviratne and E.. Izquierdo, framework for image annotat nnotation through gaming," in Proceedings of the 11th ACM AC SIGMM International Conference on Multimedia dia Information Retrieval, 2010. Figure 9 8 Comparison of different rent fr frameworks. 563