Academia.eduAcademia.edu

A novel user-centered design for personalized video summarization

2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)

In the past, several automatic video summarization systems had been proposed to generate video summary. However, a generic video summary that is generated based only on audio, visual and textual saliencies will not satisfy every user. This paper proposes a novel system for generating semantically meaningful personalized video summaries, which are tailored to the individual user's preferences over video semantics. Each video shot is represented using a semantic multinomial which is a vector of posterior semantic concept probabilities. The proposed system stitches video summary based on summary time span and top-ranked shots that are semantically relevant to the user's preferences. The proposed summarization system is evaluated using both quantitative and subjective evaluation metrics. The experimental results on the performance of the proposed video summarization system are encouraging.

A NOVEL USER-CENTERED DESIGN FOR PERSONALIZED VIDEO SUMMARIZATION Gheorghita Ghinea1, Rajkumar Kannan2, Sridhar Swaminathan2, Suresh Kannaiyan2 1 Information Systems and Computing, Brunel University, Middlesex UB8 3PH, United Kingdom [email protected] 2 Department of Computer Science, Bishop Heber College (Autonomous), Tiruchirappalli, India [email protected], [email protected], [email protected] ABSTRACT In the past, several automatic video summarization systems had been proposed to generate video summary. However, a generic video summary that is generated based only on audio, visual and textual saliencies will not satisfy every user. This paper proposes a novel system for generating semantically meaningful personalized video summaries, which are tailored to the individual user’s preferences over video semantics. Each video shot is represented using a semantic multinomial which is a vector of posterior semantic concept probabilities. The proposed system stitches video summary based on summary time span and top-ranked shots that are semantically relevant to the user’s preferences. The proposed summarization system is evaluated using both quantitative and subjective evaluation metrics. The experimental results on the performance of the proposed video summarization system are encouraging. Generic video summarization will not be sufficient when the users’ needs and interests change over a time. Users are seldom satisfied by a common video summary produced by the video summarization system as the produced video summary may not contain content of particular semantic concept or genre liked by a user. So the criterion used to summarize a video should be the user’s preferences and interests over the video semantics. Personalized video summarization is a useful technique for producing the customized video summaries to the users based on their interests over video semantics. Therefore our hypothesis is that a generic video summary does not satisfy every user and multiple video summaries for the same video should be generated depending on the preferences and needs of individual users. The following are the important requirements for the proposed summarization system: • Index Terms— Video summarization, video semantics, semantic multinomial, personalization, user preferences, multimedia information systems, multimedia retrieval and browsing • To support an efficient video content management and personalized video browsing in large scale video information system for individual users To support effective resource management such as internet bandwidth in large scale web video information system, based on whether user's interest is to watch an entire video or not. 1. INTRODUCTION Video summarization aims at producing compact version of a full-length video while preserving the significant content of the original video. For generating a video summary, most of the automatic video summarization methodologies detect interesting or significant segments of a video based on certain criteria which are mostly audio, visual and textual saliencies [1]. However users always try to summarize videos based on the semantic content of a video, rather than saliencies alone. So the well-known semantic gap problem is thus shown to also exist in video summarization. The aim of this paper is to propose a novel system for the personalized video summarization that produces the customized video summaries by adapting to the user’s interests. The contributions of this paper are, • • A novel summarization methodology that generates semantically meaningful personalized video summaries using semantic multinomial and user preferences. A novel personalized video summarization system that tailors the summaries based on individual user’s preferences over video semantics. 2. RELATED WORK A wide number of contributions can be found in the area of video summarization. Various user attention based models have been proposed for summarization to make use of the users’ perceptual response to low-level audio, visual and textual features [1-4], and the users’ response while watching a video [5-7]. A multimodal saliency curve is constructed for movie summarization using a spatiotemporal saliency model, an AM-FM speech model and Part of Speech (POS) tagging for computing visual, aural and textual saliencies respectively [1]. Video features that easily attract users’ attention and influence human perception, such as motion, contrast, special scenes and statistical rhythm, are extracted and modelled for summarization [2]. Audio, visual and linguistic attention models were used to generate the attention curve of a video for both static and dynamic video summarization [3]. The authors utilized both low-level attention models such as motion, static, camera and audio attention models and mid-level attention models such as face, speech and music attention models. Attention scores are computed using a motion attention model, and are attached to the scene, clusters, shots and subshots in a temporal graph for video summarization [4]. Most of the attention (or saliency) based video summarization methods utilize users’ task-independent response or attention to audio, visual and textual modalities of a video. A major limitation with this class of approaches is that, they often fail to work well on videos with semantically rich content (like movie and sports videos). As they do not consider high-level video semantics, summaries generated based on saliency might not, however, contain semantically interesting or significant content of a video. For goal-oriented, task-specific video summarization, it is necessary to consider the semantics underlying a video and the users’ requirements on summarization. Variations in user’s eye movement, blink, and head motion are considered for identifying interesting segments of a video [5]. Affective segments of videos and Regions of interest (ROIs) are discovered by analyzing the viewers’ eye-gaze [6]. The authors in [7] presented an affective video summarization approach based on the facial expressions of viewers while watching the video. Facial expressions were analyzed to infer affective scenes from videos. The effectiveness of these methods often depends upon the ability to capture users’ responses and mapping of such responses to the corresponding video segments. Also, this class of methods always needs the controlled summarization setups. Same as the attention based video summarization methods; these methods do not consider the video semantics and users’ requirements and thus suffer from less generalizability. High-level video semantics can be used as preferences to the users for personalized video summarization [8-11]. Users’ Degree of Interest on event, person, and object were used for personalized summarization of life-log videos in a multi-camera office environment [8]. This approach totally relies on manual annotation of events such as working, eating, printing, meeting, etc. High-level semantic concepts such as humans, explosion, indoor, outdoor, close-up, zoomin, moving objects, etc were automatically detected from videos for personalized summarization [9]. The authors used a constrained optimization problem for selecting shots that are relevant to the users’ preferences. The importance of a video segment is measured using users’ constraints and preferences over audio-visual semantic concepts [10]. IBM research has proposed a personalized video summarization system for pervasive mobile devices such as PDA [11]. The User, device and transmission profiles were used for adaptive personalized video summarization and transmission; however, the system allows only a single visual semantic concept as binary preference at a time. Most of these approaches explicitly obtain users’ preferences over video semantics for personalized video shot summarization. The limitation with these approaches is that they support only binary valued preferences. Binary valued preferences might not be adequate when the user wants to relatively prioritize the preferences. Also computation of the similarity between video semantics and users’ preferences, and the selection of ranked shots were marginally discussed in these works. In contrast to the previous personalized video summarization systems, the proposed system supports multiple real valued preferences at a time. Also, the proposed summarization system provides a wide variety of semantic concepts as preferences to the users and uses efficient similarity measures as well as constrained shot selection scheme for personalized video summarization. 3. SYSTEM OVERVIEW The architecture of the proposed video summarization system is shown in figure 1. The system consists of three modules: pre-processing, user interface and video summarization. Here, the database contains a collection of videos and their metadata. Fig. 1. Architecture of personalized video summarization system. Fig. 3. Relative user preference panel. Fig. 2. User interface of the video summarization system. The proposed system uses IBM’s Multimedia Analysis and Retrieval System (IMARS) [12] for pre-processing. First, videos are segmented into set of shots using a combination of audio-visual features. This ensures smooth audio-visual transitions in the summary presentation. Since a keyframe can represent a shot, a single keyframe is extracted from each shot. Near duplicate shots in a video are identified using these keyframes. In a set of visually similar shots, shot that appears first in the video is kept and other near duplicate shots are removed, so that the summary will not contain more than one shot with similar visual content. Twenty five semantic concepts such as beach, flower scene, indoors, etc., are detected from each keyframe. The relevance scores to each video shot for a set of semantic concepts are assigned. This score is a posterior concept probability ranging from -1 to +1 that shows the relevance between a shot and a particular semantic concept, where -1 implies highly irrelevant and +1, highly relevant. The vector of semantic weights (i.e. relevance scores), denoted as the semantic multinomial, represents each image in a semantic space. The IMARS provides a diverse set of semantic concept detectors for visual scene categories that covers places, people, objects, settings, activities and events. So, the semantic concepts used are sufficient enough to represent a keyframe in a semantic space. Hence, the proposed summarization system can be used with videos of any genre from any domain. Figure 2 depicts the user interface of the proposed video summarization system. The user interface assists the users to specify their preferences over 25 semantic concepts and to give a preferred summary length for a video. The set of semantic concepts preferred by a user will be considered as a user profile. The users can specify preferences using either list or slider. When using list, all chosen preferences will assume a numeric value 1 and rest will assume a numeric value 0. Since list allows only the binary preferences, sliders can be used to select real values from 0 to 1 (Figure 3). In the video summarization module, Dot Product and Cosine Similarity measures are employed to determine likeliness between user’s profile and semantic multinomial. Summary skimmed using top-ranked video shots and shown to the user. 4. SUMMARIZATION METHODOLOGY Let video V = {ui , 1 ≤ i ≤ n} consists of n shots, where each shot ui has duration di seconds. Let C = {cj , 1 ≤ j ≤ m } denote a set of m semantic concepts. For a shot ui , let Ri = [ri1, ri2, ri3,…, rim] denote the semantic multinomial, containing the relevance scores of m semantic concepts. Let P= [p1, p2, p3,…, pm] denote a vector (i.e. user profile), which consist of a set of weights for semantic concepts that are selected as preferences by the user. Each preference pj takes a value between 0 and 1. Let T denote a summary time (in seconds) given by the user. Let Si be a similarity score computed for the shot ui. 4.1. Shot Ranking The similarity between the semantic multinomial and user preferences can be computed using any vector similarity measure. In this paper we compared both inner product similarity (Dot Product) and angular similarity (Cosine Similarity). Shots are ranked based on the similarity between the relevance scores and a user profile using either Dot Product or Cosine Similarity. For a shot ui, the Dot Product between the semantic multinomial Ri and a preference vector P gives a similarity score Si . That is, G G S i = Ri ⋅ P (1) For a shot ui, Cosine Similarity is computed between the semantic multinomial Ri and the preference vector P. That is, G G Ri ⋅ P (2) Si = G G Ri × P 4.2. Shot Selection The objective of shot selection is to select shots that maximize the cumulative similarity score for the summary while not exceeding the time constraint T. This can be considered as an instance of 0-1 knapsack problem which is defined as, n max ∑ Si ⋅ xi i =1 n subject to ∑ di ⋅ xi ≤ T Fig. 4. Average of precisions for different single preferences. (3) i =1 Here, Si is the similarity score computed for shot i and xi is a binary decision variable that takes value 1 if ui is selected for the summary, otherwise 0. For quantitative evaluation, the selected shots are ordered decreasingly based on their similarity scores. For subjective evaluation, the selected shots are ordered by following the original video order. A summary is skimmed by concatenating the selected shots, and showed to the user with their corresponding audio. 5. EXPERIMENTS AND RESULTS The proposed summarization system is evaluated both quantitatively and subjectively with song videos. Experiments were conducted with 10 song videos of total duration of 52 minutes which were collected from various web sources. A total of 1240 video shots were manually labelled for validation. 5.1. Experimental Setup Since the relevance scores are in a range [-1,+1], and the system allows multiple preferences, a higher negative relevance score for a preferred semantic concept will reduce the similarity score in the Dot Product, even though a shot has many higher positive scores for other semantic concepts (false negatives). This will also increase the chances for the shots with lower negative relevance scores of semantic concepts to enter in the summary (false positives). As a solution, relevance scores in the range [-1,+1] can be normalized into a range [0,1]. Since this range normalization is the linear transformation of values, the effect will still remain the same. In order to solve this problem, negative relevance scores are ignored and are assumed to be zero. So, four types of summarization techniques were experimented. They are, • Dot Product without negative relevance score (DP+) • Dot Product with negative relevance score (DP-) • Cosine Similarity without negative relevance score (CS+) • Cosine Similarity with negative relevance score (CS-) Fig. 5. Average of precisions for different multiple preferences. Techniques 2 and 4 do not use negative relevance scores and set to zero. 5.2. Quantitative Evaluation For each video, four different queries (single preference or multiple preferences) with average query complexity were considered. For this quantitative evaluation, selected shots are ordered decreasingly based on their similarity scores. Results are evaluated using Ranked Precision where precisions at intervals n are averaged for all the queries. Figure 4 shows the comparison of the different similarity measures and their average of precisions when using a single semantic concept as preference. Dot Product similarity without negative score (DP+) performs better than others. The average of precisions when using multiple semantic concepts as preferences is shown in figure 5. It shows that Dot Product with negative relevance (DP-) score performs better than the other summarization techniques. The Performance of the different summarization techniques can be directly assessed from the results of the quantitative experiments. When using a single preference in the summarization, Dot Product without negative relevance score (DP+) performs better than the others. Different multiple preferences for all test videos were given to the system by using similarity measure as DP-. Averages of precisions at different average recalls are calculated for the system (Figure 6). Precision falls abruptly when recall reaches 0.3. To measure the ranking efficiency, the top ranked shots are manually graded using a scale of 03 to measure the average of Normalized Discounted Cumulative Gain (nDCG) at various positions. Figure 7 shows that, as the result set increases, nDCG decreases gradually. Fig. 6. Precision-Recall curve for different multiple preferences. Fig. 8. Keyframes of the top 25 shots for the preference ‘flower scene’. Fig. 7. Normalized Discounted Cumulative Gain for different multiple preferences. Figure 8 shows the keyframes of the top 25 shots that are retrieved from a test video for a user preference “flower scene”. It also shows some of the false positives in the resultant summary. This wrong prediction happens because, the color values and distribution of the actors’ costume in that particular keyframe somehow resembles flowers. 5.3. Subjective Evaluation The performance of the system was subjectively evaluated using a questionnaire with 20 test subjects (12 male and 8 female). For comparison, generic summaries consisting of significant shots for each video were created for a length of 1 minute. The subjects were asked to use the system to generate summaries of 1 minute length for the 10 test videos with different preferences. The summarization methodology was set as DP-. The subjects were not informed about the methodology used for summarization. For this subjective evaluation, the selected shots are temporally ordered, so that they will follow the original video order. When using the system, subjects were also shown generic summaries of the test videos. A questionnaire was prepared to comparatively evaluate the tailored and generic video summaries. The questions asked were: • How informative was these summaries? • How enjoyable was these summaries? • Are these summaries relevant to your interests? • How willing would you be to accept these summaries? Fig. 9. Mean opinion scores for tailored and generic summaries. These questions evaluate the summarization performance measures namely informativeness, enjoyability, relevance and acceptance respectively. For each measure, subjects were asked to rate the summaries of a specific type on a scale of 1-100. Figure 9 shows the Mean Opinion Score (MOS) given by the subjects for the tailored and generic summaries under each qualitative evaluation measure. Since the proposed system meets the users’ requirements for summarization, the proposed tailored summarization method performs better than the generic summarization under all the qualitative performance measures. Quality of summaries was assessed by two Quality of Perception measures, which are Quality of Perception Information Assimilation (QoP-IA) and Quality of Perception - Satisfaction (QoP-S) [13]. For these experiments, QoP-IA denotes the user’s ability to assimilate information from a video summary, and QoP-S implies the user’s satisfaction from a video summary. QoP-IA is measured by averaging the scores for the relevance and the information acquired from the summaries. QoP-S is calculated by averaging the scores of enjoyability and acceptance. For tailored summaries, QoP-IA is 90.4 and QoP-S is 87.1. For generic summaries, QoP-IA is 70.2 and QoP-S is 74.9. So, it can be seen that the tailored summaries are both informative and satisfactory than the generic summaries. 7. REFERENCES Fig. 10. Subjects’ opinion on summarization system usability. The usability of the system was subjectively evaluated with the test subjects. The Computer System Usability Questionnaire (CSUQ) [14] was used to measure subjects’ Degree of Satisfaction (on a scale of 1 to 7) for the system and the user interface. The questions used are: • The interface of this system is pleasant. • I feel comfortable using this system. • It was simple to use this system. • It is easy to find the information I needed. • Overall, I am satisfied with how easy it is to use this system. • Overall, I am satisfied with this system. These questions evaluate the system’s usability under the criteria appearance, comfortability, simplicity, user friendliness, usability and overall performance respectively. Figure 10 shows the box plot of the Mean Opinion Scores given by the subjects on system usability. The results show that the system performs fairly well under all the system usability criteria. 6. CONCLUSION AND FUTURE WORK This paper presented a novel preference aware video summarization system that produces semantically meaningful personalized video summaries by adapting to individual user’s interests. Utilizing high-level feature extraction techniques reduces the manual effort in video semantics annotation. Video shot similarity measuring and constrained selection scheme guarantees efficient ranking and selection of relevant video segments for customization. The experimental results on the personalized video summarization demonstrate the effectiveness of the proposed system and the need for personalization. The proposed user-centered design for personalized summarization can be extended to other specific domain of videos by incorporating domain specific video semantic annotation techniques. Though the system generates video summaries based on visual semantics, sometimes users may also be interested in choosing high-level audio and textual semantics as preferences. In future, the system would also consider audio and textual semantics as user preferences for personalized multimodal video summarization. Also, performance of the proposed system will be assessed with videos from more diverse range of genres. [1] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Rapantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention,” IEEE Transactions on Multimedia, vol. 15, no. 7, pp. 1553-1568, Nov. 2013. [2] J. You, G. Liu, L. Sun, and H. Li, “A multiple visual models based perceptive analysis framework for multilevel video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 3, pp. 273-285, Mar. 2007. [3] Y.F. Ma, X.S. Hua, and L. Lu, and H.J. Zhang, “A generic framework of user attention model and its application in video summarization,” IEEE Transactions on Multimedia, vol. 7, no. 5, pp. 907–919, Oct. 2005. [4] C.W. Ngo, Y.F. Ma, and H.J. Zhang, “Video summarization and scene detection by graph modeling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 296-305, Feb. 2005. [5] W.T. Peng, W.T. Chu, C.H. Chang, C.N. Chou, W.J. Huang, W.Y. Chang, and Y.P. Hung, “Editing by viewing: automatic home video summarization by viewing behavior analysis,” IEEE Transactions on Multimedia, vol. 13, no. 3, pp. 539550, Jun. 2011. [6] H. Katti, K. Yadati, M. Kankanhalli, C. Tat-Seng, “Affective video summarization and story board generation using pupillary dilation and eye gaze,” In IEEE International Symposium on Multimedia, pp. 319-326, 2011. [7] H. Joho, J.M. Jose, R. Valenti, and N. Sebe, “Exploiting facial expressions for affective video summarisation,” In Proceedings of ACM CIVR 2009, Article No 31, 2009. [8] H.S. Park, and S.B. Cho, “A personalized summarization of video life-logs from an indoor multi-camera system using a fuzzy rule-based system with domain knowledge,” Information Systems, vol. 36, no. 8, pp. 1124-1134, 2011. [9] W.N. Lie, and K.C. Hsum, “Video summarization based on semantic feature analysis and user preference,” In IEEE International Conference on Sensor Networks, Ubiquitous and Trustworthy Computing, pp. 486-491, 2008. [10] V. Parshin, and L. Chen, “Video summarization based on user-defined constraints and preferences,” In Proceedings of RIAO, pp. 18-24, 2004. [11] B.L. Tseng, C.Y. Lin, and J.R. Smith, “Video summarization and personalization for pervasive mobile devices,” Proceedings SPIE 4676, Storage and Retrieval for Media Databases 2002, pp. 359-370, 2002. [12] IBM’s Multimedia Analysis and Retrieval System (IMARS), http://researcher.watson.ibm.com/researcher/view_project.php ?id=877, accessed on Feb. 2014. [13] S.R. Gulliver, and G. Ghinea, “Defining the users perception of distributed multimedia quality,” ACM Transactions on Multimedia Computing, Communications, and Application, vol. 2, no. 4, pp. 241-257, Nov. 2006. [14] J.R. Lewis, “IBM computer usability satisfaction questionnaires: psychometric evaluation and instructions for use,” International Journal of Human-Computer Interaction, vol. 7, no. 1, pp. 57-78, 1995.