Academia.eduAcademia.edu

Modeling visual attention on scenes

Research in the computational modelling of the visual attention has mushroomed in recent years. First generation of computational models, called bottom-up models, allows to calculate a saliency map indicating the degree of interest of each area of a picture. These models are purely based on the low-level visual features. However, it is now well known that the visual perception is not a purely bottom-up process. To improve in a significant manner the quality of the prediction, top-down information (prior knowledge, expectations, contextual guidance) have to be taken into account. We propose in this article to describe some bottom-up models and the metrics used to assess their performances. New generation of models based both on low-level and high-level information is also briefly described. To go one step further in the understanding of the cognitive processes, new computational tools have been recently proposed and are listed in the final section. RÉSUMÉ. La modélisation computationelle de l'attention visuelle connaît actuellement un essor considérable. Les premières modèles, purement basés sur l'attention dite exogène, permettent de calculer une carte de saillance indiquant les zones d'intérêt visuel d'une image. Cependant, afin d'améliorer cette prédiction, il s'avère nécessaire de prendre en compte des informations de plus haut niveaux relatives à l'attention endogène, c'est à dire des informations liées aux processus cognitifs. Afin de rendre compte de cette problématique, le présent article décrit un certain nombre de modèles exogènes ainsi que des modèles intégrant de la Studia Informatica Universalis. 2 Studia Informatica Universalis.

Modeling visual attention on scenes Brice Follet (*) (***) — Olivier Le Meur (**) — Thierry Baccino (***) (*) Thomson, Compression Lab. 1 av. de Belle Fontaine 35576 Cesson-Sevigne France [email protected] (**) University of Rennes 1 Campus universitaire de Beaulieu 35042 Rennes France [email protected] (***) LUTIN (UMS-CNRS 2809), Cité des sciences et de l’industrie de la Villette 30 av. Corentin Cariou 75930 Paris [email protected] ABSTRACT. Research in the computational modelling of the visual attention has mushroomed in recent years. First generation of computational models, called bottom-up models, allows to calculate a saliency map indicating the degree of interest of each area of a picture. These models are purely based on the low-level visual features. However, it is now well known that the visual perception is not a purely bottom-up process. To improve in a significant manner the quality of the prediction, top-down information (prior knowledge, expectations, contextual guidance) have to be taken into account. We propose in this article to describe some bottom-up models and the metrics used to assess their performances. New generation of models based both on low-level and high-level information is also briefly described. To go one step further in the understanding of the cognitive processes, new computational tools have been recently proposed and are listed in the final section. RÉSUMÉ. La modélisation computationelle de l’attention visuelle connaît actuellement un essor considérable. Les premières modèles, purement basés sur l’attention dite exogène, permettent de calculer une carte de saillance indiquant les zones d’intérêt visuel d’une image. Cependant, afin d’améliorer cette prédiction, il s’avère nécessaire de prendre en compte des informations de plus haut niveaux relatives à l’attention endogène, c’est à dire des informations liées aux processus cognitifs. Afin de rendre compte de cette problématique, le présent article décrit un certain nombre de modèles exogènes ainsi que des modèles intégrant de la Studia Informatica Universalis. 2 Studia Informatica Universalis. connaissance a priori. Les méthodes d’évaluation des performances sont également décrites. Afin d’aller plus loin dans la modélisation et dans la compréhension des processus cognitifs, de nouvelles perspectives et direction d’études sont exposées. KEYWORDS: Attention, eye movements, saliency map, scanpath, bottom up, top down, computational modeling, EFRP, eye tracking. MOTS-CLÉS : Attention, mouvements oculaires, carte de salience, parcours visuel, mécanismes endogène exogène, modélisation computationnelle, EFRP, suivi du regard. studia-Hermann 3 1. Intoduction The computational modeling of visual attention is an important challenge for computer vision researchers. Potential applications are large and various (video compression, surveillance, retargeting...). The targeted goal is to predict in an automatic manner from an input picture or video sequence the locations where an observer would gaze on. For that, most of the computational models rest on the assumption that there exists a single saliency map in the brain. A saliency map is a map indicating the degree of interest of each area. However, it seems that there is no single saliency map in the brain but rather a set of saliency maps that are distributed throughout the different visual areas of the brain. The assumption of unique saliency is then a strong shortcut but a very comfortable view to design a computational model. Indeed, in this condition, the brain can be compared to a computer where the inputs are the sensory information and the output the saliency map. Based on the assumption that there is a unique saliency map in the brain, Koch and Ullman proposed in 1985 a plausible architecture of a visual attention model [KU85]. From an input picture, several early visual features are extracted in a massively parallel manner, leading to one feature map per channel. A filtering operation is then applied on these maps in order to filter out most of the visually irrelevant information. Then, these maps are mixed together to form a saliency map. From this influential work, a number of computational models have been proposed. They can be grouped into two categories that could be called hierarchical models and probabilistic (or statistical) models. The architecture of these two categories of models is almost the same; they differ in the mathematical operations used to determine the salience. This first generation of models rests mainly on bottom-up mechanisms. Early visual stages are indeed purely bottom-up. However, a number of other features such as prior knowledge, our expectation, goals we have to perform may have a significant impact on the way we scan our visual field. In this paper, we propose to tackle these aforementioned points. The first part concerns a brief review of bottom-up models. We also briefly review methods to assess the degree of similarity between predicted and human data. In the second point, we emphasize the factors that a second 4 Studia Informatica Universalis. generation of computational models of visual attention should take into account. In this same part, some models based on prior knowledge are described. In the last part, new perspectives as well as new computational tools are described. 2. Computational models 2.1. Hierarchical models Figure 2.1 (a) gives the architecture of Itti’s model [IKN98]. This computational model, proposed in 1998, is the first model able to compute efficiently a saliency map from a given picture. The saliency map is built from 3 separate feature maps coming from 3 channels (color, orientation and luminance). As depicted in figure 2.1 (a), a Gaussian pyramid is used for each channel to split the input signal into several spatial scales. Center-surround differences are used to yield a feature map. The final saliency map is then computed. A map normalization operator N (.) is defined. The definition given by Itti et al. is the following operation: 1) Normalizing the values to a fixed range in order to eliminate modality-dependent amplitude differences; 2) Finding the location of the map’s global maximum and computing the average of all its other local maxima; 3) Multiplying the map by a factor depending on the previous values. Figure 2.1 (b) gives an extension of Itti’s model. Le Meur’s model [MCBT06] improves two aspects of Itti’s model: first, the extraction of the early visual features in Itti’s model is achieved by different operators that can provide dramatically dynamic range. To cope with these problems, several normalization schemes are used many times at different stages of the model in order to eliminate modality-dependent amplitude differences. Second, the building of the final saliency map is the result of linear combination. The feature maps are first normalized in the same dynamic range and then combined. studia-Hermann (a) Itti’s model 5 (b) Le Meur’s model Figure 1: Two examples of hierarchical model of bottom-up visual attention To deal with the two aforementioned points, Le Meur et al. proposed to extract the early visual features by using psychophysic models. A contrast sensitivity function is used to normalize incoming information to their visibility threshold. Perceptual subband decomposition is then applied on the Fourier spectrum. Each resulting subband can be seen as a population of visual cells that is sensitive to a range of orientations and to a range of spatial frequencies. Finally, a visual masking operation allows modulating the visibility thresholds of a given subband by taking into account information coming from other subbands. The goal of these first steps is to provide homogeneous feature maps that can be directly compared. At this stage, a single normalization is required. A center-surround operation is then applied on each subband in order to eliminate irrelevant information and to promote contrasted areas. After that, a unique saliency map is computed by using a normalization scheme, called long-term normalization. 2.2. Statistical models Statistical models differ from the previous ones because they rest on a probabilistic approach. The assumption is that a rare event is more 6 Studia Informatica Universalis. salient than a non rare event. The mathematical tool that can simply simulate this behavior is the self-information. Self-information is a measure of the amount information provided by an event. For a discrete random variable X, defined by A = {x1 , ..., xN } and by a probability density function, the amount of information of the event X = xi is given by I(X = x) = −log2 p(X = x) bit/symbol. The first model based on this approach has been proposed by Oliva et al. [OTCH03].Bottom-up saliency map is given by: S = p(F1|G) , where F denotes a vector of local visual features observed at a given location while G represents the same visual features but computed over the whole image. More recently, this approach has been reused by a number of authors. The proposed models differ in the support used to compute the saliency: – The probability density function is learnt on the whole picture [OTCH03]; – A local surround is used to compute the saliency. The saliency can be either be computed by using the self-information [BT09] or the mutual information [GV09]; – A probability density function has been learned on a number of natural image patches. Features extracted at a given location are compared to this prior knowledge [ZTM+ 08, BT09]; – A probability density function is learnt from what it happened in the past. This is the theory of surprise proposed by Itti and Baldi [IB06]. This approach allows to compute how incoming information affect our perception. The idea is to compute the difference between posterior and prior beliefs of observers. 2.3. Performances The assessment of computational models is commonly performed by assessing the degree of similarity between predicted saliency maps and data coming from eye tracking experiments. Eye tracking experiments are commonly conducted in a free-viewing task. It means that observers were allowed to examine picture freely without any particular objective. studia-Hermann 7 Figure 2: Predicted saliency maps for different original pictures and for different computational models. The goal is to lessen top-down feedback and to favor a bottom-up gaze deployment. The collected raw eye tracking data is then parsed into fixations and saccades. A velocity-based algorithm can be used to perform this parsing(see [SG00] to have more details and to have a taxonomy of parsing methods). Several databases exist on internet (see for instance, http://www.irisa.fr/temics/staff/lemeur/ ). The degree of similarity between the prediction of a computational model and the ground truth can be obtained by two methods. These methods, called saliency-map-based and fixation-point-based method are described in the following subsections. 2.3.1. Saliency-map-based method The saliency-map-based method, as its name suggests, rests on the use of map. Therefore, from the collected eye-tracking data, a human saliency map is computed by taking into account fixation points of all observers. The building of this kind map is described in [Woo02]. Figure 2.3.1 gives an example of human and predicted saliency map. The degree of similarity between these two maps can be assessed by different methods. However, the most used and probably the most rele- 8 Studia Informatica Universalis. vant is the ROC (Receiver Operating Characteristic). The idea is to use a binary classifier in order to label each pixel as being fixated or not. Different thresholds are used. For each threshold the true positive rate (TPR) and the false positive rate (FPR) is deduced (an example is given 2.3.1 (d)). A ROC graph depicting the tradeoff between TPR and FPR is plotted. The TPR rate is plotted on the Y axis whereas the FPR rate is plotted on the X axis. On this graph, the point (0,1) represents a perfect similarity. The more the top left-hand corner the curve is close, the better the classification is. The diagonal line (if it is plotted on a linearlinear graph) indicates that the classification is a pure random process. One inter- esting indicator is the area under the curve, called AUC. This indicator is indeed very useful to compare the quality of the prediction. The AUC value is always between 0 and 1. An AUC value equal to 0.5 indicates that there is no similarity between the two sets of data. A value of 1 is obtained for a perfect classification. (a) Original (b) Exp. SM (c) Predicted SM (d) ROC Figure 3: Example of human saliency map (b) and predicted saliency map (a). (c) correponds to a ROC classification (example for a given threshold): original pixels stand for true negative, green pixels for true positive, the others for false and true negative. 2.3.2. Fixation-point-based method Rather than considering a saliency map, this method uses the list of the human fixation points [PLN02, PIIK05]. Predicted saliency for each fixation point is extracted from the predicted saliency map at the spatial coordinates of the considered fixation point. At the end of the list of fixation points, a value, called NSS standing for Normalized Scan path Saliency, is obtained. A NSS equal to zero means that there is no similarity between the predicted saliency map and the visual scan paths. studia-Hermann 9 A positive NSS indicates that there is a similarity whereas a negative one indicates an anti-correspondance. Figure 2.3.2 gives an example. Below, the three steps of the NSS method are given: 1) Each saliency map is transformed into a Z-score (normalized to zero mean and one unit standard deviation); 2) Predicted saliency is extracted in a local neighboorhood centered on a human fixation point; 3) Average of the obtained values. Figure 4: NSS computation (extracted from [PIIK05].) 3. The necessity of cognitive models A purely contrast guidance-based process is a basic view of how our attention is deployed. A second form of attention controlled by higher areas of the brain can override, or at least can influence, bottomup process. 3.1. Top-Down priming To improve in a significant manner visual attention models, topdown source guidance of visual attention should be taken into account. Numerous studies indeed show that the attentional deployment is driven not by a single source but rather by a set of guidance sources. The seminal experiment of Yarbus [Yar67] is the perfect illustration of the strong relationship existing between bottom-up and top-down mechanisms. Yarbus recorded eye movements when observers watched 10 Studia Informatica Universalis. a painting. Observers were given different questions to answer prior viewing the painting. The fact that the recorded visual scan paths were radically different simply demonstrated that the visual attention deployment is not a purely bottom-up process. Since 1967, more and more experiments strengthen and refine Yarbus’s conclusions (see [?] for a nice and short review). 3.2. Two visual pathways Behavioural researches have proposed many functional splitting like evaluating versus orienting, what and where processes, focal and ambient treatment, examining against noticing computation, figural/spatial and foveal-ambient to conceptualize a dual processing which reconciles opposed classical theoretical approaches of behavioural sciences. Functional neuroanatomy has confirmed this functional dichotomy in showing two distinct visual pathways working in parallel. These findings support idea that cortex computes visual information by separating object identification to scene spatialization. These two ways start from earlier visual areas (V1, V2) with a dorsal pathway directed to posterior parietal cortex to compute spatial attributes of scene. While ventral pathway which is responsible of objects identification is directed to infero-temporal cortex [Bar04]. 3.2.1. A dichotomy in spatial frequencies Thorpe et al. [TFM96] have shown that visual system is able to categorize visual scene in 160 ms. This very rapid processing, probably due to a feed-forward mechanism, could be explained by the brain ability to categorize rapidly visual scene from low frequencies information [SO94]. Indeed, many studies claimed that visual processing follows a coarse-tofine process. This coarse-to-fine process seems to rely on the processing of low versus high spatial frequencies of visual contrast. The coarse process might be based on low frequencies contrary to the fine process based on high frequencies. Thus, it has been shown that fast categorization of visual scene was related to the processing of low frequencies studia-Hermann 11 [SO94] while finer information given by high frequencies corresponded to time-consuming processes for object recognition [Bar04]. Emotional information in face [SO99] and maybe in scene perception is provided by low frequencies but certainly across under cortical pathways. The gist is fundamental for scene recognition [OT06]. It should correspond to the low frequency spatial invariant of each category [Oli05]. These findings brought the idea that a dorsal pathway could be specialized in low frequency coarse processing through the magnocellular dorsal pathway while high frequency fine processing might be related to the parvocellular ventral pathway [Bar04]. Others studies suggest that functional distinction corresponds to an asymmetry between both cerebral hemispheres. However, behavioural data support a dual model based on functional dichotomy and Baddeley and Tatler (2006) have shown that high frequency edges can predict fixation locus. So, the actual matter is to figure out whether this functional parallelism is able to influence attentional guidance. 3.2.2. A dichotomy in the visual scan path Yarbus’s experiments [Yar67] indicated that fixation duration increases during scene viewing involving a strong attentional focus. This result has been confirmed by recent studies on eye-movements under various conditions (different tasks like searching, recognition [USP05], memorization [CMH09] or free viewing [PHR+ 08]. These same experimentations also indicated a decreasing of amplitude saccade according to time viewing. An important aspect outlined by Unema’s study [USP05] concerns the relationship between the fixation duration and the amplitude of the subsequent saccade. Their graph (see figure 3.2.2) shows clearly that larger saccade amplitude corresponded to fixation duration shorter than 200 ms and smaller saccade amplitude to fixation duration greater than 200ms. This relationship across fixation and saccade has allowed them in categorizing ambient and focal fixations. Ambient fixations have been defined by large saccades and short fixation durations and the other way around for focal fixations. These two kinds of fixations/saccade pairs reveal a coarse-to-fine eye movement strategy [OHVE07] in accordance to the cerebral visual processing modes [PV09]. To know whether ambient-to-focal strategy was correlated to a coarse- 12 Studia Informatica Universalis. Figure 5: Saccade amplitude as a function of fixation duration (extracted from [USP05]). to-fine process or not, Follet et al. [FMB09] conducted an eye tracking experiment with hybrid stimuli with four different visual scene categories (Mountain, Street, Coast, Open Country). Hybrid pictures were built with stimuli mixing low frequencies of an image with high frequencies of another. The principle was to compare visual scan pattern (i.e. fixation duration/amplitude saccade relationship). They also measured for each category gaze allocation similarities between original and hybrid pictures. The idea is to assess the extent to which low and high spatial frequencies contribute to the guidance of eye movements. Results indicated firstly that eye movements were more or less driven by low or high frequencies and secondly they were dependent on the visual scene category. These issues suggested a relationship between the processing of low frequencies and the ambient-to-focal strategy. 3.3. Cognitive models of visual attention Most of researchers are now convinced that high-level factors coming from cognitive operations should have a bigger importance in visual guidance [HBCM07]. Saliency map models described in section studia-Hermann 13 2 are very efficient to determine the early fixations during scene viewing but they are weaker to simulate subsequent fixations when semantic information becomes more essential. In fact, fixation positions are less strongly tied to visual saliency when meaningful scenes are involved [OTCH03]. Gaze control becomes knowledge-driven over time which replaces or modulates visual saliency effects. But what kind of knowledge is involved in this visual guidance: – Memory scene knowledge: This memory can involve either episodic knowledge or semantic knowledge. Episodic memory stores personal events (times, places...) that can be explicitly stated. The use of that memory in visual guidance can describe why we restrict the region of search when we are looking for a specific object in a well known scene. Semantic memory can be related to the concept of schema or frames defended earlier by Minsky (1974) [Min74]. They represent generic semantic and spatial knowledge about a particular type of scene which includes information about the objects likely to be found (a bed in a bedroom), spatial regularities (a painting on a wall) and real world knowledge about scenes (logs can float in the river). These examples show that long-term memory affect the gaze control but not only, short term memory as well. For example, short term knowledge can explain why a viewer can refixate regions of the current scene that are semantically informative or interesting [LM78]; – Task-related knowledge: There is a lot of evidence showing that tasks involve selective fixations since the Yarbus’ experiment. Gaze control changes with reading, driving or searching. Even saliency maps are sensitive to the task and simulate rather recognition experiments than search experiments [UF06]. Modeling these types of high-level information is not trivial and the question arises over how it can be combined with low-level factors (saliency maps). One approach is to consider that knowledge structures modify the bottom-up saliency [RZB02]. For example, looking for a car in a street does not need to look at the top of building. Some prior information about a scene maintained in long-term memory serves to constrain the visual field of search and consequently the effect of saliency map. This notion has given the basis of Contextual Guidance model that highlight scene regions likely to contain a specific class of object 14 Studia Informatica Universalis. [TOCH06]. This model predicts fixation positions during object search significantly better than a saliency map model. However, memory as well as task knowledge is very complex to integrate in a computational model given the lack of accurate definition. While Henderson and others have only built theoretical models avoiding coding the semantic information, Navalpakkam and Itti [NI05] attempted to create a semantic memory by means of ontology. This solution is far from being satisfying since they had first to manually segment the objects and labelize them to be included in the ontology. The ontology is therefore not exhaustive and needs to be updated for each new stimulus. In the future, a challenge would be to associate automatically objects with their visual context as it is used in text processing by Latent Semantic Analysis [LFL98]. 4. Perspective regarding the visual attention deployment All these questions on computational models of visual attention stand on the status of eye fixation and whether it is possible to identify the main factors determining that fixation. Broadly speaking models have to predict where and how long do we look a visual scene (i.e, spatial and temporal determinants of the fixation). As we have seen in section 2, both scene-statistics and saliency maps approaches have attempted to describe the location of fixations rather than their durations. However, fixation locations can also be related to the meaning of a fixated region since meaningful object may likely differ from scene background in image properties. Human gaze is under the control of the brain that process data not only on available visual input but also according to its mental structures (memory, goals, plans, beliefs...). As an example of that semantic influence, Henderson et al [HP08] shown that fixation locations differed in both their image statistics and their semantic content. One of the future challenges for computational modeling of visual attention is therefore to address that semantic information and investigate how it might be implemented. A promise way would be to consider another characteristic of gaze behavior, namely the variability of fixation durations. This variable has been largely ignored on gaze control while in other activity, such reading, fixation durations represent the main measure studia-Hermann 15 of cognitive processing. Thus, it has been shown that fixation duration is affected by lexical frequency, predictability and semantic content [Ray98]. Obviously, looking at a visual scene does not involve the same processes than reading which entails that fixation duration on reading reflects a large part of the processing on the fixated region. However, this effect has also been found when viewing a visual scene [HP08]. Average fixation duration during scene viewing is around 330 ms [Hen07]. Visual scenes may contain several meaningful objects and that meaning modulates the duration of fixations [USP05]. There are at least two possibilities for using fixation duration in a computational model of visual attention: 1) By weighting fixation position with fixation duration. However, several open questions are still pending: whether this weighting may be applied linearly or not; what are the main semantic factors that modulate fixation duration on object viewing; how to define semantics for an object, by its own meaning, by the context of use or probably by both factors. This approach is still a challenging direction in research; 2) By separating different components of fixation duration and associating them to specific functions. One way which has been already explored is by combining the eye-fixations with other physiological variables such as EEGs [Bac10]. This technique firstly introduced by Baccino and Manunta [BM05] has been called Eye-Fixation-Related potentials and consisted to analyze EEGs only during the fixation phases. One of the main advantages with this technique is that the interpretation of fixation duration can be enriched by several components coming from the separation of the EEG signal. That separation can be carried out by statistical procedures (PCA, ICA, PARAFAC) in order to detect whether any attentional or semantic components may be associated to the fixation and categorized it. Modeling visual attention on scenes has received a great interest by cognitive scientists in the last decade. Purely visual models at the beginning, the implementation of neurophysiological and cognitive information has been progressively added over the years. But the way is still long before having a satisfying model that might be cognitively plausible. 16 Studia Informatica Universalis. References [Bac10] T. Baccino. Eye movements and concurrent erp’s: Efrps investigations in reading. In S. Liversedge, Ian D. Gilchrist & S. Everling (Eds.), Handbook on Eye Movements. Oxford University Press, 2010. [Bar04] M. Bar. Visual objects in context. Nature Revision Neuroscience, 5:617–629, 2004. [BM05] T. Baccino and Y. Manunta. Eye-fixation-related potentials: Insight into parafoveal processing. Journal of Psychophysiology, 19(3):204–215, 2005. [BT09] N.D.B. Bruce and J.K. Tsotsos. Saliency, attention and visual search: an information theoretic approach. Journal of Vision, 9:1–24, 2009. [CMH09] M. Castelhano, M. Mack, and J. M. Henderson. Viewing task influences eye movements control during active scene perception. Journal of Vision, 9:1–15, 2009. [FMB09] B. Follet, O. Le Meur, and T. Baccino. Relationship between coarse-to-fine process and ambient-to-focal visual fixations. In ECEM, http://www.irisa.fr/temics/staff/lemeur/, 2009. [GV09] D. Gao and N. Vasconcelos. Bottom-up saliency is a discrimant processe. In IEEE ICCV, 2009. [HBCM07] J. M. Henderson, J.R. Brockmole, M.S. Castelhano, and M. Mack. Visual saliency does not account for eye movements during visual search in real-world scenes. In In R. P. G. van Gompel, M. H. Fischer, W. S. Murray & R. L. Hill (Eds.), Eye movements: A window on mind and brain. Amsterdam, Netherlands: Elsevier, 2007. [Hen07] J.M. Henderson. Regarding scenes. Current Directions in Psychological Science, 16(4):219–222, 2007. [HP08] J. M. Henderson and G. Pierce. Eye movements during scene viewing: Evidence for mixed control of fixation durations. Psychonomic Bulletin and Review, 15(3):566, 2008. studia-Hermann 17 [IB06] L. Itti and P. Baldi. Bayesian surprise attracts human attention. In Advances in neural information processing systems, 2006. [IKN98] L. Ittti, C. Koch, and E. Niebur. A model for saliency-based visual attention for rapid scene analysis. IEEE Trans. on PAMI, 20:1254–1259, 1998. [KU85] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology, 4:219–227, 1985. [LFL98] T.K. Landauer, P.W. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse Processes, 25(23):259–284, 1998. [LM78] G.R. Loftus and N. H. Mackworth. Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology: Human Perception and Performance, 4:565–572, 1978. [MCBT06] O. Le Meur, P. Le Callet, D. Barba, and D. Thoreau. A coherent computational approach to model the bottom-up visual attention. IEEE Trans. On PAMI, 28, 2006. [Min74] M. Minsky. A framework for representing knowledge. In The Psychology of Computer Vision, P. H. Winston (ed.), McGraw-Hill 1975, 1974. [NI05] V. Navalpakkam and L. Itti. Modeling the influence of task on attention. Vision Research, 45:205–231, 2005. [OHVE07] E.A.B. Over, I.T.C. Hooge, B.N.S. Vlaskamp, and C.J. Erkelens. Coarse-to-fine eye movement strategy in visual search. Vision Research, 47:2272–2280, 2007. [Oli05] A. Oliva. Gist of the scene. In the Encyclopedia of Neurobiology of Attention. L. Itti, G. Rees, and J.K. Tsotsos (Eds.), Elsevier, pages 251–256, 2005. [OT06] A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research: Visual Perception, pages 23–26, 2006. 18 Studia Informatica Universalis. [OTCH03] A. Oliva, A. Torralba, M.S. Castelhano, and J.M. Henderson. Top-down control of visual attention in object detection. In IEEE ICIP, 2003. [PHR+ 08] S. Pannasch, J.R. Helmert, K. Roth, A.K. herbold, and H. Walter. Visual fixation durations and saccade amplitudes: Shifting relationship in a variety of conditions. Journal of Eye Movement Research, 2:1–19, 2008. [PIIK05] R. Peters, A. Iyer, L. Itti, and C. Koch. Components of bottom-up gaze allocation in natural images. Vision Research, 2005. [PLN02] D. Parkhurst, K. Law, and E. Niebur. Modelling the role of salience in the allocation of overt visual attention. Vision Research, 42:107–123, 2002. [PV09] S. Pannasch and B. M. Velichkovsky. Does the distractor effect allow identifying different modes of processing? In ECEM, 2009. [Ray98] K. Rayner. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3):372–422, 1998. [RZB02] R. P. N. Rao, G.J. Zelinsky, and M.M. Hayhoe D.H. Ballard. Eye movements in iconic visual search. Vision Research, pages 1447–1463, 2002. [SG00] D.D. Salvucci and J.H. Goldberg. Identifying fixations and saccades in eye-tracking protocols. In ETRA, 2000. [SO94] P.G. Schyns and A. Oliva. From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition. Psychological Science, 5, 1994. [SO99] P.G. Schyns and A. Oliva. Dr. angry and mr. smile: when categorization flexibly modifies the perception of faces in rapid visual presentations. Cognition, 69:243–265, 1999. [TFM96] S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 2:381–350, 1996. [TOCH06] A. Torralba, A. Oliva, M.S Castelhano, and J.M. Henderson. Contextual guidance of eye movements and attention studia-Hermann 19 in real-world scenes: the role of global features in object search. Psychological review, 113(4):766–786, 2006. [UF06] G. Underwood and T. Foulsham. Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Quarterly Journal of Experimental Psychology, 59(11):1931–1949, 2006. [USP05] P. Unema and B.M. Velichkovsky S. Pannasch, M. Joos. Time-course of information processing during scene perception: the relationship between saccade aplitude and fixation duration. Visual Cognition, 12:473–494, 2005. [Woo02] D. Wooding. Fixation maps: quantifying eye-movement traces. In Eye Tracking Research and Applications, 2002. [Yar67] A. Yarbus. Eye movements and vision. In New-York, Plenum, 1967. [ZTM+ 08] L. Zhang, M.H. Tong, T.K. Marks, H. Shan, and G. W. Cottrell. Sun: a bayesian framework for saliency using natural statistics. Journal of Vision, 8:1–20, 2008. ANNEXE POUR LA FABRICATION A FOURNIR PAR LES AUTEURS AVEC UN EXEMPLAIRE PAPIER DE LEUR ARTICLE 1. A RTICLE POUR LA REVUE : Studia Informatica Universalis. 2. AUTEURS : Brice Follet (*) (***) — Olivier Le Meur (**) — Thierry Baccino (***) 3. T ITRE DE L’ ARTICLE : Modeling visual attention on scenes 4. T ITRE ABRÉGÉ POUR LE HAUT DE PAGE MOINS DE 40 studia-Hermann SIGNES 5. DATE DE CETTE VERSION : November 27, 2009 6. C OORDONNÉES DES AUTEURS : – adresse postale : (*) Thomson, Compression Lab. 1 av. de Belle Fontaine 35576 Cesson-Sevigne France [email protected] (**) University of Rennes 1 Campus universitaire de Beaulieu 35042 Rennes France [email protected] (***) LUTIN (UMS-CNRS 2809), Cité des sciences et de l’industrie de la Villette 30 av. Corentin Cariou 75930 Paris [email protected] – téléphone : 01 44 10 84 – télécopie : 00 00 00 00 – e-mail : [email protected] 7. L OGICIEL UTILISÉ POUR LA PRÉPARATION DE CET ARTICLE : LATEX, avec le fichier de style st✉❞✐❛✲❍❡r♠❛♥♥✳❝❧s, version 1.2 du 03/12/2007. S ERVICE É DITORIAL – S TUDIA U NIVERSALIS :