Saliency prediction in the coherence theory of attention

Fiora Pirri

Saliency prediction in the coherence theory of attention

Fiora Pirri

2013, Biologically Inspired Cognitive Architectures

visibility

…

description

19 pages

link

1 file

In the coherence theory of attention, introduced by Rensink, O'Regan, and Clark (2000), a coherence field is defined by a hierarchy of structures supporting the activities taking place across the different stages of visual attention. At the interface between low level and mid-level attention processing stages are the proto-objects; these are generated in parallel and collect features of the scene at specific location and time. These structures fade away if the region is no further attended by attention. We introduce a method to computationally model these structures. Our model is based experimentally on data collected in dynamic 3D environments via the Gaze Machine, a gaze measurement framework. This framework allows to record pupil motion at the required speed and projects the point of regard in the 3D space (Pirri, Pizzoli, & Rudi, 2011; Pizzoli, Rigato, Shabani, & Pirri, 2011). To generate proto-objects the model is extended to vibrating circular membranes whose initial displacement is generated by the features that have been selected by classification. The energy of the vibrating membranes is used to predict saliency in visual search tasks.

Biologically Inspired Cognitive Architectures (2013) 5, 10– 28 Available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/bica RESEARCH ARTICLE Saliency prediction in the coherence theory of attention q,qq Valsamis Ntouskos a, Fiora Pirri Bruno Cafaro a a b a, , Matia Pizzoli b,1 , Arnab Sinha a, ALCOR Lab., DIIAG, University of Rome, ‘‘Sapienza’’, Rome, Italy Artificial Intelligence Lab., University of Zurich, Zurich, Switzerland KEYWORDS Visual attention; Saliency prediction; Proto-objects; Visual search; Cognitive robotics; Cognitive vision Abstract In the coherence theory of attention, introduced by Rensink, O’Regan, and Clark (2000), a coherence field is defined by a hierarchy of structures supporting the activities taking place across the different stages of visual attention. At the interface between low level and mid-level attention processing stages are the proto-objects; these are generated in parallel and collect features of the scene at specific location and time. These structures fade away if the region is no further attended by attention. We introduce a method to computationally model these structures. Our model is based experimentally on data collected in dynamic 3D environments via the Gaze Machine, a gaze measurement framework. This framework allows to record pupil motion at the required speed and projects the point of regard in the 3D space (Pirri, Pizzoli, & Rudi, 2011; Pizzoli, Rigato, Shabani, & Pirri, 2011). To generate proto-objects the model is extended to vibrating circular membranes whose initial displacement is generated by the features that have been selected by classification. The energy of the vibrating membranes is used to predict saliency in visual search tasks. ª 2013 The Authors. Published by Elsevier B.V. All rights reserved. q This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-No Derivative Works License, which permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited. qq The research has been supported by EU project NIFTi. Corresponding author. E-mail addresses: [email protected] (V. Ntouskos), [email protected] (F. Pirri), [email protected] (M. Pizzoli), [email protected] (A. Sinha), [email protected] (B. Cafaro). 1 The author has contributed to the paper while he was at Alcor Lab, in Rome. 1. Introduction Saliency prediction in visual search requires to understand which features of the scene are processed and how, and in which way this processing delivers a structure that is overtaken by attention, which then induces focusing on a selected region of the scene. In artificial systems this is a crucial concept. There are two main reasons for that. On the one hand the complexity of searching the visual field is too high to be managed by 2212-683X/$ - see front matter ª 2013 The Authors. Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.bica.2013.05.012 Saliency prediction in the coherence theory of attention processing the whole visual input at the resolution of the fovea, as indicated by Tsotsos et al. (1995). On the other hand feature detectors and orientation filters handle pre-attentive processing by partially discarding the visual input, but they cannot handle the further integration processing required to lift up the low-level structures to focused attention. We should note that artificial systems suffer of several limitations due to the mechanic, electronic and software components. Yet artificial systems need to learn to predict saliency to find targets in crowded scenes, without overloading teir resources. This is a necessary step in the design of efficient cognitive systems, to avoid memory or reasoning being clogged and paralyzed by the huge amount of visual information acquired at possibly high frame rate. A tacit assumption is that artificial computational models rely on psychophysical, neurophysiological and psychological studies (PNP) on pre-attentional and attentional processing, and then add further constraints to these models to cope with the above mentioned limitations. This is the line of research mainly taken so far, though following two main directions, namely predicting saccade directions and predicting saliency from the features standpoint. Predicting saccades directions has been analyzed in Koch and Ullman (1985), Tsotsos et al. (1995), Itti, Koch, and Niebur (1998), Minato and Asada (2001), Belardinelli, Pirri, and Carbone (2007). Predictions of saccade targets with a number of features, via bottom-up models, has been tested in Carmi and Itti (2006). In general, approaches have exploited the simulation of saccades either by active cameras, as in Butko, Zhang, Cottrell, and Movellan (2008), Mancas, Pirri, and Pizzoli (2011), or via biologically founded prior models of saliency as in Pichon and Itti (2002), Ackerman and Itti (2005), Hügli, Jost, and Ouerhani (2005), Cerf, Harel, Einhäuser, and Koch (2007), Sala, Sim, Shokoufandeh, and Dickinson (2006), Mahadevan and Vasconcelos (2010), to cite some of the works from the wide literature on saliency prediction. In this paper we focus on the steps between features analysis and collection and their integration into a coherent structure that is then passed to attention, basing our approach purely on collected data and the concept of protoobject developped within the coherence theory of attention by Rensink (2000). Indeed, since Treisman and Gelade (1980) foundational work on feature integration, it became clear that in the pre-attentive, early vision phase, primitive visual features can be rapidly accessed in searching tasks. For example colors, motion, and orientation can be processed in parallel and effortlessly, and the underlying operations occur in within hundreds of milliseconds. So the pre-attentive level of vision is based on a small set of primitive visual features organized in maps, that are extracted in parallel while the attentive phase serves to group these features into coherent descriptions of the surrounding scene. When attention takes on the control, processing passes from parallel to serial. Since Treisman’s feature integration theory, several models have been further provided in the literature, for feature integration. Among those that led to a concept of representation we consider Duncan and Humphreys (1989) who have observed that there is a large differentiation in search difficulty, observed across different stimulus material. On this basis Duncan introduces the theory of 11 visual selection as distinguished into three stages: the parallel one, that produces an internal structured representation, a selective one matching the internal representation, and the transduction one providing the input of selected information to the visual short term memory. This theory relies on the evidence of low efficiency of basic features parallel processing, in the presence of heterogeneous distractors. On the basis of this observation Duncan introduces the concept of structural unit as an internal representation given to the visual input (close to 3-D model of Marr & Nishihara (1978)). Further, Wolfe (1992) has shouldered the concept of structural units, by noting that visual search might need grouping and categorization. Indeed, Wolfe, Friedman-Hill, Stewart, and O’Connell (1992) suggest that categorization is a strategy that is invoked when it is useful and that it could affect different features of the visual input. Wolfe (1994) makes clear that attentional deployment is guided by the output of earlier parallel processes, but its control can be exogenous, based on the properties of the visual stimulus or endogenous, based on the subject task, and he introduces the notion of feature maps (see also Treisman, 1985) as independent parallel representations for a set of basic limited visual features. Finally, activation maps, both bottom-up and top-down, serve in Wolfe (1994) model to guide attention toward distinctive items in the field of view. In summary Wolfe suggests that information extracted in parallel, with loss of details, serves to create a representation for the purpose of guiding attention. The huge amount of literature that has studied how, from parallel processing, across large areas of the visual field, focused attention emerges (see also Neisser & Becklen, 1975 & Julesz, 1986) has led to the quest for a virtual representation that could explain the way input is discarded and selected features are integrated in a coherent representation. According to these principles, in this paper we propose a methodology, suitable for computational artificial-attention, to study saliency for visual search, in dynamic complex scenes, motivated by the concept of virtual representation developed in the coherence theory of attention of Rensink (2000), Rensink et al. (2000), Rensink (2002). Rensink introduces the concept of proto-object as a volatile support for focused attention, which is actually needed to see changes, see Rensink, O’Regan, and Clark (1997). Rensink (2000) assumes that proto-objects are formed in parallel across the visual field and form a continuously renovating flux that is accessed by focused attention. Proto-objects are collected by focused attention to form a stable object temporally and spatially coherent, which provides a structure for perceiving changes. In Fig. 1 Rensink’s triadic architecture is illustrated. In this architecture the lower level corresponds to the retinotopic mapping and, going up, proto-objects are structures for more complex feature configurations formed in parallel across the visual field and lying at the interface between low-level vision and higher attentional operations. These structures are said to be volatile, and fading away as new stimuli occur, within ‘‘few hundreds of milliseconds’’, as detailed in Rensink et al. (2000). Focused attention, in Rensink’s triadic architecture, accesses some of the generated proto-objects to stabilize them and form individual objects ‘‘with both temporal and spatial coherence’’, 12 V. Ntouskos et al. Fig. 1 The image above, taken from Rensink (2000), illustrates Rensink low-level vision architecture whose output are protoobjects that become the operands for attentional objects Rensink (2000). Rensink (2000). Proto-objects are linked within a coherence field to the nexus, a structure coarsely summarizing the properties of the stabilized ones. Proto-objects have been explored in computational attention for modeling how object recognition can use their representation and generation, thus at the high-level interface, in Walther and Koch (2006), and in Orabona, Metta, and Sandini (2008). Here, instead, we are interested in the other side of the interface, namely we model their generation and study their spatial and temporal persistence across the visual fields in visual search tasks. Note that we take into account real dynamic environments. Furthermore we show that these structures can be used to learn the parameters of the underlying process and predict saliency distribution across the scene. The paper is organized around the problem of modeling the data acquisition, for a freely moving subject, the recovery of the point of regard in the scene and the proto-object generation, as follows. In the next section we illustrate how to obtain the scanpath of a subject searching for some objects in the scene. Namely how to obtain the position of the head and the direction of the gaze in the scene, using a wearable device, the Gaze Machine (GM). In the section Coherent features for point saliency, we illustrate how features are learned from the data acquired by the GM, specifically for a set of search tasks. Then, in section Generating Proto-Objects, we introduce a model for the generation of proto-objects based on vibrating membranes to account for their volatility, according to the learned features. Finally we provide some experimental validation. 2. Acquisition model for search strategy estimation To model saliency prediction, computational studies have quite limited resources available, as data acquisition is based on uncertain measurements and ground truth is available only if experiments are rather constrained. The realization of a wearable device that allows to register the Point of Regard of a subject in an unconstrained condition has made possible to collect a great amount of data, see Fig. 2. We aim at exploiting these data for modeling the features that are selected during a search task, whether these specify general properties that are preserved across tasks or local properties closely related to the target. These properties characterize the spatial and temporal relations inducing the stimulus to be triggered. As highlighted in Serences and Yantis (2006) the V4 area displays neural activity with features similar to the target, and this is the area involved in the formation of a coherence field, according to the coherence theory of attention. Indeed, the interaction between stimuli-driven and voluntary factors becomes further and further relevant in the later stages of attentional processing, where more complex coherent fields of features configurations are formed. From the stand point of computational attention a proto-object can be described as a configuration of features having relative time and spatial coherence, directly affected by attention, and generating a motion field pulling the gaze toward the target. Proto-objects in this sense are dynamic and relatively volatile feature structures related both to fast eye movements, namely saccades, and to saliency. These feature structures are precursors of attention and further used by attention to drive recognition – this is the double face of proto-objects between pre-attentive and selective attention, as highlighted in Duncan and Humphreys (1989) and Rensink (2000) – and can be localized in time and space: proto-objects may last few milliseconds up to hundreds of milliseconds. We recall that the POR, namely the Point of Regard, is the point on the retina at which the rays coming from an object regarded directly are focused. In particular, we assume that PORs are the point on the fovea, sub-tending a visual angle of about 1.7. Saccades are fast eye movements that can reach peak velocities of 1000/s. While a subject is moving, like in our framework, saccades do not exceed 30, but the velocity follows an exponential function. According to Bahill and Stark (1979), the range in the duration of 30 saccades can be up to 100 ms. Saccades models rarely explain the role of saliency, being mainly motivated by the need to model the motion control (see Bahill, Bahill, Clark, & Stark, 1975; Saliency prediction in the coherence theory of attention Fig. 2 13 The Gaze Machine (GM) worn by the subject collecting PORs in an outdoor search task. Bahill & Stark, 1979; Zhou, Chen, & Enderle, 2009, and for a review see Kowler, 2011 and the references therein). It follows that saccade models do not contribute to the interpretation of proto-objects, although saccades direction and speed are substantial to explain the motion field a proto-object generates and how it fades away. Similarly, saliency models not grounded in the 3D visual scene lack to explain the coherence of proto-objects, their motion field, hence their dynamics. To measure the volatility of proto-objects we rely on two models: a model of the scan path, and a model of the surface response to the POR. To obtain meaningful data from which parameters can be estimated, we use an acquisition device, the Gaze Machine specified in Pirri et al. (2011), here denoted GM. In particular we present below a novel method to recover the scan path of the the head and eyes of a subject wearing the device. 2.1. Scan path estimation The formal model for scene acquisition, PORs projections into the retinal plane (image plane) and their registration into the scene structure, while the subject explores the environment, is the Gaze Machine (GM) model, described in Pizzoli et al. (2011) and Pirri et al. (2011). Here we are mainly concerned with the scan path of the head; namely of the subject’s head, while she/he is moving across the environment to perform a search task. The task implies possible return to previously focused regions, in so inducing relations among the PORs at different time periods. In other words the scan path model has to establish whether a set of PORs belongs to the same saliency region, according to the process deployed during search. Some results of scanpath estimation, namely of the projection of the gaze on the visual field, are illustrated in Fig. 5. First note that the GM enables good controlled experiments, as the device can be well fitted on the head, the pupil rate acquisition can reach 180 Hz, ensuring to get good saccades approximation, while the visual field can be acquired at a rate up to 30 Hz, the association with the much faster acquisition of gaze is maintained by time-stamping. The GM calibrated stereo rig records the experimental stimuli, allowing for dense 3D reconstruction from multiple views. Moreover, the localization of the subject in the 3D experimental scenario is based on the visual data acquired by the GM scene cameras. The above statement assesses that the model we propose is quite general and allows a calibration procedure that is efficient and easy to perform on field, with little intervention from the subject. After the calibration, the parameters for the model of eye positions are recovered and the gaze ^ ðtÞ is computed, on the basis of the imaged pupil direction q at time t, and the geometry of the multi-camera system. The estimated POR is relative to the acquisition device and a localization step is needed in order to measure gaze behaviors in the 3D world taking into account the changes in the pose of the subject’s head. To build a map of gazed 3D points requires the following steps: 1. estimating the 3D POR pc in the reference frame of the GM left scene camera. 2. estimating the 3D pose (6 degrees of freedom) of the GM left scene camera in the reference frame of the experiment at hand, in terms of translation t and orientation R; 3. computing the 3D POR in the world reference frame as p w = Rpc + t. Note that the 3D PORs are naturally attached to 3D points that are imaged in the retinal plane, and the 3D points generate the 3D global map. For an abstract structure of the hierarchical construction see Fig. 4. 2.2. Subject localization Most of the issues affecting the localization of a camera system, see Hartley and Zisserman (2000), Faugeras, Luong, and Papadopoulou (2001), also apply to the GM, with some notable differences. Indeed, the main concern of the GM localization is high precision in the estimation of the whole trajectory, needed to correctly estimate the 3D POR, see Fig. 3, to see the head poses of a subject performing a search task. We follow an efficient hierarchical approach subdividing the whole trajectory into sets of frames, that we specify as coherent subsequences. Indeed, subsequences are characterized by a high level of coherence in terms of what the subject is attending in the course of the experiment. More specifically, the pose estimation is performed sequentially, 14 V. Ntouskos et al. Fig. 3 The Figure illustrates the reconstruction of the scene where the subject is performing the experiment searching for the J, wearing the GM. The head poses are projected on the scene, head poses are computed with the described localization algorithm. adding a new frame to the last acquired set, denoted subpath, as long as the estimation is sufficiently accurate, performing sparse bundle adjustment to enforce consistency and to avoid drifting, see Triggs, McLauchlan, Hartley, and Fitzgibbon (2000); Hartley and Zisserman (2000). Subsequences are induced by the selection of a keyframe to delimit the coherence of head poses. Namely, the set of keyframes constitutes a subset of the whole frame sequence and a new keyframe, eliciting a new subsequence, is created upon the event of a change in the visual scene. The sequence of images collected by the GM scene cameras is used to localize the subject in the experimental environment. The estimation of the subject’s pose relies on matching descriptors from visual features corresponding to the current view with those recorded in the map built so far. The overall process is summarized as follows: 1. Take the first frame of the sequence as the first keyframe. A map of 3D feature points is initialized by triangulating matched image features in the first pair of stereo frames. 2. For each new pair of stereo frames, compute matched feature points and descriptors among left and right views; triangulate to get a new set of unoptimized 3D points. Match the computed descriptors with the current map. Estimate the pose w.r.t. the current map and compute the POR in 3D. Check if a new keyframe has to be selected, if not repeat 2. 3. Upon the selection of a new keyframe, add the current frame to the keyframe list. Optimize by a local bundle adjustment w.r.t. unoptimized 3D points and cameras from the subsequence. Add the optimized points to the map and empty the set of unoptimized points. e i Þ; i ¼ 1; . . . N, the N pairs of matched ~i ; X Let us call ðx e i 2 R3 respec~i 2 R2 and X retinal plane and map points, x tively. The pairs (xi,Xi) represent the same points in homogeneous coordinates: xi 2 R3 and Xi 2 R4 . The goal is to compute the pose, expressed by the rotation matrix R and translation vector t, of the camera that is projecting the 3D points Xi into the retinal points xi. We refer in general to cameras specified by a translation t, a rotation R and a calibration matrix K as P = K[RŒt]. The rotation, translation and calibration might be decorated by superscripts specifying whether they involve the left (l), the right (r), or the scene (s) cameras. According to Hartley and Zisserman (2000), let us define the matrix K expressing the intrinsic camera parameters, namely the focal lengths fx and fy and the position of the principal point in image coordinates (px, py), as 1 0 fx 0 px C B K ¼ @ 0 fy py A: ð1Þ 0 0 1 Fiore’s linear algorithm for exterior orientation Fiore (2002) has been used to generate multiple hypotheses in a RANSACbased, robust estimation process (Fischler & Bolles, 1981). The core routine estimates the camera pose by solving 0 1 xi B C e i þ tÞ Zi @ y i A ¼ sKRð X i ¼ 1; . . . ; N: ð2Þ 1 Here Zi, i = 1, . . . , N are the depth parameters and s is the scale parameter. Note that these last parameters can be recovered up to an arbitrary common scale factor, and that the calibration matrices (likewise those of the eye cameras) are pre-estimated. The algorithm first estimates Zi in order to subsequently solve the problem of absolute orientation with scale. The model selection process makes use of an error function that takes into account re-projection errors in both the left and right retinal planes of the stereo pair. Using the l and r superscripts to identify quantities related to the left and right scene cameras, respectively, and assuming the relative pose Rs and ts of the scene cameras fixed to the GM stereo rig known from calibration, the error function is: 2 2 ~ i þ tÞ; xl þ d sKr Rs ½RðX ~ i þ tÞ ts ; xr i ¼ d sKl RðX ð3Þ i i where d is the Euclidean distance and Kl, Kr are the calibration matrices of the left and right scene cameras, see Pirri et al. (2011). The two distance terms in Eq. (3) account for reprojection errors in the left and right scene camera planes. The largest consensus set is selected by RANSAC according to Eq. (3) and used to estimate a model. A final Levenberg-Marquardt optimization is carried out to refine the linearly estimated pose by iteratively minimizing ei with respect to R and t: Saliency prediction in the coherence theory of attention Fig. 4 Visual Localization of the subject. Local consistency is enforced by optimization on frame subsequences, limited by keyframes. Frame registration with the 3D map ensures global consistency. R; t ¼ arg min R;t X i : ð4Þ i Details of the suggested minimization can be found, for example, in Hartley and Zisserman (2000). Keyframe selection Upon the acquisition of a new pair of scene frames, the pose of the subject is estimated from matched features among the current frames and the 3D map. This method guarantees a global consistency across the whole experiment and it is accurate as long as the global map is accurate. At this point the goal is to detect the change in space of the focus of overt attention in order to identify sequences of PORs that exhibit a coherence in space and time. The collected scene frames are clustered into subsequences according to the subject’s POR and keyframes are used to delimit coherent subsequences. Roughly speaking, keyframes consist of scene frames corresponding to time 15 steps in which the focus of overt attention changes and a new sequence of PORs starts. Therefore, a strategy is required to select keyframes when no knowledge of the pose and, thus, of the 3D point of regard of the subject is retained. We introduce a keyframe selection method that evaluates the novelty of a view in the experiment by measuring how different it is from the last selected keyframe. The quantities involved in the keyframe selection are the n matched pairs of visual features {(x, x0 )i, i = 1 . . . n}, between the current scene frame and the last keyframe, and the pair (c,c0 ) of gaze positions as projected into to the current frame and into the last keyframe. Note that in this phase the correspondences (x, x0 )i are drawn among frames collected by one of the scene cameras at different timesteps and the pair (c, c0 ) refers to coordinates on the image plane. A change in the subject’s vantage point induces a motion of the camera acquiring the scene and a variation of the POR in space. Suppose that the subject, during a search task, is focusing on a particular object in the scene and that her pose, in the experiment frame, can be described by a certain motion model. This will induce a sequence of PORs that is consistent with the given motion model. Therefore, we evaluate the opportunity to instantiate a new keyframe by checking the consistency of the current POR with a motion model estimated on the basis of frame to keyframe correspondences. We characterize the subject’s change in head pose by means of two types of motion models that can be estimated from the scene frames: a planar homography, represented by the H matrix, and the fundamental matrix F (see Hartley & Zisserman, 2000 for a comprehensive Fig. 5 A panoramic stitching and the PORs collected in 20 s; the stitching has been realized with 30 images over a collection of 600 left images of the scene. The acquisition of the scene is at 30 Hz while the acquisition of the eye is at 120 Hz. The PORs are measured on the scene via dense structure from motion and further reprojected on the retinal plane (image plane). 16 V. Ntouskos et al. Fig. 6 Keyframe selection criterion. Left: C(F) (red), C(H) (blue) and C(F) C(H) (green). Right: C(F) C(H) (green) and d (magenta). Keyframes are selected in correspondence of dashed lines. treatment). A motion characterized by a small baseline between the current frame and the last keyframe is best described by a plane homography H. In contrast, when the subject’s head undergoes a translational motion, the fundamental matrix F is more suitable to describe a general camera motion. Building on the Geometric Robust Information Criterion (GRIC, Torr, 1998), a score function is evaluated for both the F and H motion models at every frame in order to quantitatively measure the fitness of each model to the data. The score function takes into account the n matched features with the last keyframe, the residuals ei, the number k of model parameters, the error standard deviation r, the dimensions r of the data and q of the model: C¼ n X q e2i þ ½nq lnðrÞ þ k lnðrnÞ; ð5Þ i¼1 where qðe2i Þ ¼ min e2i ; 2ðr qÞ : r2 ð6Þ Eq. (5) returns the lowest score for the model that best fits the data. Once the motion model has been selected, it is used to evaluate the gaze variation, see Fig. 6. According to the selected motion model, changes in the subject’s vantage point involving the gaze projections c and c0 can be detected and new keyframes are instantiated on the basis of the following criterion, balancing between the choice of an homography H and of the fundamental matrix F: 0> c Fc if CðFÞ < CðHÞ ðCðFÞ CðHÞÞ d < 0; d ¼ kHc c0 k otherwise: ð7Þ Upon the instantiation of a new keyframe at time t, the following steps are performed: Subsequence Optimization. Let X be the set of unoptimized points, then this set is optimized by Sparse Bundle Adjustment (SBA) (Lourakis & Argyros, 2009) on the sequence of the last k camera poses, using a reprojection error eij as objective function X ij ; ð8Þ min Ri ;ti ;Xj ij With 2 e j þ ti Þ; xl ij ¼ d sKl Ri ð X ij 2 e j þ ti Þ ts ; xr : þ d sKr Rs ½Ri ð X ij ð9Þ e j 2 X and xc , c 2 {l, r} is the point Here i ¼ t 1; . . . ; t k; X ij e j imaged by the i-th left or right camera respectively. X Map Upgrade. Let M be the global 3D map, built so far, then M is updated with the new set of optimized points X : M ¼ M [ X. Subsequence Initialization. The set of optimized points is emptied and the number k of camera poses is set to 0. When a new keyframe is selected, the previous subsequence is terminated, the correspondent points and cameras are optimized and the resultant structure is added to the global map. Each subsequence as defined above is a coherent subsequence as it collects a coherent set of PORs, on a specific region in space. Fig. 7 illustrates the head pose and the PORs related to the scanpath elicited during the search task looking for J (see the Experimental validation section). 3. Coherent features for point saliency In the previous section we illustrated how to compute the head scanpath, leading to coherent subsequences of head poses and gaze directions. Once the head poses are retrieved, retrieving the scene structure can be done using the computed camera poses. The scene structure, even if partial, is needed to collect the features of the attended regions. For example, a crucial feature is the space range of PORs, and this is available only if the scene structure is available. Note that by estimating the scene depth, using the computed cameras, a point cloud of the scene structure is obtained. In this section we illustrate how the coherent subsequence of frames, the point of regard in space and the fixations on the retinal plane can contribute to the definition of the set of features that best specify the visual search task. Though we remark that each search task experiment cleaves the feature set into some unknown prior component; this prior component cannot be recovered Saliency prediction in the coherence theory of attention 17 Fig. 7 Head poses of the subject during the experiment searching for the J, computed with the described localization algorithm, and the rays joining the head pose with the PORs (the red circles) projected on the scene point cloud. The lines represent, ideally, the intersection of the visual axes. experimentally from the PORs data, as it is embedded into some prior knowledge the subject has about the shape, dimension and color of both the environment and the object, while she is performing the search. Now, in our experimental approach, we build an inverse problem, namely given the PORs, the head scan path and the points in the image, we want to determine the properties that are common to all of the experiments. Once these properties are identified then, as described in the following section, we can use them to attempt to define a forward model. Here we want to recover the features that elicited the PORs, from the scene structure, as computed from different experiments. Features are specific for both the space geometry, such as position on a surface and orientation, and the image such as color and intensity variation. Slightly changing the notation adopted in the previous section, in the following we shall denote a non-homogeneous point in space or on the retinal plane as X and x, respectively, while in the e and x ~. On the previous section they were denoted by X, other hand, when a homogeneous point is needed we shall b or x ^. denote it X Let us consider a coherent subsequence of frames in terms of the set of collected PORs X ¼ fðX1 ; t0 Þ; . . . ; ðXm ; tq Þg; Xj 2 R3 , labeled with the time stamp of their acquisition. It is easy to show that two PORs, even if the same region has been observed at time t and t0 , cannot coincide, as none is able to observe exactly the same point in space twice. Therefore given the camera Pj = K[RjŒtj], there is only one retinal plane Ih where the POR Xh is imaged. However if we consider the region around the POR then the points in the region can be imaged into different retinal planes. Now, for each coherent subsequence, define a monotonic grid of about 12 · 103 nodal points nX = (X, Y, Z)>; then we approximate the point cloud with a thin plate surface S : V#R3 ; V R2 minimizing the energy functional: Z n X b i Þ2 þ g SXX ðX; YÞ2 Ma ðSÞ ¼ ðSðX i ; Y i Þ Z i¼1 X þ 2SXY ðX; YÞ2 þ SYY ðX; YÞ2 dX dY ð10Þ b i is the depth of the i th point in the Here S(X, Y) = Z, and Z point cloud, SXX(v), SYY(v), SXY(v) are the second order derivatives of S, g is a stabilization parameter, and X R2 is the surface domain; the first term in the rhs of (10) is the penalty term and the second one is the stabilizing functional, for the energy functional, see Hegland, Roberts, and Altas (1997). A ray X(k) = P+x + Ck backprojecting a point x = (x, y, 1)>, where P+ is the pseudo inverse of the current camera matrix, and C its center, shall intersect the surface S into a point p = (X, Y, S(X, Y))>, when this point is a POR, it is denoted pw. The surface patch around such a point pw, is defined according to a distance threshold a; this surface patch is reprojected on the retinal planes of the subsequence, and forms a patch on the retinal planes which is defined the coherent region. Therefore a coherent region is the foveated area in the image surrounding a gaze direction. Coherent regions in images are illustrated in Fig. 8. Given the surface approximating the point cloud, we can sample from the whole data set, retrieved from an experiment, two different set of points: the points on the surface patches centered at pw, the pixels on the coherent regions on the retinal planes, and those points, on S and on the retinal planes, who have never been observed, according to the current subsequence. Once these points have been transformed into a feature space, we can obtain a training set (W, h) such that h = 1 if the back transformed item comes from a POR region and h = 1 otherwise. Given a coherent subsequence I1 ; . . . ; Iq in a time interval (t0, t0 + Dt), and its associated collection of PORs X ¼ fðX1 ; t0 Þ; . . . ; ðXm ; ðto þ DtÞg; Xj 2 R3 , labeled with their time stamp, a surface S, and a region sP = {p 2 SŒiX pi 6 a}, with a the distance threshold indicated above, then for each point in sP there is a pixel x and a retinal plane Is , 1 6 s 6 q imaging it. Therefore the set of data, obtained from the POR regions, given a coherent subsequence, in a time interval (t0,t0 + Dt) and the surface S, is: fðp; ðx1 ; . . . ; xm ÞÞjp 2 S; kXP pk < a; b ^j ¼ Pj XðkÞ; x 1 6 j 6 m; with xj on some retinal plane Ij in the subsequenceg ð11Þ b are the homogenized version of x and p, ^ and X Here x respectively. Points not in this set are the non-observed ones, and are sampled uniformly on the surface and projected on to the corresponding retinal planes points. Given the above sample set, it is possible to introduce a set of functions mapping points p 2 S and points x 2 R2 to a suitable feature space. In feature space it is then possible to 18 V. Ntouskos et al. Fig. 8 The sequence of images illustrates the notion of coherent region. Here the coherent regions induced by a subsequence of PORs are highlighted in red. They are identified among the frames collected during a search experiment with the GM on the street. In this case the experiment was ‘‘looking for a fine’’. The PORs are shown as white circles, while the current POR is shown as a white cross. learn the function f separating points belonging to salient regions from all the other ones. More precisely, we introduce a set of transformations F mapping p 2 R3 and xj 2 R2 , j = 1, . . . , m, into a feature space, then the learned function f is such that fðfFg ðp; ðx1 ; . . . ; xm ÞÞÞ ¼ h, h 2 {1, 1}. Here the Æ indicates that a transformation in F is applied to the specific set of points, as specified below. We aim at: (1) identify the optimal set of features characterizing a search task and (2) define the function f that separates regions that can/should be attended, according to the search task, from the not attended ones. A large amount of literature on feature selection (see for example Guyon & Elisseeff, 2003 and references therein) uses a discriminative model, based on the well known family of Support Vector Machines (SVMs) Vapnik (1995), to select the most significant features among a starting base set. Given the set of all possible separating hyperplanes, there are two main optimality criteria for identifying the best one: ‘1 and ‘2-norm. In the former case, the 1-norm SVM (Mangasarian, 2005) with the ‘1-norm, known as lasso penalty is obtained. In the latter case, standard SVM (Cristianini & Shawe-Taylor, 2004; Smola, Bartlett, Schölkopf, Saliency prediction in the coherence theory of attention & Schuurmans, 2000) is obtained and the ‘2-norm is indicated as ridge penalty. In Zhu, Rosset, Hastie, and Tibshirani (2003) it is argued that 1-norm SVM have advantages over the standard 2-norm, when there are redundant features. The simplest method for achieving feature selection is recursive feature elimination Guyon and Elisseeff (2003), assigning a relative importance to a feature, according to its weight vector within the SVM classifier (see below Eq. (17)). This method allows to remove more than a single feature at a time, once a threshold has been identified. A first observation for feature selection is that the data collected by the Gaze Machine are available only for training and feature selection, while in general data are taken with a freely moving camera, maybe mounted on a robot pan-tilt head. In general we expect that visual search is performed by a single moving camera, the camera localization and the camera parameters are available during search, a surface patch S for each coherent subsequence is available, though obviously the PORs are available only for the training dataset. Therefore no data specific of the GM can be selected. Given the surface S, a point p = (X, Y, S(X, Y))> on it and its projection x, we consider different surface parameters that can be obtained from the first and second derivatives of S, in space, and of the image intensity L. The surface S(X, Y) = Z is parametric; let SX, SY be the first order partial derivatives and SXX,SYY,SXY be the second order ones. In the following we identify the surface S with its parametrization. Let p be a point on S, the normal N at p is: N¼ Sx Sy jSx Sy j ð12Þ Let v be a vector on the tangent plane at p, the matrices of first and second form for S are: " > # " > # Sxx S> Sx Sx S> N 0 xy x Sy g¼ > ; H¼ > ð13Þ > Sx Sy Sy Sy 0 N Sxy S> yy The above matrices are both symmetric and det(g) > 0. Then we consider the Gaussian curvature KG = det(H)/det(g), namely: KG ¼ H11 H22 H212 g11 g22 g212 ð14Þ Actually we considered also the mean Gaussian curvature. Namely, let the best values for H(v) be obtained by ivi = 1 and by maximizing the quadratic form v>Hv, under the constraint that v>gv = 1. Call these maximal values j1 and j2. Then the mean Gaussian curvature is: KM ¼ j1 þ j2 2 ð15Þ We have verified that KG is more influential than KM, we indicate the Gaussian curvature of the surface S as rS. Similarly, consider the patches with points x = (x, y)>, corresponding to the surface patch with each x the projection of p according to the current camera. The Gaussian curvature for the RGB surface is specified as: rL ¼ g1 g2 ð16Þ Here g1 and g2 are obtained as j1 and j2 considering the RGB surface. Therefore also for the intensity surface we 19 have considered the principal curvatures. Both rS and rL are invariant to rotation. The last feature that turned out to be important is the task domain, namely the range of the values p corresponding to PORs. Their importance, as gathered above, is quite intuitive, since we do not search in general an item in the sky unless we know in advance that it can challenge gravity. Clearly the constraints on the range can be given only on S. We define Rs to be the plausibility interval ((Xmin, Xmax), (Ymin, Ymax), (Zmin, Zmax)) for a search task s. We can now list the features we have inferred. For the scene structure: F1 : the surfaces points on Si, given in global coordinates, whose center 0 is the search task starting point; the surfaces are matrices n · 3; F2 : rS for each patch corresponding to nodal points p on the surface; F3 : the plausible interval Rs on the surface domain; F4 : the timestamp. For the image structure, for each point x, image of p in frame I, the features are defined as follows: F7 : the contrast sensitivity function (see Watson & Ahumada (2005)). F6 : rL for each image patch; F5 : an image patch, centered at x and having size consistent with a meaningful distance Z of the projected point p. Namely we fix the maximum depth to 3m. and the acute vision angle to about 15 degrees. This concludes the set of feature operators. We consider a feature point W ¼ fFg fðp; ðx1 ; . . . ; xm ÞÞg. Following the approach of Schölkopf, Platt, Shawe-Taylor, Smola, and Williamson (2001), we map this set into the vector space defined by a kernel function and set a maximum margin classification problem to separate the data from the origin. Let U : Dn ! Vk represent a mapping to the vector space Vk corresponding to the kernel function K. The separating hyperplane in Vk space is computed by solving the quadratic program 1 1X minþ n q ð17Þ kwk2 þ tl i i w2Vk ;n2R ;q2R 2 s:t: ðwUðWÞÞ P q ni ; ni P 0; ð18Þ Here ni are slack variables, while t is a regularization parameter controlling the trade-off between the goals of maximizing the width of the margin and minimizing the training error at the points fFg ðp; ðx1 ; . . . ; xm ÞÞg, which takes value 1. So for a new point W the side of the hyperplane it falls on in Vk can be determined by evaluating fðWÞ ¼ sgnððw UðWÞÞ qÞ: ð19Þ The learned function, in principle, separates salient regions from nonsalient ones. More precisely, given a set of corresponding points {(p, (x1, . . . , xm))}, according to some ^ into a point x ^ in different cameras P1, . . . Pm mapping p scene images of the same bundle; given that (X, Y, S(X, Y))> is the point on the surface corresponding to X(k), and given the feature transformations set F, then fðF ðp; ðx1 ; . . . ; xm ÞÞÞ ¼ 1 if this is a point in a possible salient region and 1 otherwise. 20 V. Ntouskos et al. Results on the classification performed on the above devised feature set are illustrated in the section on Experimental validation. We can note that for a 50 s search experiment we collect about 1500 frames, since each image has dimension 480 · 640, then we have a number of points of the order of 108.5. On the other hand as at most 7 PORs are gathered in a single frame and for each POR we collect a surface of about 31 · 31 pixels then we have positive examples of the order of 107, since PORs are often in the same region. Therefore we have rather sparse matrices. The outcome of these experiments is to validate the feature set across different search tasks and to understand what is missing, what is actually part of a prior ability of the searcher and cannot be recovered from the data. 4. Generating proto-objects In the previous sections we have illustrated a model for head and point of regard localization in space for a gaze machine that can be worn by a subject looking for specific objects in the environment. Using the model we have identified several features, among which we sorted out the most relevant ones for learning a function that can separate the attended regions from the unattended ones, given a specific search task. Note that the function needs to be learned for each task, to cope with the PORs elicited during the specific visual search experiment, though the set of features remain fixed: it is like a continuous recalibration process. This lack of generalization is to be expected, human visual-search relies on an inner model able to generalize search abstracting from the context and the specific task. We argued in the introduction that this might be a consequence of the way features are aggregated into a coherent structure, that is, a proto-object. If the unknown function to be learned has to be one generalizing all the learned functions for all the search tasks, then it should be a function minimizing a distance from all the learned functions, for all the experimented tasks. This function u should be one minimizing the following functional: Z Z wðXÞkuX ðWÞ fðWÞk2 dX df ð20Þ EðuÞ ¼ L X Here f is any function learned for the task of visual search, with L its domain, w is a weight given to the features selected within classification, and X the observations. In other words, given a search task, the observations, the models specified by the features and the learned function space, E(u) returns the function u which is as close as possible to the value of any possible function selected by the learning process, where the distance is weighted by the features Here, however, rather than deriving the function u we propose a forward model, based on the previously selected features, which generalizes the learning results. The model is based on wave motion, more specifically it is governed by the equations of a vibrating membrane, with the membranes distributed on the surface S and having an initial displacement induced by the selected features at the specific location. The main idea of the model is to mimic the stimulus activation, during search, by integrating the features into a vibrational energy. Indeed, due to the initial displacement, the vibration model returns a vibrational energy that is higher where proto-object are expected to be generated and lower or null elsewhere. In the following, after recalling the model of the finite circular membrane we show how its motion is determined by its initial displacement, induced by the features integration strength. Note that here we do not consider possible interferences between two or more membranes. This will be considered in future works. In Fig. 9 we illustrate the underlying structure of the proposed model. The general equation for a vibrating circular membrane, occupying a finite region, is the following: ! 2 @2u 1 @u 1 @ 2 u 2 @ u þ ¼ c þ 0 6 r < a; h 6 2p; t @r2 r @r r2 @h2 @t2 >0 ð21Þ This admits a solution by separating variables, and using the positive roots of the Bessel functions of first and second kind. In particular, if the membrane is finite, as in our case, the Bessel functions of the second kind, of any order, are excluded from the solution. Indeed, the general solution of (21), for a membrane that is held fixed at the boundary, r = a, and it is finite, is obtained using the Bessel function of the first kind of any order as follows: uðr; h; tÞ ¼ 1 1 X X famn sinðjmn tÞ þ bmn m¼0;1;;n¼1;2; H cosðjmn tÞg aH mn sinðmhÞ þ bmn cosðmhÞ Jm ðjmn rÞ ð22Þ Here Jm is the Bessel function of the first kind of order m, jmn is the nth root of Jm and a, b, aw and bw are constants that can be determined by the initial conditions of the membrane. We recall that the Bessel functions are the solutions of the second order differential equation z2 d2y dy þ z þ ðz2 m2 Þy ¼ 0 dz dz ð23Þ With two classes of solution, the Jm of the first kind and the Ym of the second kind. Though, as observed above, here the Bessel of the second kind is disregarded. The interest of the membrane is in its vibration modes, they provide a plausible model for integrating features and, accordingly, they release energy via their displacement, and because of the Bessel function the energy vanishes in time. The main aspect of the model is to provide the right initial displacement so that a solution is found in closed form, for up to a certain order, and the energy induced pulls attention or it fades away, as suggested in the coherence theory. Let (r, h, Z) be the cylindrical coordinates of a nodal point X on the surface. Let c be the contrast sensitivity, and let r = rS + rL + e be the surface variations introduced in the previous section (see Eqs. (14), (16)). We assume that the initial velocity is zero, namely ou/otŒt=0 = 0 therefore the general solution becomes: Saliency prediction in the coherence theory of attention 21 Fig. 9 The figure above illustrates the model for generating proto-object based on wave motion. The model generates vibration at nodal points where, according to the integrated features a stimulus should occur. uðr; h; tÞ ¼ c 1 X m¼0;1 þ " 1 X " 1 X # amn Jm ðjmn rÞ sinðmhÞ n¼1;2; # ! bmn Jm ðjmn rÞ cosðmhÞ cosðcjmn tÞ n¼1;2; 4zrr exp Cm ¼ Using the initial condition c(r, h, 0), we can separate the inner summations of the above Eq. (24), for t = 0 as follows: 1 X Cm ¼ Dm ¼ bmn Jm ðjmn rÞ and by Fourier series obtain: 8 1 R 2p > < p 0 cðr; h; 0Þ cosðmhÞdh; R 2p 1 Cm ¼ cðr; h; 0Þdh; 2p 0 > : 4zrr exp ¼ 1 p sin ðpzÞ2 ð28Þ z2 2r2 cosð2p Þ sinð2mpÞ mz cosð2mpÞ sinð2p Þ z z pðm2 z2 1Þ ; ð29Þ Finally the coefficients amn and bmn are obtained as follows: ¼ for m ¼ 0 and Dm 2 2 pa2 Jmþ1 ðjmn aÞ Z a rJm ðjmn rÞCm 0 m z2 22m rzC mþ3 mzsinð2pmÞsinð2p Þþcosð2p Þ1 K exp 2r 2 ðjm;n Þ 2 z z pðm2 z2 1ÞJmþ1 ðjm;n Þ 2 ð30Þ 2p cðr; h; 0Þ sinðmhÞdh; m P 1 ; m>0 mP1 amn ¼ for m P 1 pðm2 z2 1Þ and Dm ¼ ð25Þ ð1þcosð2mpÞ cosð2pz Þþmz sinð2mpÞ sinð2pz ÞÞ p n¼1;2; Z z2 2r2 C0 ¼ amn Jm ðjmn rÞ n¼1;2; 1 X z2 2r2 8zrr exp ð24Þ ð26Þ 0 Now, we let the initial displacement be given by the following equation: 2 1 z 1 cðr; h; 0Þ ¼ rr exp h ð27Þ sin 2 z 2r2 This initial displacement ensures that where the surfaces variations r increase the energy increases too, while the frequency at which the energy is released depends on the radius and the h values, in such a way that distant points, namely for increasing values of Z, on the surface are penalized. Using Eq. (26) we obtain: ; mþ5 ; m þ 1; Here C is the Gamma function, K ¼ 1 Fe 2 mþ3 2 2 2 1 4 ðjm;n Þ Þ, where p Fq(a; b; z) is the regularized generalized hypergeometric function. And the second parameter bmn is given below: bmn ¼ 2 pa2 J mþ1 ðjmn aÞ 22m rzC ¼ mþ3 2 2 Z a rJm ðjmn rÞDm 0 m z2 exp 2r mzcosð2pmÞsin 2 ðjm;n Þ 2 pðm2 z2 1ÞJmþ1 ðjm;n Þ 2p z sinð2pmÞcos 2p z K ð31Þ Analogously, here C is the Gamma function, K ¼ 1 Fe 2 mþ3 ; 2 2 mþ5 1 ; m þ 1; j Þ, where (a; b; z) is the regularized p Fq m;n 2 4 generalized hypergeometric function. Noting that the roots 22 V. Ntouskos et al. Fig. 10 Vibrations generated by different initial displacements, according to the initial feature values. The interface made in Mathematica, allows to understand the influence of the Gaussian Curvature rS and rL, for S and L, specified in the GUI as variance, and the distance Z, on the vibration frequency. of the Bessel Jm are easily computed with Mathematica, Matlab or Maple, it follows that up to a given order and to a given root, the vibrating membrane takes a solution for varying features values in closed form. Some of the computed membranes with vibrations varying according to the features, inducing the initial displacement c(r, h, 0) are illustrated in Fig. 10 showing some of the vibration modes. The full algorithm to compute the energy elicited by the features structured by the vibrating membrane and to generate proto-object is as follows. S First of all let us define D ¼ SnR be the domain of all the experiments, in terms of the plausible regions R. Let Q b be a coherent subsequence of frames, and f Zg i¼1;...;n the point cloud for Q, note that a coherent subsequence includes no more than 15 frames, hence it is labeled by a time interval (t0, t0 + Dt) of less than half second. Let K[IŒ0] be the reference camera and RŒt]1, . . . , [RŒt]m the poses of the other views with respect to the reference one. 1. For each nodal point p of S, such that p 2 D, and for each projected pixel, according to the camera poses, select the regions generated by the points (p, (x1, . . . , xm)) restricted to the domain D. 2. Compute the feature set W for the sampled set. 3. Using the above equations, and the obtained features W at each nodal point, compute the vibrating membrane, allowing the radius r to vary about the membrane distribution on S, between 1 and 5. Here we exploit the pre computation of the Bessel roots in a lookup table. 4. Compute Eq. (24) for each 0 6 m 6 12 and for 1 6 n 6 9. Define the membrane surface as: ðrm cosðhÞ; rm sinðhÞ; uðr; h; tÞÞ ð32Þ with t varying from zero to the maximum time lapse of the subsequence interval. Some examples with varying r, z, and r are illustrated in Fig. 10. Sum the membrane surface absolute values for each time t 2 (t0, t0 + Dt) and using gradient descent, find the membranes that have maximal energy at t0 + Dt. 5. The nodal points with maximal energy are generators of proto-objects. 6. Consider the energy of all the neighbor these selected nodal points, according to the maximal radius a, and identify these patches in S and their projection on the retinal planes of the subsequence as the proto-objects predicting saliency. Results of this algorithm, for the indoor experiments looking for J and looking for the pink elephant are illustrated in Fig. 11. 5. Experimental validation Experiments are at the basis of our experimental model of saliency, whose main stages are shown in the left panel of Fig. 12. An experiment, begins with a calibration phase, in which the subject moves her/his eyes, head and body while Saliency prediction in the coherence theory of attention 23 Fig. 11 Comparison between PORs taken from a coherent subsequence and the inferred proto-objects. We can see that in the whole the generated proto-objects are plausible. fixating a specified target. This phase is needed to calibrate the wearable device with the subject eye motion manifold and scene cameras, as illustrated in Pirri et al. (2011). Thereafter, according to the search task, the search experiment lasts a certain amount of time T, 120 s 6 T 6 180 s and it collects the frame sequence F, of the left and right images, at a frequency of fT 2 [15, 30] Hz; frames are gathered in bundles specifying the local coherence of the gaze motion. Further it collects the pupil sequence P at a frequency ft 2 [120, 180] Hz and the head motion H via a compact inertial device part of the acquisition device. Data are processed off-line and the following set of data is returned together with a synchronization of images, visual axes and head poses: the head pose in global coordinates H via the localization, Pizzoli et al. (2011), the point cloud M in global coordinates, the visual axes of the eye manifolds, namely the PORs directions, projected as point in the global coordinates of the scene P, the reprojection of the PORs in the images RPOR , synchronized, so that in each image a certain amount of PORs, between 7 up to 15 is reprojected. Finally, B are the relative positions of the observer with respect to the scene. An experiment, therefore, comes with the following formal structure: E ¼ hH; M; ðB; DTÞ; ðP; DtÞ; RPOR i ð33Þ Here DT is the time lapse between two measurements of the scene, DT 60 ms; Dt is the time lapse between two measurements of the PORs direction in the scene, Dt 8 ms exploiting the scene constancy – namely, the speed of the eyes is faster than any meaningful motion in the scene and of the head and body motion. To these data we add the membrane structures to support the proto-objects. The principal outcomes of an experiment E are the PORs and their localization in the 3D space together with the localization of the head pose in the dense map reconstruction of the scene. These are illustrated in Figs. 3, 7 showing the dense map, the path of the head poses, together with PORs as located in the natural scenes, and in Fig. 5, showing a meaningful part of an experiment, via a stitched panorama, with the PORs reprojected on the images. A typical dataset with the tracked head poses, a dense point cloud with the projected PORs is illustrated in Fig. 13. 5.1. Experimental validation of the acquisition model Investigating the accuracy of the proposed acquisition model involves different aspects. Localization and mapping of the POR in the 3D scene rely on the estimation of the POR relative position and the localization of the subject in the reference frame of the experiment. In addition, the identification of coherent regions depends on the effectiveness of the keyframe-based mechanism to detect changes in the POR sequence. A first evaluation focuses on investigating the accuracy of the proposed method in localizing and mapping the PORs. The ground truth has been produced as follows: five visual landmarks have been placed in the experimental scenario and their position has been measured with respect to a fixed reference frame; six subjects have been instructed to fixate the visual landmarks while freely moving in the scenario, 24 V. Ntouskos et al. Fig. 12 The left panel shows the stages of saliency prediction according to our experimental saliency model. We use the term experimental as it is based on 3D measurements of the gaze in natural scenes and of its motion field. The model copes with the coherence theory of attention with respect to the interpretation of Proto-Objects in early attention stages. On the right the backprojection of proto-objects during the task looking for J, the last image in the right panel is a proto-object in the 3D dense map. annotating (by voice) the starting and ending of the landmark observations. In each sequence, an average of 60 PORs were produced for each landmark. The validation sequences comprise about 6000 frames each. After registration of the subject initial pose with the fixed reference system, the PORs in the annotated frames were computed and compared with the ground truth, producing a Root Mean Square (RMS) value of 0.094 m. For a quantitative analysis of the keyframe selection strategy we relied on a manual coding to produce ground truth data: after the acquisition, subjects were shown the scene sequence overlapped with the POR projection on the image plane and used their innate human pattern recognition skill to select coherent subsequences, annotating for each one the starting keyframe. The performance measure is the agreement, defined as the ratio between the number of keyframes recognized by the system over the number of keyframes identified by the subject. Experiments on sequences characterized by a number of frames in the range 4000–6000, yielding a number of keyframes in the range 120–200 produced an average agreement of 85%. Validation of the coherent subsequence Coherent regions constitute the support for the attended proto-objects during an experiment. Each coherent region also selects, in the related sequence of frames, the appearance of the attended structure that is used to train the saliency model. To validate the method introduced in Section Coherent features for point saliency, we quantified the extent of the coherent region projections in each of the related bundle images. The result for an experiment producing 16 regions, with centroid distances ranging from 1.8 and 8 m from the observer, is shown in Fig. 14. For each region, the extent of its projection to the frames of the sequence is evaluated as percentage of the total number of pixels in a frame. Scene frames have size 640 · 480 pixels in the experiments. Fig. 14 shows the median values, the boxes representing the 25th and 75th percentiles, the minimum and maximum values. The validation confirms that the extent of the projections is mostly confined between 1% and 10% of the image area, and is thus suitable for the proposed feature model. 5.2. Validation of the features model Given a visual search task, we have implemented both a slight varied version of Mangasarian and Wild (2007) and the easier selection addressed in Guyon and Elisseeff (2003). Focusing on sets of features we obtain the balanced error rate as follows: 1 wpþ wp : ð34Þ þ ber ¼ 2 jDjþ jDj Here ŒDŒ+ are the positive instances and ŒDŒ- are the negative ones, while wp+ and wp- are, respectively, the false negatives and false positives. In the case of the approach of Mangasarian and Wild (2007), to keep trace of the decrease of the objective function on feature groups, we generate k!/(k m)!m! m-tuples of even features, up to k = 5, so as to assign a ber value to each feature group. A model trained on the complete set of features selected as described in Section Coherent features for point saliency, is able to predict if a new sample point is likely to be attended, i.e., if it belongs to a coherent region, when the experiment is fixed. To validate this assumption, we ran maximum margin classification experiments. A K-fold cross-validation strategy has been followed: we divided the available data comprising more than 6 million points in 3 subsets; in turn, 2 of the three subsets have been used to train the classifier and the remaining one for validation. The process is iterated until every subset is used for validation. As expected, classification accuracy is very high, as reported in Table 1. Saliency prediction in the coherence theory of attention 25 Fig. 13 Dataset E of a typical visual search experiment with the GM device; the dataset includes: point cloud, head scan-path, projection of PORs in space and on the retinal planes. Fig. 14 Box plot for the extent of 16 coherent regions identified in a GM experiment on the street. The extent of the coherent regions is in percentage with respect to the frame dimension in pixels. Table 1 Results from the k-fold cross validation of the maximum margin classification using the complete image+bundle feature set. Here wp+ and wp are, respectively, the false negatives and false positives. Iteration Number of positives wp+/ŒDŒ+ wp/ŒDŒ Accuracy (%) 1 2 3 44,707 46,881 420,034 0.0127 0.01883 0.0093 0.0318 0.0206 0.0157 95.334 93.591 93.019 The accuracy is, in particular illustrated in the tables of Fig. 15, where the outcome of the classification and the measured PORs is highlighted, the first in green and the second in red. 5.3. Validation of the vibration model To validate the vibration model we have tested the algorithm described in Section Generating Proto Objects. The 26 V. Ntouskos et al. Fig. 15 Results of features and classification validation for the outdoor experiment looking for car fines. In red the PORs, and the coherent patches, in green the estimated point saliency, for the specific task. Fig. 16 Results for computed POR as functions of energy vibration at time t0 + Dt, given the domain of the specified experiments, and given the limited domain of selected experiments. implementation of the membrane has been done in both Mathematica, where a GUI is implemented to study the variations according to the initial displacement conditions, see Fig. 10, and in Matlab, exploiting a look up table of the Bessel roots computed in Mathematica. We used also the implementation of gridfit by D’Errico (2013) for surface approximation. After classification, we have collected the domain elicited by the learned function. And we have generated two sets. The first set with free domains, namely the range of the p values was given by the domains of all experiments. In the second set we have limited the range to similar domains. The results are illustrated in Fig. 16. Here the number of PORs per experiments, indicates the pw collected by the GM, with varying experiments, both indoor and outdoor. The number of proto-objects in coherent regions indicates the regions of maximal energy at t0 + Dt, computed at the time steps given for the end of a coherent subsequence. 6. Conclusions The computational theory of visual attention aims at mimicking the human capability to select, among stimuli acquired in parallel, those that are relevant for the task at hand. Similar to the biological counterpart, artificial systems can accomplish this by orienting the vision sensors toward regions of space that are more promising. 3D saliency prediction resides in defining a quantitative measure of how attention should be deployed in the three-dimensional scene. Current state-of the art does not model the integration of features in space and time, which is required when dealing with a three-dimensional, dynamic scene. In the coherence theory of attention, as introduced in Rensink et al. (2000), the concept of proto-object emerged to explain how focused attention collects features to form a stable object that is temporally and spatially coherent. In this work we address the problem of modeling the process of formation of proto-objects and their relative spatial and temporal coherence according to a double process. At first a pure experimental setting allows as to identify the best features, which are stable across different experiments and different contexts. We show their stability using a classifier that has been exploited also to select the best features. Further we define a forward model based on the selected features. The forward model define a vibrational Saliency prediction in the coherence theory of attention energy capturing coherent proto-objects. These encapsulate the information about the search task and we show that some good approximation results are possible. We have thus shown a whole process which, starting from three-dimensional gaze tracking experiments, extract features that are relevant to predict saliency and introduce a novel energy based model to indicates the salient regions in space. A drawback of the proposed method is the lack of motion features. We intend to address these aspect in future research, note that for an experimental method as the one proposed here it is required to deal with the reconstruction of motion, which is still a hard problem. Acknowledgments The research has bee funded by EU-FP7 NIFTI Project, Contract No. 247870. References Ackerman, C., & Itti, L. (2005). Robot steering with spectral image information. IEEE Transactions on Robotics, 21(2), 247–251. Bahill, T., Bahill, K. A., Clark, M., & Stark, L. (1975). Closely spaced saccades. Investigative Ophtalmology, 14(4), 317–321. Bahill, T., & Stark, L. (1979). The trajectories of saccadic eye movements. Scientific American, 240(1), 1–12. Belardinelli, A., Pirri, F., & Carbone, A. (2007). Bottom-up gaze shifts and fixations learning by imitation. IEEE Transactions Systems, Man and Cybernetics B, 37, 256–271. Butko, N.J., Zhang, L., Cottrell, G.W., & Movellan, J.R. (2008). Visual saliency model for robot cameras. In ICRA (pp. 2398– 2403). Carmi, R., & Itti, L. (2006). Visual causes versus correlates of attentional selection in dynamic scenes. Vision Research, 46(26), 4333–4345. Cerf, M., Harel, J., Einhäuser, W., & Koch, C. (2007). Predicting human gaze using low-level saliency combined with face detection. In NIPS. Cristianini, N., & Shawe-Taylor, J. (2004). Kernel methods for pattern analysis. CUP. D’Errico, J. (2013). Surface fitting using gridfit. Tech. rep., Matlab File Exchange. <http://www.mathworks.com/matlabcentral/ fileexchange/authors/679>. Duncan, J., & Humphreys, G. W. (1989). Visual search and stimulus similarity. Psychological Review, 96(3), 433–458. Faugeras, O., Luong, Q., & Papadopoulou, T. (2001). The geometry of multiple images. MIT Press. Fiore, P. (2002). Efficient linear solution of exterior orientation. IEEE PAMI, 23(2), 140–148. Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. JMLR, 3(7-8), 1157–1182. Hartley, R., & Zisserman, A. (2000). Multiple view geometry in computer vision. CUP. Hegland, M., Roberts, S., & Altas, I. (1997). Finite element thin plate splines for surface fitting. Tech. Rep. TR-CS-97-20, Department of Computer Science, Faculty of Engineering and Information Technology, The Australian National University Canberra, ACT 0200. 27 Hügli, H., Jost, T., & Ouerhani, N. (2005). Model performance for visual attention in real 3d color scenes. In IWINAC-2005 (pp. 469–478). Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE PAMI, 20(11), 1254–1259. Julesz, B. (1986). Texton gradients: The texton theory revisited. Biological Cybernetics, 54, 245–251. Koch, C., & Ullman, S. (1985). Shifts in selective visual-attention: Towards the underlying neural circuitry. Human Neurobiology, 4(4), 219–227. Kowler, E. (2011). Eye movements: The past 25years. Vision Research, 1–27. Lourakis, M., & Argyros, A. (2009). Sba: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software (TOMS), 36(1), 2. Mahadevan, V., & Vasconcelos, N. (2010). Spatiotemporal saliency in dynamic scenes. IEEE PAMI, 32, 171–177. Mancas, M., Pirri, F., & Pizzoli, M. (2011). From saliency to eye gaze: Embodied visual selection for a pan-tilt-based robotic head. In ISVC (1) (pp. 135–146). Mangasarian, O. L. (2005). Exact 1-norm support vector machines via unconstrained convex differentiable minimization. Tech. rep., Data Mining Institute TR 05-03. Mangasarian, O. L., & Wild, E. W. (2007). Feature selection for nonlinear kernel support vector machines. In IEEE-ICDM workshops (pp. 231–236). Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B. Biological Sciences, 200, 269–294. Minato, T., & Asada, M. (2001). Image feature generation by visiomotor map learning towards selective attention. In Proc. of IROS 2001 (pp. 1422–1427). Neisser, U., & Becklen, R. (1975). Selective looking: Attending to visually specified events. Cognitive Psychology, 7(4), 480–494. Orabona, F., Metta, G., & Sandini, G. (2008). A proto-object based visual attention model. In L. Paletta & E. Rome (Eds.), Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint (pp. 198–215). Berlin Heidelberg: SpringerVerlag. Pichon, E., & Itti, L. (2002). Real-time high-performance attention focusing for outdoors mobile beobots. In Proceedings of AAAI spring symposium (AAAI-TR-SS-02-04) (p. 63). Pirri, F., Pizzoli, M., Rudi, & A. (2011). A general method for the point of regard estimation in 3d space. In In (CVPR) (pp. 921– 928). Pizzoli, M., Rigato, D., Shabani, R., & Pirri, F. (2011). 3d saliency maps. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on (pp. 9 – 14). Rensink, R. (2000). The dynamic representation of scenes. Visual Cognition, 7, 17–42. Rensink, R. (2002). Change detection. Annual Review of Psychology, 53, 245–277. Rensink, R., O’Regan, J. K., & Clark, J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8, 368–373. Rensink, R., O’Regan, J., & Clark, J. (2000). On the failure to detect changes in scenes across brief interruptions. Visual Cognition, 7, 127–145. Sala, P. L., Sim, R., Shokoufandeh, A., & Dickinson, S. J. (2006). Landmark selection for vision-based navigation. IEEE Transactions on Robotics, 22(2), 334–349. Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a highdimensional distribution. Neural Computation, 13, 1443–1471. 28 Serences, J. T., & Yantis, S. (2006). Selective visual attention and perceptual coherence. Trends in Cognitive Sciences, 10(1), 38–45. Smola, A., Bartlett, P., Schölkopf, B., & Schuurmans, D. (2000). Advances in large margin classifiers. Cambridge, MA: MIT Press. Torr, P. H. S. (1998). Geometric motion segmentation and model selection. Philosophical Transactions of the Royal Society of London, Series A, 356(1740), 1321–1340. Treisman, A. (1985). Preattentive processing in vision. Computer Vision Graphics Image Processing, 31(2), 156–177. Treisman, A., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12, 97–136. Triggs, B., McLauchlan, P.F., Hartley, R., & Fitzgibbon, A. (2000). Bundle adjustment – A modern synthesis. In ICCV. Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., & Nuflo, F. (1995). Modeling visual attention via selective tuning. Artifical Intelligence, 78, 507–547. Vapnik, V.N. (1995). The nature of statistical learning theory. Walther, D., & Koch, C. (2006). Modeling attention to salient protoobjects. Neural Networks, 19(9), 1395–1407. V. Ntouskos et al. Watson, A. B., & Ahumada, A. J. (2005). A standard model for foveal detection of spatial contrast. Journal of Vision, 5(9), 717–740. Wolfe, J. M. (1992). The parallel guidance of visual attention. Current Directions in Psychological Science, 4, 124–128. Wolfe, J. M. (1994). Guided search 2.0. a revised model of visual search. CPsychonomic Bulletin and Review, 2, 202–238. Wolfe, J. M., Friedman-Hill, S. R., Stewart, M. L., & O’Connell, K. M. (1992). The role of categorization in visual search for orientation. Journal of Experimental Psychology: Human Perception and Performance, 18, 34–49. Zhou, W., Chen, X., & Enderle, J. (2009). An updated time-optimal 3rd-order linear saccadic eye plant model. International Journal of Neural Systems, 19(5). Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2003). 1-norm support vector machines. Neural Information Processing Systems, 16.

Log In

Saliency prediction in the coherence theory of attention

Related papers

Related papers

Related topics