Academia.eduAcademia.edu

A Bayesian Framework for Active Artificial Perception

In this text, we present a Bayesian framework for active multimodal perception of 3D structure and motion. The design of this framework finds its inspiration in the role of the dorsal perceptual pathway of the human brain. Its composing models build upon a common egocentric spatial configuration that is naturally fitting for the integration of readings from multiple sensors using a Bayesian approach. In the process, we will contribute with efficient and robust probabilistic solutions for cyclopean geometry-based stereovision and auditory perception based only on binaural cues, modelled using a consistent formalisation that allows their hierarchical use as building blocks for the multimodal sensor fusion framework. We will explicitly or implicitly address the most important challenges of sensor fusion using this framework, for vision, audition and vestibular sensing. Moreover, interaction and navigation requires maximal awareness of spatial surroundings, which in turn is obtained through active attentional and behavioural exploration of the environment. The computational models described in this text will support the construction of a simultaneously flexible and powerful robotic implementation of multimodal active perception to be used in real-world applications, such as human-machine interaction or mobile robot navigation.

Transactions on Systems, Man, and Cybernetics, Part B A Bayesian Framework for Active Artificial Perception Final Preprint Version Citable as: J. F. Ferreira, J. Lobo, P. Bessière, M. Castelo-Branco, and J. Dias, “A Bayesian Framework for Active Artificial Perception,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. PP, no. 99 (Early Access Article), pp. 1 –13, 2012. 1 A Bayesian Framework for Active Artificial Perception João Filipe Ferreira, Member, IEEE, Jorge Lobo, Member, IEEE, Pierre Bessière, Miguel Castelo-Branco, and Jorge Dias, Senior Member, IEEE Abstract—In this text, we present a Bayesian framework for active multimodal perception of 3D structure and motion. The design of this framework finds its inspiration in the role of the dorsal perceptual pathway of the human brain. Its composing models build upon a common egocentric spatial configuration that is naturally fitting for the integration of readings from multiple sensors using a Bayesian approach. In the process, we will contribute with efficient and robust probabilistic solutions for cyclopean geometry-based stereovision and auditory perception based only on binaural cues, modelled using a consistent formalisation that allows their hierarchical use as building blocks for the multimodal sensor fusion framework. We will explicitly or implicitly address the most important challenges of sensor fusion using this framework, for vision, audition and vestibular sensing. Moreover, interaction and navigation requires maximal awareness of spatial surroundings, which in turn is obtained through active attentional and behavioural exploration of the environment. The computational models described in this text will support the construction of a simultaneously flexible and powerful robotic implementation of multimodal active perception to be used in real-world applications, such as human-machine interaction or mobile robot navigation. Index Terms—Sensing and Perception, Computer Vision, Sensor Fusion, Biologically-Inspired Robots, Multisensory Exploration, Active Perception, Multimodal Perception, Bayesian Programming. I. I NTRODUCTION Humans and robots alike have to deal with the unavoidable reality of sensory uncertainty. Consider the following scenario — a static or moving observer is presented with a non-static 3D scene containing several static and moving entities, probably generating some kind of sound: how does this observer perceive the 3D J. F. Ferreira ([email protected]), J. Lobo and J. Dias are with the Institute of Systems and Robotics and the Faculty of Science and Technology, University of Coimbra, Coimbra, Portugal J. Dias is also with the Khalifa University of Science, Technology and Research (KUSTAR), Abu Dhabi, UAE P. Bessière is with the CNRS-Collège de France, Paris, France M. Castelo-Branco is with the Biomedical Institute of Research on Light and Image, Faculty of Medicine, University of Coimbra, Coimbra, Portugal location, motion trajectory and velocity of all entities in the scene, while taking into account the ambiguities and conflicts inherent to the perceptual process? Within the human brain, both dorsal and ventral visual systems process information about spatial location, but in very different ways: allocentric spatial information about how objects are laid out in the scene is computed by ventral stream mechanisms, while precise egocentric spatial information about the location of each object in a body-centred frame of reference is computed by the dorsal stream mechanisms, and also the phylogenetically preceding superior colliculus (SC), both of which mediate the perceptual control of action [1]. On the other hand, several authors argue that recent findings strongly suggest that the brain codes complex patterns of sensory uncertainty in its internal representations and computations [2, 3]. Finally, direction and distance in egocentric representations are believed to be separately specified by the brain [4, 5]. Considering distance in particular, justdiscriminable depth thresholds have been usually plotted as a function of the log of distance from the observer, with analogy to contrast sensitivity functions based on Weber’s fraction [6]. These findings inspired the construction of a probabilistic framework that allows fast processing of multisensory-based inputs to build a perceptual map of space so as to promote immediate action on the environment (as in the dorsal stream and superior colliculus), effectively postponing data association such as object segmentation and recognition to higher-level stages of processing (as in the ventral stream) — this would be analogous to a tennis player being required to hit a ball regardless of perception of its texture properties. Our framework bears a spherical spatial configuration (i.e. encoding 3D distance and direction), and constitutes a short-term perceptual memory performing efficient, lossless compression through log-partitioning of depth. 2 Figure 1. View of the current version of the Integrated Multimodal Perception Experimental Platform (IMPEP). The active perception head mounting hardware and motors were designed by the Perception on Purpose (POP - EC project number FP6-IST-2004-027268) team of the ISR/FCT-UC, and the sensor systems mounted at the Mobile Robotics Laboratory of the same institute, within the scope of the Bayesian Approach to Cognitive Systems project (BACS - EC project number FP6-IST-027140). A. Contributions of this work In this text, we intend to present an integrated account of our work, which has been partially documented in previous publications [7–13], together with unpublished results. We will start in section II by presenting a bioinspired perceptual solution with focus on Bayesian visuoauditory integration. This solution serves as a shortterm spatial memory framework for active perception and also sensory control of action, with no immediate interest in object perception. We will try to explicitly or implicitly address each of the challenges of sensor fusion as described by Ernst and Bülthoff [14] using the Bayesian Volumetric Map (BVM) framework for vision, audition and vestibular sensing. It is our belief that perceptual systems are unable to yield useful descriptions of their environment without resorting to a temporal series of sensory fusion processed on some kind of shortterm memory such as the BVM. We mainly expect to contribute with a solution which: • Deals inherently with perceptual uncertainty and ambiguity. • Deals with the geometry of sensor fusion in a natural fashion. • Allows for fast processing of perceptual inputs to build a spatial representation that promotes immediate action on the environment. The Bayesian models for visuoauditory perception which form the backbone of the framework are presented next in section II. We propose to use proprioception (e.g. vestibular sensing) as ancillary information to promote visual and auditory sensing to satisfy the requirements for integration. To support our research work and to provide a testbed for some of the possible applications of the BVM, an artificial multimodal perception system (IMPEP — In- tegrated Multimodal Perception Experimental Platform) has been constructed at the ISR/FCT-UC. The platform consists of a stereovision, binaural and inertial measuring unit (IMU) setup mounted on a motorised head, with gaze control capabilities for image stabilisation and perceptual attention purposes — see Fig. 1. It presents the same sensory capabilities as a human head, thus conforming to the biological conditions that originally inspired our work. We believe IMPEP has great potential for use in applications as diverse as active perception in social robots or even robotic navigation. We present a brief description of its implementation, its sensory processing modules and system calibration in section III. As an illustration of the particular application of active perception, and also and more importantly to test the performance of our solution, in section IV we will present an algorithm that implements an active exploration behaviour based on the entropy of the BVM framework, together with results of using this algorithm in real-time. Finally, in section V conclusions will be drawn and related ongoing work will be mentioned, and in section VI future work based on what is presented in this article will be discussed. B. Related work Fusing computer vision, binaural sensing and vestibular sensing using a unified framework, to the authors’ knowledge, has never been addressed. Moreover, as far as is known by the authors, the application of the wellknown probabilistic inference grid model [15] to an egocentric, log-spherical spatial configuration as a solution to problems remotely similar to the ones presented in this text is also unprecedented. In our specific application domain, where a 3D metric and egocentric representation is required, common inference grid configurations which assume regularly partitioned Euclidean space to build the cell lattice are not appropriate: 1) Most sensors, vision and audition being notable examples, are based on a process of energy projection onto transducers, ideally yielding a pencil of projection lines that converge at the egocentric reference origin; consequently, they are naturally disposed to be directly modelled in polar or spherical coordinates. The only example of the use of a spherical configuration known to the authors was presented by Zapata et al. [16]. 2) Implementation-wise, regular partitioning in Euclidean space, while still manageable in 2D, renders temporal performances impractical in 3D 3 when fully updating a panoramic grid (i.e. performing both prediction/estimation for all cells on the grid) with satisfactory size and resolution (typically grids with much more than a million cells). There are, in fact, two solutions for this problem: either non-regular partitioning of space (e.g. octree compression), or regular partitioning of log-distance space. Interestingly enough, the latter also accounts for just-discriminable depth thresholds found in human visual perception — an example of an Euclidean solution following a similar rationale was presented by Dankers, Barnes, and Zelinsky [17]. An important part of recent work in active vision, contrary to our solution, either use an explicit representation for objects to implement active perception or multisensory fusion (e.g. [18, 19]) or rely on object detection/recognition to establish targets for active object search (e.g. [20–22]). On the other hand, several solutions for applications similar to ours (e.g. [23–25]) avoid explicit object representation by resorting to a bottom-up saliency approach such as defined by Itti, Koch, and Niebur [26]. The underlying rationale is that postponing data association processing allows for the implementation of fast automatic mechanisms of active exploration that resemble what is believed to occur within the human brain, as explained in the introductory section of this paper. Our solution implements active visuoauditory perception using an egocentric spatial representation, adding to it vestibular sensing/proprioception so as to allow for efficient sensor fusion given a rotational egomotion. Moreover, the log-partitioning of the spatial representation intrinsically deals with just-discriminable depth thresholds, while avoiding the use of complex error dispersion models. Complementing the possibility of the extension of our system to include sensory saliencyfuelled behaviours [13], our solution differs from purely saliency-based approaches in that it inherently implements an active exploration behaviour based on the entropy of the occupancy grid (inspired in research work such as [27]), so as to promote gaze shifts to regions of high uncertainty. In summary, our framework elicits an automatic behaviour of fixating interesting (i.e. salient) and unexplored regions of space, without the need to resort to active object search, as in [20–22]. II. BAYESIAN M ODELS FOR M ULTIMODAL P ERCEPTION A. Background and definitions Taking into account the goals stated in the introductory section, the framework for spatial representation that Figure 2. Egocentric, log-spherical configuration of the Bayesian Volumetric Map. will be presented in the rest of this section satisfies the following criteria: • • It is egocentric and metric in nature. It is an inference grid, allowing for a probabilistic representation of dynamical spatial occupation of the environment. It therefore encompasses positioning, structure and motion of objects, avoiding any need for any assumptions on the nature of those objects, or in other words, for data association. Given these requirements, we chose a log-spherical coordinate system spatial configuration (see Fig. 2) for the occupancy grid that we have developed and will refer to as BVM, thus promoting an egocentric trait in agreement with biological perception. The BVM is primarily defined by its range of azimuth and elevation angles, and by its maximum reach in distance ρMax , which in turn determines its log-distance base loga (ρMax −ρMin ) N , ∀a ∈ R, where ρMin defines through b = a the egocentric gap, for a given number of partitions N , chosen according to application requirements. The BVM space is therefore effectively defined by Y ≡ ] logb ρMin ; logb ρMax ] × ]θMin ; θMax ] × ]φMin ; φMax ] (1) In practice, the BVM is parametrised so as to cover the full angular range for azimuth and elevation. This configuration virtually delimits a horopter for sensor fusion. Each BVM cell is defined by two limiting logdistances, logb ρmin and logb ρmax , two limiting azimuth angles, θmin and θmax , and two limiting elevation angles, φmin and φmax , through: Y ⊃ C ≡ ] logb ρmin ; logb ρmax ] × ]θmin ; θmax ] × ]φmin ; φmax ] (2)                                                                 Specification                                       Description Program 4  Relevant variables:        C ∈ Y: indexes a cell on the BVM;     AC : identifier of the antecedents of cell C (stored as with C);      Z , · · · , Z ∈ {“No Detection”} ∪ Z: independent measurements taken by S sensors;  1  S     −1   OC , OC : binary values describing the occupancy of cell C,      for current and preceding instants, respectively;       VC : velocity of cell C,       discretised into N + 1 possible cases ∈ V ≡ {v0 , · · · , vN }.      Decomposition:      −1     P (C AC OC OC VC Z1 · · · ZS ) =   S Y −1 −1  P (Zi |VC OC C) P (AC )P (VC |AC )P (C|VC AC )P (OC |AC )P (OC |OC )     i=1      Parametric forms:       P (AC ): uniform;        P (VC |AC ): histogram;     P (C|VC AC ): Dirac, 1 iff clog ρ = alog ρ + vlog ρ δt, cθ = aθ + vθ δt and cφ = aφ + vφ δt   b b b      (constant velocity assumption);      −1  P (OC |AC ): probability of preceding state of occupancy given set of antecedents;          P (OC |O −1 ): defined through transition matrix T = 1−ǫ ǫ ,  C ǫ 1−ǫ      where ǫ represents the probability of non-constant velocity;     P (Zi |VC OC C): direct measurement model for each sensor i, given by respective sub-BP.                                                            Identification:     None.                                  Questions:     P (Oc Vc |z1 · · · zS c) → ( P (Oc |z1 · · · zS c) P (Vc |z1 · · · zS c) Figure 3. Bayesian Program for the estimation of Bayesian Volumetric Map current cell state (on the left), and corresponding Bayesian filter diagram (on the right – it considers only a single measurement Z for simpler reading, with no loss of generality). The respective filtering equation is given by (3) and (4), using two different formulations. Observation Estimation (Joint Distribution) z }| { P (VC OC Z1 · · · ZS C) = z S Y }| { P (Zi |VC OC C) Prediction zX }| { −1 −1 ) |AC )P (OC |OC P (AC )P (VC |AC )P (C|VC AC )P (OC (3) −1 AC ,OC i=1 Observation z S Y Estimation }| P (Zi |VC OC C) i=1 }| { z P (VC OC |Z1 · · · ZS C) = { X Prediction zX }| { −1 −1 ) |AC )P (OC |OC P (AC )P (VC |AC )P (C|VC AC )P (OC −1 AC ,OC −1 −1 ) |AC )P (OC |OC P (AC )P (VC |AC )P (C|VC AC )P (OC −1 AC ,OC ,OC ,VC | where constant values for log-distance base b, and angular ranges ∆θ = θmax − θmin and ∆φ = φmax − φmin , chosen according to application resolution requirements, ensure BVM grid regularity. Finally, each BVM cell is formally indexed by the coordinates of its far corner, defined as C = (logb ρmax , θmax , φmax ). The main hypothesis of inference grids is that the state of each cell is considered independent of the states of the remaining cells on the grid. This assumption effectively breaks down the complexity of state estimation. As a matter of fact, complete estimation of the state of the grid S Y (4) P (Zi |VC OC C) i=1 {z Normalisation } resumes to applying N times the cell state estimation process, N being the total number of cells that compose the grid. To compute the probability distributions for the current states of each cell, the Bayesian Program (BP) formalism, consolidated by Bessière et al. [28], will be used throughout this text. 5 B. Multimodal Sensor Fusion Using Log-Spherical Bayesian Volumetric Maps 1) Using Bayesian filtering for visuoauditory integration: The independency hypothesis postulated earlier allows for the independent processing of each cell, and hence the Bayesian Program should be able to perform the evaluation of the state of a cell knowing an observation of a particular sensor. The Bayesian Program presented in Fig. 3 is based on the solution presented by Tay et al. [29], the Bayesian Occupancy Filter (BOF), adapted so as to conform to the BVM egocentric, three-dimensional and log-spherical nature. In the spirit of Bayesian programming, we start by stating and defining the relevant variables: • • • • • C ≡ (logb ρmax , θmax , φmax ) ∈ Y is random variable denoting a log-spherical index which simultaneously localises and identifies the reference BVM cell, as has been defined in section II-A. It is used as a subscript of most of the random variables defined in this text, so as to explicitly state their relation to cells in the grid. AC ≡ (logb ρmax , θmax , φmax ) ∈ AC ⊂ Y is a random variable that denotes the hypothetical antecedent cell of reference cell C . The set of allowed antecedents AC of reference cell C is composed by the N + 1 cells on the BVM grid from which an object might have moved from, within the time interval going from the previous inference step t−1 to the present time t. The number of possible antecedents of any cell is arbitrary; in the case of the present work, we considered N + 1 = 7 antecedents: two immediate neighbours in distance, two immediate neighbours in azimuth, and two immediate neighbours in elevation, and cell C itself (which would represent the hypothesis of an object occupying the reference cell remaining still). OC is a binary variable denoting the occupancy [OC = 1] or emptiness [OC = 0] of cell C ; −1 OC denotes the occupancy state of the effective antecedent of C , AC , in the previous inference step, which will propagate to the reference cell as the object occupying a specific AC is moved to C . VC denotes the dynamics of the occupancy of cell C as a vector signalling local motion to this cell from its antecedents, discretised into N +1 possible cases for velocities ∈ V ≡ {v0 , · · · , vN }, with v0 signalling that the most probable antecedent of AC is C , i.e. no motion between two consecutive time instants. Z1 , · · · , ZS ∈ {“No Detection”} ∪ Z are independent measurements taken by S sensors. The estimation of the joint state of occupancy and velocity of a cell is answered through Bayesian inference on the decomposition equation given in Fig. 3. This inference effectively leads to the Bayesian filtering formulation as used in the BOF grids. Using the decomposition equation given in Fig. 3, we also have a more familiar QSformulation of the Bayesian filter of (3), given that i=1 P (Zi |VC OC C) does not −1 depend either on AC or OC . Applying marginalisation and Bayes rule, we obtain the answer to the Bayesian Program question, the global filtering equation (4). The process of solving the global filtering equation can actually be separated into three stages, in practice. The first stage consists on the prediction of the probabilities of each occupancy and velocity state for cell [C = c], ∀k ∈ N0 , 0 ≤ k ≤ N , αc ([OC = 1], [VC = vk ]) = X −1 −1 P (AC )P (vk |AC )P (C|vk AC )P (OC |AC )P (oc |OC ) −1 AC ,OC (5a) αc ([OC = 0], [VC = vk ]) = X −1 −1 P (AC )P (vk |AC )P (C|vk AC )P (OC |AC )P (¬oc |OC ) −1 AC ,OC (5b) with oc and ¬oc used as shorthand notations for [OC = 1] and [OC = 0], respectively. The prediction step thus consists on performing the computations represented by (5) for each cell, essentially by taking into account the velocity probability P ([VC = vk ]|AC ) and the occupation probability of −1 the set of antecedent cells represented by P (OC |AC ), therefore propagating occupancy states as a function of the velocities of each cell. The second stage of the BVM Bayesian filter estimation process is multiplying the results given by the previous step with the observation from the sensor model, yielding, ∀k ∈ N0 , 0 ≤ k ≤ N , βc ([OC = 1], [VC = vk ]) = S Y i=1  P (Zi |vk [OC = 1] C) αc ([OC = 1], vk ) (6a) βc ([OC = 0], [VC = vk ]) = S Y i=1  P (Zi |vk [OC = 0] C) αc ([OC = 0], vk ) (6b) Performing these computations for each cell [C = c] gives a non-normalised estimate for velocity and occupancy for each cell. The marginalisation over occupancy values gives the likelihood of each velocity, 6 b ∀k ∈ N0 , 0 ≤ k ≤ N , xl { Cl } {E } f xr lc (vk ) = βc ([OC = 1], [VC = vk ]) + βc ([OC = 0], [VC = vk ]) (7) The final normalised estimate for the joint state of occupancy and velocity for cell [C = c] is given by P (OC [VC = vk ]|Z1 · · · ZS C) = βc (O C ,[VC =vk ]) P lc (VC ) (8) VC The related remaining questions of the BP for the BVM cell states, the estimation of the probability of occupancy and the estimation of the probability of a given velocity, are given through marginalisation of the free variable by P (OC |Z1 · · · ZS C) = X P (VC OC |Z1 · · · ZS C) (9a) X P (VC OC |Z1 · · · ZS C) (9b) VC P (VC |Z1 · · · ZS C) = OC In summary, prediction propagates cell occupancy probabilities for each velocity and cell in the grid — P (OC VC |C). During estimation, P (OC VC |C) is updated by takingQinto account the observations yielded by the sensors Si=1 P (Zi |VC OC C) to obtain the final state estimate P (OC VC |Z1 · · · ZS C). The result from the Bayesian filter estimation will then be used for the prediction step in the next iteration. 2) Using the BVM for sensory combination of vision and audition with vestibular sensing: Consider the simplest case, where the sensors may only rotate around the egocentric axis and the whole perceptual system is not allowed to perform any translation. In this case, the vestibular sensor models, described ahead, will yield measurements of angular velocity and position. These can then be easily used to manipulate the BVM, which is, by definition, in spherical coordinates. Therefore, to compensate for this kind of egomotion, instead of rotating the whole map, the most effective solution is to perform the equivalent index shift. This process is described by redefining C : C ∈ Y indexes a cell in the BVM by its far corner, defined as C = (logb ρmax , θmax + θinertial , φmax + φinertial ). This process relies on the uncontroversial assumption that inertial precision on angular measurements is greater than the chosen resolution parameters for the BVM. 3) Dealing with sensory synchronisation: The BVM model presented earlier assumes that the state of a cell C , given by (OC , VC ), and the observation by any sensor i, given by Zi , correspond to the same time instant t. In accordance with the wide multisensory integration temporal window theory for human perception reviewed X (i, k) { Cr } Z Y Figure 4. Cyclopean geometry for stereovision — b stands for the baseline and f is the focal length. The use of cyclopean geometry (pictured on the left for an assumed frontoparallel configuration) allows direct use of the egocentric reference frame for depth maps taken from the disparity maps yielded by the stereovision system (of which an example is shown on the right). in [30], the BVM may be used safely to integrate auditory and vision measurements as soon they become available; local motion estimation using the BVM enforces a periodical state update with constant rate to ensure temporal consistency. Consequently, the modality of highest measurement rate is forced to set the update pace (i.e. by means of measurement buffers) in order to satisfy the constant update requirement. The velocity estimates for the local motion states of the BVM are thus a function of this update rate. Promotion through vestibular sensing is also perfectly feasible, since inertial readings are available at a much faster rate than visuoauditory perception. C. Bayesian sensor models Next, the sensor models that are used as observations for the Bayesian filter of the BVM will be presented. C as a random variable and P (C), although redundant in this context, will be used in the following models to maintain consistency with the Bayesian filter formulation and also with cited work. 1) Vision sensor model: We have decided to model these sensors in terms of their contribution to the estimation of cell occupancy in a similar fashion to the solution proposed by Yguel, Aycard, and Laugier [31]. This solution incorporates a complete formal definition of the physical phenomenon of occlusion (i.e. in the case of visual occlusion, light reflecting from surfaces occluded by opaque objects do not reach the vision sensor’s photoreceptors). Our motivations suggest a tentative data structure analogous to neuronal population activity patterns to represent uncertainty in the form of probability distributions [32]. Thus, a spatially organised 2D grid may have each cell (corresponding to a virtual photoreceptor in the cyclopean view — see Fig. 4) associated to a “population code” extending to additional dimensions, µρ (k) σρ (k) = ρ̂(δ̂) , = λ1 σmin (10) a discrete probability distribution with mean µρ and standard deviation σρ , both a function of the cell index k , which directly relates to the log-distance from the observer ρ. Values δ̂ and λ represent the disparity reading and its correspondent confidence rating, respectively; σmin and the expression for ρ̂(δ̂) are taken from calibration, the former as the estimate of the smallest error in depth yielded by the stereovision system and the latter from the intrinsic camera geometry (see camera calibration description later in this text). The likelihood function constitutes, in fact, the elementary sensor model as defined above for each vision sensor, and formally represents soft evidence, or “Jeffrey’s evidence” in reference to Jeffrey’s rule [33] concerning the relation between vision sensor measurements denoted generically by Z and the corresponding readings δ̂ and λ, described by the calibrated expected value ρ̂(δ̂) and standard deviation σρ (λ) for each sensor. Equation (10) only partially defines the resulting probability distribution by specifying the random variable over which it is defined and an expected value plus a standard deviation/variance — a full definition requires the choice of a type of distribution that best fits the noisy pdfs taken from the population code data structure. The traditional choice, mainly due to the central limit theorem, favours normal distributions N (Z, µρ (k), σρ (k)). Considering what happens in the mammalian brain, this choice appears to be naturally justified — biological population codes often yield bell-shaped distributions around a preferred reading [34, 35]. For more details on our adaptation of such a distribution, please refer to [7]. To correctly formalise the Bayesian inference process, a formal auxiliary definition with respective properties is needed — for details, please refer to [7]. The Bayesian Program that summarises this model is presented on Fig. 5. 2) Audition sensor model: Current trends in robotic implementations of sound-source localisation models rely on microphone arrays with more than a couple of sensors, either by resorting to steerable beamformers, high-resolution spectral estimation, time difference of arrival (TDOA) information, or fusion methods (i.e, the Program Pk (Z) = Lk (Z, µρ (k), σρ (k)), (                                                                                                                  Specification yielding a set of probability values encoding a N dimensional probability distribution function or pdf. Given the first occupied cell [C = k] on the line-ofsight, the likelihood functions yielded by the population code data structure can be finally formalised as Description 7  Relevant variables:       C: cell identifier,      stored as a 3-tuple of cell coordinates (logb ρC , θ, φ);       Z ∈ {“No Detection”} ∪ ZVisDepth : sensor depth measurement      along line-of-sight (θ, φ);       OC : binary value describing the occupancy of cell C;      N −1   GC ∈ GC ≡ O : state of all cells in the line-of-sight except for C.       Decomposition:       P (Z C OC GC ) = P (C)P (OC |C) · P (GC |OC C)P (Z|GC OC C)    {z } |   P      Gives P (Z|OC C) through G .    C            Parametric forms:            P (C): uniform;            P (OC |C): uniform or prior estimate;             P (GC |OC C): unknown, apart from dependency on number of occupied             cells;            P (Z|GC OC C): probability of a measurement by sensor,          knowing first occupied cell is [C = k] ≡ elementary sensor model Pk (Z).       Identification:     Calibration for Pk (Z) ⇒ P (Z|GC OC C).                                                        Question:    P (Z|oc c) Figure 5. Bayesian Program for vision sensor model of occupancy. integration of estimates from several distributed arrays). Generically, it is found that increasing the number of microphones also increases estimation accuracy. In fact, there is theoretical (see the Cramer-Rao bound analyses presented by Yang and Scheuing [36] for TDOA and Chen et al. [37] for beamforming) and practical (see Loesch et al. [38]) evidence supporting this notion. Conversely, it is also generally accepted that the computational burden of sound-source localisation increases with the number of sensors involved; in fact, this provides the support for the use of fusion methods, and also for one of the reasons speculated for why humans, like many animals, have only two ears [39]. The direct audition sensor model used in this work, first presented in [8, 10], relies on binaural cues alone to fully localise a sound-source in space. The reason for its use resides in our biological inspiration mentioned in section I-A and our desire to use the BVM framework in future experiments where comparisons to human sensory systems are to be made in “fair terms” (see future work referred to in [13]); however, the inclusion within this framework of any alternative model supporting the use of more microphones to increase estimation accuracy would be perfectly acceptable. The model is formulated as the first question of the Bayesian Program in Fig. 6, where all relevant variables and distributions and the decomposition of the corresponding joint distribution, according to Bayes’ rule and dependency assumptions, are defined. The use of the auxiliary binary random variable SC , which signals the presence or absence of a sound-source                                                                           Identification:       Calibration for P (τ |SC OC θmax ).     k k Calibration for P (∆L(fc )|τ SC OC C) ≈ P (∆L(fc )|SC OC C).                                                                                   Questions:       P (Z|oc c)      max, arg max P ([S C C = 1]|z C) P (SC |OC ) [SC = 0] P [SC = 1] P (sc |OC ) [OC = 0] 1 0 1 [OC = 1] .5 .5 1 Figure 6. Bayesian Program for binaural sensor model. At the bottom is presented the probability table which was used for P (SC |OC C) ≡ P (SC |OC ), empirically chosen so as to reflect the indisputable fact that there is no sound-source in a cell that is not occupied (left column), and the safe assumption that when a cell is known to be occupied there is little way of telling if it is in this condition due to a sound-source or not (right column). in cell C , and the corresponding family of probability distributions P (SC |OC C) ≡ P (SC |OC ) promotes the assignment of probabilities of occupancy close to 1 for cells for which the binaural cue readings seem to indicate a presence of a sound-source and close to .5 otherwise (i.e. the absence of a detected sound-source in a cell doesn’t mean that the cell is empty). The second question corresponds to the estimation of the position of cells most probably occupied by sound-sources, through the inversion of the direct model through Bayesian inference on the joint distribution decomposition equation. The former is used as a sub-BP for the BVM multi-                                                Specification                                                      Description  Relevant variables:       C ≡ (logb ρmax , θmax , φmax ) ∈ C: cell identifier;      Z ∈ ZBinauralMeasurements : sensor measurement vectors      1 m  [τ, ∆L(fc ) · · · ∆L(fc )];      k k     τ ≡ ITD and ∆L(fc ) ≡ ILD; fc denotes each k ∈ N, 1 ≤ k ≤ m     frequency band in m frequency channels).       SC : binary value describing the presence of a sound-source in cell C,      [SC = 1] if a sound-source is present at C, [SC = 0] otherwise;       OC : binary value describing the occupancy of cell C,       [OC = 1] if cell C is occupied by an object, [OC = 0] otherwise;       Decomposition:       P (Z C SC OC ) =      P (C)P (OC |C)·      m Y k P (SC |OC C)P (τ |SC OC θmax ) P (∆L(fc )|τ SC OC C)    k=1    | {z }   P   Gives P (Z|OC C) through S   C      Parametric forms:       P (C): uniform;       P (OC |C): uniform or prior estimate;        P (SC |OC C) ≡ P (SC |OC ): probability table (see table below);     P (Z|O C): probability of a measurement [τ, ∆L(f 1 ) · · · ∆L(f m )];   C c c      P (τ |SC OC θmax ) ≡ P (τ |SC θmax ): normal distribution, yielding the       probability of a measurement τ by sensor for cell C, given its azimuth θmax      and presence or absence of a sound-source SC in that cell;        P (∆L(fck )|τ SC OC C) ≡ P (∆L(fck )|τ SC C): normal distribution,      k  yielding the probability of a measurement ∆L(fc ) by sensor for cell C,     given the presence or absence of a sound-source SC in that cell. Program                                                                                      Specification                                                                                               Description Program 8  Relevant variables:     t t t t   ξ = (Θ , Ω , A ): state variables,      S t = (Φt , Υt ): sensor variables.        Decomposition:       P (ξt ξt−δt S t ... S 0 ) =      t t   P (S |ξ )      t t t t−δt t  .P (Ω ).P (A ).P (Θ |Θ Ω )    t−δt t−δt 0 .P (ξ S ... S )      Parametric forms:      t t t t t t t t t    P (S |ξ ) = P (Φ |Ω ).P (Υ |F ).P (F |Θ A )     : sensor model, Gaussians and dirac;      t t  P (Ω ), P (A ): a priori for state, Gaussians;      t t−δt t   P (Θ |Θ Ω ): state dynamic model, diracs;      t−δt t−δt 0   P (ξ S ... S )    : previous iteration, distribution computed at last time step.                                                                               Identification:           Parameters of the Gaussians: σΦ , σΥ , σA and σΩ .       Question:     t t 0 P (ξ |S ... S ) Figure 7. Bayesian Program for processing of inertial data. modal sensor fusion framework described earlier, while the answer to the latter yields a gaze direction of interest in terms of auditory features which can be used by a multimodal attention system, through a maximum a posteriori (MAP) method. 3) Vestibular sensor model: To process the inertial data, we follow the Bayesian model of the human vestibular system proposed by Laurens and Droulez [40, 41], adapted here to the use of inertial sensors. The aim is to provide an estimate for the current angular position and angular velocity of the system, that mimics the human vestibular perception. At time t the Bayesian program of Fig. 7 allows the inference of the probability distribution of the current state ξ t = (Θt , Ωt , At ) — where the orientation of the system in space is encoded using a rotation matrix Θ, the instantaneous angular velocity is defined as the vector Ω, and linear acceleration by A — given all the previous sensory inputs until the present instant, represented by S 0→t = (Φ0→t , Υ0→t ) — where Φ denotes Ω with added Gaussian noise measured by the gyros, and Υ denotes the gravito-inertial acceleration F with added Gaussian noise measured by the accelerometers — and the initial distribution ξ 0 . Details regarding this model, the respective implementation issues and preliminary results have been presented in [11, 12]. III. T HE I NTEGRATED M ULTIMODAL P ERCEPTION E XPERIMENTAL P LATFORM A. Platform description To support our research work, we have developed an artificial multimodal perception system, of which an im- 9 # Figure 8. Implementation diagram for the BVM-IMPEP multimodal perception framework. plementation diagram is presented on Fig. 8, consisting of a stereovision, binaural and inertial measuring unit (IMU) setup mounted on a motorised head, with gaze control capabilities for image stabilisation and perceptual attention purposes. The stereovision system is implemented using a pair of Guppy IEEE 1394 digital cameras from Allied Vision Technologies (http://www.alliedvisiontec.com), the binaural setup using two AKG Acoustics C417 linear microphones (http://www.akg.com/) and an FA-66 Firewire Audio Capture interface from Edirol (http://www.edirol. com/), and the miniature inertial sensor, Xsens MTi (http://www.xsens.com/), provides digital output of 3D acceleration, 3D rate of turn (rate gyro) and 3D earthmagnetic field data for the IMU. Full implementation details can be found in [12]. Figure 9. The IMPEP Bayesian sensor systems. the binaural unit, which correlates these signals and consequently estimates the binaural cues and segments each sound-source; and, finally, the Bayesian 3D soundsource localisation unit, which applies a Bayesian sensor model so as to perform localisation of sound-sources in 3D space. We have adapted the realtime software by the Speech and Hearing Group at the University of Shefield [42] to implement the solution by Faller and Merimaa [43] as the binaural processor — for more details, please refer to [8, 10]. 3) Inertial sensing system: The calibrated inertial sensors in the IMU provide direct egocentric measurements of body angular velocity and linear acceleration. The gyros and the accelerometers provide noise-corrupted measurements of angular velocity Ωt and the gravitoinertial acceleration F as input for the sensor model of Fig. 7. B. Sensory processing In the following text, the foundations of the sensory processing systems depicted on Fig. 9, which feed the Bayesian sensor models that have been defined in previous text, will be presented. 1) Vision system: The stereovision algorithm used yields an estimated disparity map δ̂(k, i) and a corresponding confidence map λ(k, i), and is thus easily converted from its deterministic nature into a probabilistic implementation simulating the population code-type data structure, as defined earlier. 2) Auditory system: The Bayesian binaural system presented herewith is composed of three distinct and consecutive modules (Fig. 9): the monaural cochlear unit, which processes the pair of monaural signals {x1 , x2 } coming from the binaural audio transducer system by simulating the human cochlea, so as to achieve a tonotopic representation (i.e. a frequency band decomposition) of the left and right audio streams; C. System calibration Camera calibration was performed using the Camera Calibration Toolbox by Bouguet [44], therefore allowing the application of the reprojection equation:  1  0   0 0  0 0 0 ul −  1 0 0    vl 0 0 f  δ̂ 0 1b 0 1 δ̂ 2    WX   WY  =    WZ  W (11) where ul and vl represent the horizontal and vertical coordinates of a point on the left camera, respectively, and δ̂ is the disparity estimate for that point, all of which in pixels, f and b are the estimated focal length and baseline, respectively, both of which in metric distance, and X , Y and Z are 3D point coordinates respective to the egocentric/cyclopean referential system {E}. Using reprojection error measurements given by the calibration procedure, parameter σmin as defined earlier 10 positions [C = c] ∈ Y . Let cα− denote the contiguous cell to C along the negative direction of the generic logspherical axis α, and consider the edge of cells to be of unit length in log-spherical space, without any loss of generality. A reasonable first order approximation to the joint entropy gradient at [C = c] would be → − ∇H(c) ≈ [H(c) − H(cρ− ), H(c) − H(cθ− ), H(c) − H(cφ− )]T Figure 10. Illustration of the entropy-based active exploration process using the Bayesian Volumetric Map. Please refer to [9, 12] for more details. is taken as being equal to the maximum error exhibited by the stereovision system. Calibration of the binaural system involves the characterisation of the families of normal distributions P (τ |SC OC θmax ) and P (∆L(fck )|τ SC OC C) ≈ P (∆L(fck )|SC OC C) of the binaural sensor model defined earlier through descriptive statistical learning of their central tendency and statistical variability. This is done following a proceeding similar to commonly used head-related transfer function (HRTF) calibration processes, and was described in detail in [10]. Visuoinertial calibration was performed using the InerVis toolbox [45], that adds on to the Camera Calibration Toolbox by Bouguet [44]. The toolbox estimates the rotation quaternion between the Inertial Measurement Unit and a chosen camera, requiring a set of static observations of a vertical chequered visual calibration target and of sensed gravity [46]. IV. ACTIVE E XPLORATION U SING BAYESIAN M ODELS FOR M ULTIMODAL P ERCEPTION A. Active exploration using the Bayesian Volumetric Map Information in the BVM is stored as the probability of each cell being in a certain state, defined as P (Vc Oc |z c). The state of each cell thus belongs to the state-space O × V . The joint entropy of the random variables VC and OC that compose the state of each BVM cell [C = c] is defined as follows: H(c) ≡ H(Vc , Oc ) = − P oc ∈O vc ∈V P (vc oc |z c) log P (vc oc |z c) (12) The joint entropy value H(c) is a sample of a continuous joint entropy field H : Y → R, taken at log-spherical (13) → − with magnitude k ∇H(c)k. A great advantage of the BVM over Cartesian implementations of occupancy maps is the fact that the log-spherical configuration avoids the need for timeconsuming ray-casting techniques when computing a gaze direction for active exploration, since the logspherical space is already defined based on directions (θ, φ). In case there is more than one global joint entropy gradient maximum, the cell corresponding to the direction closest to the current heading is chosen, so as to deal with equiprobability, while simultaneously ensuring minimum gaze shift rotation effort (see Fig. 10). The full description of the active exploration heuristics was presented in [9, 12]. B. Results The realtime implementation of all the processes of the framework was subjected to performance testing for each individual module — please refer to [12] for further details. To avoid Bayesian update deadlocks due to 0 or 1-probabilities, a simple error model analogous to what is proposed in [31] was implemented, both for occupancy and for velocities. Addtionally, log-probabilities are used to increase both the execution performance and the numerical stability of the system. The full active exploration system runs at about 6 to 10 Hz. This is ensured by forcing the main BVM thread to pause for each time-step when no visual measurement is available (i.e. during 40 ms for N = 10, ∆φ = 2o ). This guarantees that BVM time-steps are regularly spaced, which is a very important requirement for correct implementation of prediction/dynamics, and also ensures that processing and memory resources are freed and unlocked regularly. In Fig. 11 a qualitative comparison is made between the outcome of using each sensory modality individually, and also with the result of multimodal fusion, using a single speaker scenario, showcasing the advantages of visuoauditory integration in the effective use of both the spatial precision of visual sensing, and the temporal precision and panoramic capabilities of auditory sensing. The BVM renderings were produced from screenshots 11 (a) Left camera snapshot of a male speaker, at −41o azimuth relatively to the Z axis, which defines the frontal heading respective to the IMPEP “neck”. (b) BVM results for binaural processing only. Interpretation, from left to right: 1) sound coming from speaker triggers an estimate for occupancy from the binaural sensor model, and a consecutive exploratory gaze shift at approximately 1.6 seconds; 2) At approximately 10 seconds, noise coming from the background introduce a false positive, that is never again removed from the map (i.e. no sound does not mean no object, only no audible sound-source). (c) BVM results for stereovision processing only. Notice the clean cut-out of speaker silhouette, as comparing to results in (b). On the other hand, active exploration using vision sensing alone took approximately 15 seconds longer to start scanning the speaker’s position in space, while using binaural processing the speaker was fixated a couple of seconds into the experiment. (d) BVM results for visuoauditory fusion. In this case, the advantages of both binaural (immediacy from panoramic scope) and stereovision (greater spatial resolution and the ability to clean empty regions in space) influence the final outcome of this particular instantiation of the BVM, taken at 1.5 seconds. Figure 11. Online results for the real-time prototype for multimodal perception of 3D structure and motion using the BVM — three reenactments (binaural sensing only, stereovision sensing only and visuoauditory sensing) of a single speaker scenario. A scene consisting of a male speaker talking in a cluttered lab is observed by the IMPEP active perception system and processed online by the BVM Bayesian filter, using the active exploration heuristics described in the main text, in order to scan the surrounding environment. The heading arrow together with an oriented 3D sketch of the IMPEP perception system depicted in each map denote the current gaze orientation. All results depict frontal views, with Z pointing outward. The parameters for the BVM are as follows: N = 10, ρM in = 1000 mm and ρM ax = 2500 mm, θ ∈ [−180o , 180o ], with ∆θ = 1o , and φ ∈ [−90o , 90o ], with ∆φ = 2o , corresponding to 10 × 360 × 90 = 324, 000 cells, approximately delimiting the so-called “personal space” (the zone immediately surrounding the observer’s head, generally within arm’s reach and slightly beyond, within 2 m range [6]). Cu umulative in nformation ggain (bits) 12 KL Divergence (bits) 0.0120 0.0100 0.0080 0.0060 0.0040 0.0020 0.45 0.40 0.0000 0 35 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 0 5 5 10 10 visuoauditory visual only auditory only 15 time (s) 20 25 15 time (s) visuoauditory visual only auditory only 20 25 30 30 (a) Instantaneous average information gain for updated cells. (b) Cumulative sum of average information gain for updated cells. Explored region (de eg) 60.00 50.00 40.00 30.00 20.00 10.00 0.00 0 5 10 15 time (s) visuoauditory visual only auditory only 20 25 30 (c) Maximum angle covered by explored volumetric convex hull. Figure 12. Temporal evolution of average information gain (i.e. average Kullback-Liebler divergence for the full set of cells which were updated, either due to observations or propagation from prediction) and corresponding exploration span for the auditory-only, visual-only and visuoauditory versions of the single speaker scenario (see Fig. 11), for a 30 second period since the start of each experiment. of an online OpenGL-based viewer running throughout the experiments. Fig. 12 presents a study based on information gain and exploration span, yielding a quantitative comparison of these advantages and capabilities, and demonstrating the superior results of visuoauditory fusion as compared to using each sensory modality separately. In [12], results were presented concerning the effectiveness of active exploration when having to deal with the ambiguity and uncertainty caused by multiple sensory targets and complex noise in a two-speaker scenario scenario. In order to provide a better view of the BVM configuration, the supplemental MPEG file presents a higher resolution rendering of a reenactment of this scenario, processed offline from a saved instantiation of the BVM occupancy grid. The advantages of the log-spherical configuration were made apparent in these experiments, when comparing to other solutions in the literature: (a) an efficiency advantage: as mentioned in the introductory section, fewer cells are used to represent the environment — for example, to achieve the resolution equivalent to approximately the distance obtained half-way through the log-scale using a regular Euclidean partition for the examples in Figs. 11 and the supplemental video, i.e. 40 mm-side cells and removing the egocentric gap, around 1, 937, 500 cells would be needed, while, using a similar rationale, around 1, 215, 000 cells would be needed for a solution such as the one presented by [17], roughly constituting at least a 3-fold reduction in total cell count; (b) a robustness advantage: the fact that sensor readings are directly referred to in spherical coordinates and consequently no ray-tracing is needed leads to inherent antialiasing, therefore avoiding the Moiré effects which are present in other grid-based solutions in the literature, as is reported by Yguel et al. [31]. Moreover, the benefits of using an egocentric spherical configuration have also been made clear: sensory fusion is seamlessly performed, avoiding the consequences of transformations between referentials and respective complex registration and integration processes proposed in related work, such as [20–22]. V. C ONCLUSIONS In this text we introduced Bayesian models for visuoauditory perception and inertial sensing emulating vestibular perception which form the basis of the probabilistic framework for multimodal sensor fusion — the Bayesian Volumetric Map. These models build upon a common spatial configuration that is naturally fitting for 13 the integration of readings from multiple sensors. We also presented the robotic platform that supports the use of these computational models for implementing an entropy-based exploratory behaviour for multimodal active perception. In the future, the computational models described in this text will allow the construction of a simultaneously flexible and powerful robotic implementation of multisensory active perception to be used in real-world applications. Regarding its future use in applications such as human-machine interaction or mobile robot navigation, the following conclusions may be drawn: • • • • The results presented in the previous section show that active exploration algorithm successfully drives the IMPEP-BVM framework to explore areas of the environment mapped with high uncertainty in realtime, with an intelligent heuristic that minimises the effects of local minima by attending to the closest regions of high entropy first. Moreover, since the human saccade-generation system promotes fixation periods (i.e. time intervals between gaze shifts) of a few hundred milliseconds on average [47, 48], the overall rates of 6 to 10 Hz achieved with our CUDA implementation, in our opinion, back up the claim that our system does, in fact, achieve satisfactory real-time performance. Effective use of visual spatial accuracy and auditory panoramic capabilities and temporal accuracy by our system constitutes a powerful solution for attention allocation in realistic settings, even in the presence of ambiguity and uncertainty caused by multiple sensory targets and complex noise. Although not explicitly providing for object representation, many of the scene properties that are already represented by the Bayesian filter allow for clustering and tracking of neighbouring cells sharing similar states, which in turn provides a fast processing prior/cue generator for an additional object detection and recognition module. An active object search could then be implemented, as in related work such as [20–22]. The BVM and its egocentric log-spherical configuration carry with it, however, in its current state, a few important limitations. In decreasing order of importance, these would be the following: 1) The non-regular tesselation of space might introduce perceptual distortions due to motion-based prediction when a moving object becomes occluded: a big object moving towards the observer will appear to shrink or, conversely, a small object moving away from the observer will appear to inflate. These perceptual illusions will of course disappear as soon as the object returns to the observer’s line-of-sight. 2) If an object happens to get outside of the robotic observer’s field-of-view, either due to a gaze shift or to object motion, the effect of the BVM representing a persistent memory might result in “ghost occupancies” and consequent cluttering of the spatial map. This did not visibly happen during the experiments described in section IV-B; however, it is a definite concern, and will be addressed in future work (see section VI). 3) If this system is to be used by an autonomous mobile robot to perform allocentric mapping, then some care must be taken when dealing with the non-trivial integration of reconstructions taken from different points of view. This is, however, beyond the scope of our current research. We are currently developing a complex artificial active perception system that follows human-like bottom-up driven behaviours using vision, audition and vestibular sensing, building upon the work presented in this text. More specifically, we have devised a hierarchical modular probabilistic framework that allows the combination of active perception behaviours, adding to the active exploration behaviour described in this text automatic orientation based on sensory saliency. This research work has demonstrated in practice the usefulness rendered by the extensibility, adaptivity and scalability of the proposed framework – for more details, please refer to [13]. Further details on the development and application of these models can be found at http://paloma.isr.uc.pt/ ~jfilipe/BayesianMultimodalPerception. VI. F UTURE W ORK Long-term improvements to the BVM-IMPEP framework would include sensor models specifically for local motion, in contrast to the occupancy-only-based sensor models presented in this paper. These models could be built upon concepts such as optical flow processing for vision (which could be enhanced by visuoinertial integration), the Doppler effect for audition, etc. – and perceptual grouping solutions, through clustering processes similar to what was presented by Tay et al. [29], but in our case using prior distributions based on multimodal perceptual integration processes, some of which are currently being studied in psychophysical studies performed by our research group, to be concluded soon. Another important addition would be the introduction of a decay factor to the BVM – in other words a “forget- 14 fulness” factor – thus avoiding the cluttering limitation of the framework, pointed out in section V. ACKNOWLEDGEMENTS This publication has been partially supported by the European Commission within the Seventh Framework Programme FP7, as part of theme 2: Cognitive Systems, Interaction, Robotics, under grant agreement 231640. The work presented herewith was also supported by ECcontract number FP6-IST-027140, Action line: Cognitive Systems. The contents of this text reflect only the author’s views. The European Community is not liable for any use that may be made of the information contained herein. This research has also been supported by the Portuguese Foundation for Science and Technology (FCT) [postdoctoral grant number SFRH/BPD/74803/2010]. The authors would like to thank the reviewers and the Associate Editor for their kind and useful suggestions, which we gratefully acknowledge to have substantially improved the quality of our manuscript. R EFERENCES [1] K. J. Murphy, D. P. Carey, and M. A. Goodale, “The Perception of Spatial Relations in a Patient with Visual Form Agnosia,” Cognitive Neuropsyshology, vol. 15, no. 6/7/8, pp. 705–722, 1998. [2] D. C. Knill and A. Pouget, “The Bayesian brain: the role of uncertainty in neural coding and computation,” TRENDS in Neurosciences, vol. 27, no. 12, pp. 712–719, December 2004. [3] M. J. Barber, J. W. Clark, and C. H. Anderson, “Neural representation of probabilistic information,” Neural Computation, vol. 15, no. 8, pp. 1843–1864, August 2003. [4] J. Gordon, M. F. Ghilardi, and C. Ghez, “Accuracy of planar reaching movements. I. Independence of direction and extent variability,” Experimental Brain Research, vol. 99, no. 1, pp. 97–111, 1994. [5] J. McIntyre, F. Stratta, and F. Lacquaniti, “Short-Term Memory for Reaching to Visual Targets: Psychophysical Evidence for Body-Centered Reference Frames,” Journal of Neuroscience, vol. 18, no. 20, pp. 8423–8435, October 15 1998. [6] J. E. Cutting and P. M. Vishton, “Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth,” in Handbook of perception and cognition, W. Epstein and S. Rogers, Eds. Academic Press, 1995, vol. 5; Perception of space and motion. [7] J. F. Ferreira, P. Bessière, K. Mekhnacha, J. Lobo, J. Dias, and C. Laugier, “Bayesian Models for Multimodal Perception of 3D Structure and Motion,” in International Conference on Cognitive Systems (CogSys 2008), University of Karlsruhe, Karlsruhe, Germany, April 2008, pp. 103–108. [8] C. Pinho, J. F. Ferreira, P. Bessière, and J. Dias, “A Bayesian Binaural System for 3D Sound-Source Localisation,” in International Conference on Cognitive Systems (CogSys 2008), University of Karlsruhe, Karlsruhe, Germany, April 2008, pp. 109–114. [9] J. F. Ferreira, C. Pinho, and J. Dias, “Active Exploration Using Bayesian Models for Multimodal Perception,” in Image Analysis and Recognition, Lecture Notes in Computer Science series (Springer LNCS), International Conference ICIAR 2008, A. Campilho and M. Kamel, Eds., June 25–27 2008, pp. 369–378. [10] ——, “Implementation and Calibration of a Bayesian Binaural System for 3D Localisation,” in 2008 IEEE International Conference on Robotics and Biomimetics (ROBIO 2008), Bangkok, Thailand, February, 21–26 2009, pp. 1722–1727. [11] J. Lobo, J. F. Ferreira, and J. Dias, “Robotic Implementation of Biological Bayesian Models Towards Visuo-inertial Image Stabilization and Gaze Control,” in 2008 IEEE International Conference on Robotics and Biomimetics (ROBIO 2008), Bangkok, Thailand, February, 21–26 2009, pp. 443–448. [12] J. F. Ferreira, J. Lobo, and J. Dias, “Bayesian real-time perception algorithms on GPU — Real-time implementation of Bayesian models for multimodal perception using CUDA,” Journal of Real-Time Image Processing, vol. 6, no. 3, pp. 171–186, September 2011. [13] J. F. Ferreira, M. Castelo-Branco, and J. Dias, “A hierarchical Bayesian framework for multimodal active perception,” Adaptive Behavior, vol. 20, no. 3, pp. 172– 190, June 2012. [14] M. O. Ernst and H. H. Bülthoff, “Merging the senses into a robust percept,” TRENDS in cognitive Sciences, vol. 8, no. 4, pp. 162–169, April 2004. [15] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” IEEE Computer, vol. 22, no. 6, pp. 46–57, 1989. [16] R. Zapata, B. Jouvencel, and P. Lépinay, “Sensor-based motion control for fast mobile robots,” in IEEE Int. Workshop on Intelligent Motion Control, Istambul, Turkey, 1990, pp. 451–455. [17] A. Dankers, N. Barnes, and A. Zelinsky, “Active Vision for Road Scene Awareness,” in IEEE Intelligent Vehicles Symposium (IVS05), Los Vegas, USA, June 2005, pp. 187–192. [18] J. Tsotsos and K. Shubina, “Attention and Visual Search : Active Robotic Vision Systems that Search,” in The 5th International Conference on Computer Vision Systems, Bielefeld, March 21–24 2007. [19] D. Roy, “Integration of Speech and Vision using Mutual Information,” in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 6, 2000, pp. 2369–2372. [20] A. Andreopoulos, S. Hasler, H. Wersing, H. Janssen, J. K. Tsotsos, and E. Körner, “Active 3D Object Localization Using A Humanoid Robot,” IEEE Transactions on Robotics, vol. 27, no. 1, pp. 47–64, 2011. 15 [21] J. Ma, T. H. Chung, and J. Burdick, “A probabilistic framework for object search with 6-DOF pose estimation,” International Journal of Robotics Research, vol. 30, no. 10, pp. 1209–1128, 2011. [22] F. Saidi, O. Stasse, K. Yokoi, and F. Kanehiro, “Online object search with a humanoid robot,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2007), 2007, pp. 1677–1682. [23] C. Breazeal, A. Edsinger, P. Fitzpatrick, and B. Scassellati, “Active Vision for Sociable Robots,” IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, vol. 31, no. 5, pp. 443–453, September 2001. [24] A. Dankers, N. Barnes, and A. Zelinsky, “A Reactive Vision System: Active-Dynamic Saliency,” in 5th International Conference on Computer Vision Systems, Bielefeld, Germany, 21–24 March 2007. [25] A. Koene, J. Morén, V. Trifa, and G. Cheng, “Gaze shift reflex in a humanoid active vision system,” in 5th International Conference on Computer Vision Systems, Bielefeld, Germany, 2007. [26] L. Itti, C. Koch, and E. Niebur, “A Model of SaliencyBased Visual Attention for Rapid Scene Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, November 1998. [27] R. Rocha, J. Dias, and A. Carvalho, “Cooperative MultiRobot Systems: a study of Vision-based 3-D Mapping using Information Theory,” Robotics and Autonomous Systems, vol. 53, no. 3–4, pp. 282–311, December 2005. [28] P. Bessière, C. Laugier, and R. Siegwart, Eds., Probabilistic Reasoning and Decision Making in SensoryMotor Systems, ser. Springer Tracts in Advanced Robotics. Springer, 2008, vol. 46, ISBN: 978-3-54079006-8. [29] C. Tay, K. Mekhnacha, C. Chen, M. Yguel, and C. Laugier, “An efficient formulation of the bayesian occupation filter for target tracking in dynamic environments,” International Journal of Autonomous Vehicles, vol. 6, no. 1–2, pp. 155–171, 2008. [30] C. Spence and S. Squire, “Multisensory integration: maintaining the perception of synchrony,” Current Biology, vol. 13, pp. R519–R521, July 2003. [31] M. Yguel, O. Aycard, and C. Laugier, “Efficient GPUbased Construction of Occupancy Grids Using several Laser Range-finders,” International Journal of Autonomous Vehicles, vol. 6, no. 1–2, pp. 48–83, 2008. [32] A. Pouget, P. Dayan, and R. Zemel, “Information processing with population codes,” Nature Reviews Neuroscience, vol. 1, pp. 125–132, 2000, review. [33] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, revised second printing ed., M. B. Morgan, Ed. Morgan Kaufmann Publishers, Inc. (Elsevier), 1988. [34] S. Treue, K. Hol, and H.-J. Rauber, “Seeing multiple directions of motion — physiology and psychophysics,” Nature Neuroscience, vol. 3, no. 3, pp. 270–276, March 2000. [35] R. T. Born and D. C. Bradley, “Structure and Function of Visual Area MT,” Annual Review of Neuroscience, vol. 28, pp. 157–189, July 2005. [36] B. Yang and J. Scheuing, “Cramer-Rao bound and optimum sensor array for source localization from time differences of arrival,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP ’05), vol. 4. IEEE, 2005, pp. 961–964. [37] J. C. Chen, K. Yao, and R. E. Hudson, “Source Localization and Beamforming,” IEEE Signal Processing Magazine, vol. 1053, no. 5888/02, pp. 30–39, 2002. [38] B. Loesch, P. Ebrahim, and B. Yang, “Comparison of different algorithms for acoustic source localization,” in ITG-Fachbericht-Sprachkommunikation 2010, 2010. [Online]. Available: http://www.vde-verlag. de/proceedings-en/453300039.html [39] T. Mukai, “Developing sensors that give intelligence to robots,” Riken Research, vol. 1, no. 6, pp. 13–16, Jun. 2006. [Online]. Available: http://www.rikenresearch. riken.jp/eng/frontline/4428 [40] J. Laurens and J. Droulez, “Bayesian processing of vestibular information,” Biological Cybernetics, vol. 96, no. 4, pp. 389–404, December 2007, (Published online: 5th December 2006). [Online]. Available: http://dx.doi.org/10.1007/s00422-006-0133-1 [41] ——, “Bayesian modeling of inertial self-motion perception,” 2005, section 3.2 of Bayesian IBA Project Workpackage 2 deliverable D15. [42] Y.-C. Lu, H. Christensen, and M. Cooke, “Active binaural distance estimation for dynamic sources,” in Interspeech 2007, Antwerp, Belgium, 2007, pp. 574–577. [43] C. Faller and J. Merimaa, “Source localization in complex listening situations: Selection of binaural cues based on interaural coherence,” Journal of the Acoustical Society of America, vol. 116, no. 5, pp. 3075–3089, November 2004. [44] J.-Y. Bouguet, “Camera Calibration Toolbox for Matlab,” 2006. [Online]. Available: http://www.vision.caltech.edu/ bouguetj/calib_doc/index.html [45] J. Lobo, “InerVis Toolbox for Matlab,” http://www.deec. uc.pt/~jlobo/InerVis_WebIndex/, 2006. [46] J. Lobo and J. Dias, “Relative Pose Calibration Between Visual and Inertial Sensors,” International Journal of Robotics Research, Special Issue 2nd Workshop on Integration of Vision and Inertial Sensors, vol. 26, no. 6, pp. 561–575, June 2007. [47] R. H. S. Carpenter, “The saccadic system: a neurological microcosm,” Advances in Clinical Neuroscience and Rehabilitation, vol. 4, pp. 6–8, 2004, review Article. [48] A. Caspi, B. R. Beutter, and M. P. Eckstein, “The time course of visual information accrual guiding eye movement decisions,” Proceedings of the National Academy of Sciences U.S.A., vol. 101, no. 35, pp. 13 086–13 090, 31 August 2004. 16 João Filipe de Castro Cardoso Ferreira (M’12) was born in Coimbra, Portugal, in 1973. He received his B.Sc. (five-year course), M.Sc. and Ph.D. degrees in Electrical Engineering and Computers from the University of Coimbra, Portugal, in 2000, 2005 and 2011, respectively. He has been an Invited Assistant Professor at the University of Coimbra, and a Post-Doc researcher at the Institute of Systems and Robotics (ISR) since 2011. He has also been a staff researcher at the ISR since 1999. His main research interests are human and artificial perception, robotics, Bayesian modelling and 3D scanning. Dr. Ferreira is a member of the IEEE Robotics and Automation Society (RAS). Jorge Nuno de Almeida e Sousa Almada Lobo (M’08) was born in Cambridge, UK, in 1971. He completed, his five-year course in Electrical Engineering, and the M.Sc. and Ph.D. degrees from the University of Coimbra, Portugal, in 1995, 2002 and 2007, respectively. He was a junior teacher in the Computer Science Department of the Coimbra Polytechnic School, and later joined the Electrical and Computer Engineering Department of the Faculty of Science and Technology at the University of Coimbra, where he currently works as Assistant Professor. His current research is carried out at the Institute of Systems and Robotics, University of Coimbra. His current research interests focus on inertial sensor data integration in computer vision systems, Bayesian models for multimodal perception of 3D structure and motion, and real-time performance using GPUs and reconfigurable hardware. Dr. Lobo is a member of the IEEE Robotics and Automation Society (RAS). Pierre Bessière was born in 1958. He received the engineering degree and the Ph.D. degree in computer science from the Institut National Polytechnique de Grenoble (INPG), France, in 1981 and 1983, respectively. He did a Post-Doctorate at SRI International (Stanford Research Institute) working on a project for National Aeronautics and Space Administration (NASA). He then worked for five years in an industrial company as the leader of different artificial intelligence projects. He came back to research in 1989. He has been a senior researcher at Centre National de la Recherche Scientifique (CNRS) since 1992. His main research concerns have been evolutionary algorithms and probabilistic reasoning for perception, inference and action. He leads the Bayesian Programming research group (Bayesian-Programming.org) on these two subjects. Fifteen PhD diplomas and numerous international publications are fruits of the activity of this group during the last 15 years. He also leads the BIBA (Bayesian Inspired Brain and Artefact) and was a Partner in the BACS (Bayesian Approach to Cognitive Systems) European project. He is a co-founder and scientific adviser of the ProBAYES Company, which develops and sells Bayesian solutions for the industry. Miguel de Sá e Sousa Castelo-Branco received the M.D. degree from the University of Coimbra, Coimbra, in 1991, and the Ph.D. degree from the Max-Planck Institute für Hirnforschung, Frankfurt, and the University of Coimbra, in 1998. He is currently an Assistant Professor at the University of Coimbra, Portugal, and has held a similar position in 2000 at the University of Maastricht, the Netherlands. Before (1998-1999), he was a Postdoctoral fellow at the Max-Planck-Institut für Hirnforschung, Germany where he had also performed his PhD work (1994-1998). He is also the Director of IBILI (Institute for Biomedical Research on Light and Image), Faculty of Medicine, Coimbra, Portugal, which is a part of the European Network Evi-Genoret. He is also involved in the Portuguese National Functional Brain Imaging Network. He has made contributions in the fields of ophthalmology, neurology, visual neuroscience, human psychophysics, functional brain imaging and human and animal neurophysiology. Jorge Manuel Miranda Dias (M’96-SM’10) received his Ph.D. in Electrical Engineering with specialisation in Control and Instrumentation from the University of Coimbra, Portugal, in 1994. He holds his research activities at the Institute of Systems and Robotics (Instituto de Sistemas e Robótica), University of Coimbra, and also at the Khalifa University of Science, Technology and Research (KUSTAR), Abu Dhabi, UAE. His current research areas are computer vision and robotics, with activities and contributions in these fields since 1984. He has been the main researcher in several projects financed by the European Commission (Framework Programmes 6 and 7) and by the Portuguese Foundation for Science and Technology (FCT). Dr. Dias is currently the officer in charge for the Portuguese Chapter for IEEE-RAS (Robotics and Automation Society), and also the vicepresident of “Sociedade Portuguesa de Robótica - SPR”.