A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/41137215
A Cognitive Approach to Multimodal Attention
Article in Journal of Physical Agents · January 2009

DOI: 10.14198/JoPha.2009.3.1.07 · Source: DOAJ
CITATIONS READS
15 498
3 authors:
Raul Arrabales Agapito Ismael Ledezma Espino

Psicobōtica University Carlos III de Madrid
34 PUBLICATIONS 204 CITATIONS 135 PUBLICATIONS 896 CITATIONS
SEE PROFILE SEE PROFILE
Araceli Sanchis de Miguel

University Carlos III de Madrid
214 PUBLICATIONS 1,670 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Servicios de generación de alertas agroclimáticas como soporte a la toma de decisiones del sector Cafetero Colombiano, AgroCloud View project
AI and Society View project
All content following this page was uploaded by Raul Arrabales on 29 May 2014.
The user has requested enhancement of the downloaded file.

JOURNAL OF PHYSICAL AGENTS, VOL. 3, NO. 1, JANUARY 2009 53
A Cognitive Approach to Multimodal Attention

Raúl Arrabales, Agapito Ledezma and Araceli Sanchis
Abstract—An adaptive attention mechanism is a requirement

when an autonomous robot has to deal with real world envi-
ronments. In this paper we present a novel cognitive architec-
ture which enables integrated and efficient filtering of multiple
modality sensory information. The proposed attention mechanism
is based on contexts that determine what sensorimotor data is
relevant to the current situation. These contexts are used as
a mean to adaptively select constrained cognitive focus within
the vast multimodal sensory space. In this framework, the focus
of attention can be directed to meaningful complex percepts,
thus allowing the implementation of higher cognitive capabilities.
Sonar, contact, and visual sensory modalities have been used in
the perception process, and the motor capability of the physical
agent is provided by a differential wheel drive system. The testing
of this artificial attention approach, carried out initially in the Fig. 1. Mobilerobots Pioneer 3 DX robot.
domain of counterpart recognition and chasing, has demonstrated
both a great decrease in computation power requirements and
ease of multimodal integration for cognitive representations.
in great degree on visual sensory information; nevertheless,
Index Terms—Physical agents, Attention, cognitive modeling,
mobile robotics.
some salient examples incorporate data from other sensors
in the attention mechanism. For instance, laser range finders
[4]. In this work we present a purely multimodal attention
I. I NTRODUCTION
mechanism, which permits a straightforward and graceful
D ESIGNING an autonomous control system for a mobile

robot implies a decision on what inputs will be handled
and what repertory of actions can be executed at any given
inclusion of new additional sensors of different modalities.
The proposed mechanism for multimodal integration is not
only intended to exclusively serve agent’s attention capability,
time. The option of considering all the available sensory infor- but also to provide a rich, complex, and coherent percept
mation as input for the core control of the robot is usually both representation that can be directly used by other cognitive
unnecessary and extremely expensive in computational terms. functions like associative learning and decision making.
Analogously, not all possible robot behaviors are appropriate at Currently, sonar, contact, and vision modalities have been
any given time. Instead of considering all physically plausible already incorporated in the perception subsystem. The ac-
behaviors, the robot control system should take into account tuators subsystem consists exclusively on a two-motor set
its current situation and assigned mission in order to build forming a single differential wheel drive that provides the
a shorter list of eligible behaviors. A simplistic definition of required indoor mobility. The testing platform is based on a
artificial attention can be drawn from the problem described Mobilerobots Pioneer 3 DX robot (P3DX) equipped with an
above. Hence, let us say that an efficient artificial mechanism onboard laptop computer, frontal centered fixed single camera,
for attention would solve the problem of filtering relevant eight-transducer frontal sonar ring, and frontal and rear bumper
sensory information and selecting relevant behaviors. rings (see Fig. 1). A counterpart recognition and chasing
According to the former definition, we need to specify what task has been selected as preliminary testing domain for the
relevant means in terms of implementing an efficient attention proposed cognitive attention mechanism. Both simulated and
mechanism. Relevant sensor data and relevant behaviors are real environments have been setup as described below. In the
those that could be both useful to accomplish the mission and simplest scenario, two P3DX robots are used: P3DX-Chaser
adapted to the world in which the robot is situated. Attention is the robot running the autonomous control architecture
has been typically applied to artificial vision systems taking the which implements the proposed attention mechanism, and
human visual attention mechanisms and its related eye move- P3DX-Target is a similar robot base tethered or remotely
ment control (foveation) as inspiration [1]. Visual attention has controlled by a human. The mission consigned to P3DX-
been extensively applied in robotics, e.g. [2]. However, much Chaser is to keep heading towards P3DX-Target maintaining
less effort has been put in pure multimodal attention mech- a safe constant distance between the two robots. In order
anisms [3]. Usually attention mechanisms for robots focus to accomplish the chasing goal, P3DX-Chaser has to pay
Raúl Arrabales is with University Carlos III of Madrid. attention to complex percepts such as “a moving target which
E-mail: [email protected] is a Pioneer 3 DX robot”, while ignoring other percepts which
Agapito Ledezma is with University Carlos III of Madrid. are irrelevant to current mission. Being able to deal with such
E-mail: [email protected]
Araceli Sanchis is with University Carlos III of Madrid. complex percepts when focusing attention is one of the main
E-mail: [email protected] goals of this work in cognitive artificial attention.
54 JOURNAL OF PHYSICAL AGENTS, VOL. 3, NO. 1, JANUARY 2009
Fig. 2. CERA Control Architecture Layers.
In the next sections we discuss the implementation of an

attention mechanism able to fulfill the requirement of selecting
relevant sensorimotor information. Section II provides an
introduction to the software architecture and how the attention Fig. 3. Perception cycle overview.
mechanism is incorporated into a layered control system.
Section III covers the definition of the attentional contexts that
are used to form sets of sensory and motor data. Section IV calculating the contextual parameters of percepts and actions.
is dedicated to explain how the proposed mechanism allows From the point of view of the attention mechanism, the CERA
the integration of different modality sensory information into Physical Layer is the domain of single percepts and simple
the same context. Section V illustrates the application of the actions. As the Physical Layer is specific to a given hardware
proposed technique to the domain of counterpart recognition it has to be changed or adapted if the underlying physical robot
and chasing. Finally, we conclude in section VI with a is replaced by a significantly different model. The composition
discussion of the benefits and possible areas of application of percepts and actions forming complex percepts and complex
of the attention mechanism in the field of cognitive robotics. actions takes place in the CERA Instantiation Layer. This is the
place where mission-specific contexts are to be applied, and
II. A RCHITECTURE OVERVIEW therefore mission-specific complex percepts and behaviors are
Typically, autonomous robot control architectures are struc- generated. As the Instantiation Layer is designed specifically
tured in layers. Each layer usually represents a different level for a given problem domain it can be replaced by a different
of control, from lower reactive levels to higher deliberative problem instantiation without changing the existing Physical
levels. The proposed attention mechanism has been integrated and Core layers. Finally, the CERA Core Layer is where a
into a three level control architecture called CERA (Conscious machine consciousness model is implemented based on several
and Emotional Reasoning Architecture). CERA is composed modules that represent higher cognitive functions. One of these
of a lower level, called Physical Layer, a mission specific functions related to consciousness is attention.
level, called Instantiation Layer, and a higher level, called Core The attention module implemented in the Core Layer is
Layer, where higher cognitive functions are implemented (see designed to activate the most appropriate contexts at any
Fig. 2). The details about CERA are discussed elsewhere [5]. given time, i.e. an attentional bias is induced from the Core
A number of processing steps that take place within this Layer enforcing particular contexts. Complex percepts that are
architecture can be identified as specifically belonging to the obtained in the perception cycle depend on the active contexts
attention mechanism. Concretely, if we look at the perception established by the Core Layer. Therefore, at any given time,
cycle, the following steps are performed (see Fig. 3): the robot can only perceive those objects or events that are
relevant to the mission (top-down attentional bias). However,
• Sensory data is acquired by sensors (for instance, a bump
a mechanism for bottom-up attention is always in place, so
panel contact is reported).
critical single percepts like bumper contact notifications are
• Contextualization parameters (like relative position vec-
not ignored. One of the benefits of integrating the attention
tors and timestamps) are calculated for each perceived
mechanism into a layered control system, where priorities
object or event.
for perceptions and actions can be established, is that the
• Sensor Preprocessors build single percepts using both
implementation of a combination of top-down and bottom-up
sensory data and their associated contextualization pa-
attentional bias can be naturally enabled.
rameters.
• Groups of single percepts showing contextual affinity are
eventually combined into complex multimodal percepts. III. D EFINITION OF ATTENTIONAL C ONTEXTS
CERA Physical Layer provides the required functionality in Our proposed artificial attention mechanism is inspired in
order to interface with the robot hardware. In other words, it the concept of context as defined in the Global Workspace
provides access to sensors and actuators. Additionally, as the Theory (GWT) [6]. The GWT is a cognitive account for
CERA architecture has been designed to host the proposed consciousness, and therefore it covers attention as a key
attention mechanism, the physical layer is also in charge of characteristic of conscious beings. However, the GWT do not
ARRABALES ET AL. : A COGNITIVE APPROACH TO MULTIMODAL ATTENTION 55
provide any algorithmic description of attention but just a and movement properties have been considered as additional
metaphorical explanation. A theater spotlight simile is used to criteria; therefore, four criteria have been used for context
represent the focus of consciousness. This spotlight illuminates formation in the experiments described below.
only a small part of the scene, which is considered the The time criterion refers to the exact moment at witch a
conscious content of the mind. The scene is actually built upon stimulus is perceived. Therefore, it should be taken as an
the subject’s working memory. The movement of the spotlight, important criterion to relate one percept to another. Given that
i.e. the selection of contents that will be used for volition and different sensors and their associated device drivers can take
action, is directed by unconscious contextual systems. The aim different time intervals to process the sensory information,
of the work described in this paper is to design and test an a mechanism for time alignment is required. It has been
implementation of such contextual systems, which are able to demonstrated that such a time alignment mechanism is present
adaptively direct attention toward the interesting areas of the in biological brains [9][10]. Although visual and auditory
robot sensorimotor space. stimuli are processed at different speeds, the time gap be-
From the point of view of perception, contexts are sets of tween different processed signals, whose physical originators
percepts retrieved from the sensors. Percepts are considered were acquired at the same time, is automatically removed by
the minimal information units obtained by the robot sensory the brain [11]. An analogous artificial mechanism has been
machinery [5]. Therefore, a sensory context can be used to implemented in the proposed architecture.
build a complex percept composed of related single percepts. Location is another fundamental criterion for context for-
From the point of view of behavior, contexts define sets of ac- mation as the representation of the position of objects in
tions available for execution. Hence, we can define behavioral the world is a requirement for situatedness. Furthermore,
contexts as possible compositions of related actions. In order the location of an object relative to the robot body (or any
to generate an efficient robot behavior, both sensory contexts other reference frame) is required for generating adaptive
and behavioral context have to be adaptively generated. behaviors. The relative location of any element in the sensory
world is necessary for the integration of complex percepts;
A. Visual Field Segmentation additionally, it allows the selection of a given direction of
attention toward the most relevant places. The presence of
The first stages in visual sensor data processing are con-
space coding neurons and the use of reference frames (like
cerned with attentional context definition. Concretely, instead
somatotopic or head-centered) has been demonstrated in the
of applying a full preprocessing task to the entire image
mammal brain [12][13].
captured by the camera sensor, each incoming frame is frag- In a world where color patterns can be associated with
mented into smaller regions. Subsequently, only one selected particular objects, this property of entities should be taken
fragment (foveal region) is further processed, thus reducing into account. Similarly, some objects are mobile while others
to a great extend the processor requirements of visual sensor remain static; consequently, movement is a property that
preprocessor. Additionally, as explained below, this strategy should also be considered as criterion for relevant context
allows the robot to focus attention in specific visual regions formation. Particularly, the task of counterpart recognition has
also in further processing stages. Nevertheless, before the been simplified in the research under discussion by charac-
preprocessing stage, when context criteria are evaluated, all terizing other peer robots as autonomously moving red and
visual data packages (frame segments) are equally processed. black objects. The presence of specialized areas for color and
It is known that this strategy is similar to the way human movement detection has been demonstrated in human’s brain
visual system processes the foveal region, which is much visual cortex [14].
richer in resolution and detail than retinal periphery. Humans Following the principles presented above, we have used
use the fovea to fixate on an object and specifically process its time, location, color, and motion as fundamental contextual-
image while maintaining a much less demanding process for ization criteria for the formation of:
peripheral regions [7]. This very same strategy has also been
• Sensory contexts as composition of single percepts (com-
successfully applied in other artificial systems, e.g. [8].
plex percepts), and
• behavioral contexts as composition of simple actions.
B. Context Criteria In order to generate these contexts, both single percepts
We have designed the process of context formation as (which are built from data packages obtained from sensors)
the application of predefined criteria in order to calculate and simple actions (which are defined as part of the robot
the degree of relation between the potential elements of a control system) are required to incorporate estimated time,
given context. Basically, a context should be constructed in location, motion, and color parameters (see Fig. 4). Motion
a way that it can become a meaningful representation of the properties could be obviously derived from time and location
reality, i.e. the interplay between agent and situation must be parameters; however, we have decided to use a natively visual
enforced by a proper definition of both sensory and behavioral motion detection approach in which motion properties are
contexts. The very basic factors that need to be considered directly obtained from visual input analysis. In our proposed
in the correct representation of robot situation in the world architecture there are specialized modules designed to calcu-
are time and location. Nevertheless, other factors can be late time, color, motion, and location parameters: the Timer
considered depending on the problem domain and internal module maintains a precision clock (less than 1 millisecond
state representation richness. In the work described here, color resolution) that represents the robot’s age, the Proprioception
set when the simple action is created and enqueued in

the control system. The second timestamp is set when
the action enters the core execution cycle, i.e. when
the action is actually dequeued and dispatched (begins
physical execution). The time span between these two
timestamps can be used to detect delays in the execution
queue and eventually abort too old actions.
• J-Index: for the representation of the location parameter
of both single percepts and simple actions we have
decided to use the robot body center of mass as reference
frame. The term J-Index refers to a structure able to
Fig. 4. Creation of single percepts and simple actions represent or map the relative position of an object or
event within a biological brain [15]. We have adapted
and enhanced the original definition of the J-Index rep-
resentation with the aim of representing both the relative
module maintains all the required information to calculate the
position and relative dimensions of the object. Hence, our
exteroceptive sensors position. This information is necessary to
J-Indexes are implemented as a composition of several
estimate the relative location of an object or event detected by
n-dimensional vectors. The main vector is called the
an exteroceptive sensor. A Color Detection module is in charge
j referent vector, and is used to calculate the relative
of providing a color histogram representation associated to
position of the geometrical center of the percept’s source
visual data retrieved by the camera. Similarly, a Motion De-
or the geometrical target of an action. Depending on
tection module continuously calculates the differences between
the nature of the sensor that is reporting the sensory
the last data retrieved by the camera and current visual data.
data, more positional vectors can be calculated in order
Time, location, color, and motion parameters provided by
to estimate the size of the percept (examples for sonar
the Timer, Proprioception, and Color and Motion Detector
range finder, camera, and bump panel arrays are described
modules are used by the preprocessor modules in charge of
below).
generating single percepts and simple actions. A Sensor Pre-
• Color Histogram: Each data package provided by the
processor takes a given sensor reading as input, then calculates
visual sensor (corresponding to a frame segment) is
the relative position of the source of the reading and the instant
assigned a color histogram, where the frequency of image
when it took place using the information provided by the Timer
color components is represented. Obviously this param-
and Proprioception. In case of visual sensor readings, also
eter can only be set for visual sensory information. Any
color and motion detectors are activated and histogram and
other more demanding visual processing concerned with
motion vectors are calculated. Finally, the sensor preprocessor
color, like texture recognition is not defined as contextual
creates a single percept packing together the proper sensor
parameter because it will be limited to the scope of foveal
reading with its contextualization information. The Action
region (when the robot is fixating on a particular object);
Preprocessor takes as input an action generated by the Self-
therefore, it must be part of the sensor preprocessing task.
Coordination module (this module and the way it works is
• M-Index: The result of the application of the motion
described elsewhere [5]), and applies the same approach as in
detection module on an incoming visual data package is
the Sensor Preprocessor in order to build the Simple Action
a movement vector called M-Index, whose value is zero
representations.
when no movement has been detected. Although motion
More parameters should be added to single percepts if could be detected using other sensory modalities, like
other contextualization criteria are to be applied. In the work sonar, we have decided to use only vision for the time
described in the present paper, the following parameters have being. Nevertheless, when complex percepts are built, the
been used: robot own movement is taken into account to calculate
• Timestamps: two different timestamps are recorded in the relative motion of observed objects.
single percepts. The first timestamp is set when the
sensory data is collected from the sensor. Usually this The timestamp parameters are easily acquired using the
timestamp is directly assigned by the sensor hardware robot’s control system precision timer. However, the J-Index
and retrieved in the control system thought the sensor parameters require more elaboration, particularly in the case
driver. The second timestamp is set when the percept of movable sensors. In the case discussed here, we have used
is actually used in the control system. The time span P3DX robots (see Fig. 5a) with fixed position sensors: a frontal
between these two timestamps can be significant when sonar array (see Fig. 5c) and frontal and rear bump panels (see
a sensor is incessantly notifying readings and there is Fig. 5b). In the experiments that we have carried out so far, J-
not enough onboard processing power to dispatch all the Indexes have been calculated for sonar readings, bump panels
incoming data. Actually, the time span value can be used contact and release notifications, and visual segments. The J-
to discard too old sensory data which is not significant Indexes are calculated as a function of the transducer (fixed)
to the current robot state. Similarly, two timestamps are position and orientation (relative to the robot front).
logged in the case of simple action. The first one is Although the J-Index parameter can be primarily repre-
sented by a three-dimensional vector, for the task of following

a counterpart robot in a flat surface, a two-dimensional j ref-
erent vector can be considered, where (X,Z) = (0,0) represents
the subjective reference frame of the robot (see Fig. 5b and
5c). Nevertheless, a Y coordinate (height) is usually calculated
even though it is not used.
The calculation of the j referent vector is different depending
on the sensor. In the case of bump panels, as they are located
at angles around the robot (see Fig. 5b), the j referent vector
is calculated using (1). Where, BR is the bump panel radius,
i.e. the distance from the center of mass of the robot to the
bumper contact surface (see Fig. 5b). BA is the bump panel
angle to the front of the robot (Pioneer 3 DX bump panels are
located at angles −52◦ , −19◦ , 0◦ , 19◦ , and 52◦ ). BH is the
height at which the bumpers are mounted.
 
BR ∗ Cos(BA)
j = (X, Y, Z) =  BH  (1)
BR ∗ Sin(BA)
Additionally, two more vectors are calculated to be as-
sociated to a bumper percept: the left-j referent and the
right-j referent (see Fig. 6). These two vectors represent the
dimensions of the percept (the width assigned to the collision).
In order to calculate the j referent vector corresponding to
a given sonar reading, (2) is used. Note that the calculation
of j referent vectors is dependent on the type of sensor being
considered.
 
(R + SR) ∗ Cos(SA)
j = (X, Y, Z) =  SH  (2)
(R + SR) ∗ Sin(SA)
Where, R is the maximum range measured by the sonar
transducer, SR is the distance from the center of mass of
the robot to the sonar transducer, and SA is the angle at
which the particular sonar transducer is located. Note that
sonar transducers are located at angles −90◦ , −50◦ , −30◦ ,
−10◦ , 10◦ , 30◦ , 50◦ , and 90◦ to the front of the robot (see
Fig. 5c). Therefore, each transducer is able to measure the free
space available within a three-dimensional 15◦ wide cone (this Fig. 5. MobileRobots Pioneer 3 DX Robot, frontal bumper panel, and frontal
cone aperture corresponds to the SensComp 600 transducer). sonar ring.
Taking into account that the ultrasonic beams emitted by
the sonar transducers take the form of a symmetric three-
dimensional cone, at least one additional j referent vector has
to be calculated in order to estimate the dimensions of the
single transducer sonar percept, i.e. the open space perceived
in front of that particular sonar transducer. The main j referent
vector calculated using (2) represents the cone bisector. Addi-
tionally, two more vectors: the left-j referent vector and right-
j referent vector represent the lateral 2D boundaries of the
percept (see Fig. 7). The representations of J-Indexes for both
sonar and bumpers have been designed as described above
with the aim of implementing an attention algorithm. Although
some of the calculated reference vectors are expendable, they
are useful to pre-calculate the regions of the world affected by
Fig. 6. Vectors calculated to build the J-Index of a single bump panel contact
a given percept. Besides, this representation is also particularly percept.
useful for the subsequent task of counterpart robot chasing.
In the case of visual sensory information, each segment is
assigned a j referent vector which corresponds to the relative
Fig. 7. Vectors calculated to build the J-Index of a single sonar transducer

percept.
Fig. 9. J-Index of a single visual percept.
in any further processing. This means that no single percepts

are built with visual information outside the simulated fovea.
The foveal region has to be kept small, therefore, a maximum
of four contiguous segments are considered to form a single
visual percept. The J-Index of a single percept that has been
formed as a combination of contiguous segments is calculated
by adding a new j referent vector pointing to the geometrical
center of the set of segments (see Fig. 9). Concretely, (3) is
used to calculate the main j referent vector of the visual single
percept, where CH is the relative height at which the camera
is located, V R is the distance from the optical axis origin to
the center of the percept, and V A is the angle relative to the
optical horizontal axis (SH = V R ∗ Sin(V A)).
 
V R ∗ Sin(V A)
Fig. 8. J referent vector of a segment of visual sensory data. j = (X, Y, Z) =  CH  (3)
?
location of the geometrical center of that particular segment Note that the calculation of all context criteria parameters
within the visual field. As the orientation of the camera is is rather quick, and no complex processing is carried out at
fixed, it is straightforward to estimate the relative X coordinate this level of the architecture. One of the advantages of having
(left / right position relative to the robot) of the corresponding an attention mechanism is the processing power saving, and
percept, being SH the distance from the optical vertical axis this principle is preserved by keeping simple context criteria
of the camera (center of the field of view) to the center of parameters. When more processing is required in order to build
the segment. Fig. 8 depicts an example of segmented visual complex percepts and apply inference rules, this is uniquely
input in which the visual field has been divided into 64 smaller done using a reduced subset of the sensory space, which has
regions and the referent vector for segment S24 is calculated. been already selected by the application of a given context.
Estimating the distance to the visual percept is a different
matter. Usually a stereo vision system is used. Having just C. Actions Context Composition
one camera, a pinhole model could be applied. Nonetheless, As both Single Percepts and Simple Actions include the ba-
in this work distance to objects is provided exclusively by sic time and location contextualization parameters (timestamps
sonar percepts. and J-Indexes) it is straightforward to calculate similarity
Fig. 9 shows an example where j referent vectors are distances between them. Other specific sensory parameters,
calculated only for those segments in which any saliency has like color, are used exclusively with single percepts. Therefore,
been detected. In this case, as the goal is to follow a red contexts can be generally defined based on the dimensions of
counterpart robot, two segments where the color histogram relative time and relative location, and also specifically defined
presents a salient frequency of red have been selected. At for some sensory modalities using other specific parameters.
this point, when the sensor preprocessor is building single Each sensory context is used to build a representation structure
percepts from visual input, a foveal region is to be selected, called complex percept (see Fig. 10a). Complex percepts en-
and the rest of the image is discarded and not taking part close the required information to represent the meaning of the
Fig. 10. Formation of complex percepts and complex behaviors. Fig. 11. Vectors calculated to build the J-Index of a complex bumper contact
percept.
associated sensory context as required by the subsystems of the

A. Monomodal Context Formation
autonomous control system. As behavioral contexts are formed
they may trigger the generation of the corresponding complex Taking the bump panel percepts as example, we can il-
behaviors, which are representations that enclose sequences of lustrate how a sensory context gives place to a monomodal
actions specified by the behavioral context (see Fig. 10b). In complex percept. Using the aforementioned common criteria,
the present work, the behavioral context formation has been time and location, if the bumper handler of our robot reports
oversimplified in order to generate uncomplicated behaviors contact in bump panels b2, b3, and b4 simultaneously (see Fig.
for the available actuator: the P3DX differential drive. Two 11), a context is automatically created if these three indepen-
basic operations have been defined for the control of the dent notifications have close enough timestamps. Therefore,
differential drive: the three single percepts are associated by a temporal context.
Additionally, as b2, b3, and b4 are located side by side,
1) RotateInPlace: this operation takes an angle in degrees
the corresponding contact percepts J-Indexes will indicate
as input parameter (positive values mean counterclock-
proximity, thus forming an additional spatial context that again
wise rotation) and triggers the robot rotation in position
associates these three single percepts. The newly created com-
until it completes the consigned angle.
plex percept, which is a composition of three single percepts,
2) MoveStraight: this operation takes a speed in meters
also holds a representation of a J-Index. This complex percept
per second as input parameter (positive values mean
J-Index is calculated as a function of the reference vectors
move forward) and triggers the robot movement towards
of the former single percepts (note that Fig. 11 depicts with
the current heading (or backwards for negative speed
solid lines the J-Index referent vectors of the formed complex
values).
percept, and dashed lines represent the referent vector of the
Attending to the relative direction specified by the attention old single percepts).
mechanism (a composition of J-Indexes representations), an The way the J-Index of a complex percept is calculated
angle parameter is calculated for the RotateInPlace operation depends on the nature (shape, dimensions, etc.) of the single
in order to set the robot heading towards the object that “called percepts that take part in the context that gave place to it.
robot’s attention”. Also, a speed parameter is calculated as a The composition of J-Indexes is trivial when all the single
function of the distance to the object. This means that the percepts belong to the same modality (as illustrated in Fig.
typical minimum behavioral context is formed by a sequence 11). However, the composition can be complex when several
of simple actions like a RotateInPlace operation followed by different modalities are involved.
a MoveStraight operation.
B. Multimodal Context Formation
IV. M ULTIMODAL I NTEGRATION
Focusing on the mentioned fundamental criteria for con-
Combining multiple monomodal sensory data sources is a textualization (time and location), all percepts, independently
typical problem in mobile robotics, also known as multisen- of their modality, can be compared with each other, thus
sory integration or sensor data fusion [16]. Actually, in the allowing a simple mechanism to create perceptual contexts.
present work we are also approaching the problem of fusing The contexts formed following this method can have signif-
proprioceptive and exteroceptive sensor data. Neuroscientists icant meaning. For instance, “all objects within the reach of
refer to the binding problem [17], as the analogous problem the robot” (context formed applying the criterion of location
of how to form a unified perception out of the activity of and estimating that the relative location is below a given
specialized sets of neurons dealing with particular aspects of threshold, like the robotic arm reach distance in this case), or
perception. From the perspective of autonomous robot control “all events that took place between five and ten minutes ago”
we argue that the binding problem can be functionally resolved (context formed applying the criterion of time and estimating
by applying the proposed contextualization mechanism. that the relative timestamp of the events fall within the given
interval). Similarly, more specific criteria can be used in order

to build more specific contexts which might not involve all the
available sensory modalities. This is the case of the motion and
color criteria used in this work.
C. Contextualization Hierarchy
The proposed contextualization mechanism supports hierar-
chical composition; hence complex percepts can be built by
either combining:
• A number of single percepts.
• A number of complex percepts.
• Both single and complex percepts.
In order to assemble coherent percepts, a priority policy has

been established in relation to complex percept formation. The Fig. 12. Simulated indoor environment.
first and top priority contextualization step is to build complex
percepts that come from the application of contextualization
criteria over the same modality single percepts. The outcome goal: to find the P3DX-Target robot and then keep a constant
of this first step is a set of monomodal complex percepts. distance to it. Both simulated and real environment setups have
As illustrated above, these monomodal complex percepts can been prepared. All localization estimation problems have been
come from simultaneous and contiguous bumper contacts or neglected for the time being. Figure 12 shows a screen capture
from simultaneous and contiguous salient visual segments. of the simulated environment we have used for initial testing.
Once the first contextualization step is completed, both the One of the objectives of the proposed attention mechanism
newer monomodal complex percepts and existing single per- is to offer an effective policy for selecting the next action (or
cepts enter the CERA Workspace where multimodal complex complex behavior) as part of the robot’s main control loop. In
percepts are built (see Fig. 13 for an example). the case of counterpart chasing, spatial contexts are defined in
order to estimate the best heading that the robot should take.
D. Managing contradictory percepts A specific CERA Instantiation Layer has been coded with the
aim of representing the particular complex percepts that are
A common application of multimodal sensory information required for the chasing task.
fusion is the disambiguation or refutation of contradictory
Robot mission is structured in two sequential tasks. Firstly,
sensor data. In the case under study in this paper, contradictory
during the searching task, P3DX-Chaser has to find out if the
information happen to be processed when the sonar transduc-
P3DX-Target robot is in the surroundings. Secondly, if the
ers fail to detect a sharp solid corner (the ultrasonic beams are
target has been detected (target engaged), the chaser has to
diverted, and do not come back to the transducer, failing to
follow it keeping a constant separation distance. If for some
provide a realistic range measurement). In such a scenario, the
reason the target is loss, P3DX-Chaser will come back to
last resort are the bumpers. When the robot base is too close to
the first task. During the initial searching task, the chaser
the sharp corner, bumpers will contact the obstacle and notify
wander randomly performing surrounding inspections (360◦
single percepts, which in turn will become complex percepts.
turns) periodically. Two attentional contexts are applied during
However, during the process of complex percepts formation,
the searching phase in order to detect the presence of the
potential contradictory information has to be handled. The
target: “a red and moving object”. Detecting this sort of objects
time criteria for context formation will associate the roughly
involves paying attention to complex percepts that are formed
simultaneous readings from both sonar and bumpers. But, in
as a result of the conjoined application of color and motion
the case of a bad sonar reading the single percepts available
context criteria. Therefore, such a context should be activated
are not consistent. Therefore, a policy has to be established in
during the searching phase. However, as motion criteria is
order to build a significant complex percept out of conflicting
difficult to assess when the own referential system is also
single percepts. A single but effective approach is to apply a
moving, CERA Core Layer initially activates only the red
level of confidence to each sensor modality depending on the
color context while the robot is wandering or performing a
situation. In the case described here, we have just assigned
360◦ scan. When a salient complex percept is obtained due
more confidence to bumper contact notifications than sonar
to red color saliency (like in Fig. 9), robot comes to a full
measurements.
stop and activates a new attentional context with two criteria:
red color and motion. Then, if more complex percepts are
V. PAYING ATTENTION TO C OUNTERPART ROBOTS obtained from a given location, the target is recognized and
Following a counterpart robot across an unknown office- engaged, and the second task (following) is activated. During
like environment has been selected as a preliminary problem the second task, again a single criterion context for red color
domain for the testing of the proposed attention mechanism. is activated and the robot heading is adjusted to the direction
It provides a valid real world scenario where the sensors and indicated by the target complex percepts (SH value and sign).
actuators described above can be used to achieve the mission Basically, a j referent vector is calculated based on the target
complex percept J-Index, and simple actions are generated that the figure represent the alignment between visual horizontal
will cause the chaser to head towards the target. axis and the central −45◦ to +45◦ angular span of frontal
Keeping a constant distance to the target is facilitated by sonar. In this case, the value of SH in the visual complex
the ranging data obtained from sonar. As single percepts from percept corresponds to the sonar percept originated from the
vision and single percepts from sonar share location parame- sonar transducer at +10◦ . The measurement represented in
ters (j referent vectors), distance to target can be estimated by this particular sonar percept (2493 millimeters) is directly the
multimodal contextualization. Actually, the complex percepts distance estimate assigned to the multimodal complex percept
that represent the target are composed of both visual and sonar as visual percept itself does not provide any distance estimate.
single percepts. These single percepts were associated in the Preliminary results obtained applying the proposed attention
same complex percept because of their affinity in relative mechanism to the human-controlled target chasing task are
location. This means that target complex percepts include promising; however more complex environments have to be
sonar ranging data in addition to the visual pose estimation. tested in order to appreciate the real potential of the cognitive
Fig. 13 shows an example of both visual and sonar data approach. In addition to the manual control of P3DX-Target,
as ingested in the CERA Physical Layer, where associated which produces very variable results, three autonomous simple
time and location parameters are calculated. Then, sensor behaviors have been implemented with the aim to test the
preprocessors build single percepts including timestamps and capability of the attention mechanism when confronted to
J-Indexes. All generated single percepts enter the CERA Phys- different movement patterns (scenarios a, b and c depicted
ical workspace where the complex percepts are built based in Fig. 14). Fig. 14 shows the typical trajectories of the
on current active contexts. Active contexts are established autonomous control schemes implemented in P3DX-Target.
by the higher control system running in the CERA Core Initial time to target engaged state varies and is basically
Layer. The scenario depicted in figure 13 corresponds to dependent on start position of both P3DX-Chaser and P3DX-
the chasing task; hence a context for red objects is active Target robots. Therefore, in the present case, the performance
(in addition to location and time, which are always used as of the attention mechanism is measured in terms of the overall
contextualization criteria). Right side of the figure shows an duration of target engaged state (percentage of total navigation
example of the complex percepts that are formed due to the time when the target is engaged). The performance when
application of the mentioned contextualization mechanism. chasing targets in open space (wide corridors) is 100% in
Single percepts corresponding to visual segments S5,5 and scenario (a), and close to 100% in scenarios (b) and (c).
S6,5 are selected because the present saliency in terms of However, when the target (a, b, or c) performs obstacle
the red color contextualization criterion. Given that their j avoidance maneuvers close to walls, performance usually fall
referent vectors happen to be contiguous, a new monomodal to 50-70%. In these situations the chaser also has to avoid
(visual) complex percept is built as a composition of them. obstacles, eventually causing the loss of target.
As shown in the picture, the J-Index of the newer monomodal
complex percept points to the geometrical center of the visual
VI. C ONCLUSION AND F UTURE W ORK
segment formed as a combination of the two former single
percept segments. It can be noticed that the J-Index of this A novel attention mechanism for autonomous robots has
visual complex percept does not spot the actual center of the been proposed and preliminary testing has been done in the
target, but the approximation is good enough for the realtime domain of simple mobile object recognition and chasing. The
chasing task. Once monomodal complex percepts have been integration of the attention cognitive function into a layered
built, time and location contextualization is applied amongst control architecture has been demonstrated. Additionally, the
different modalities. problem of multimodal sensory information fusion has been
Right bottom representation in the picture (Fig. 13) cor- addressed in the proposed approach using a generic context
responds to sonar j referent vectors, including a highlighted formation mechanism. Preliminary results obtained with the
single percept (the one obtained from the reading of the sonar simulator show that this account is applicable to classical
transducer oriented at +10◦ ). The projection outlined top- mobile robotics problems. Nevertheless, moving to a real
down from the visual complex percept to this sonar single world environment and facing more demanding missions in-
percept indicates that both percepts are to be associated and cluding localization requirements would imply dealing with
will form a multimodal complex percept. Time association the problem of imperfect odometry [18]. In such a scenario
is obvious; however, the location contextualization between our proposed attention mechanism had to be integrated into a
visual and sonar percepts require some additional parametric SLAM (Simultaneous Localization and Mapping) system.
alignment as these different modality sensors present particular The attention mechanism proposed in this work is designed
orientations and wide span. Furthermore, as explained above, to be highly dynamic and configurable. Following the same
only the X coordinate is considered for visual percepts (visual principles described above, more contexts can be created as
horizontal axis). While we have used a +90◦ field of view more contextualization criteria are defined in the system. The
camera, the Pioneer 3DX robot frontal sonar ring covers a concrete definition of criteria and context is to be selected
total field of +195◦ (including blind angles between transducer based on the specific problem domain.
cones). Therefore, only percepts originated from the central The system described in this paper is work in progress.
90◦ of sonar coverage are taken into account for visual to sonar Counterpart recognition is currently based on color and move-
contextualization. Black dashed lines on the right hand side of ment detection, however we are working on adding other
Fig. 13. Upper left image is a representation of the segmented visual sensory information acquired by the simulated onboard camera. Lower left graph is
a capture from the CERA graphical user interface displaying real time sonar transducer measurements. Sonar ranging representation capture corresponds to
the particular instant when the camera acquired the image depicted above. Right hand side of the picture shows a representation of the multimodal complex
percept being built with this sensory information
visual and higher cognitive properties recognition in order Moreover, the attention mechanism is to be integrated with
to build a most robust mechanism. Concretely, visual texture other Core Layer modules, like memory and self-coordination
recognition, vertical symmetry detection, and movement pat- modules in order to use the required related information for
tern identification are expected to greatly improve robustness the activation of appropriate contexts in the Instantiation and
in real world environments. Furthermore, a mechanism for the Physical layers.
detection of structurally coherent visual information could be Given the need to define complex spatiotemporal relations
implemented as part of the proposed attention mechanism. in the process of attentional contexts formation, the application
Complex percepts formed representing unique objects could of fuzzy temporal rules will be considered as they have been
be evaluated in terms of their structural coherence, as human proved to be an effective method in landmark detection (like
brain seems to do [19]. More complex attentional contexts (and doors) for mobile robots [20].
therefore more contextual criteria) have to be defined in order
to face other problem domains. Perception is well covered for
ACKNOWLEDGMENT
sonar range finder and bumpers. However, additional develop-
ment is required in the CERA Physical Layer in order to add This research work has been supported by the Spanish Min-
more functionality to visual perception, e.g. visual distance istry of Education and Science CICYT under grant TRA2007-
estimation. The definition of behavioral contexts and complex 67374-C02-02.
behaviors should also be enhanced to cope with more complex
actuators and to generate more efficient behaviors. At the
level of the CERA Core Layer, learning mechanisms could be R EFERENCES
applied in order to improve the attention selection technique. [1] P. J. Burt, “Attention mechanisms for vision in a dynamic world,” pp.
977–987 vol.2, 1988.
[2] R. Arrabales and A. Sanchis, “Applying machine consciousness models

in autonomous situated agents,” Pattern Recognition Letters, vol. 29,
no. 8, pp. 1033–1038, 6/1 2008.
[3] J. L. C. Mariño, “Una arquitectura de atención distribuida para agentes
con sensorizacin multimodal”, Doctoral Thesis. Universidade de Coruña,
2007.
[4] S. Frintrop, A. Nchter, and H. Surmann, “Visual attention for object
recognition in spatial 3d data,” in 2nd International Workshop on
Attention and Performance in Computational Vision, ser. Lecture Notes
in Computer Science, vol. 3368. Springer, 2004, pp. 168–182.
[5] R. Arrabales, A. Ledezma, and A. Sanchis, “Modeling consciousness
for autonomous robot exploration,” in IWINAC 2007, ser. Lecture Notes
in Computer Science, vol. 4527-4528, 2007.
[6] B. J. Baars, A Cognitive Theory of Consciousness. New York:
Cambridge University Press, 1993.
[7] B. A. Wandell, Foundations of Vision. Sinauer Associates, 1995.
[8] S. Gould, J. Arfvidsson, A. Kaehler, B. Sapp, M. Messner, G. Bradski,
P. Baunstarck, S. Chung, and A. Y. Ng, “Peripheral-foveal vision for
real-time object recognition and tracking in video”, in Proceeding of
the International Joint Conference on Artificial Intelligence, 2007, pp.
2115–2121.
[9] M. H. Giard and F. Peronnet, “Auditory-visual integration during mul-
timodal object recognition in humans: A behavioral and electrophysio-
logical study”, The Journal of Cognitive Neuroscience, vol. 11, no. 5,
pp. 473–490, September 1 1999.
[10] D. Senkowski, D. Talsma, M. Grigutsch, C. S. Herrmann, and M. G.
Woldorff, “Good times for multisensory integration: Effects of the pre-
cision of temporal synchrony as revealed by gamma-band oscillations”,
Neuropsychologia,, vol. 45, no. 3, pp. 561–571, 2007.
[11] C. Spence and S. Squire, “Multisensory integration: Maintaining the
perception of synchrony”, Current Biology, vol. 13, no. 13, pp. R519–
R521, 7/1 2003.
[12] L. Fogassi, V. Gallese, G. Pellegrino, L. Fadiga, M. Gentilucci,
G. Luppino, M. Matelli, A. Pedotti, and G. Rizzolatti, “Space coding
by premotor cortex”, Experimental Brain Research, vol. 89, no. 3, pp.
686–690, 06/01 1992.
[13] M. Avillac, S. Deneve, E. Olivier, A. Pouget, and J.-R. Duhamel,
“Reference frames for representing visual and tactile locations in
parietal cortex”, Nature neuroscience, vol. 8, no. 7, pp. 941–949, 2005.
[14] S. Zeki, J. Watson, C. Lueck, K. Friston, C. Kennard, and R. Frackowiak,
“A direct demonstration of functional specialization in human visual
cortex”, Journal of Neuroscience, vol. 11, no. 3, pp. 641–649, March 1
1991.
[15] I. Aleksander and B. Dunmall, “Axioms and tests for the presence of
minimal consciousness in agents”, Journal of Consciousness Studies,
vol. 10, no. 4-5, 2003.
[16] R. C. Luo, “Multisensor integration and fusion in intelligent systems”,
pp. 901–931, 1989.
[17] A. Revonsuo and J. Newman, “Binding and consciousness”, Conscious-
ness and Cognition, vol. 8, no. 2, pp. 123–127, 6 1999.
[18] S. Thrun, “Probabilistic algorithms in robotics”, AI Magazine, vol. 21,
Fig. 14. Behavior (a) is the outcome of the simplest control algorithm no. 4, pp. 93–109, 2000.
which is based on performing random turns when the target is too close to an [19] D. L. Schacter, E. Reiman, A. Uecker, M. R. Roister, L. S. Yun, and
obstacle. Behavior (b) is obtained by adding an attentional bias to unvisited L. A. Cooper, “Brain regions associated with retrieval of structurally
areas. Finally, behavior (c) adds random turns to the former control strategies. coherent visual information,” Nature, vol. 376, pp. 587–590, aug 1995.
These three simple autonomous control strategies have been used to implement [20] P. Carinena, C. V. Regueiro, A. Otero, A. J. Bugarin, and S. Barro,
a moving target in the Robotics Developer Studio simulation environment. “Landmark detection in mobile robotics using fuzzy temporal rules”,
pp. 423–435, 2004.
View publication stats

A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009

Uploaded by

Copyright:

Available Formats

A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Cognitive Approach to Multimodal Attention

Article in Journal of Physical Agents · January 2009

Raul Arrabales Agapito Ismael Ledezma Espino

SEE PROFILE SEE PROFILE

Araceli Sanchis de Miguel

AI and Society View project

The user has requested enhancement of the downloaded file.

A Cognitive Approach to Multimodal Attention

Abstract—An adaptive attention mechanism is a requirement

D ESIGNING an autonomous control system for a mobile

Fig. 2. CERA Control Architecture Layers.

In the next sections we discuss the implementation of an

set when the simple action is created and enqueued in

sented by a three-dimensional vector, for the task of following

Fig. 7. Vectors calculated to build the J-Index of a single sonar transducer

Fig. 9. J-Index of a single visual percept.

in any further processing. This means that no single percepts

associated sensory context as required by the subsystems of the

interval). Similarly, more specific criteria can be used in order

In order to assemble coherent percepts, a priority policy has

[2] R. Arrabales and A. Sanchis, “Applying machine consciousness models

View publication stats

You might also like