A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009
A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009
A Cognitive Approach To Multimodal Attention: Journal of Physical Agents January 2009
net/publication/41137215
CITATIONS READS
15 498
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Servicios de generación de alertas agroclimáticas como soporte a la toma de decisiones del sector Cafetero Colombiano, AgroCloud View project
All content following this page was uploaded by Raul Arrabales on 29 May 2014.
provide any algorithmic description of attention but just a and movement properties have been considered as additional
metaphorical explanation. A theater spotlight simile is used to criteria; therefore, four criteria have been used for context
represent the focus of consciousness. This spotlight illuminates formation in the experiments described below.
only a small part of the scene, which is considered the The time criterion refers to the exact moment at witch a
conscious content of the mind. The scene is actually built upon stimulus is perceived. Therefore, it should be taken as an
the subject’s working memory. The movement of the spotlight, important criterion to relate one percept to another. Given that
i.e. the selection of contents that will be used for volition and different sensors and their associated device drivers can take
action, is directed by unconscious contextual systems. The aim different time intervals to process the sensory information,
of the work described in this paper is to design and test an a mechanism for time alignment is required. It has been
implementation of such contextual systems, which are able to demonstrated that such a time alignment mechanism is present
adaptively direct attention toward the interesting areas of the in biological brains [9][10]. Although visual and auditory
robot sensorimotor space. stimuli are processed at different speeds, the time gap be-
From the point of view of perception, contexts are sets of tween different processed signals, whose physical originators
percepts retrieved from the sensors. Percepts are considered were acquired at the same time, is automatically removed by
the minimal information units obtained by the robot sensory the brain [11]. An analogous artificial mechanism has been
machinery [5]. Therefore, a sensory context can be used to implemented in the proposed architecture.
build a complex percept composed of related single percepts. Location is another fundamental criterion for context for-
From the point of view of behavior, contexts define sets of ac- mation as the representation of the position of objects in
tions available for execution. Hence, we can define behavioral the world is a requirement for situatedness. Furthermore,
contexts as possible compositions of related actions. In order the location of an object relative to the robot body (or any
to generate an efficient robot behavior, both sensory contexts other reference frame) is required for generating adaptive
and behavioral context have to be adaptively generated. behaviors. The relative location of any element in the sensory
world is necessary for the integration of complex percepts;
A. Visual Field Segmentation additionally, it allows the selection of a given direction of
attention toward the most relevant places. The presence of
The first stages in visual sensor data processing are con-
space coding neurons and the use of reference frames (like
cerned with attentional context definition. Concretely, instead
somatotopic or head-centered) has been demonstrated in the
of applying a full preprocessing task to the entire image
mammal brain [12][13].
captured by the camera sensor, each incoming frame is frag- In a world where color patterns can be associated with
mented into smaller regions. Subsequently, only one selected particular objects, this property of entities should be taken
fragment (foveal region) is further processed, thus reducing into account. Similarly, some objects are mobile while others
to a great extend the processor requirements of visual sensor remain static; consequently, movement is a property that
preprocessor. Additionally, as explained below, this strategy should also be considered as criterion for relevant context
allows the robot to focus attention in specific visual regions formation. Particularly, the task of counterpart recognition has
also in further processing stages. Nevertheless, before the been simplified in the research under discussion by charac-
preprocessing stage, when context criteria are evaluated, all terizing other peer robots as autonomously moving red and
visual data packages (frame segments) are equally processed. black objects. The presence of specialized areas for color and
It is known that this strategy is similar to the way human movement detection has been demonstrated in human’s brain
visual system processes the foveal region, which is much visual cortex [14].
richer in resolution and detail than retinal periphery. Humans Following the principles presented above, we have used
use the fovea to fixate on an object and specifically process its time, location, color, and motion as fundamental contextual-
image while maintaining a much less demanding process for ization criteria for the formation of:
peripheral regions [7]. This very same strategy has also been
• Sensory contexts as composition of single percepts (com-
successfully applied in other artificial systems, e.g. [8].
plex percepts), and
• behavioral contexts as composition of simple actions.
B. Context Criteria In order to generate these contexts, both single percepts
We have designed the process of context formation as (which are built from data packages obtained from sensors)
the application of predefined criteria in order to calculate and simple actions (which are defined as part of the robot
the degree of relation between the potential elements of a control system) are required to incorporate estimated time,
given context. Basically, a context should be constructed in location, motion, and color parameters (see Fig. 4). Motion
a way that it can become a meaningful representation of the properties could be obviously derived from time and location
reality, i.e. the interplay between agent and situation must be parameters; however, we have decided to use a natively visual
enforced by a proper definition of both sensory and behavioral motion detection approach in which motion properties are
contexts. The very basic factors that need to be considered directly obtained from visual input analysis. In our proposed
in the correct representation of robot situation in the world architecture there are specialized modules designed to calcu-
are time and location. Nevertheless, other factors can be late time, color, motion, and location parameters: the Timer
considered depending on the problem domain and internal module maintains a precision clock (less than 1 millisecond
state representation richness. In the work described here, color resolution) that represents the robot’s age, the Proprioception
56 JOURNAL OF PHYSICAL AGENTS, VOL. 3, NO. 1, JANUARY 2009
Fig. 10. Formation of complex percepts and complex behaviors. Fig. 11. Vectors calculated to build the J-Index of a complex bumper contact
percept.
C. Contextualization Hierarchy
The proposed contextualization mechanism supports hierar-
chical composition; hence complex percepts can be built by
either combining:
• A number of single percepts.
• A number of complex percepts.
• Both single and complex percepts.
complex percept J-Index, and simple actions are generated that the figure represent the alignment between visual horizontal
will cause the chaser to head towards the target. axis and the central −45◦ to +45◦ angular span of frontal
Keeping a constant distance to the target is facilitated by sonar. In this case, the value of SH in the visual complex
the ranging data obtained from sonar. As single percepts from percept corresponds to the sonar percept originated from the
vision and single percepts from sonar share location parame- sonar transducer at +10◦ . The measurement represented in
ters (j referent vectors), distance to target can be estimated by this particular sonar percept (2493 millimeters) is directly the
multimodal contextualization. Actually, the complex percepts distance estimate assigned to the multimodal complex percept
that represent the target are composed of both visual and sonar as visual percept itself does not provide any distance estimate.
single percepts. These single percepts were associated in the Preliminary results obtained applying the proposed attention
same complex percept because of their affinity in relative mechanism to the human-controlled target chasing task are
location. This means that target complex percepts include promising; however more complex environments have to be
sonar ranging data in addition to the visual pose estimation. tested in order to appreciate the real potential of the cognitive
Fig. 13 shows an example of both visual and sonar data approach. In addition to the manual control of P3DX-Target,
as ingested in the CERA Physical Layer, where associated which produces very variable results, three autonomous simple
time and location parameters are calculated. Then, sensor behaviors have been implemented with the aim to test the
preprocessors build single percepts including timestamps and capability of the attention mechanism when confronted to
J-Indexes. All generated single percepts enter the CERA Phys- different movement patterns (scenarios a, b and c depicted
ical workspace where the complex percepts are built based in Fig. 14). Fig. 14 shows the typical trajectories of the
on current active contexts. Active contexts are established autonomous control schemes implemented in P3DX-Target.
by the higher control system running in the CERA Core Initial time to target engaged state varies and is basically
Layer. The scenario depicted in figure 13 corresponds to dependent on start position of both P3DX-Chaser and P3DX-
the chasing task; hence a context for red objects is active Target robots. Therefore, in the present case, the performance
(in addition to location and time, which are always used as of the attention mechanism is measured in terms of the overall
contextualization criteria). Right side of the figure shows an duration of target engaged state (percentage of total navigation
example of the complex percepts that are formed due to the time when the target is engaged). The performance when
application of the mentioned contextualization mechanism. chasing targets in open space (wide corridors) is 100% in
Single percepts corresponding to visual segments S5,5 and scenario (a), and close to 100% in scenarios (b) and (c).
S6,5 are selected because the present saliency in terms of However, when the target (a, b, or c) performs obstacle
the red color contextualization criterion. Given that their j avoidance maneuvers close to walls, performance usually fall
referent vectors happen to be contiguous, a new monomodal to 50-70%. In these situations the chaser also has to avoid
(visual) complex percept is built as a composition of them. obstacles, eventually causing the loss of target.
As shown in the picture, the J-Index of the newer monomodal
complex percept points to the geometrical center of the visual
VI. C ONCLUSION AND F UTURE W ORK
segment formed as a combination of the two former single
percept segments. It can be noticed that the J-Index of this A novel attention mechanism for autonomous robots has
visual complex percept does not spot the actual center of the been proposed and preliminary testing has been done in the
target, but the approximation is good enough for the realtime domain of simple mobile object recognition and chasing. The
chasing task. Once monomodal complex percepts have been integration of the attention cognitive function into a layered
built, time and location contextualization is applied amongst control architecture has been demonstrated. Additionally, the
different modalities. problem of multimodal sensory information fusion has been
Right bottom representation in the picture (Fig. 13) cor- addressed in the proposed approach using a generic context
responds to sonar j referent vectors, including a highlighted formation mechanism. Preliminary results obtained with the
single percept (the one obtained from the reading of the sonar simulator show that this account is applicable to classical
transducer oriented at +10◦ ). The projection outlined top- mobile robotics problems. Nevertheless, moving to a real
down from the visual complex percept to this sonar single world environment and facing more demanding missions in-
percept indicates that both percepts are to be associated and cluding localization requirements would imply dealing with
will form a multimodal complex percept. Time association the problem of imperfect odometry [18]. In such a scenario
is obvious; however, the location contextualization between our proposed attention mechanism had to be integrated into a
visual and sonar percepts require some additional parametric SLAM (Simultaneous Localization and Mapping) system.
alignment as these different modality sensors present particular The attention mechanism proposed in this work is designed
orientations and wide span. Furthermore, as explained above, to be highly dynamic and configurable. Following the same
only the X coordinate is considered for visual percepts (visual principles described above, more contexts can be created as
horizontal axis). While we have used a +90◦ field of view more contextualization criteria are defined in the system. The
camera, the Pioneer 3DX robot frontal sonar ring covers a concrete definition of criteria and context is to be selected
total field of +195◦ (including blind angles between transducer based on the specific problem domain.
cones). Therefore, only percepts originated from the central The system described in this paper is work in progress.
90◦ of sonar coverage are taken into account for visual to sonar Counterpart recognition is currently based on color and move-
contextualization. Black dashed lines on the right hand side of ment detection, however we are working on adding other
62 JOURNAL OF PHYSICAL AGENTS, VOL. 3, NO. 1, JANUARY 2009
Fig. 13. Upper left image is a representation of the segmented visual sensory information acquired by the simulated onboard camera. Lower left graph is
a capture from the CERA graphical user interface displaying real time sonar transducer measurements. Sonar ranging representation capture corresponds to
the particular instant when the camera acquired the image depicted above. Right hand side of the picture shows a representation of the multimodal complex
percept being built with this sensory information
visual and higher cognitive properties recognition in order Moreover, the attention mechanism is to be integrated with
to build a most robust mechanism. Concretely, visual texture other Core Layer modules, like memory and self-coordination
recognition, vertical symmetry detection, and movement pat- modules in order to use the required related information for
tern identification are expected to greatly improve robustness the activation of appropriate contexts in the Instantiation and
in real world environments. Furthermore, a mechanism for the Physical layers.
detection of structurally coherent visual information could be Given the need to define complex spatiotemporal relations
implemented as part of the proposed attention mechanism. in the process of attentional contexts formation, the application
Complex percepts formed representing unique objects could of fuzzy temporal rules will be considered as they have been
be evaluated in terms of their structural coherence, as human proved to be an effective method in landmark detection (like
brain seems to do [19]. More complex attentional contexts (and doors) for mobile robots [20].
therefore more contextual criteria) have to be defined in order
to face other problem domains. Perception is well covered for
ACKNOWLEDGMENT
sonar range finder and bumpers. However, additional develop-
ment is required in the CERA Physical Layer in order to add This research work has been supported by the Spanish Min-
more functionality to visual perception, e.g. visual distance istry of Education and Science CICYT under grant TRA2007-
estimation. The definition of behavioral contexts and complex 67374-C02-02.
behaviors should also be enhanced to cope with more complex
actuators and to generate more efficient behaviors. At the
level of the CERA Core Layer, learning mechanisms could be R EFERENCES
applied in order to improve the attention selection technique. [1] P. J. Burt, “Attention mechanisms for vision in a dynamic world,” pp.
977–987 vol.2, 1988.
ARRABALES ET AL. : A COGNITIVE APPROACH TO MULTIMODAL ATTENTION 63