Intention-based Robot Control in Social Games
Christopher Crick (
[email protected])
Yale Department of Computer Science, 51 Prospect St.
New Haven, CT 06511 USA
Brian Scassellati (
[email protected])
Yale Department of Computer Science, 51 Prospect St.
New Haven, CT 06511 USA
Abstract
We present a novel, sophisticated intention-based control system for a mobile robot built from an extremely inexpensive
webcam and radio-controlled toy vehicle. The system visually
observes humans participating in various playground games
and infers their goals and intentions through analyzing their
spatiotemporal activity in relation to itself and each other, and
then builds a coherent narrative out of the succession of these
intentional states. Starting from zero information about the
room, the rules of the games, or even which vehicle it controls,
it learns rich relationships between players, their goals and intentions, probing uncertain situations with its own behavior.
The robot is able to watch people playing various playground
games, learn the roles and rules that apply to specific games,
and participate in the play. The narratives it constructs capture
essential information about the observed social roles and types
of activity. After watching play for a short while, the system is
able to participate appropriately in the games. We demonstrate
how the system acts appropriately in scenarios such as chasing,
follow-the-leader, and variants of tag.
Keywords: Artificial Intelligence, Interactive Behavior,
Learning, Social Cognition, Robotics
Introduction
Humans have a powerful ability to make sense of the world
using very rudimentary sensory cues. We can watch children
from down the street, and know instantly whether they’re
playing amicably or if we need to prepare to deal with torn
jeans and tears. We can sit in the nosebleed bleachers and
enjoy a football game, even though the players are nothing
more than small colored blobs. We can navigate the house by
a four-watt nightlight and (usually) pilot automobiles through
traffic in the dark and the fog. We usually can make do with
even less. Two-thirds of a century ago, Heider and Simmel
found that animated boxes on a flat white screen are enough to
trigger this inference process (Heider & Simmel, 1944). We
easily spin stories about sterile geometric shapes, assigning
them intentions, personalities and goals. Given the chance,
we happily take control of these nondescript avatars to play
out our own intentions and desires, whether in the context of
psychological research (Gigerenzer & Todd, 1999), or simply
in relaxing video games.
Making sense of very low-context motion data is an important cognitive task that we perform every day, an irrepressible instinct that develops quickly in children, around the age
of 9 months (Rochat, Striano, & Morgan, 2004). This lowlevel processing skill is quickly followed by the development
of other social skills (Csibra, Gergely, Biro, Koos, & Brockbank, 1999), such as the attribution of agency and intentionality. It depends on very little information from the world
Figure 1: The robot-controlled toy truck
– so little, in fact, that we can have some hope at designing
computational processes that can manipulate the manageable
quantity of data to accomplish similar results. What’s more,
this can be accomplished quickly enough to serve as a control system for a robot, enabling us to explore the relationship
between watching a game and participating. When taking
an active part, the system can probe uncertainties in its learning, collapsing ambiguity by performing experiments, and explore how motor control relates to social interaction (Wolpert,
Doya, & Kawato, 2003).
Our work also draws from and contributes to investigations
of the fundamental cognitive processing modules underpinning perception and interpretation of motion. These modules appear responsible for our rapid and irresistable computation of physics-based causality (Choi & Scholl, 2006), as
well as facile, subconscious individuation of objects in motion independently of any association with specific contextual features (Leslie, Xu, Tremoulet, & Scholl, 1998) (Scholl,
2004) (Mitroff & Scholl, 2004). Furthermore, different processing modules appear to attend to different levels of detail
in a scene, including global, low-context motion such as used
by our system (Loucks & Baldwin, 2008).
The specific analysis undertaken by our system, hypothesizing vectors of attraction and repulsion between agents and
objects in the world in order to explain the causal relationships we note in an interaction, relates to the dynamics-based
model of causal representation proposed by Wolff (Wolff,
2007) and on Talmy’s theory of force dynamics (Talmy,
1988). As Talmy notes, the application of force has a great
impact (no pun intended) on our understanding of the semantics of interaction, and on our ideas about causality, intention
and influence. Humans can explain many events and interac-
Forward
Toy Remote
Controller
Circuitry
RTS
Left
RTS
COM1
COM2
DTR
DTR
3v
Reverse
Right
Figure 2: System components and information flow
Figure 3: Circuit diagram for computer-controlled toy RC car
tions by invoking a folk-physics notion of force vectors acting
upon objects and agents. This holds not only for obviously
physical systems (we talk easily of how wind direction affects the motion of a sailboat), but for social interactions as
well (the presence of a policeman can be interpreted – and in
fact is described by the same vocabulary – as a force opposing our desire to jaywalk). Our system explicitly generates
these systems of forces in order to make sense of the events it
witnesses.
This work represents the latest step in our efforts to model
a computationally tractable piece of human social cognition
and decisionmaking. Within the constraints of its conceptual
framework, our robot comprises a complete functional entity,
from perception to learning to social interaction to mobility.
Earlier versions of this system – lacking the ability to participate bodily in the observed games – are fully described in
(Crick, Doniec, & Scassellati, 2007) and (Crick & Scassellati,
2008).
to generate the low-context percepts that are all our cognitive
model requires.
Note that the camera is not overhead. The information
coming to the robot is a trapezoid with perspective foreshortening. It would be possible to perform a matrix transformation to convert pixel positions to Cartesian geospatial ones,
but our system does not go to the computational expense of
doing so. The image may be distorted, but only in a linear
way, and the vector calculations described below work the
same, whether in a perspective frame or not.
System Description
The system involves a number of interconnected pieces, depicted in Figure 2. Each component is described below in
turn.
Vision
The system employs a simple but robust method to tracking
the players as they move through the play space. Using an
inexpensive USB webcam mounted on a stand in such a way
as to provide a complete image of the floor of the room, the
system uses naive background subtraction and color matching
to track the brightly-colored vehicles. Before play begins,
the camera captures a 640x480 pixel array of the unoccupied
room for reference. During a game, 15 times a second, the
system examines the raster of RGB values from the webcam
and looks for the maximum red, green and blue values that
differ substantially from the background and from the other
two color channels of the same pixel. These maximum color
values are taken to be the positions within the visual field
of the three vehicles – one painted red, one blue, and one
green (by design). Obviously, this is not a general-purpose
or sophisticated visual tracking algorithm, but it is sufficient
Motor Control
In order not only to observe but to participate in activities,
we provided our system with a robotic avatar in the form of a
$20 toy remote-controlled car. By opening up the plastic radio controller and wiring in transistors to replace the physical
rocker switches that control the car’s driving and steering, and
connecting these wires to controllable voltage pins on a computer’s serial ports, we turned the system into a high-speed (7
m/s) robot. See Figure 3 for wiring details.
The controller is quick and reactive. The system maintains
the position history over the previous 15 second – three position reports, including the current one. With this information,
it computes an average velocity vector and compares it with
the intended vector given by the own-goal system described
further below. Depending on the current direction of drive
and the angle of difference between the actual and intended
vectors, a set of commands is sent to the robot as shown in
Figure 4.
Reactive collision avoidance
The room’s walls obviously have an effect on the motions of
the players, since their actions are constrained by the physical dimensions of the space. We chose to deal with wall
avoidance in simple fashion. If the robot approaches too near
the edge of the play area, a reactive behavior emerges that is
independent of the goal state: if the robot is located within
a certain number of pixels of the edge of the play area, an
emergency goal vector pointing straight out from the wall or
corner supercedes whatever the robot had been trying to do
beforehand. This danger area ranged from 30 pixels wide at
Motion Vector Analysis
Fwd: drive fwd
Rev: drive rev
Fwd: drive fwd, steer left
Rev: drive rev, steer right
Fwd: drive fwd, steer right
Rev: drive rev, steer left
Fwd: drive rev, steer right
Rev: drive fwd, steer left
Fwd: drive rev, steer left
Rev: drive fwd, steer right
Fwd: drive rev
Rev: drive fwd
Figure 4: Robot directional control. The goal vector is compared to the computed vector of motion.
the bottom of the image (closest to the camera) to 18 near the
top. Interestingly, several study participants noted the robot’s
ability to avoid running into walls, claiming that the robot was
a much better driver than they were!
Self-other Identification
The system does not immediately know what salient object in
its visual field “belongs” to itself. The playing area contains
three different-colored toy cars, but it controls only one. Using a technique described in (Gold & Scassellati, 2005) for
robotic self-recognition, the system sends out a few random
motor commands and detects which of the perceived objects
responds in a correlated fashion. The system sends a brief
pulse (200 ms) of the command for “forward”, followed by
a similar command for “back”, repeating as necessary. At
the same time, the system inspects the visual field for the positions of the three salient colorful objects, looking for one
moving predictably forward and back in time with the commands (finding and computing the necessary motion vectors
are a byproduct of the analysis described in the next section).
In this way, the system identifies itself for the duration of the
exercise. Although the process would theoretically continue
for as long as necessary, we found that throughout our experiments it never took more than one forward and reverse
command for reliable identification.
Notably, this is precisely the same procedure invariably
used by the human participants, who were each handed a remote controller without being told which of the three cars
they were to drive. Invariably, the participant worked the
controls forward and backward, watching the playing area to
note which car acted as directed. The system has access to
no privileged information about what it sees, no more than an
undergraduate test subject walking into the lab space for the
first time.
Having determined which vehicle it is driving, the system begins to observe the behavior of the others to begin working
out the rules of the game. For each of the other two participants in the game, the system calculates the “influence” of
the remaining players (including itself) on the first person’s
perceived two-dimensional motion, expressed as constants in
a pair of differential equations:
Vxin =
cx j (xnj − xin )
dinj
+
cxk (xkn − xin )
+···
dikn
(1)
(and similarly for the y dimension). It obtains the (noisy)
velocities in the x and y direction, as well as the positions of
the other vehicles, directly from the visual data:
Vxin =
xin+1 − xin
tn+1 − tn
(2)
(again, also for the y dimension). Here, Vxin represents the x
component of agent i’s velocity at time n. xin , xnj and xkn are
the x coordinates of agents i, j and k respectively, at time n.
Likewise, dinj and dikn are the Euclidean distances between i
and j or i and k at time n.
This results in an underconstrained set of equations; thus to
solve for the constants we collect all of the data points falling
within a short window of time and find a least-squares best fit.
The visual system runs at 15 Hz; we found that a window of
220 milliseconds (about three position reports) worked best –
coincidentally near the accepted average human reaction time
(Laming, 1968).
Belief State Calculation
Each constant determined by the process described above represents in some fashion the influence of one particular player
on the motion of another at a particular point in time. Some
of these may be spurious relationships, while others capture
something essential about the motivations and intentions of
the agents involved.
To determine the long-term relationships that do represent
essential motivational information, we next assemble these
basic building blocks – the time-stamped pairwise constants
that describe instantaneous attraction and repulsion between
each agent and object in the room – into a probabilistic finite
state automaton, each state representing a set of intentions
that extend over time. At any particular point in time, any
particular agent may be attracted or repelled or remain neutral with respect to each other object and agent in the room;
this is characterized by the pairwise constants found in the
previous step. The system assumes that the actors in the room
remain in a particular intentional state as long as the pattern of
hypothesized attractions, repulsions and neutralities remains
constant, discounting noise. A particular state, then, might
be that Red is attracted by Blue and neutral toward Green,
Blue is repelled by Red and neutral toward Green, and Green
is repelled by red and neutral toward Blue. This state might
occur, for instance, in the game of tag when Red is “it” and
has decided to chase Blue.
The system maintains an evolving set of beliefs about the
intentions of the people it observes, modeled as a probability distribution over all of these possible states. As new data
comes in, the current belief distribution is adjusted, and the
system assumes that the most likely alternative reflects the
current state of the game.
Beln (S) =
Beln−1 (S)(1 + λ ∑c∈S s(cn ))
Z
(3)
Here, the belief in any particular state S at time n is the
belief in that state at time n − 1, modified by the current observation. cn is the value at time n of one of the pairwise relationship constants derived from the data in the previous step;
the function s is a sign function that returns 1 if the constant’s
sign and the intention represented by the current state agree,
-1 if they disagree, and 0 if the state is neutral toward the pairwise relationship represented by the constant. λ is a “learning
rate” constant which affects the tradeoff between the system’s
sensitivity to error and its decision-making speed. The magnitude of this factor ranges between 0.04 and 0.12, depending
on whether the system is simply observing or is actively participating and trying out hypotheses (see the following section). Finally, Z is a normalizing constant obtained by summing the updated belief values across all states.
Own Goal Determination
As the system begins to observe its human partners, it develops a belief distribution over their possible intentional states.
Because it controls a robot of its own, the system is then able
to probe the likeliest candidate states. It chooses the belief
state it has rated most likely, and acts in such a way to confirm or reject the hypothesis. It adjusts its beliefs accordingly,
and more decisively than if it was not participating.
For example, say that the system had the highest degree
of belief in the following state: Green was chasing Red and
ignoring Blue, while Red was fleeing from both Green and
Blue. To probe this state of affairs, the system would drive
Blue toward Red. If Red continued to move away from Blue
and Green did not react, the system’s degree of belief in this
state would further increase; if the other players reacted in
some other way, the belief would subside, eventually to be
replaced by another belief state judged more likely.
The ability to participate in and change the course of the
game is a powerful tool for efficient learning. Machine learning theory is full of algorithms which perform much better
when they are allowed to pose queries, rather than simply
passively receiving examples (Angluin, 1988). Our system
possess an analogous ability, able to query its environment
and settling ambiguities in its beliefs by manipulating its own
intentions and behaviors. At the same time, it watches for
the effects on others’ behaviors of the social forces brought
into play by its actions. We show the effectiveness of such
participation below.
Narrative Construction
The process described in the preceding sections converts instaneous velocity vectors derived from somewhat noisy video
into sustained beliefs about the intentional situation that pertains during a particular phase of an interaction. As the action progresses, so too do the system’s beliefs evolve, and as
those beliefs change, the sequence of states becomes a narrative describing the scenario in progress. This narrative can be
analyzed statistically to identify the action in progress, differentiate it from other possible activities, and also provide
the system with clues to use in unsupervised feature detection. It can collect statistics about which states commonly
follow which others (a prerequisite for developing the ability to recognize distinct activities). And it identifies points in
time where important events take place, which will allow the
system to notice information about the events themselves.
For this particular set of scenarios involving playgroundlike games, we set the system to look for game rules by observing the relative positions of the participants during the
crucial moments of a belief state change, and to search for
correlations between the observed distances and the particular state change. Distance is only one feature that could be
considered, of course, but it is a common-enough criterion in
the world of playground games to be a reasonable choice for
the system to focus on. If the correlations it observes between
a particular state transition and a set of relative distances are
strong enough, it will preemptively adjust its own behavior
according to the transition it has learned, thus playing the
game and not only learning it.
Experiments
We tested the system in a 20x20-foot lab space with an open
floor. We ran trials on three separate occasions, with two human subjects driving the red and green remote-controlled cars
and the system controlling the blue one. We also ran one additional control trial with three human drivers and no robotcontrolled car. The subjects themselves were in the room with
the vehicles, but seated against the wall behind the camera’s
field of view. Each set of trials involved different people as
drivers. Data from the first experiment were collected during
each trial; the final experiment involving modified tag was
conducted only during the last trial.
Chasing and Following
The first game we tested was simple. Each player had only
one unchanging goal. The driver of the red car was asked to
stay as far away from the others as possible, while the green
car gave chase. In each trial, the behavior of the system was
consistent. Within less than a second, the system determined
the intentional states of Green and Red with respect to each
other. It then proceeded to generate and test hypotheses regarding their intentions toward itself, by approaching each
of the two cars. Within a few seconds more, it was able to
determine that Red was fleeing from both, and Green was indifferent to Blue. Since the intentional state never changed,
no positional information was ever recorded or analyzed.
Table 1: Results from Chase and Follow
No. of Games Average Time
Chase
6
7.5 sec
Follow
4
33.5 sec
Chase (observe)
3
29.3 sec
σ
1.41
4.66
10.69
The fact that the robot can participate in the game provides it with significant added power to probe the players’ intentional states. For comparison, we also ran versions of the
game that involved three human drivers, relegating the robot
system to the role of passive observer. Still, the system applied the same algorithms to hypothesize the intentions of the
players, and eventually converged on a stable, correct belief
state. But it took much longer.
The second game, Follow the Leader, increased the complexity somewhat. The driver of the red car was instructed to
drive wherever he or shey wished, and the green car was to
follow, but remain a foot or two away – stopping when the
red car stopped, reversing if it got too close. Success in this
game came when the system understood this: it should approach the red car from across the room, but avoid it close in.
In this game, the system was only successful in four runs of
the game, out of six. In both of the other two trials, it formed
the belief that the game was Chase, just as in the previous
experiment, and never noticed the change from following to
ignoring or avoiding.
Tag
Having confirmed that the system was able to understand
and participate in simple games, we asked our subjects to
play the somewhat more sophisticated game of tag. In previous research that involved the system merely watching people play, rather than attempting to participate, we enjoyed a
great deal of success (Crick et al., 2007; Crick & Scassellati,
2008). However, several factors conspired against us. The
RC cars are not nearly as agile as actual humans, and our
subject drivers had significant difficulties controlling the vehicles well enough to conduct the game. In addition, one of
the three participants in the game – the robot – had no idea
how it should be played, and the two human players were unable to demonstrate gameplay adequately by themselves. We
asked a pair of students unconnected to our tests to watch the
videos of the tag attempts, and neither of them were able to
identify the game being played, either.
Since freeform tag was too difficult for all involved, we
developed rules for a tag-like game in order to test the system’s ability to understand turn-taking and role shifts within
the context of a game. In the modified game, only one person
was supposed to move at a time. The player designated as
“it” picked a victim, moved toward it, tagged and retreated.
Then the new “it” repeated the process. Figure 5 depicts a set
of stills from one of these modified tag games. A frame-byframe description of the game is depicted in Table 2.
At each time point, the table includes a human-constructed
t=8
t = 19
t = 26
t = 35
t = 42
t = 49
t = 52
t = 55
Figure 5: Succession of images from modified tag. See text
and Table 2 for details.
verbal description of the action of the game, as well as the
textual description produced by the system itself. This comes
from the robot’s own actions (which it knows absolutely and
need form no beliefs about), and its belief in the intentional
states of the players during a particular narrative episode.
We can evaluate the system’s success in ascribing intentions
by comparing these human descriptions with the intentional
states posited by the robot. Furthermore, we can identify
points at which the system establishes rules that coincide with
human understanding of the game. At the start, the robot
watches the other two players each tag one another, without
participating. Then, not knowing what its own role in the
game is, it begins to move toward and away from the other
players, observing their reactions. Because both of the human players are currently ignoring the robot, these actions
are inconclusive. However, by second 42, the system has accumulated enough data to know that intentional shifts are signalled by close proximity. In the fifth frame, it recognizes the
tag and reverses its own direction at the same time. By the
seventh frame, it is testing to see whether approaching the
red car will cause it to reverse course. By the end of this sequence, the system still has not determined that there is only
one player (“it”) with the chasing role, but it is well along the
way – it understands tagging and the ebb and flow of pursuit
and evasion.
Conclusion
Biological beings excel at making snap decisions and acting
in a complex world using noisy sensors providing information both incomplete and incorrect. In order to survive, hu-
Table 2: Action and narrative during modified tag. Each time
point corresponds to a still from Figure 5.
t (sec) Action Description
System narrative
8
R approaches and tags G R chases G.
19
R withdraws
G chases R, and
G approaches and tags R R runs from G.
26
G withdraws
R chases G, and
R approaches and tags G G runs from R and
B approaches G and R
me, and I chase G.
35
R withdraws
G chases R, and
G approaches R
R runs from G, and
B approaches G and R
I chase R.
42
G tags R
G chases R, and
B approaches G and R
I chase R.
49
G and B run from R
R chases G, and
R tags G
G runs from R, and
I run from R.
52
R withdraws
R chases G, and
B approaches and tags R G runs from R, and
I run from R.
55
G approaches B and R
R runs from me, and
B withdraws
G runs from R, and
I chase R.
mans must engage and profit from not only their physical environment, but a yet-more-complex social mileu erected on
top. One of our most powerful and flexible cognitive tools
for managing this is our irrepressible drive to tell stories to
ourselves and to each other. This is true even or perhaps especially when we have only sparse information to go on. And
beyond the telling, we take great delight in participating. We
play games, we act, and the stories we love most are the ones
in which we are the central characters.
We have developed a system that takes advantage of the
very fact that it receives only rudimentary sensory impressions, and uses them to weave a story in which it can take
part. The relative positions of moving objects are more than
enough data for a human observer to begin making sense of
the interaction by imagining their intentions and goals. By
applying force dynamics to hypothesize about such human
intentions, and by acting on those hypotheses to explore and
verify its beliefs about the world, our system attempts to do
the same.
The system has to figure out for itself how its motor controls correspond to action in the world. It theorizes about
and tries to learn the intricate rules to games it knows nothing
about. The verisimilitude of the data thus collected enables us
to draw stronger conclusions with respect to real human interaction and interpretation, in contrast to data derived from
simulation or computer-mediated play. And it does it in the
real world, in real time, at human speed, using few shortcuts.
Acknowledgements
Support for this work was provided by National Science Foundation awards #0534610
(Quantitative Measures of Social Response in Autism), #0835767 (Understanding Reg-
ulation of Visual Attention in Autism through Computational and Robotic Modeling)
and CAREER award #0238334 (Social Robots and Human Social Development). Some
parts of the architecture used in this work was constructed under the DARPA Computer
Science Futures II program. This research was supported in part by a software grant
from QNX Software Systems Ltd, hardware grants by Ugobe Inc., and generous support from Microsoft and the Sloan Foundation.
References
Angluin, D. (1988). Queries and concept learning. Machine
Learning, 2, 319–342.
Choi, H., & Scholl, B. J. (2006). Perceiving causality after
the fact: Postdiction in the temporal dynamics of causal
perception. Perception, 35, 385–399.
Crick, C., Doniec, M., & Scassellati, B. (2007). Who is it?
inferring role and intent from agent motion. In Proceedings
of the 11th ieee conference on development and learning.
London UK: IEEE Computational Intelligence Society.
Crick, C., & Scassellati, B. (2008). Inferring narrative and intention from playground games. In Proceedings of the 12th
ieee conference on development and learning. Monterey
CA: IEEE Computational Intelligence Society.
Csibra, G., Gergely, G., Biro, S., Koos, O., & Brockbank, M.
(1999). Goal attribution without agency cues: the perception of ’pure reason’ in infancy. Cognition, 72, 237–267.
Gigerenzer, G., & Todd, P. M. (1999). Simple heuristics that
make us smart. New York NY: Oxford University Press.
Gold, K., & Scassellati, B. (2005). Learning about the self
and others through contingency. In Aaai spring symposium
on developmental robotics. Palo Alto, CA: AAAI.
Heider, F., & Simmel, M. (1944). An experimental study of
apparent behavior. American Journal of Psychology, 57,
243-259.
Laming, D. R. J. (1968). Information theory of choicereaction times. London: Academic Press.
Leslie, A. M., Xu, F., Tremoulet, P. D., & Scholl, B. J. (1998).
Indexing and the object concept: developing ’what’ and
’where’ systems. Trends in Cognitive Sciences, 2(1), 10–
18.
Loucks, J., & Baldwin, D. (2008). Sources of information in
human action. In Proceedings of the 30th annual conference of the cognitive science society (pp. 121–126). Austin
TX: Cognitive Science Society.
Mitroff, S. R., & Scholl, B. J. (2004). Forming and updating
object representations without awareness: evidence from
motion-induced blindness. Vision Research, 45, 961–967.
Rochat, P., Striano, T., & Morgan, R. (2004). Who is doing
what to whom? young infants’ developing sense of social
causality in animated displays. Perception, 33, 355–369.
Scholl, B. J. (2004). Can infants’ object concepts be trained?
Trends in Cognitive Sciences, 8(2), 49–51.
Talmy, L. (1988). Force dynamics in language and cognition.
Cognitive Science, 12, 49-100.
Wolff, P. (2007). Representing causation. Journal of Experimental Psychology, 136, 82–111.
Wolpert, D. M., Doya, K., & Kawato, M. (2003, March). A
unifying computational framework for motor control and
social interaction. Philisophical Transactions of the Royal
Society B, 358(1431), 593–602.