Fuzzy Q-Learning For First Person Shooters

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

FUZZY Q-LEARNING FOR FIRST PERSON SHOOTERS

Youssef S. G. Nashed Darryl N. Davis


Department of Information Engineering Department of Computer Science
University of Parma University of Hull
43124, Parma - Italy HU6 7RX, UK
[email protected] [email protected]

KEYWORDS FPS games are characterized by the player viewing the world
Reinforcement Learning, Q-Learning, Fuzzy Systems, First from the eyes or the perspective of the main playing
Person Shooters, Quake-II. character. These games usually provide the player with
various weapons to defeat increasingly challenging
ABSTRACT opponents.
This study requires aspects of Champandard‟s AI extensions
Here machine learning techniques in the context of their to Quake-II (Champandard 2003) to be built into the Quake-
application within computer games is examined. The scope II demonstrator code and then extended to allow learning in
of the study was that of reinforcement learning [RL] the game. This will enable studying the effect of applying
algorithms as applied to the control of enemies in a video machine learning algorithms to Non-player characters
game dynamic environment thus providing interesting new [NPCs] in terms of emerging behaviors, diversity,
experiences for different game players. optimality, and believability.
The project proved reinforcement learning algorithms are
suitable and useful for simulating intelligence of agents ARTIFICIAL INTELLIGENCE IN GAMES
within a game. The implemented learning bot exhibited
interesting capabilities to adapt to a human player playing Board games [chess, checkers, go, etc…] have always been
style in addition to outperforming other implemented and studied by AI researchers. These games act as a problem
downloaded bots. Test results showed that incorporating generator to be solved by AI algorithms especially when
learning into a game can reduce the extensive scripting and playing against a human professional player (Campbell
tuning phase, while retaining the ability to guide the NPCs 1999).
not to exhibit totally unrealistic or unexpected behaviors. Graph machines describe the family of search techniques
This gives the designers the scope to explore new strategies. like breadth first, depth first, iterative deepening, or A*
The effect of the exploration-exploitation policy on the algorithms. Graph machines are particularly effective for
learning convergence was studied and tested in depth. The implementing AI for board games. The game discrete state
combination of the two most popular reinforcement space can be represented as a tree or graph and any of these
algorithms [Q-Learning and TD(λ)] resulted in faster search mechanisms can be used to traverse the tree to find
learning rates and realistic behavior through the exploration the optimal solution for a certain situation. The use of
period. heuristics about the game or the opponent moves allows the
tree branches and the searching time to be considerably
INTRODUCTION decreased thus making for efficient search algorithms.
Modern video games often comprise of complex dynamic
Video games or electronic games can be defined as computer environments. The attempt to store the continuous state
games that involve handling a user interface to provide space of such a game using a graph is computationally
visual feedback. Recently the video game industry has been impossible. Hence other world and agent models were
growing exponentially and AI is playing an important role in adopted for complex video games, and other machines are
the success of a game. used to work on these models like state machines and
Nowadays video games are required to deliver stunning reactive machines (Brooks 1991). Graph machines are still
graphics, real world physics, impressive sound, believable used for path finding [A*] and planning (Orkin 2006).
AI, and gripping storylines to be able to survive in this fierce There has always been a gap between AI research in
market. The lifetime of a game title on retailers‟ shelves academia and AI applied in video games. In the past game
depend on how much fun the game is, which in turn depends AI programmers often complained about being given
on many aspects including the repeatability of in-game insufficient CPU resources compared to graphics which
situations; the less predictable the more fun the game. often constrained them from employing any significant AI
First person shooters [FPS] games can be grouped under the techniques. With the recent breakthroughs in computer and
Action genre which generally involves the player guiding a console hardware, game AI programmers can borrow well
character through a set of virtual environments encountering established academic research like neural networks and
several types of enemies. These games depend on quick genetic algorithms (Zanetti and El Rhalibi 2004).
responses and eye-hand coordination (Johnson and Wiles Furthermore, academic researchers have started to take video
2001). games seriously and use it as a test-bed for novel AI research
as games provide an increasingly realistic and complex

© EUROSIS-ETI
environments bundled in a reliable, cheap, and extendible The AI in the game F.E.A.R by Monolith Productions uses a
package. There are also commercial games used and method very similar to STRIPS (Nilsson 1998) to describe
developed by the US army national guard to test new agent goals and actions.
weapons and surveillance technologies [America's Army,
AirForce Delta Storm, Microsoft Flight Simulator, etc...]. AI in Quake
Game AI can be categorized into two broad groups:
Deterministic Agents: Not necessarily meaning that the A brief overview of the bots developed in the Quake game
NPCs are driven by designer scripts but that their behavior is series or for research reasons is given in this section.
deterministic i.e. when facing the same situation agents will The Omicron Bot: This bot plays the multiplayer version of
always perform the same action or behavior. Scripted agents Quake. It was created to emulate a human player in a
are easier to implement, debug, and extend but struggle to multiplayer or online game. Omicron starts the game with no
display complex behaviors. prior knowledge of the level structure then it drops
Learning Agents: In contrast to scripted or rule based agents, waypoints till it knows its way around a level (Van Waveren
learning agents can adapt to human players and game 2001). The bot uses fuzzy logic for action preferences and
environments and can even learn new behaviors. Learning weapon selection.
can speed up the development time and provide extra levels The Gladiator Bot: Like Omicron, Gladiator is a multiplayer
of intelligence for NPCs that are increasingly challenging for bot for Quake-II. The only difference between both bots is
the player. that Gladiator has a whole range of characters each having
its own personality by varying the fuzzy logic parameters
AI in First Person Shooters (Van Waveren 2001).
Quake-III Arena Bot: The Quake-III bot uses a layered
Game AI is all about believability. If the NPCs in a game steering behavior approach which employs the type of
can convince the player that they are not artificially subsumption architecture proposed by (Brooks 1986).
controlled characters then the AI is said to be successful. Quake-II is used as a research environment for developing
FPS games have been around for many years and their AI autonomous agents. Anticipation bot (Laird 2001) uses the
systems are well established. However, players still prefer SOAR engine (Laird et al. 1987) with Quake-II to
the multiplayer and online versions of these games which demonstrate dynamical hierarchical task decomposition and
means that the AI in FPS is still falling short from prediction. Emobot (Hooley et al. 2004) was developed to
mimicking the skills of human players. prove that simple emotions like fear and anger can have a
This subsection contains an overview of the AI methods that great effect on agents‟ actions and believability
are widely used in FPS games giving examples from
commercial games. REINFORCEMENT LEARNING

Finite State Machines The standard RL model is depicted in Figure 1. At each time
Finite state Machines [FSMs] are a natural way of thinking step t+1 the agent receives s as the state of the environment
about synthetic creatures, consisting of a set of states, inputs, which can be separated to i as input to the agent and r as the
and outputs. An enemy is always in a given state [e.g. scalar reward signal for the action a chosen at the previous
patrol]; when this state receives a specific input [e.g. player time step t. The agent‟s behavior B should choose the action
sighted] the state changes [e.g. attack] and an action is that is expected to increase the long-run sum of rewards. The
performed [e.g. fire]. Games like Doom, Descent, and Quake input state s is perceived by the agent as input according to
use FSMs to model the enemies (Bourg and Seeman 2004). the input function I, which is the way the agent views the
environment state. Typically I is the identity function [that is
Fuzzy Logic the agent perceives the exact state of the environment],
Fuzzy logic replaces the standard Boolean logic concept of however this function can change in the case of partially
absolute True or False with degrees of membership to a observable environments (Kaelbling et al. 1996).
specified fuzzy set (Zadeh 1965). Fuzzy logic maps
perceived input [crisp input] to a real number describing how
close this input to a certain fuzzy variable [e.g. dangerous].
The game Unreal by Epic Games uses fuzzy logic to
implement Fuzzy State Machines [FuSM]. These fuzzy
machines provide smoother state transition than normal
FSMs. „S.W.A.T. 2‟ uses fuzzy logic to give each enemy a
personality by parameterizing different fuzzy sets like
aggression, courage, intelligence and cooperation (Johnson
and Wiles 2001).

Planning
In planning actions and goals are decoupled which allows
more modularity in the design and enables designers to Figure 1: The Standard Reinforcement Learning model
produce more complex behaviors that would require a [taken from (Kaelbling et al. 1996)]
considerable amount of effort and computational memory to
be expressed using FSMs.

© EUROSIS-ETI
(Sutton and Barto 1998) identify four main components of a Where A is the set of all actions, P(a) is the selection
RL system: probability of action a, E(a) is the estimated pay off of action
A policy: often denoted as π, this is what determines what a, and T is the temperature of the system.
action the agent should choose in a given state. The temperature parameter T is decreased over time to
A reward function: this is an external function to the agent decrease exploration and allow exploitation of higher
providing the reward signals [usually a floating point number estimated actions. T values need careful tuning as they are
between -1 and 1]. used to control convergence.
A value function: also known as the expected reward E,
which is the long term value of a state when operating under Delayed Reward
the current policy. The value function calculation depends on
the model of optimality used. The two main problems facing RL methods are temporal
A model for the environment: some RL algorithms start with credit assignment and the curse of dimensionality.
a complete model of the environment; other algorithms learn In the standard RL model the agent performs an action then
this model; while some algorithms do not need such a model receives an immediate reward from the environment; this is
at all. not always applicable in the case of complex dynamic
environments where the agent has to execute a series of
Exploration versus Exploitation actions to reach a state of the environment with a high
reward signal [goal]. This kind of reward functions is called
The exploration-exploitation dilemma is very popular in the a sparse reward function.
field of RL algorithms. This problem is studied in depth Sparse reward functions are zero everywhere, except for a
within statistical decision theory and the k-armed bandit few places. The contrast with dense reward functions, which
problems (Berry and Fristedt 1985). give non-zero rewards most of the time.
In the k-armed bandit problems the agent is presented with k To cope with such environments the agent must learn from
number of gambling machines, each having an arm that delayed rewards. When a reinforcement signal is received
when pulled pays off 1 or 0, based on a certain probability. the value function of all the actions leading to the current
The agent is allowed a limited number of pulls. What should state should be updated with a discounted version of the
be the strategy undertaken to maximize the payoff in this reward. Temporal credit assignment [delayed reward]
situation? problems are well modeled as Markov Decision Processes
The agent should try out every option in the first few tries [MDPs].
[exploration], then - after forming a model that is close to MDPs consist of a set of actions A, a set of states S, a reward
being correct - be greedy and make use of highest paying off function R, and a state transition function T with which T(s,
machines [exploitation]. The balance between the a, s’) means the probability of the environment being in state
exploration period and the exploitation period is called the s’ after executing action a in state s.
exploration policy. It is extremely difficult to find an optimal Some RL algorithms use a model of the environment. These
exploration policy, nevertheless various researches proposed use MDPs and learn the state transition function or construct
reasonable policies that produce good results. the model while learning.
Some approaches to this problem use dynamic programming To form a model of a video game environment is extremely
incorporating basic Bayesian reasoning (Berry and Fristedt difficult; both sets of states and actions would be huge.
1985), or learning automata techniques (Kaelbling et al. Consequently this study is concerned with model free
1996). In (Gu et al. 2003) a genetic algorithm was used for techniques like Q-learning and Temporal difference (TD)
exploration, while Jouffe proposed a heuristic based method algorithms.
based on the frequencies and success of the choices (Jouffe
1998). However the most successful and used approaches are Temporal Difference TD(λ)
that of Ad-Hoc strategies.
Ad-Hoc strategies are simple undirected [randomized] TD methods introduced in (Sutton 1988) do not wait until
algorithms. They are far from optimal but are proven the end of an episode like Monte Carlo methods to update
reasonable, traceable, and computationally efficient for most the value function or estimate. Instead it uses the temporal
applications. The two main exploration stratigies that are difference in the expected return V( ) value of a state
commonly used are the ε-greedy exploration and the to the previous one(s).
Boltzmann exploration. The parameter λ controls how many steps or states to look
In the ε-greedy exploration algorithm actions with the back when adjusting the state expected return values, when
highest estimated payoff are always chosen. The downside λ=0 the algorithm only looks one step back. The update rule
of this approach is that the algorithm can get stuck in for a TD(0) algorithm is:
suboptimal actions while the optimal solution is starved and
never found. A useful heuristic is optimism in the face of Where is the reward signal received at state and
uncertainty in which actions are initialized with high is the discount factor.
optimistic estimations so that negative feedback is needed to
remove the action from selection. Where is the learning rate.
The Boltzmann exploration algorithm uses the Boltzmann In addition to the state expected return value V, (Sutton
distribution to assign action selection probabilities. 1988) defines eligibility traces of a state e(s) as the degree
to which a state has been visited in the recent past. These
(1) traces are used when updating the expected value function of

© EUROSIS-ETI
a state each according to its eligibility. The update rule for pitch, and roll angles for the bot orientation. Other classes of
e(s) is as follows: steering behaviors are described for each agent in turn.
(2)
Rule Based FSM Bot “Tom”

Q-Learning Tom is the first and simplest bot, implemented using scripts.
Tom uses a FSM to control his actions and decisions. Here
Q-Learning (Watkins and Dayan 1992) is probably the most FSMs are designed as a form of Rule Based Systems [RBS].
used RL algorithm. Q-Learning can be considered a TD(0) The implemented RBS is a forward chaining RBS with a
method but instead of estimating a value function for each First Applicable conflict resolution mechanism (Bourg et al.
state Q-learning maintains a value function Q(s,a) for each 2004). This system can be described in terms of three
state-action pair. mutually exclusive sets namely: states, actions, and inputs,
Let‟s say the agent has performed an action a, received a and a rule base.
reward signal r, and moved from state s to s’. The Q value
for the stat-action pair (s, a) is updated using an equation States
very similar to the TD update equation, except that the They dictate the action that is chosen in a given point in
highest Q value for the pair (s’, a’) is used. Where a’ is a time. Tom can be in one and only one of the following
possible action that can be executed in state s’. states:
(3) "Searching" Looking around exploring as much possible
Where α is the learning rate and γ is the discount factor. from the level terrain.
"Wandering" Randomly moving around an area.
Generalization "Attacking" Firing the held weapon at an enemy.
"Hunting" Pursuing the enemy's current position.
In a simple discrete state space environment, the state and "Gathering" Collecting Health, Armor, Weapons, or Ammo.
action spaces can be enumerated and the Q values stored in a "Fleeing" Running away from the enemy to a safe position.
lookup table. In the case of a large continuous state space
environment the use of such a data structure is impractical. Actions
This problem is called the curse of dimensionality. Actions are the outputs of an AI system. They are considered
To address this problem some sort of input generalization the bot's effectors to interact with the environment. Tom‟s
should be employed. Generalization uses compact storage of effectors are outlined below:
learned experience and applies it to similar states and "Turn" Keep yawing with a certain angle.
actions. "Move" Step in a given direction.
Input generalization is achieved by the use of function "Shoot" Fire the held weapon.
approximators such as Cerebellar Model Articulation "Seek" Go to a given location on the map.
Controller [CMAC] (Sutton 1996), neural networks (Lin "Avoid" Go away from a given location on the map.
1992), and fuzzy inference systems (Berenji 1996) (Jouffe
1998). Inputs
They represent the values and observations that are fed into
DESIGN AND IMPLEMENTATION the system through sensors. Tom has the following Boolean
sensor inputs:
Quake-II has been used as a test bed for AI researchers for "Enemy Sighted" Another player or bot is in the field of
some time because it is easily extendible, multiplatform, view.
open source, and has a huge amount of community support "Item Sighted" An item health, weapon, or ammo is in the
[maps, levels, tools, etc…]; also the graphics and physics field of view.
requirements are very modest compared to today‟s "Is Confident" Health is above 20%.
processing power leaving computational resources free for "Is Hit" A projectile or weapon damage has been inflicted.
AI calculations. Being a FPS game, Quake-II players move
in a 3D virtual world viewing it from the eyes of the main Rules
character. The main task of a player is staying alive and The rule base consists of production rules. These rules are
eliminating his/her opponents. basically in the form of “If … then” statements. They are
The project involved the design and implementation of three crafted by an expert to transfer a human knowledge to the
agent types. Each agent [bot] was designed and implemented system. Table 1 illustrates the rules at each state.
in an incremental fashion, its action was built and tested
individually and every bot contributed in the implementation Fuzzy Rule Based Agent “Yianni”
of its successor. The navigation module used a reactive
architecture and is divided into two levels: Low level actions Yianni uses Tom‟s RBS redesigned as a Fuzzy Rule Based
[short term goals] like obstacle avoidance and turning; and System (FRBS). Instead of Boolean sensors Yianni uses
high level actions [long term goals] like taking cover, fuzzy logic to classify sensor crisp values as degrees of
gathering, investigating, chasing, fleeing, and dodging. membership to a specific fuzzy set.
There are only two effectors for the navigation module: Step: There are two types of FRBSs; The Mamdani FRBS and the
which takes a 3D vector specifying the direction as input, Takagi-Sugeno-Kang (TSK) FRBS. The difference between
where only the first two dimensions are used; Turn: which them is the types of inputs [antecedents] and outputs
takes three floating point values as input, specifying the yaw, [consequents] used in both of them (Alcalá et al. 2000).

© EUROSIS-ETI
Table 1: Rules for the FSM Scripted Bot powers, so the multiplication by 10 gives the weapon
strength out of 100 [a bot switches automatically to a higher
Rank State Input New State Action rank weapon when acquired].
1 Searching Enemy Sighted Attacking Shoot "Got Shot" A Boolean variable indicating that the bot has
2 Searching Is Hit Searching Turn
(Default) Searching Any Other Wandering Move
been hit by a projectile or any enemy weapon.
1 Wandering Enemy Sighted Attacking Shoot "Enemy Present" A Boolean variable indicating that the bot
2 Wandering Item Sighted & ¬Is Gathering Seek can see an enemy.
Confident "Enemy Disappeared" A Boolean variable indicating that the
3 Wandering Is Hit Searching Turn
(Default) Wandering Any Other Wandering Move
bot was seeing an enemy but then the enemy ran away or
1 Attacking Enemy Sighted & Is Hunting Seek took cover.
Confident
2 Attacking Enemy Sighted & ¬Is Fleeing Avoid Fuzzy Output Variables [Consequents]
Confident TSK consequents are assigned the value of the minimum
(Default) Attacking Any Other Wandering Move
1 Hunting Enemy Sighted Attacking Shoot [sometimes the product] of the truth values computed for the
(Default) Hunting Any Other Wandering Move antecedents in a fuzzy rule. If more than one rule has the
1 Gathering Enemy Sighted Attacking Shoot same consequent the output variable will have multiple
2 Gathering Item Sighted & ¬Is Gathering Seek confidence values to choose from. This can be solved in
Confident
3 Gathering Is Hit Searching Turn various ways; the most popular techniques are bounded sum
(Default) Gathering Any Other Wandering Move [sum and bound to 1], and maximum value [equivalent to
1 Fleeing Enemy Sighted Attacking Shoot OR-ing the confidences together].
(Default) Fleeing Any Other Wandering Move The maximum value method is used here to resolve the
multiple confidences per output conflict. Yianni has the
The Mamdani system rules are constructed in the following following output variables each representing a tendency to a
form: certain action:
IF is and … and is THEN Y is B "Dodge" Try to avoid incoming visible projectiles [bullets,
Where X = { , …, } is the set of real value input rockets, etc...]
variables, A = { , , …, } is the set of fuzzy linguistic "Flee" Moving away [opposite direction] from an attacking
variables for the inputs, Y is an output variable, and B is also enemy.
a fuzzy linguistic variable. "Attack" Aiming and shooting the held weapon at a visible
The TSK consequents are a function of the input variables, enemy.
taking the form: "Chase" Moving towards or pursuing an enemy.
IF is and … and is THEN Y is ( … ) "Gather" Moving towards a visible item [Health Kit, Armor,
Where is the truth value for input variable in the fuzzy Weapon, or Ammo]
set defined by , and is the disjunction [min] operator. "Search" Turning around looking for enemies.
This implementation uses the TSK FRBS which can be "Investigate" Moving towards a location in the level and
expressed in terms of its input and output variables along searching the area.
with the rule base. "Wander" Random movement and turning while avoiding
obstacles.
Fuzzy Input Variables [Antecedents]
Sensor Input fuzzification provides smooth transition Actions
between states and emulates the way humans think. Crisp The agent‟s overall action is a weighted vector for a
floating point values are fed into the fuzzification component movement direction vector and a weighted orientation angle
to be mapped to a degree of membership of a linguistic for turning.
variable (e.g. how close an enemy is). Each output fuzzy variable contributes to the action with the
Yianni has the following list of sensor inputs some with an value calculated after firing the fuzzy rule where this
accompanying fuzzy membership function and others are variable is a consequent. This provides smooth control
Boolean sensors. implementing the architecture proposed by Saffiotti for
"Projectile Distance" The length of the vector from the bot to controlling autonomous robots (Saffiotti et al. 1993).
the closest projectile observed.
"Enemy Distance" The length of the vector from the bot to Fuzzy Rules
the closest enemy observed. TSK rules are used to control Yianni‟s behavior and can be
"Item Distance" The length of the vector from the bot to the seen in Table 2. It is noticeable the amount of rules required
closest item observed. for Yianni is far less than that for Tom. This proves to be
"Confidence" The crisp value for this fuzzy variable is more efficient for the inference system as fewer rules have to
gathered from the agent‟s personal status and calculated as be matched at every time step.
follows:
Confidence (out of 400) = Health (%) + Armor (%) + Fuzzy Q-Learning Agent “Babis”
(Ammo/2) + (Weapon Rank * 10)
Four different game variables form the crisp value for the Babis is a learning agent and is built upon Yianni‟s fuzzy
fuzzy Confidence input variable. Each variable is clamped to system. The learning algorithm implemented is a Fuzzy Q-
100 then added together to give the final value out of 400. Learning method (Glorennec and Jouffe 1997) used by many
In Quake-II, maximum ammo for a weapon is 200, hence the researchers in the robotics field (Deng et al. 2003)
division by 2. Weapons rank from 1-10 with increasing

© EUROSIS-ETI
Table 2: Fuzzy Agent Behavior Rules (7)
Output Condition Where is the amount added to the value of action ,
and is the learning rate.
Dodge Projectile Distance is near To speed up the learning process and make it exploration
Flee (Enemy Distance is near OR Enemy insensitive we use the eligibility traces from equation (2) in
Distance is close) AND Confidence is weak the form:
Attack Enemy Present
Chase (Enemy Distance is near OR Enemy (8)
Distance is close) AND Confidence is
strong Where λ is the discount parameter of TD(λ)
Gather (Item Distance is near OR Item Distance is After introducing eligibility traces equation (7) becomes:
close) AND Confidence is weak (9)
Search Got Shot AND ¬Enemy Present Finally, the learning algorithm steps are:
Investigate Enemy Disappear
Wander Always [Default Behavior if none of the 1. Collect sensor information and fuzzify them to state x.
above Rules Fire] 2. For each rule, choose the consequent action using
Boltzmann EEP.
Using Fuzzy logic for input state generalization in 3. Apply the global action a from blending all rules actions
reinforcement learning methods has been used for a while and receive reinforcement.
now (Berenji 1996). While most fuzzy reinforcement 4. Compute the overall quality of the system Q(x,a) using
learning algorithms used Actor-Critic architectures (Berenji equation (4).
et al. 1992) (Wang et al. 2007), the method implemented 5. Observe the new state y.
here has the advantage of being exploration insensitive 6. Compute the value of the new state V(y) using (5).
because of the use of eligibility traces of (Sutton 1988). 7. Calculate the error signal using formula (6).
Fuzzy Q-learning is specifically designed to work with TSK 8. Update the eligibility traces for all actions using (8).
FRBSs which makes it suitable for adding learning to the 9. Use the and the eligibility traces to update all the
existing system built for Yianni. Firstly the TSK rule base action q values using (9).
has to be amended to allow J competing actions defined for
every combination of the state [input] vector. Then a q value It is worth noting that the convergence of this algorithm
is associated with each of the competing action as follows: depends on many parameters:
IF is then with The number of input variables forming the state vector and
OR with the number of linguistic variables describing each input
... decide how many rules are specified. The temperature
OR with parameter in the Boltzmann probability equation used for
Where x is the vector of input variables { , , …, }, EEP decides how long is the exploration period and when to
is the state vector defined “ is AND is AND … start choosing only actions with maximum q values. The
AND is ”, are fuzzy sets, , and is discount and learning rate factors affect the speed of
the set of possible actions that can be chosen in state . convergence of the learning algorithm.
The learning agent has to find the best consequent [action] Action q values can be initialized based on prior knowledge
for each rule using the quality value q. The q values are giving some sort of bias towards certain actions in specific
initialized to zeros at the beginning. At each time step an state input configurations. This provides more human control
Exploration/Exploitation Policy [EEP] is used to choose an on the learning process but might prevent the agent from
action for each rule using the Boltzmann Probability exploring new solutions.
assignment from equation (1). Babis has the same fuzzy input and output variables as
Let be the index of the selected action in rule using the Yianni. Additional rules were added to cover the remaining
EEP, and is the index of the action having the maximum q input space. Three input variables where considered in the
value in rule . Therefore the overall quality Q of the inferred learning process: Enemy Distance, Item Distance, and
global action in state is given by: Confidence; Making Chase, Flee, and Gather the three
competing actions for the bot to choose from.
(4)
TESTING AND RESULTS
Where N is the total number of rules, and is the
inferred truth value for input vector x in rule i.
The three bots were developed within the FEAR platform
The value V for state is given by:
(Champandard 2003). The platform is developed to support
(5) AI research and teaching through building synthetic
creatures in virtual environments. Theoretically it can
To update the action values, we have to calculate the error interface with any game engine as a backend, yet its only
signal . Let be a state, the action applied to the implementation is with the Quake-II engine.
system, the new state and the reinforcement signal. By Bots are written in C++, compiled into Dynamic Link
substituting in equation (3) we get: Libraries (DLLs) and are defined with XML files.
(6)
By using an ordinary gradient descent we get:

© EUROSIS-ETI
The DLL file is the implementation of the control processes,
150
using FEAR components specified in the XML files through
interfaces called hooks to interact with the game engine. 100
The knowledge in the rule base of the implemented bots is 50
based upon experience in the game and trial and error. In the
Fuzzy Q-Learning algorithm we use as the discount 0
factor and as the learning rate. -50 1 2 3 4 5 6 7 8 9 10
The reward function is defined as follows: -100
On pain: , on item pickup: , on death:
, hurting an enemy: , and killing an -150
enemy: -200

To evaluate the performance of the implemented bots, a T = 0.5 T=1 T = 0.1


simple map was created comprising of one large room,
scattered obstacles, and various power-ups. Performance Figure 2: Temperature Effect On The Frag Difference
evaluation is achieved through death matches between
implemented bots, the difference between how many times a To speed up learning TD(λ) methods were combined with
bot has killed the other is called the frag difference. the Fuzzy Q-learning algorithm by using the eligibility traces
The frag difference is used as the main performance for each rule action. Another ten death match simulations
evaluation factor. It conveys superiority of a bot‟s actions were run to show the effect of changing the λ value on
over the other. Babis‟ performance. Results are plotted in Figure 3.

Tom VS Yianni 100

The winner from the first two implemented bots will have 50
the chance to compete against Babis the learning bot to 0
assess whether learning has added any power to the agent‟s
behaviors or not. -50 1 2 3 4 5 6 7 8 9 10
Results from ten death matches [each has a time limit of 3
-100
minutes in accelerated mode] were collected and presented
in Table 3. -150

Table 3: Death Match Results for Tom VS Yianni -200

# Tom’s Kills Yianni’s Kills Frag Difference λ = 0.3 λ = 0.9 λ=1


1 974 1028 54 Figure 3: λ Effect On The Frag Difference
2 900 1002 102
3 832 928 96 ANALYSIS AND FUTURE WORK
4 894 942 48
5 833 880 47 From the results presented and from observing the bot
6 900 1003 103 actions, the following points are noticed: Babis the learning
7 918 1018 100 bot is evidently superior to both the scripted bots [and
8 947 1050 103 Champandard‟s bots]. The learnt modified rule base
9 815 853 38 presented surprising preferences. For instance Babis prefers
10 839 890 51 chasing when confidence is weak and flees when confidence
is strong. This may seem to be counterintuitive. However,
Yianni VS Babis after careful observation this strategy proved to be very
useful. At the beginning of a confrontation Babis is usually
Babis‟ performance depends on the learning algorithm and cautious inflicting damage while avoiding the opponent, then
the time it takes to converge to an optimum policy, which in after dealing some damage and getting hurt in the process,
turn depends on many parameters. Babis becomes aggressive and finishes the badly wounded
During the exploration period the agent may behave fleeing enemy.
unrealistically, so the temperature parameter was chosen to An advantage of Fuzzy Q-learning is the amount of human
show the effect of the exploration period on Babis' control over the learning system. This is a desirable feature
performance through ten death matches [each has a time in the game AI field, as game designers do not usually wish
limit of 5 minutes in accelerated mode] against Yianni. for completely unexpected behaviors from the NPCs; they
Results are plotted in Figure 2. T=0.5 yields a good demand tools to script or at least guide the NPC in the course
exploration policy but with one loss while a lower of the game. Though the proposed learning technique works
exploration time does not find the optimal behavior policy. well in adapting to the environment and the opponents, the
basic fuzzy membership functions still have to be crafted by
a designer or an experienced player. Fuzzy tuning addresses
this problem where not only the consequents are learnt but

© EUROSIS-ETI
also the fuzzy membership function boundaries and even the Deng, C.; and Er, M. J. 2003. "Real-time dynamic fuzzy Q-learning
functions themselves (Alcalá et al. 2000) (Er et al. 2004). and control of mobile robots". In 2nd WSEAS International
The reward function was another problem with the Conference on Electronics, Control and Signal Processing.
implementation. The reward signal is general to the agent‟s Singapore.
Glorennec, P. Y.; and Jouffe, L. 1997. "Fuzzy Q-Learning". In
global action, which might be deceiving especially in the Proceedings of the 6th IEEE International Conference on Fuzzy
case of multiple opponents. Hierarchical reinforcement Systems.
methods (Ponsen et al. 2006) separate the reward signal for Gu, D.; Hu, H.; and Spacek, L. 2003. "Learning Fuzzy Logic
every action, so the agent receives accurate feedback. A Controller for Reactive Robot Behaviours". In Proceedings of
combination of the subsumption architecture and hierarchical IEEE/ASME International Conference on Advanced Intelligent
reinforcement learning can be an interesting experiment Mechatron. Kobe, Japan.
within game AI. Hooley, T.; Hunking, B.; Henry, M.; and Inoue, A. 2004.
"Generation of Emotional Behavior for Non-Player Characters:
CONCLUSION Development of EmoBot for Quake II". In Proceedings of the
19th National Conference on Artificial Intelligence (AAAI04),
San Jose, California, 954-955.
Reinforcement learning, while popular in the robotics field, Johnson, D. and Wiles, J. 2001. "Computer Games with
has never found its way through to the video game AI Intelligence". 10th IEEE International Conference on Fuzzy
applications. The project proved reinforcement learning systems. Melbourne, Australia.
algorithms are suitable and useful for simulating intelligence Jouffe, L. 1998. "Fuzzy Inference System Learning by
of NPCs and agents within a game. The implemented Reinforcement Methods". IEEE Transactions On Systems, Man,
learning bot exhibited interesting capabilities to adapt to a And Cybernetics.
human player playing style in addition to outperforming the Kaelbling, L. P.; Littman, M. L.; and Moore, A. W. 1996.
other bots. "Reinforcement Learning: A Survey". Journal of Artificial
Intelligence Research 4 , 237-285.
Testing results showed that incorporating learning into a Laird, J. E. 2001. "It Knows What You‟re Going To Do: Adding
game can reduce the extensive scripting and tuning phase Anticipation to a Quakebot". Agents 2001 conference.
usually following the AI development, while retaining the Laird, J. E.; Newell, A.; and Rosenbloom, P. S. 1987. "Soar: An
ability to guide the NPCs not to exhibit totally unrealistic or architecture for general intelligence". Artificial Intelligence, 33,
unexpected behaviors. This gives the game designers the 1-64.
prospect to explore new strategies. Lin, L.-J. 1992. "Self-Improving Reactive Agents Based On
The effect of the exploration-exploitation policy on the Reinforcement Learning, Planning and Teaching". Machine
learning convergence was studied and tested in depth. The Learning , 8, 293-321.
combination between the two most popular reinforcement Nilsson, N. J. 1998. "STRIPS Planning Systems". In Artificial
Intelligence: A New Synthesis, 373-400
algorithms (Q-Learning and TD (λ)) resulted in faster Orkin, J. 2006. "3 States and a Plan:The AI of F.E.A.R". Game
learning rates and realistic behavior through the exploration Developers Conference GDC.
period. Ponsen, M.; Spronck, P.; and Tuyls, K. 2006. "Hierarchical
Reinforcement Learning in Computer Games". ALAMAS’06
REFERENCES Adaptive Learning Agents and Multi-Agent Systems, 49–60.
Brussels: Vrije Universiteit.
Alcalá, R.; Casillas, J.; Cordón, O.; Herrera, F.; and Zwir, S. J. Saffiotti, A.; Ruspini, E. H.; and Konolige, K. 1993. "Blending
2000. "Learning and tuning fuzzy rule-based systems for Reactivity and Goal-Directedness in a Fuzzy Controller". In
linguistic modeling". Knowledge-based Systems: Computer Proceedings of the Second IEEE Conference on Fuzzy Systems,
techniques 3 (29), 889–941. San Francisco, CA, 134-139.
Berenji, H. R. 1996. "Fuzzy Q-Learning for Generalization of Sutton, R. S. 1988. "Learning to Predict by the Methods of
Reinforcement Learning". In Proceedings of the Fifth IEEE Temporal Differences". Machine learning, 3, 9-44.
International Conference on Fuzzy Systems, 2208-2214. Sutton, R. S. 1996. "Generalization in reinforcement learning:
Berenji, H. R.; and Khedkar, P. 1992. "Learning and Tuning Fuzzy Successful examples using sparse coarse coding". Advances in
Logic Controllers Through Reinforcements". IEEE Neural Information, 8.
Transactions on Neural Networks, 3, 724-740. Sutton, R. S. and Barto, A. G. 1998. Reinforcement Learning: An
Berry, D. A.; and Fristedt, B. 1985. Bandit Problems: Sequential Introduction. MIT Press, Cambridge, MA.
Allocation of Experiments. Prentice-Hall, Englewood Cliffs, Van Waveren, J. 2001. "The Quake III Arena Bot". University of
NJ. Technology Delft.
Bourg, D. M.; and Seeman, G. 2004. AI for Game Developers. Wang, X.-S.; Cheng, Y.-H.; and Yi, J.-Q. 2007. "A fuzzy Actor–
O'Reilly , California. Critic reinforcement learning network". Information Sciences,
Brooks, R. A. 1986. "A robust layered control system for a mobile 177, 3764–3781.
robot". IEEE Journal of Robotics and Automation, 14-23. Watkins, C. J. and Dayan, P. 1992. "Q-learning". Machine
Brooks, R. A. 1991. "Intelligence without representation". Artificial Learning, 8 (3), 279-292.
Intelligence 47 , 139-159. Zadeh, L. A. 1965. "Fuzzy sets". Information and Control, 8, 338-
Campbell, M. 1999. "Knowledge Discovery in Deep Blue". 353.
Communication of the ACM , 42 (11), 65-67. Zanetti, S. and El Rhalibi, A. 2004. "Machine learning techniques
Champandard, A. J. (2003). AI Game Development: Synthetic for FPS in Q3". In Proceedings of the 2004 ACM SIGCHI
Creatures with Learning and Reactive Behaviors. New Riders International Conference on Advances in computer
Publishing. entertainment technology, Singapore, 239 - 244.

© EUROSIS-ETI

You might also like