AZUL Report Team 7
AZUL Report Team 7
AZUL Report Team 7
1. DEEP Q-LEARNING
Why we chose this method: Q-learning is a simple but powerful algorithm. However, it will
require a considerable size of q-tables if state space and action space are too large. In our case,
we have a large number of state and action spaces. Therefore, we tried deep Q-learning because
it is more capable of handling large scale problems in that we can use a neural network to
approximate the Q-value function. The basic framework used in the code was based on OpenAI
Gym. The formula for updating parameters for Q-learning is shown below to introduce network
structure in the next section:
θ←θ+α[r+γmaxa′Q(s′,a′;θ)−Q(s,a;θ)]∇θQ(s,a;θ)
• Pattern line(15) : ['W', 'Y', 'Y', 'B', 'B', 'B', 'K', 'K', 'K', 0, 'K', 'K', 'K', 'K', 'K']
• Grid(25): ['B', 'Y', 'R', 0, 0, 'W', 'B', 0, 'R', 'K', 'K', 'W', 0, 'Y', 0, 0, 0, 0, 0, 'Y', 0, 'R', 0, 0, 0]
• Floor(7): [-1, -1, -1, -1, 0, 0, 0 ]
• State_feature(47) = Pattern_line(15) + Floor(7) + Grid(25)
During the implementation, we found that the information of factories is not that useful, so it
was not used in the representation of the game state. The player state feature is combined with
the enemy state and round information.
1.3. REWARDS
The basic policy for reward was player_score - enemy_score. More specifically, we used the
difference of change of scores between players in the round to train for the very round and the
difference of the total scores including bonuses to train the replay for the entire game. Therefore,
the more points player wins by, the more rewards it gets and it will get minus reward if lost.
This evaluation was more effective compared to giving a fixed reward.
1.4. RESULTS AND THOUGHTS
It beats Naive player with a winning rate of 60%. Not too bad but not too strong. Possibly there
are too many action spaces to train. Firstly, we tried to represent state-action pair features in a
way of getting only available actions for every state. However, the problem was that these
available actions for each state change every time and it was hard to update values to the right
one in the network.
2. MINIMAX
Why we chose this method: Minimax is a decision rule used in game theory for minimizing
the loss for the worst case and maximizing the gains. Although the game used for this project,
AZUL, can be played with more than two players, there are only two players in the competition.
For two-player games, the minimax algorithm is a good tactic because it’s a zero-sum game,
where two players are working towards opposite goals. In this game, no matter how many
scores we get, if the enemy gets more score than mine, we lose. Therefore, we thought minimax
is one of the most effective algorithms for this game.
As seen in the figure below, layers have different minimizing or maximizing objectives so as
to decide the action that can maximize the gains for the current player.
4. FINAL AGENT
Although we could not test the final agent in the practice tournament, comparing to our
minimax player in the initial stage it has the following strengths. 1) Depth increased from 1 to
4 by pruning further implausible actions. 2) More rich evaluation metrics were used other than
initial evaluation metrics. Just a few examples of these are the number of tiles in the grid and
the number of remaining tiles in the pattern line. By adding these, it worked in a way that fills
the line at one go and tries to put as many tiles in the grid as possible. Some weakness may
include that the number of actions to be explored were sacrificed to increase depth. In addition,
it is not very likely, but it may time-out with bad luck due to the increased depth. Apart from
weakness or strength, there were more brilliant ideas we implemented (explained in the
evaluation section of 3. Minimax) but could not be included in the final version because it
performed worse in contrast to our expectation and did not have enough time to test our
changed model in the last phase.
5. SELF-REFLECTION
5.1. HONGSANG
5.1.1. WHAT DID I LEARN ABOUT WORKING IN A TEAM?
I could learn about how to work on group project. Since I have not had experience in doing
group projects before, participating in a group work was a big challenge for me. In this special
situation when we can’t directly meet, I learned how to collaborate via team meeting online
and by utilising tools such as trello and slack. And using google doc, we wrote down our ideas
and give feedback to each other. Even though not all of this resulted in the actual
implementation, discussing ideas itself was something that I could not obtain from lectures and
tutorials.