Deep Imitation Learning For Playing Real Time Strategy Games
Deep Imitation Learning For Playing Real Time Strategy Games
Deep Imitation Learning For Playing Real Time Strategy Games
1
For real-time games, much work has been done, most ever the agent loses all its Marines. The agent is rewarded
popularly in Atari games by DeepMind [4], where they used 5 points for each enemy defeated, and loses one point for
a DeepQ learning approach with experience replay. each Marine lost.
StarCraft: Brood War is the predecessor to StarCraft II
and remains a popular AI testing environment. Much work
has been done on this game because of its accessibility and
APIs. Approaches to playing the game include work on
macro-management[3], the management of all unit-creating
and resource-gathering units and structures, and grouping
of units for movement around the map[7]. However, virtu-
ally no computer vision techniques have been applied to the (a) When the marines are (b) When the banelings
bunched up, they take (AOE melee attackers) are
game despite the large amount of work and infrastructure,
splash damage from the allowed to connect with
including an agent tournament circuit, created for the game.
baneling’s dangerous area many marines, it is much
Despite what has been mentioned, little work has been of effect (AOE) attack. more difficult to win.
done within the space of StarCraft II, limited only to a few
papers on datasets for macro-management [9] and using lo- Figure 1: Lack of splitting leads to quick defeat.
gistic regression to evaluate game balance [10]. These pa-
pers don’t propose or evaluate any strategies for actually
playing the game, only giving meta-analysis of the game it- This approach shown in Figure 1 is suboptimal for the
self in the form of datasets or regression. This lack of work problem at hand; a smarter approach is needed, as shown in
is due to the relative difficulty to work on the game; before Figure 2. Although this is one good way to improve the out-
a few months ago, there was no available interface to the come of this particular map, it is a good way to demonstrate
game, and the creators of the game worked hard to make many skills in the game of StarCraft II: unit management
sure there was no available entry point to the game to pre- and prioritization in addition to precision with unit control.
vent the possibility of cheating by human players. We set out to design and build an imitation learning algo-
The only paper that we could find on this topic is Deep- rithm to tackle this problem, described in more detail below.
Mind’s own paper [8] which introduced PySC2 [2], an in-
terface between StarCraft II and Python, and also provided
some baseline reinforcement learning algorithms. These al-
gorithms did not perform very well on various minigames
proposed in the paper and available from PySC2. However,
this paper was simply an introduction to the game and the
environment itself, and StarCraft II is an active area of re-
search for DeepMind, so further progress and publications (a) The starting position (b) The marines can start
are expected on this front. of the minigame, with 9 out by attacking the more
marines and 10 enemies. dangerous banelings first.
3. Problem Definition
Because StarCraft II is a game comprised of multi-
ple elements, we have decided to focus only on a cer-
tain aspect of the game. This allowed us to set up our
framework more easily so we can work on more com-
plicated tasks in the future. Specifically, we focused on
the DefeatZerglingsAndBanelings minigame to (c) Splitting the marines (d) Since zerglings are
model the complexities of battling with a group of Marines. into smaller groups miti- melee units, kiting (alter-
At the start, the agent is given 9 preselected Marines and gates the effect of the splash nating between retreating
damage inflicted by the and attacking with ranged
must defeat 6 Zerglings and 4 Banelings placed on the other
AOE banelings, and takes units) can be used to trade
side of the map. The agent can see the entire map and no more efficiently.
precise mouse movement.
‘Fog of War’ mechanism is in place. When the agent defeats
all the Zerglings and Banelings, a new group is respawned Figure 2: The second part of a skirmish between Zerglings
(6 and 4 of each respectively) and the agent is given 4 addi- and Banelings versus Marines, including the use of splitting
tional Marines at full health. The destroyed marines are not and kiting to keep more marines alive.
re-spawned and the remaining Marines don’t recover any
health. This cycle continues for either 120 seconds or when-
2
4. Dataset and Features 4.3. Features
4.1. Data Collection Unlike actions, which needs to be transformed into
network-compatible matrices, the map data provided by
As with traditional supervised learning, we needed a suf-
PySC2 can be used directly. Figure 3 shows features pro-
ficient amount of data in order to train our model. For
vided by the PySC2 interface. We see that the information
games such as StarCraft II, this would involve collecting
Replay Data. However, since this minigame is new and
was created for deep learning purposes (specifically Deep
Reinforcement Learning, which is the topic of our CS 221
counterpart), replay data was not readily available. There-
fore, we had to collect our own replays by using PySC2
Library to record the replays of a human agent playing the
minigame. Overall, we have so far collected around 10,000
frame-action pairs worth of data (after pre-processing).
4.2. Data Preprocessing
StarCraft II replay data files are binary files that can’t be
trained on directly. Therefore, we used the PySC2 library
to establish an interface between Python data structures and
StarCraft II raw data at each given time step. The current
state of the game is represented as an 84 × 84 × 17 feature
tensor, and is associated with a ground truth label represent-
ing the action to take. However, we still cannot train on this
directly. Instead, we need to process the ground truth labels
Figure 3: Features and their image representations
even further. Internally, PySC2 represents actions as func-
tion calls. For example, the select box action (which
selects a rectangle on the screen) can be represented as such: is contained spatially. Therefore, it is intuitive to use Con-
volutional Neural Networks to approach these features.
func select box(x, y, height, width);
There are hundreds of possible actions, each with a variable 5. Methods
number of parameters. Sometimes, the replay will contain Figure 3 also shows us that the feature information is
actions that bear no effect to the game but are for logging or contained spatially within the state. Therefore, it is intuitive
human convenience. Therefore, for the scope of this project, to use CNN’s to approach these features. The parameter
we focus only on the five following actions shown in Ta- sharing properties of a CNN is also useful in the context of
ble 1. For the purpose of our minigame, these are the only Marines and enemy units being in different locations.
actions necessary. Our parser works by taking in a valid
5.1. Vanilla Model
Action Parameters Notes
noop NONE Do Nothing The initial version of our model is shown in Figure 4.
select box x, y, w, h Multi-Select We have four convolution layers, each with a ReLU activa-
select x, y Single Select tion layer, and two batch normalization layers, one at the
move x, y Move To (x, y) start and one in the middle, and one fully-connected layer
attack x, y Move and Attack at the end. This network outputs a 5 × 5 action matrix. The
action with the highest score is chosen and it’s respective
Table 1: A description of the actions used in this project. parameters are selected.
3
outputs probabilities for each action, and the loss calculates episode. In order to evaluate our performance during the
the deviation from ground truth. Mean squared loss is sim- learning process, we ran these playouts after every epoch.
ply an average of squared euclidean distances. The loss is Since our problem can be divided into a classification
shown below. We denote yi,j as the column/row element. problem (for actions) and a regression problem (for parame-
! ter selection), our next set of evaluations involved looking at
M X
5 M X
5
1 X X (i) (i) the accuracy of selecting the correct action and of choosing
L= ||yc(i) − ŷc(i) ||22 − y1,k log ŷ1,k the correct parameters, respectively. These results provide
m i=1 c=2 i=1 k=1
insight into the training process and lets us know if training
We also opted to use the relatively new batch normaliza- is actually successful.
tion technique across training. As data passes through the For accuracy and loss evaluation, we also used a valida-
network, it may be changed to become extremely large or tion set. We ran our validation set at the end of every epoch.
small, an issue called “invariant covariate shift”. Batch This allowed us to compare performance between training
normalization accounts for this by adjusting layer outputs and validation and gave us insight into whether or not we
which helps lead to better training and improved accuracy. were overfitting to our training set. Our Train-Validation-
Lastly, we used regularization and compared regularized vs Test split was roughly 80-10-10.
non-regularized models. 6.2. Hyperparameter Selection
5.2. Model With Previous Actions We chose our learning rate by running Minibatch SGD
and examining the learning curve on just the training set.
Our next model involved passing in previous actions as
An example of the learning curves is shown in Figure 5.
inputs along with the current state. The motivating reason
For all our networks, we found that a learning rate of around
behind this is that the Vanilla Network is completely reflex
0.001 was sufficient for around 30 epochs worth of training.
based. That is, it takes a state and outputs an action without
The results were consistent, as opposed to lower rates.
any context of possible previous actions. In StarCraft II,
it’s common to play certain actions in a sequence. That is,
given the previous action was selection, it’s more likely that
the following action is attacking or moving.
The structure behind the Vanilla Network is completely
maintained. The 5 × 5 action matrix is flattened to a vector
in R25 and fed through two fully connected layers with 100
hidden units each. The output is also a vector in R25 that’s
reshaped into a 5×5 action matrix and appended to the CNN Figure 5: Learning Rates (Vanilla Network)
output. We did not use convolutions for the actions because
there is no spatial information to learn from. Therefore, a Due to the time limitations imposed and the use of per-
simple fully connected network (FCN) sufficed. sonal GPU’s, we capped our mini-batch size at 25.
We selected a regularization parameter of λ = 0.1 as a
preliminary value to experiment regularization with. Batch
5.2.1 Variation on Number of Hidden Layers regularization also has regularizing effects.
Additionally, we experimented with the number of hidden
convolutional and fully connected layers in the network.
7. Results and Analysis
Specifically, we experimented with a “shallow” network Figure 6 shows average performance after every epoch.
which uses only one convolution and one fully-connected, We observe that our all models exhibit some form of learn-
as well as a “deep” network that uses five of each. The idea ing. A random agent produces a score of around 21.664
was to analyze the tradeoffs of faster reflexes vs more think- over 155 games. Therefore, all our networks pass baseline
ing. This is important in a real-time game. performance being better than random. However, we note
that after around 10 epochs, training seems to stagnate sig-
6. Experiments nificantly. Judging by the error bars, most models seem to
maintain the same level of performance after every itera-
6.1. Model Evaluation
tion. Therefore, none of the models were able to reach the
We evaluated our network through a combination of sev- average 242.885 achieved by the human expert. To see why,
eral techniques. Our primary test involved performing play- it’s best to observe what the agent was doing once it “con-
outs on the minigame. This allows us to gauge practical verged.” Figure 7 depicts common strategy that the agent
performance of the model. To do this, we wrote a custom would consistently take, which is very similar to the strat-
Python script that gathers score data at the end of every egy shown in Figure 1. The agent typically would not split
4
overfit more. In terms of performance, taking into account
previous actions was beneficial in raising overall scores.
This is due to the Vanilla Network occasionally getting in
an infinite loop scenario, where the agent would go back
and forth between two states.
5
References 9. Contributions
[1] J. Barratt and C. Pan. Playing go without game tree search Name SUNet
using convolutional neural networks. 2016. Jeffrey Barratt jbarratt
[2] Google DeepMind. PySc2 – StarCraft II Learning Environ- Chuanbo Pan chuanbo
ment, 2017–. [Online].
[3] N. Justesen and S. Risi. Learning macromanagement in star- Although Jeffrey has significantly more experience with
craft from replays using deep learning. 2017. StarCraft II, the majority of the work involved with this
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, project does not require knowledge of StarCraft II con-
I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari cepts. The only exception to this is data gathering, which
with deep reinforcement learning. 2013. involved Jeffrey, an experienced StarCraft II player, play-
[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, ing the minigame repeatedly until enough data is gath-
G. van den Driessche, J. Schrittwieser, I. Antonoglou, ered. Aside from that, both have contributed equally to
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, the progress of the project, often working at the same time
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
through TeamViewer.
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Mastering the game of Go with deep neural networks and
tree search. Nature, 529(7587):484–489, January 2016. 10. Combined Project
[6] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, As per Piazza Post @2367, we will immediately re-
A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, lease our CS 221 Paper, title Deep Reinforcement
T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess
Learning for Playing Real Time Strategy
and shogi by self-play with a general reinforcement learning
Games, to the Teaching Assistants upon request.
algorithm. 2017.
[7] G. Synnaeve and P. Bessiere. A dataset for starcraft ai & an
example of armies clustering. 2012.
[8] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S.
Vezhnevets, M. Yeo, A. Makhzani, H. Kttler, J. Agapiou,
J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Si-
monyan, T. Schaul, H. van Hasselt, D. Silver, T. Lillicrap,
K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ek-
ermo, J. Repp, and R. Tsing. Starcraft ii: A new challenge
for reinforcement learning. 2017.
[9] H. Wu, J. Zhang, and K. Huang. Msc: A dataset for macro-
management in starcraft ii, 2017.
[10] H. Yun. Using logistic regression to analyze the balance of a
game: The case of starcraft ii, 2011.