Deep Imitation Learning For Playing Real Time Strategy Games

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Deep Imitation Learning for Playing Real Time Strategy Games

Jeffrey Barratt Chuanbo Pan


Stanford University Stanford University
353 Serra Mall 353 Serra Mall
jbarratt@cs.stanford.edu chuanbo@cs.stanford.edu

Abstract ligence and machine learning techniques, as they are per-


formed in real time and the player cannot see the whole
Competitive Computer Games, such as StarCraft II, re- battlefield at once (incomplete information). They must
main a largely unexplored and active application of Ma- balance making units, controlling those units, executing a
chine Learning, Artificial Intelligence, and Computer Vi- strategy, and hiding information from their enemy to suc-
sion. These games are highly complex as they typically 1) cessfully win a game of Starcraft II.
involve incomplete information, 2) include multiple strate- For our project, we take in as input the current state of
gies and elements that usually happen concurrently, and 3) the game. Using the PySC2 Library released by DeepMind,
run in real-time. For this project, we dive into a minigame we can represent game states as “images”, where each layer
for StarCraft II that involves many engagement skills such contains sptial information of the map (health, location,
as focus fire, splitting, and kiting to win battles. This paper etc.). We use a Convolutional Neural Network (CNN) to
goes into the details of implementing an algorithm using produce outputs which represent the optimal action for the
behavioral cloning, a subset of imitation learning, to tackle given input state.
the problem. Human expert replay data is used to train dif-
1.2. Joint Project with CS221
ferent systems that are evaluated on the minigame. Super-
vised learning, Convolutional Neural Networks, and Com- While this paper will be solely submitted to CS229, we
bined Loss Functions are all used in this project. While we have used the same general infrastructure for both projects.
have created an agent that shows some basic understand- However, we have applied different techniques and models
ing of the game, the strategies performed are rather prim- to these problems between the two classes. We applied deep
itive. Nevertheless, this project establishes a useful frame- reinforcement learning to the problem for CS 221, while we
work that can be used for future expansion. (This project applied deep imitation learning for CS 229. These mod-
was completed in tandem with a related CS221 project.) els each present their own difficulties (data collection, algo-
rithms, convergence, etc.).

1. Introduction 2. Related Work


1.1. Background and Motivation Games have long been the study of researchers in com-
puter science, and machine learning has recently grown in
Competitive Computer Games, despite recent progress popularity to apply to these problems. Computer vision and
in the area, still remain a largely unexplored application CNN’s have very frequently been the methods applied to
of Machine Learning, Artificial Intelligence, and Computer games. For example, Go, an ancient Chinese board game,
Vision. Board games have long been the standard for ad- has gained popularity because of DeepMind’s work in the
vancements in game playing, However, the structure of field with their recent success of AlphaGo [5] being able
these games is limited; they are for the most part turn- to beat the best player in the world. In fact, some research
based and full information games where two players alter- has been looked into not even exploring the game tree to
nate moves and know the full state of the game. This shows play the game [1], an aspect of the game previously thought
limited application to the real world, as the real world oper- to take much thought in state exploration. Other than Go,
ates in real time and it’s impractical to know the whole state there has been recent news about AlphaZero, a general re-
of the world in one computer. inforcement learning player that was able to beat top bots
Thus, Real Time Strategy (RTS) games such as Starcraft in Chess, Chinese Chess, and even Go [6] using the same
II provide an ideal testing environment for artificial intel- learning infrastructure and model for all games.

1
For real-time games, much work has been done, most ever the agent loses all its Marines. The agent is rewarded
popularly in Atari games by DeepMind [4], where they used 5 points for each enemy defeated, and loses one point for
a DeepQ learning approach with experience replay. each Marine lost.
StarCraft: Brood War is the predecessor to StarCraft II
and remains a popular AI testing environment. Much work
has been done on this game because of its accessibility and
APIs. Approaches to playing the game include work on
macro-management[3], the management of all unit-creating
and resource-gathering units and structures, and grouping
of units for movement around the map[7]. However, virtu-
ally no computer vision techniques have been applied to the (a) When the marines are (b) When the banelings
bunched up, they take (AOE melee attackers) are
game despite the large amount of work and infrastructure,
splash damage from the allowed to connect with
including an agent tournament circuit, created for the game.
baneling’s dangerous area many marines, it is much
Despite what has been mentioned, little work has been of effect (AOE) attack. more difficult to win.
done within the space of StarCraft II, limited only to a few
papers on datasets for macro-management [9] and using lo- Figure 1: Lack of splitting leads to quick defeat.
gistic regression to evaluate game balance [10]. These pa-
pers don’t propose or evaluate any strategies for actually
playing the game, only giving meta-analysis of the game it- This approach shown in Figure 1 is suboptimal for the
self in the form of datasets or regression. This lack of work problem at hand; a smarter approach is needed, as shown in
is due to the relative difficulty to work on the game; before Figure 2. Although this is one good way to improve the out-
a few months ago, there was no available interface to the come of this particular map, it is a good way to demonstrate
game, and the creators of the game worked hard to make many skills in the game of StarCraft II: unit management
sure there was no available entry point to the game to pre- and prioritization in addition to precision with unit control.
vent the possibility of cheating by human players. We set out to design and build an imitation learning algo-
The only paper that we could find on this topic is Deep- rithm to tackle this problem, described in more detail below.
Mind’s own paper [8] which introduced PySC2 [2], an in-
terface between StarCraft II and Python, and also provided
some baseline reinforcement learning algorithms. These al-
gorithms did not perform very well on various minigames
proposed in the paper and available from PySC2. However,
this paper was simply an introduction to the game and the
environment itself, and StarCraft II is an active area of re-
search for DeepMind, so further progress and publications (a) The starting position (b) The marines can start
are expected on this front. of the minigame, with 9 out by attacking the more
marines and 10 enemies. dangerous banelings first.
3. Problem Definition
Because StarCraft II is a game comprised of multi-
ple elements, we have decided to focus only on a cer-
tain aspect of the game. This allowed us to set up our
framework more easily so we can work on more com-
plicated tasks in the future. Specifically, we focused on
the DefeatZerglingsAndBanelings minigame to (c) Splitting the marines (d) Since zerglings are
model the complexities of battling with a group of Marines. into smaller groups miti- melee units, kiting (alter-
At the start, the agent is given 9 preselected Marines and gates the effect of the splash nating between retreating
damage inflicted by the and attacking with ranged
must defeat 6 Zerglings and 4 Banelings placed on the other
AOE banelings, and takes units) can be used to trade
side of the map. The agent can see the entire map and no more efficiently.
precise mouse movement.
‘Fog of War’ mechanism is in place. When the agent defeats
all the Zerglings and Banelings, a new group is respawned Figure 2: The second part of a skirmish between Zerglings
(6 and 4 of each respectively) and the agent is given 4 addi- and Banelings versus Marines, including the use of splitting
tional Marines at full health. The destroyed marines are not and kiting to keep more marines alive.
re-spawned and the remaining Marines don’t recover any
health. This cycle continues for either 120 seconds or when-

2
4. Dataset and Features 4.3. Features
4.1. Data Collection Unlike actions, which needs to be transformed into
network-compatible matrices, the map data provided by
As with traditional supervised learning, we needed a suf-
PySC2 can be used directly. Figure 3 shows features pro-
ficient amount of data in order to train our model. For
vided by the PySC2 interface. We see that the information
games such as StarCraft II, this would involve collecting
Replay Data. However, since this minigame is new and
was created for deep learning purposes (specifically Deep
Reinforcement Learning, which is the topic of our CS 221
counterpart), replay data was not readily available. There-
fore, we had to collect our own replays by using PySC2
Library to record the replays of a human agent playing the
minigame. Overall, we have so far collected around 10,000
frame-action pairs worth of data (after pre-processing).
4.2. Data Preprocessing
StarCraft II replay data files are binary files that can’t be
trained on directly. Therefore, we used the PySC2 library
to establish an interface between Python data structures and
StarCraft II raw data at each given time step. The current
state of the game is represented as an 84 × 84 × 17 feature
tensor, and is associated with a ground truth label represent-
ing the action to take. However, we still cannot train on this
directly. Instead, we need to process the ground truth labels
Figure 3: Features and their image representations
even further. Internally, PySC2 represents actions as func-
tion calls. For example, the select box action (which
selects a rectangle on the screen) can be represented as such: is contained spatially. Therefore, it is intuitive to use Con-
volutional Neural Networks to approach these features.
func select box(x, y, height, width);
There are hundreds of possible actions, each with a variable 5. Methods
number of parameters. Sometimes, the replay will contain Figure 3 also shows us that the feature information is
actions that bear no effect to the game but are for logging or contained spatially within the state. Therefore, it is intuitive
human convenience. Therefore, for the scope of this project, to use CNN’s to approach these features. The parameter
we focus only on the five following actions shown in Ta- sharing properties of a CNN is also useful in the context of
ble 1. For the purpose of our minigame, these are the only Marines and enemy units being in different locations.
actions necessary. Our parser works by taking in a valid
5.1. Vanilla Model
Action Parameters Notes
noop NONE Do Nothing The initial version of our model is shown in Figure 4.
select box x, y, w, h Multi-Select We have four convolution layers, each with a ReLU activa-
select x, y Single Select tion layer, and two batch normalization layers, one at the
move x, y Move To (x, y) start and one in the middle, and one fully-connected layer
attack x, y Move and Attack at the end. This network outputs a 5 × 5 action matrix. The
action with the highest score is chosen and it’s respective
Table 1: A description of the actions used in this project. parameters are selected.

action and then transforming the function-parameter repre-


sentation into a 5 × 5 sparse matrix. The first column is a
one-hot representing the action whereas the following four
columns represent a parameter, if necessary. Additionally, Figure 4: The vanilla actor model.
we also wrote a reverse-parser, capable of transforming our
network interpretation back into a function-parameter for- We defined the loss function as a combination of the
mat for SC2. For the sake of convenience, we saved the Softmax Cross-Entropy loss on the action column, and the
pre-processed data as NumPy arrays for easier access. mean squared loss on the parameters. The softmax function

3
outputs probabilities for each action, and the loss calculates episode. In order to evaluate our performance during the
the deviation from ground truth. Mean squared loss is sim- learning process, we ran these playouts after every epoch.
ply an average of squared euclidean distances. The loss is Since our problem can be divided into a classification
shown below. We denote yi,j as the column/row element. problem (for actions) and a regression problem (for parame-
! ter selection), our next set of evaluations involved looking at
M X
5 M X
5
1 X X (i) (i) the accuracy of selecting the correct action and of choosing
L= ||yc(i) − ŷc(i) ||22 − y1,k log ŷ1,k the correct parameters, respectively. These results provide
m i=1 c=2 i=1 k=1
insight into the training process and lets us know if training
We also opted to use the relatively new batch normaliza- is actually successful.
tion technique across training. As data passes through the For accuracy and loss evaluation, we also used a valida-
network, it may be changed to become extremely large or tion set. We ran our validation set at the end of every epoch.
small, an issue called “invariant covariate shift”. Batch This allowed us to compare performance between training
normalization accounts for this by adjusting layer outputs and validation and gave us insight into whether or not we
which helps lead to better training and improved accuracy. were overfitting to our training set. Our Train-Validation-
Lastly, we used regularization and compared regularized vs Test split was roughly 80-10-10.
non-regularized models. 6.2. Hyperparameter Selection
5.2. Model With Previous Actions We chose our learning rate by running Minibatch SGD
and examining the learning curve on just the training set.
Our next model involved passing in previous actions as
An example of the learning curves is shown in Figure 5.
inputs along with the current state. The motivating reason
For all our networks, we found that a learning rate of around
behind this is that the Vanilla Network is completely reflex
0.001 was sufficient for around 30 epochs worth of training.
based. That is, it takes a state and outputs an action without
The results were consistent, as opposed to lower rates.
any context of possible previous actions. In StarCraft II,
it’s common to play certain actions in a sequence. That is,
given the previous action was selection, it’s more likely that
the following action is attacking or moving.
The structure behind the Vanilla Network is completely
maintained. The 5 × 5 action matrix is flattened to a vector
in R25 and fed through two fully connected layers with 100
hidden units each. The output is also a vector in R25 that’s
reshaped into a 5×5 action matrix and appended to the CNN Figure 5: Learning Rates (Vanilla Network)
output. We did not use convolutions for the actions because
there is no spatial information to learn from. Therefore, a Due to the time limitations imposed and the use of per-
simple fully connected network (FCN) sufficed. sonal GPU’s, we capped our mini-batch size at 25.
We selected a regularization parameter of λ = 0.1 as a
preliminary value to experiment regularization with. Batch
5.2.1 Variation on Number of Hidden Layers regularization also has regularizing effects.
Additionally, we experimented with the number of hidden
convolutional and fully connected layers in the network.
7. Results and Analysis
Specifically, we experimented with a “shallow” network Figure 6 shows average performance after every epoch.
which uses only one convolution and one fully-connected, We observe that our all models exhibit some form of learn-
as well as a “deep” network that uses five of each. The idea ing. A random agent produces a score of around 21.664
was to analyze the tradeoffs of faster reflexes vs more think- over 155 games. Therefore, all our networks pass baseline
ing. This is important in a real-time game. performance being better than random. However, we note
that after around 10 epochs, training seems to stagnate sig-
6. Experiments nificantly. Judging by the error bars, most models seem to
maintain the same level of performance after every itera-
6.1. Model Evaluation
tion. Therefore, none of the models were able to reach the
We evaluated our network through a combination of sev- average 242.885 achieved by the human expert. To see why,
eral techniques. Our primary test involved performing play- it’s best to observe what the agent was doing once it “con-
outs on the minigame. This allows us to gauge practical verged.” Figure 7 depicts common strategy that the agent
performance of the model. To do this, we wrote a custom would consistently take, which is very similar to the strat-
Python script that gathers score data at the end of every egy shown in Figure 1. The agent typically would not split

4
overfit more. In terms of performance, taking into account
previous actions was beneficial in raising overall scores.
This is due to the Vanilla Network occasionally getting in
an infinite loop scenario, where the agent would go back
and forth between two states.

Network Reg Train Val Test


VanillaNet N 0.994 0.649 0.671
VanillaNet Y 0.994 0.753 0.732
PrevAction N 0.978 0.800 0.751
PrevAction Y 0.904 0.691 0.718
PrevActTiny Y 0.966 0.748 0.708
PrevActDeep Y 0.938 0.682 0.647

Table 2: Action Accuracy Results (Accuracy Evaluation)

Network Reg Train Val Test


VanillaNet N 4.719 139.589 125.173
VanillaNet Y 7.826 100.374 112.902
PrevAction N 14.149 90.891 103.006
PrevAction Y 29.466 123.232 137.216
PrevActTiny Y 18.589 102.945 101.718
PrevActDeep Y 20.5747 126.859 124.518

Table 3: Final Param Loss Results (Parameter Evaluation)

Figure 6: Playout results after every epoch.


Given that our issue appears to be high variance, we
should look into gathering more data for this minigame as
after attacking. A possible explanation involves analyzing 10,000 state action pairs does not seem to be enough. Since
no professional replays exist and we had to gather data on
our own time, we were limited in this regard.

8. Conclusion and Future Work


We end our report with some general observations. Over-
all, we are not near maximum potential. Although the
agent learned how to find and attack the enemies, it did not
achieve an advance understanding of splitting and kiting.
Figure 7: Charging without splitting Additionally, the agent sometimes gets stuck moving back
and forth between two states. That said, we were excited
the starting pattern. At the start of each episode, the agent’s about the prospects of our what we can potentially achieve.
Marines are laid out in a very predictable way, and, as seen Just by adding in previous states we were able to slightly
in Figure 2, it is customary to start the episode by charging. increase performance. We hope to analyze features and de-
Therefore, going from a relatively consistent state to charg- velop better heuristics to improve performance even more.
ing is expected. However, after charging, the states become Were we to have more time and resources to work on
inconsistent and more varied. Although moves and actions this project, we would consider evaluating more advanced
are deterministic, outcomes are not as there is some ran- models such as a Long-Short Term Memory Recurrent Neu-
domness when firing. Therefore, outcomes can vary rather ral Network (LSTM RNN) to help determine actions, espe-
quickly. The agent was able to imitate how to initiate the cially in a sequence. We would work towards refining our
battle, but was unable to go any further. abstraction to ensure stability during training. Finally, we
This points to a bigger issue, namely overfitting. As seen would recruit more people to gather more training data to
in Table 2 and Table 3, all our models show characteristics cover more possible cases, especially those cases that ex-
of high variance. Unsurprisingly, deeper networks tended to hibit desired behavior.

5
References 9. Contributions
[1] J. Barratt and C. Pan. Playing go without game tree search Name SUNet
using convolutional neural networks. 2016. Jeffrey Barratt jbarratt
[2] Google DeepMind. PySc2 – StarCraft II Learning Environ- Chuanbo Pan chuanbo
ment, 2017–. [Online].
[3] N. Justesen and S. Risi. Learning macromanagement in star- Although Jeffrey has significantly more experience with
craft from replays using deep learning. 2017. StarCraft II, the majority of the work involved with this
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, project does not require knowledge of StarCraft II con-
I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari cepts. The only exception to this is data gathering, which
with deep reinforcement learning. 2013. involved Jeffrey, an experienced StarCraft II player, play-
[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, ing the minigame repeatedly until enough data is gath-
G. van den Driessche, J. Schrittwieser, I. Antonoglou, ered. Aside from that, both have contributed equally to
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, the progress of the project, often working at the same time
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
through TeamViewer.
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Mastering the game of Go with deep neural networks and
tree search. Nature, 529(7587):484–489, January 2016. 10. Combined Project
[6] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, As per Piazza Post @2367, we will immediately re-
A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, lease our CS 221 Paper, title Deep Reinforcement
T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess
Learning for Playing Real Time Strategy
and shogi by self-play with a general reinforcement learning
Games, to the Teaching Assistants upon request.
algorithm. 2017.
[7] G. Synnaeve and P. Bessiere. A dataset for starcraft ai & an
example of armies clustering. 2012.
[8] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S.
Vezhnevets, M. Yeo, A. Makhzani, H. Kttler, J. Agapiou,
J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Si-
monyan, T. Schaul, H. van Hasselt, D. Silver, T. Lillicrap,
K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ek-
ermo, J. Repp, and R. Tsing. Starcraft ii: A new challenge
for reinforcement learning. 2017.
[9] H. Wu, J. Zhang, and K. Huang. Msc: A dataset for macro-
management in starcraft ii, 2017.
[10] H. Yun. Using logistic regression to analyze the balance of a
game: The case of starcraft ii, 2011.

You might also like