Jkuilui Using Learning by Imitation: Dapeng Zhang, Zhongjie Cai, Bernhard Nebel

jkuilui USING LEARNING BY IMITATION
Dapeng Zhang, Zhongjie Cai, Bernhard Nebel

Research Group on the Foundations of Artificial Intelligence
University of Freiburg
Georges-Ko¨hler-Allee Geb. 52 Freiburg
Email: zhangd, caiz, nebel@informatik.uni-freiburg.de
a
KEYWORDS The competition in Tetris is certainly an interesting topic. In
ABSTRACT theory, the two-player Tetris is much more coadmplex than the
single one [5]. Assuming both human and the artificial player
Tetris is a stochastic and open-end board egame. Several
handle the piece with the same speed, human players can defeat
artificial playe playσδ Tetris. These players pasdaerform well
the best artificial player with ease in the comadpetition mode.
in single learning by imitation, wdahich is novel iadn Tetris.
To our knowledge, the existing artificial playeadrs cannot
The imitation tasks to a standard data classifsdfication
create many attacks in the competitions. The researchers
padroblem. The experiments showed that the performance of
evaluate their players mainly in single games.
tadhe player can be significantly improveertd when our player
Imitation is essential in social learning [6]. Assuming the
acquires similar game skills as those of the imitated human. Our
similarities between the observations and themselves, humans
player can play Tetris in diverse ways by imitating different
acquire various skills via imitation. Imitation learning can be
players, and has chances to defeat the beadst-known artificial
applied in robotics and automatic systems in several ways[7].
player in the world. The framework supports incremental
For instance, Billard et al. built a system according to the
learning because the artificial player can find stronger players
structure of the human brain [8]. Atkeson et al. developed a
and imitate their skills.
method to explain the actions of a demonstrator, and to use the
A. INTRODUCTION explanations in an agent [9].
This paper was motivated by building an artificial player for
Tetris was first invented by Alexey Pajitnov et al. in 1984, the competitions in Tetris. As a human is superior in the
and remains one of the most popular video games today. It can competitions, we employed learning by imitation to clone the
be found in many game consoles and several desktop systems game skills of human players. The highlights of this paper can
in PC, such as KDE and GNOME. be summarized as follows:
Tetris is a stochastic and open-end board game. A piece of • We developed an open source platform for the
block is dropped from the top of the board. The piece is competitions.
randomly chosen from seven predefined ones, and it falls down • To our knowledge, learning by imitation is novel in Tetris.
step by step. The player can move and rotate the current piece • Our artificial player can acquire diverse game behaviors
to place it in a proper position. A new piece appears at the top by imitating different players.
of the board after the current one touches the ground. A fully- • Our player has chances to defeat the best-known artificial
occupied row will be cleared and the blocks above it will player in the competitions.
automatically fall down one step. The goal of the game is to • The framework supports incremental learning.
build as many such rows as possible. This paper is structured in the following manner: first, the
Two players can compete against each other in Tetris. When relation between this work and the literature is addressed in next
one player places the current piece to clear n rows, the other Section. Then, an open source platform for Tetris competitions
player will receive an attack of n − 1 rows, each of which is introduced in Section -B. Next, a method is developed to map
contains n−1 empty cells. The attacks are pushed into the game the imitation to a standard data classification problem in Section
board from the bottom, raising all the accumulated blocks up n -C. After that, the performance of the developed methods is
− 1 steps in the board. The player who has no more space to shown in Section -D. Finally, we draw the conclusion and
accommodate the next piece loses the game. discuss the future works in Section -E.
The single Tetris game was used as a test-bed in the research
Related Works
in artificial intelligence. Researchers developed artificial
players using different approaches [1]. Fehey created a Learning by imitation has been widely applied in robotics,
handcoded player [2], Bo¨hm et al. employed genetic especially in humanoid robots [8]. The core idea of
algorithms [3], and Szita et al. used cross-entropy methods in imitation is to improve the similarity between the imitated
Tetris [4]. These players can play the single game, clearing system and the imitator, even if certain physical or virtual
hundreds of thousands of rows, which would take several weeks dissimilarities exist. In this paper, a framework is developed
or even months for a human player. to imitate both human and artificial players. The structure
of our approach is certainly different from human brains or GUI
the models of the other artificial players. Generally, we socketconnection
follow the idea of learning by imitation. To our knowledge,
Game
it is the first time that imitation learning has been applied in Player2 Player1
Engine
Tetris.
The single Tetris games have been used as test-beds in
several branches in artificial intelligence [1]. For example, the playeror
GUI...
standard 10×20 Tetris game is still a challenging task for the
methods in reinforcement learning [10] [11]. The number of
Fig. 1. The System Components of KBlocks
rows that a player can clear is widely accepted as a criteria for
the evaluation. So far, several successful artificial players e.g.
in [3], [4], and [2], are based on building an evaluation function
competitions among up to 8 players, in which one player could
with linear combinations of the weighted features. These
be a human. A hand-coded artificial player is integrated [16]. It
features were listed in [1]. We also employ 19 handcoded
can clear on average 2000−3000 lines in single games. The
features in our approach, some of which cannot be found in the
competitions can be done in a synchronized mode, in which
list. Instead of a linear evaluation function, we use multiple
each player gets the new piece after the slowest player finishes
support vector machines in our framework.
the current placement.
Support Vector Machine (SVM) was first proposed by Cortes
A new artificial player can be integrated into the platform
and Vapnik in 1995[12], and became an important method for
with ease. We provide a source code package in Internet 2, in
data classification. SVM is well-developed. I was implemented
which the class KBlocksDummyAI is a clean and simple
in several open source packages which were available in
interface for the further development. Graduate students can
Internet. In this paper, SVM is used as a tool. Our
simply change the source code for their internship or thesis.
implementation is based on LIBSVM [13]. We modeled the
Researchers can play around with some ideas or organize
imitation tasks in Tetris as a standard data classification
competitions.
problem which can be finally solved by SVMs.
Incremental learning is mainly about a series of machine C. LEARNING BY IMITATION
learning issues in which the training data is available gradually
In Section -B, we addressed the functionality of the Tetris
[14]. It is a special learning method with which a certain
evaluation can be improved by the learning process during a platform. In this section, the learning by imitation is discussed
in details. First, we give a brief introduction to the system
fairly long period. In order to do that, we defined a learning
components. Then, the patterns, which are used in the filters and
paradigm: switching attention learning [15]. In the paradigm,
support vector machines (SVMs), are explained. Last, we
there are multiple learners with their inputs and outputs forming
a loop. The performance of one learner generates potential address how the SVMs are used for data classification in our
imitation learning.
improvement space for the others. Following this idea, Tetris is
used as a test-bed. Our artificial player can choose a game The training data of the imitation learning are obtained from
played by a stronger player as its target to imitate. the imitated system. In this paper, they are the Tetris games
played by the imitated player. We created several models to
B. AN OPEN SOURCE FRAMEWORK FOR TETRIS obtain the skills of the imitated player. The training process
KDE 1 is an advanced desktop platform which provides user- receives positive feedback if the models make the same
friendly graphic interface. It is an open source project. KBlocks decision as the imitated system. Otherwise, it receives the
is the Tetris game in KDE. We developed KBlocks to a platform negative feedback. The imitation learning is successful if the
for researches in artificial intelligence. The system components trained models keep the similarity even if the data never appear
of KBlocks is shown in Figure 1. in the training set.
KBlocks can be run in two modes: KDE users can use it as a The learning system consists of several components, as
normal desktop game; researchers can choose to start a game shown in Figure 2. We created three catalogs for these
engine, a GUI, or a player. The GUIs and the players are components: the data representation; the algorithms; and the
connected to the game engine via UDP sockets. The learners. They are illustrated as the gray rectangles, the regular
components can be run in one or several computers. rectangles, and the round-cornered rectangles in the figure.
KBlocks can be configured with parameters defined in a Figure 2 also shows the relations among the components. We
text file. The game engine (and the GUI) supports game align these components vertically according to the catalogs. A
lower algorithm uses the outputs of the upper one as its inputs.
The learners computes the models which are used in the
algorithms.
1 2
official cite: http://kde.org http://www.informatik.uni-freiburg.de/∼kiro/KBlocksDummyPlayer.tgz
The middle column with the dotted arrow lines shows the to develop the filters for reducing the amount of data in the
sequence of the computation in the games. With the current learning.
DataPreprocessor A filter consists of a set of patterns. Figure 3 shows the
Filtered column
Counting Filters
Data
x
Hand x x
Pattern Activated
Info.Gain Coded x x
Calculator Patterns
Features x x c c c
flat x x x x x x x c x
Support hole x x x x x x x x
SVM Class
Vector x x x x x x x x x
Learner Labels
Machine x x x x x x x x x
x x x x x x x x x
stop
buriedwell
Fig. 2. System Components

Fig. 4. The Illustrations of the Features
concept of the patterns. The current piece is denoted by ’c’, it is

an ’L’ in the figure. We use ’x’ to denote the already occupied
cells. Around the placement, a small field, which is marked in
Fig. 3. An example of the Pattern
gray, is chosen as the activated area for the patterns. The
patterns are smaller than the small field. For example, the
deeper gray area in the figure shows a pattern. It contains 5×2
cells. The cells with a ’c’ or ’x’ inside are occupied.
board state and the piece (s, p), the data preprocessor can
A pattern can be activated by a placement. As mentioned
generate up to 34 candidate placements by enumerating all the
above, the small field is activated by the placement. All the 5 ×
rotation and the position of the p. The candidates are filtered
because of the heavy computational power required by training 2 patterns can be enumerated. We move a pattern around the
the SVMs. The rests of the candidates are passed to the pattern small field. It is activated by the placement, if the occupied cells
calculator and the hand-coded features. Each candidate is in the pattern match the occupied cells in the background (the
transferred into a vector of the values of the patterns and the small field).
features. The vectors are used as the input of the SVM for the Filters can thus be learned by counting. If a pattern is never
prediction. The output of the SVM can be described as how activated by the placements of the imitated player, it can be used
similar a candidate is to the choice of the imitated player. to reduce the candidates. Each filter is a set of such patterns. It
Consequently, the most similar one is labeled as the final can be learned by running the activation tests over all the
choice. training data.
The Learning of the Patterns Support Vector Machines
Training the SVMs is time consuming. There are 7 different The patterns are useful not only in the filters but also for
modeling the skills of the imitated players. For instance, a
pieces in Tetris: L, J, O, I, T, Z, and S. To place one of L, J, or
pattern was activated 1000 times over the training set, among
T, there will be 34 candidates by combining all the possible
which 900 were activated by the positive cases. This pattern
rotations and positions; O has 9 combinations; I, Z, or S have
cannot be used in a filter because there are mixed negative and
17. The candidate chosen by the imitated player is regarded as
positive cases. However, activating it apparently indicates that
the positive case. The others are the negative cases. If the size
the placement tends to be positive because of the positive to
of the training set is 10000, there are about 220000 tuples
negative rate in the training data. Therefore, the patterns are also
(cases) in the set. If each tuple is a vector of 39 values, training
used in this section to compute the inputs of the SVMs.
a SVM from these data would take more than a week using a
However, the patterns can only get the “local” information.
2.3GHz PC.
They are checked within the small field around the placement.
In order to reduce the data set, the types of pieces are used in
From another aspect, it is important to consider some “global”
the data preprocessor to separate the data into 7 subsets. Each
parameters in Tetris. For example, a candidate placement can
subset is used to train its own filter, patterns, and SVM. In other
clear 4 rows. This would be important for the game. The
words, seven SVMs work together in the artificial player.
patterns, however, cannot express this occurrence.
When placing the current piece, human players can first
We designed hand-coded features to acquire “global”
reduce the candidates to a limited number by observing the
information. If the patterns can define the tactics of the games,
surface of the accumulated blocks. Then, they choose one from
the features can be used to describe the strategies. In order to
the filtered candidates as their final decision. This idea was used
define these features, we use Figure 4 to illustrate some phrases:
hole, flat, column, and well. A well or a hole is buried if it is no The values of the inputs should be within the same range in
deeper than three cells from the surface. the SVMs. The patterns always have a value of 0 or 1, which
The features are listed in Table I. Items 2 and 3 are for the denotes whether or not it is activated by the current placement.
column. Items 4 − 6 are about the flat. 9 − 11 are for the hole. The value of features, however, can be much bigger. For
14 − 18 are about the well. Our features are compared example, the maximum length of the flat can be up to 9 in a
TABLE I standard Tetris game. In order to avoid this situation, the values
List of Hand-Coded Features of the features were mapped to 0, 0.5, or 1 in our
Imitating Human
1* How many attacks are possible after the current 0.8 Imitating AI
0.7
placement. 0.6
0.5
2 The number of the columns. 0 20 40 60 80 100
3 The increased height of the column. 35 Imitating Human

Imitating AI
25
4 The increased number of the flat. 15
5
5 The decreased number of the flat. 0 20 40 60 80 100
6 The maximum length of the flat 1300 Imitating Human

Imitating AI
900
7 The increased height of accumulated blocks. 500

100
0 20 40 60 80 100
8 The height difference between the current placement and The Training Process
the highest position of the accumulated blocks.
9 How many holes will be created after the current Fig. 5. The Training Process
placement
10 How many holes will be removed after the current
placement. implementation.
11* How may occupied cells are added over a hole.
12 The number of removed lines of the current placement. D. EXPERIMENTS
13* How well will the next piece be accommodated. The experiments were done in a grid system. There are 8
14 If a well is closed by the current placement, how deep is the computers in the grid. Each computer has 8 2.3GHz AMD
well. CPUs, and 32G memory. 64 processes can be run in parallel in
15 If a well is open by the current placement, how deep is the the grid.
well. We recorded 10 games of a human player. Each game lasted
16* How may occupied cells are added over a buried well. more than one hour. The game speed was limited, so that the
17 The number of the open wells. player had enough time for the game. The player can play Tetris
18 How deep is the well, if it is created by the current at an amateur level. In total 6720 rows were cleared in these
placement. games. The human player was regarded as the first imitated
19 Whether a well is removed by current placement. player.
with the features listed in [1]; the items with * were not The Fehey’s artificial player [2] was run for about 1 hour. It
mentioned. There are differences in the descriptions of the cleared 6774 lines without a restart. The game was recorded as
features because we use them as the inputs of the SVMs. The the training set. Fehey’s artificial player was the second
other researchers developed the evaluation function with the imitated player.
linear combinations of the weighted features. The two imitated players had very different behaviors in the
A large number of patterns can be created by enumeration. games. If the human player competes with the artificial player
For example, an enumeration of 5×2 will create 1024 patterns. in the synchronized mode, the artificial player has very little
It is difficult to consider all these patterns as the inputs of the chance to win, because it attacks only a few times.
SVMs because of the required computational power. To our The recorded data were divided into 150 subsets, 120 of them
knowledge, there is no trivial way to compute a subset of the were used as the training set. The rests comprised the testing
patterns which yield to the optimal performance of the SVMs. set, through which the similarity between the trained models
Therefore, we employ the information gain in decision tree for and the imitated players can be calculated the rate that the
computing a subset of 20 patterns for each SVM. trained model chooses the same placements as the imitated
SVMs are a popular method in data classification, in which player. The results are shown in the upper plot of Figure 5. The
the whole data set are globally classified with a set of the labels. data were averaged over 10 slices.
Nevertheless, the data in Tetris are grouped by the current piece. The solid lines show the performance of the player that
Among the candidate placements of the current piece, the imitates the human player. The dotted lines are the player that
algorithm needs to choose the one which is closest to the choice was imitating Fehey’s player. Both imitations achieved a
of the imitated player. LIBSVM [13] provides an API to similarity of about 0.7. The curves resemble a typical learning
compute this probability, which is used in our implementation. curve because the similarity is regarded as the evaluation in the
learning. The similarity cannot be higher because of the according to the evaluations, which means our imitation
differences in the data representation and the models between learning can generate various artificial players according to the
the imitating system and the imitated systems. imitated systems.
The trained models compete against Fehey’s player in the
E. Conclusions
synchronized two-player games. 200 random piece sequences
were generated for the 200 games, so that each model was In this paper, we developed a platform for Tetris
evaluated in the same set of the games. The middle plot in competitions. The platform is based on an open-source project.
Figure 5 shows the winning rates of the imitating players. The The GUIs and players can connect with the game engine via the
player imitating human finally achieved 0.25 as its rate of wins socket connections. A dummy player was provided as an
interface for further development.
We implemented a framework by using learning by imitation.
The framework consists of several sets of filters, pattern
calculators, and SVMs. The imitation tasks were mapped to a
standard data classification problem. The experiments show
that our imitators have chances to defeat Fehey’s player, which
is the best-known artificial player in single Tetris games. And
the imitation learning can acquire diverse skills in Tetris games.
Fig. 6. Behavoiurs of Different Players
There are multiple learners in the framework. The learned
artificial player can be used to select an interesting game for
further training. The inputs and outputs of the learners form a
in the competitions against Fehey’s player. The other imitator
loop so that each performance of one of the learners create
did not perform well because the competitions were between
improvement space for the incremental learning.
the imitating and imitated systems. As the similarity cannot be
very high in our implementation, the imitated system should in Discussions
principle be better than the imitating system.
The imitator did not win many games in the competitions. In
The trained models also play the single games. The piece the next step, we will develop an extra learner for better results
sequences used in the games were generated and fixed. The in the competitions. The initial experiments showed that the
number of handled pieces was used as the evaluation of the wins in the competitions can be significantly improved by using
player. The results are shown in the lower plot in Figure 5. the rate of wins as the evaluation in the learning.
Fehey’s artificial player is better than the human player in the
Tetris was studied mainly in single games. If the sequence of
single games, which explains the observation that the imitator
the pieces are known, how can a player win the competitions?
of Fehey’s player is in the end better than the other imitator.
AI planning is an interesting direction for further development.
The training process was designed to search for the maximum
We are going to implement the bandit based Monte-Carlo
rate of the similarity. The rate reached 0.68 at the 30th data slice, planning in Tetris.
and kept this value after that. The performance in the
competitions and single games can still be improved after the References
30th data slice. This observation indicates that a bigger [1] S. B. T. Christophe, “Building Controllers for Tetris,” International
training set helps to improve the game skills, though it does not Computer Games Association Journal, vol. 32, pp. 3–11, 2009.
improve the similarity in the imitation. [2] C. Fehey, “Tetris AI,”
http://www.colinfahey.com/tetris/tetrisen.html, 2003, www accessed
The human player, Fehey’s player, and their imitators have on 02-August-2010.
different behaviors in the games. In order to show the [3] G. K. S. M. N. Bo¨hm, “An evolutionary approach to tetris,” 2005, in
difference, we designed the evaluations for the attack, defense, Proceedings of the sixth metaheuristics international conference
(MIC2005).
and risk. Each player played the same sequences of the pieces [4] I. A. Lo˝rincz, “Learning tetris using the noisy cross-entropy method,”
in the single games. Attack is the average number of attacks that Neural Computation, vol. 18, pp. 2936–2941, 2006.
the player made to clear 100 lines. Defense is evaluated by the [5] L. Reyzin, “2 player tetris is pspace hard,” 2006, in Proceedings of 16th
average number of cleared lines of each game. Risk is measured Fall Workshop on Computational and Combinatorial Geometry.
[6] A. Bandura, “Social learning theory,” New York: General Learning
by the average height of the placements. The results are shown Press, 1971.
in Figure 6. [7] B. S. C. Breazeala, “Robots that imitate humans,” Trends in Cognitive
Fehey’s player has a defense ability several levels of Sciences, vol. 6-11, pp. 481–487, 2002.
[8] M. M. A. Billard, “Human arm movement by imitation:evaluation of
significance better than the other players. This information was biologically inspired connectionist architecture,” Robotics and
shown as the open-end column in the figure. The other Autonomous Systems, vol. 35, pp. 145–160, 1998.
evaluations were mapped to a comparable range. The human [9] S. S. C.G. Atkeson, “Learning from demonstration,” 1997, pp. 12–20, in
player has the best attack ability, which explains how its Proceedings of 14th International Conference on Machine Learning
(ICML97).
imitator has chances to defeat Fehey’s player in the [10] K. Driessens, “Relational reinforcement learning,” 2004, phD thesis,
competitions. The two imitators show quite different behaviors Catholic University of Leuven.
[11] D. Carr, “Adapting reinforcement learning to tetris,” 2005, bachelor
Thesis of Rhodes University, Grahamstown 6139, South Africa.
[12] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning,
vol. 20, pp. 273–297, 1995.
[13] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector
machines, 2001, software available at
http://www.csie.ntu.edu.tw/∼cjlin/ libsvm.
[14] Z. S. A. Bouchachia, B. Gabrys, “Overview of some incremental
learning algorithms,” 2007, pp. 1–6, in Proceedings of Fuzzy Systems
Conference, 2007. FUZZ-IEEE.
[15] A. H. D. Zhang, B. Nebel, “Switching attention learning - a paradigm
for introspection and incremental learning,” 2008, pp. 99–104, in
Proceedings of Fifth International Conference on Computational
Intelligence, Robotics and Autonomous Systems (CIRAS 2008).
[16] J. Tang, “Ai agent for tetris,” 2009, bachelor Thesis, University of
Freiburg, Germany.

Jkuilui Using Learning by Imitation: Dapeng Zhang, Zhongjie Cai, Bernhard Nebel

Uploaded by

Copyright:

Available Formats

Jkuilui Using Learning by Imitation: Dapeng Zhang, Zhongjie Cai, Bernhard Nebel

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jkuilui Using Learning by Imitation: Dapeng Zhang, Zhongjie Cai, Bernhard Nebel

Uploaded by

Copyright:

Available Formats

jkuilui USING LEARNING BY IMITATION

Dapeng Zhang, Zhongjie Cai, Bernhard Nebel

Fig. 2. System Components

concept of the patterns. The current piece is denoted by ’c’, it is

The Learning of the Patterns Support Vector Machines

3 The increased height of the column. 35 Imitating Human

4 The increased number of the flat. 15

6 The maximum length of the flat 1300 Imitating Human

7 The increased height of accumulated blocks. 500

You might also like