Zhang 2019

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)
A Novel Neural Source Code Representation Based

on Abstract Syntax Tree
Jian Zhang†‡ , Xu Wang†‡∗ , Hongyu Zhang§ , Hailong Sun†‡ , Kaixuan Wang†‡ and Xudong Liu†‡
† SKLSDE Lab, School of Computer Science and Engineering, Beihang University, Beijing, China
‡ Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing, China
§ The University of Newcastle, Australia
{zhangj, wangxu}@act.buaa.edu.cn, hongyu.zhang@newcastle.edu.au, {sunhl, wangkx, liuxd}@act.buaa.edu.cn
Abstract—Exploiting machine learning techniques for analyz- the source code) is composed of natural language texts. Even
ing programs has attracted much attention. One key problem though code fragments have something in common with plain
is how to represent code fragments well for follow-up analysis. texts, they should not be simply dealt with text-based or token-
Traditional information retrieval based methods often treat
programs as natural language texts, which could miss important
based methods due to their richer and more explicit structural
semantic information of source code. Recently, state-of-the-art information [2], [18].
studies demonstrate that abstract syntax tree (AST) based neural Recent work [2], [5], [6] provides the strong evidence that
models can better represent source code. However, the sizes of syntactic knowledge contributes more in modeling source code
ASTs are usually large and the existing models are prone to and can obtain better representation than traditional token-
the long-term dependency problem. In this paper, we propose
a novel AST-based Neural Network (ASTNN) for source code
based methods. These approaches combine Abstract Syntax
representation. Unlike existing models that work on entire ASTs, Tree (AST) and Recursive Neural Network (RvNN) [5], Tree-
ASTNN splits each large AST into a sequence of small statement based CNN [2] or Tree-LSTM [6] to capture both the lexical
trees, and encodes the statement trees to vectors by capturing (i.e., the leaf nodes of ASTs such as identifiers) and syntactical
the lexical and syntactical knowledge of statements. Based on the (i.e., the non-leaf nodes of ASTs like the grammar construct
sequence of statement vectors, a bidirectional RNN model is used
to leverage the naturalness of statements and finally produce the
WhileStatement) information. Such AST-based neural models
vector representation of a code fragment. We have applied our are effective, yet they have two major limitations. First, similar
neural network based source code representation method to two to long texts in NLP, these tree-based neural models are
common program comprehension tasks: source code classification also vulnerable to the gradient vanishing problem that the
and code clone detection. Experimental results on the two tasks gradient becomes vanishingly small during training, especially
indicate that our model is superior to state-of-the-art approaches.
when the tree is very large and deep [19], [20], [21]. For
Keywords-Abstract Syntax Tree; source code representation; example, as our experiments show (Section V), the maximal
neural network; code classification; code clone detection node number/depth of ASTs of common code fragments in
C and Java are 7,027/76 and 15,217/192, respectively. As
I. I NTRODUCTION a result, traversing and encoding entire ASTs in a bottom-
Many software engineering methods, such as source code up way [5], [6] or using the sliding window technique [2]
classification [1], [2], code clone detection [3], [4], [5], [6], may lose long-term context information [19], [22]; Second,
defect prediction [7], [8] and code summarization [9], [10] these approaches either transform ASTs to or directly view
have been proposed to improve software development and ASTs as full binary trees for simplification and efficiency,
maintenance. One main challenge that is common across all which destroys the original syntactic structure of source code
these methods is how to represent source code, in order and even make ASTs much deeper. The transformed and
to effectively capture syntactical and semantic information deeper ASTs further weaken the capability of neural models
embedded in the source code. to capture more real and complex semantics [23].
Traditional approaches such as Information Retrieval (IR) In order to overcome the limitations of the above AST-
usually treat code fragments as natural language texts and based neural networks, one solution is to introduce explicit
model them based on tokens. For example, programs are (long-term) control flow and data dependencies graphs and
represented by token sequences or bag of tokens for code clone employ a Graph Embedding technique [24] to represent source
detection [3], [4], bug localization[11], and code authorship code. For instance, one recent study considers the long-
classification [1]. In addition, a number of researchers use range dependencies induced by the same variable or function
Latent Semantic Indexing (LSI) [12] and Latent Dirichlet in distant locations [25]. Another study directly constructs
Allocation (LDA) [13] to analyze source code [14], [15], [16]. control flow graphs (CFGs) of code fragments [26]. However,
However, according to [17], the common problem of these as depicted in the above work, precise and inter-procedural
approaches is that they assume the underlying corpus (i.e., program dependency graphs (PDGs) (i.e. control flow and data
flow dependencies) [27] usually rely on compiled intermediate
∗ Corresponding author: Xu Wang, wangxu@act.buaa.edu.cn. representations or bytecodes [28], [29], and are not applicable
1558-1225/19/$31.00 ©2019 IEEE 783

DOI 10.1109/ICSE.2019.00086
Modifier public
1. static public String readText(final String path)
2. throws IOException { Method MethodDeclaration …
readText static
3. final InputStream stream Declaration Type InputStream
4. = FileLocator.getAsStream(path); … LocalVariable …
5. final StringBuilder sb LocalVariable ...
6. = new StringBuilder(); LocalVariable …
7. try (BufferedReader reader = throws Type StringBuilder
8. new BufferedReader(
new InputStreamReader(stream))){ body LocalVariable ... TryStatement …
9.
10. String line;
11. while ((line=reader.readLine())!=null) TryResource ... LocalVariable …
12. {
13. sb.append(line).append("\n"); TryStatement block LocalVariable
WhileStatement …
14. }
WhileStatement
…
15. }
16. return sb.toString();
17. } ReturnStatement ... ReturnStatement …
(a) Code fragment and statements (b) AST and statement trees (c) Statement naturalness
Fig. 1. An example of AST Statement nodes (marked in red)
to uncompilable and incomplete code fragments. Such a lim- vectors, we use bidirectional Gated Recurrent Unit (GRU)
itation hinders the applications of the code representations in [34], [35], one type of recurrent neural network, to leverage
many areas that involve arbitrary code fragments. the sequential naturalness of statements and finally obtain the
In this paper, we propose a novel approach for representing vector representation of an entire code fragment.
code fragments that do not have to be compilable, called In summary, our proposed neural source code representation
AST-based Neural Network (ASTNN), which splits the large aims to learn more syntactical and semantic information about
AST of one code fragment into a set of small trees at source code than the state-of-the-art AST-based neural models.
the statement level and performs tree-based neural embed- It is general-purpose and can be used in many program
dings on all statement trees. It produces statement vectors comprehension related tasks such as source code classification
which can represent the lexical and statement-level syntactical and code clone detection. We have conducted experiments
knowledge. Here statements refer to the Statement AST on the two tasks on public benchmarks and compared with
nodes defined in program language specification [30]. We state-of-the-art approaches. The experimental results show that
also treat MethodDeclaration as a special statement node. As our model is more effective. For example, for source code
an example, Figure 1 shows a code fragment from an open classification, our approach improves the accuracy from 94%
source project1 . The code snippet between line 7 and line to 98.2%. For clone detection, our approach improves the
15 contains a whole T ry statement and the code snippet results of F1 values from 82% to 93.8% and 59.4% to 95.5%
between line 5 and line 6 includes only the LocalV ariable on two benchmark datasets, respectively.
statement initializing variable sb. For each statement like the Our main contributions are as follows:
T ry statement that includes the header and other statements in • We propose a novel neural source code representation,
the body, we split the header of the statement and all included which can capture the lexical, statement-level syntactical
statements. In this way, the large AST is decomposed to a knowledge, and the naturalness of statements;
short sequence of small statement trees. We use Recurrent • We have applied our representation to two common
Neural Network (RNN) [31] to encode statements and the program comprehension tasks (code classification and
sequential dependency between the statements into a vector. clone detection). The experimental results show that our
Such a vector captures the naturalness of source code [32], approach can improve the state-of-the-art methods.
[33] and can serve as a neural source code representation. The remainder of this paper is structured as follows. Sec-
More specifically, first, we build an AST from the code tion II introduces the background. Section III presents our
fragment and split the whole AST to small statement trees (one approach. Section IV describes the applications of our neural
tree consisting of AST nodes of one statement and rooted at the source code representation. Section V provides our exper-
Statement node). For example, in Figure 1, the statement trees imental results. Related work and discussion about threats
are denoted by dashed lines and the corresponding statements to validity are presented in Section VI and Section VII,
(or statement headers) in the original code fragment are also respectively. Finally, Section VIII concludes our work.
marked by dashed lines. Second, we design a recursive encoder
on multi-way statement trees to capture the statement-level II. BACKGROUND
lexical and syntactical information and then represent them in A. Abstract Syntax Tree
statement vectors. Third, based on the sequence of statement Abstract Syntax Tree (AST) is a kind of tree aimed at
1 https://github.com/apache/ctakes/blob/ representing the abstract syntactic structure of the source code
9c552c5c4f92af00d9d008b8c7f9e9d326a2450a/ctakes-core/src/main/java/org/ [36]. It has been widely used by programming language and
apache/ctakes/core/resource/FileReadWriteUtil.java#L32 software engineering tools. As illustrated in Figure 1(b), nodes
784
of an AST are corresponding to constructs or symbols of the some global information for improving its localness. Although
source code. On the one hand, compared with plain source nodes in the original AST may have more than two children,
code, ASTs are abstract and do not include all details such as TBCNN treats ASTs as continuous full binary trees because
the punctuation and delimiters. On the other hand, ASTs can of the fixed size of convolution.
be used to describe the lexical information and the syntactic 3) Tree-based Long Short-Term Memory: Tree-LSTM is a
structure of source code, such as the method name readText generalization of LSTMs to model tree-structured topologies.
and the control flow structure WhileStatement in Figure 1(b). Different from standard LSTM, Child-Sum Tree-LSTM [41]
Some studies directly use ASTs in token-based methods recursively combines current input with its children states
for source code search [37], program repair [38] and source for state updating across the tree structure. CDLH [6] uses
code differencing [39]. Due to the limitation of token-based Tree-LSTM to learn representations of code fragments for
approaches [17], these methods can catch little syntactical clone detection where code fragments are parsed to ASTs.
information of source code. To deal with the variable number of children nodes, ASTs
are transformed to full binary trees. After a bottom-up way
B. Tree-based Neural Networks of computation, the root node vectors of ASTs are used to
Recently Tree-based Neural Networks (TNNs) have been represent the code fragments.
proposed to accept ASTs as the input. Given a tree, TNNs
learn its vector representation by recursively computing node C. The limitations of the existing work
embeddings in a bottom-up way. The most representative These three tree-based methods have two major limitations.
tree-based models for ASTs are Recursive Neural Network First, during gradient-based training of tree topologies, the gra-
(RvNN), Tree-based CNN (TBCNN) and Tree-based Long dients are calculated via backpropagation over structures [41],
Short-Term Memory (Tree-LSTM). [23]. However, the structures of ASTs are usually large and
1) Recursive Neural Network: RvNN was first proposed for deep due to the complexity of programs, especially the nested
the recursive structure in natural language and image parsing structures. Thus the bottom-up computations from the leaf
[40]. Specifically, given a tree structure, suppose that one nodes to the root nodes may experience the gradient vanishing
parent node y1 has two children nodes (c1 , c2 ), where c1 and problem and are difficult to capture long-range dependencies
c2 are word embeddings or intermediate vector representations [19], [22], which will miss some of the semantics carried by
of nodes. The vector of node y1 is computed by: distant nodes from the root nodes such as identifiers in the leaf
nodes. Second, the existing tree-based methods view ASTs as
p = f (W (1) [c1 ; c2 ] + b(1) ) binary trees by moving three or more children nodes of a
where W (1) is a matrix of parameters, f is an element-wise parent node to new subtrees for simplification, which changes
activation function, and b(1) is a bias term. To assess the the original semantics of source code and makes the long-term
quality of this vector representation, a decoding layer is used dependency problem more serious. For example, CDLH [6]
to reconstruct the children: can only have the F1 value of 57% in one public benchmark
for clone detection, and the studies in NLP [23], [41], [21]
[c1 ; c2 ] = W (2) p + b(2) show that the tree size and depth do matter and have significant
impact on the performance.
Then the training loss is measured by E(θ) = ||c1 − c1 ||22 +
||c2 − c2 ||22 . In this way, RvNN can recursively compute and III. O UR A PPROACH
optimize parameters across the tree, and the final vector of
We introduce our AST-based Neural Network (ASTNN) in
the root node will represent the given tree. Based on RvNN, a
this section. The overall architecture of ASTNN is shown in
recursive autoencoder (RAE) is incorporated for automatically
Figure 2. First, we parse a source code fragment into an AST,
encoding ASTs to detect code clones [5], where ASTs are
and design a preorder traversal algorithm to split each AST
transformed to full binary trees due to the fixed-size inputs
to a sequence of statement trees (ST-trees, which are trees
for simplification.
consisting of statement nodes as roots and corresponding AST
2) Tree-based Convolutional Neural Network: TBCNN
nodes of the statements), as illustrated in Figure 1. All ST-trees
performs convolution computation over tree structures for
are encoded by the Statement Encoder to vectors, denoted as
supervised learning such as source code classification [2].
e1 , · · · , et . We then use Bidirectional Gated Recurrent Unit
Its core module is an AST-based convolutional layer, which
[35] (Bi-GRU), to model the naturalness of the statements.
applies a set of fixed-depth feature detectors by sliding over
The hidden states of Bi-GRU are sampled into a single vector
entire ASTs. This procedure can be formulated by:
by pooling, which is the representation of the code fragment.
n
y = tanh( Wconv,i · xi + bconv ) A. Splitting ASTs and Constructing ST-tree Sequences
i=1 At first, source code fragments can be transformed to large
where x1 , · · · , xn are the vectors of nodes within each sliding ASTs by existing syntax analysis tools. For each AST, we split
window, Wconv,i are the parameter matrices and bconv is the it by the granularity of statement and extract the sequence of
bias. TBCNN adopts a bottom-up encoding layer to integrate statement trees with a preorder traversal.
785
Statement Vector
Representation
ℎ ℎ ℎ ℎ ℎ ℎ Pooling Layer Pooling
h1 h2 h3 h4 … hN-2 hN-1 hN
←
Recurrent Layer
→
Method
Declaration
…
Reference Formal throws body

Encoding Layer Modifier Type readText Parameter
Statement Encoder
Fig. 3. The statement encoder, where blue, orange and green circles represent
the initial embeddings, hidden states and statement vector, respectively.
…
- - - - - - ST-trees
… … … … … … … … … … … … … … … … … …
Fig. 2. The architecture of AST-based Neural Network as the node level of ASTs, the code blocks within brace
pairs, and the full ASTs. We will discuss these experiments
in Section V. If the size of selected granularity is too large
(e.g., the full AST), similar to the related work [5], [6], we
Formally, given an AST T and a set of Statement AST may also experience the gradient vanishing problem mentioned
nodes S, each statement node s ∈ S in T corresponds one in Section II. But if it is too small (e.g., the node level of
statement of source code. We treat MethodDeclaration as a AST), the model will become a token-based RNN that may
special Statement node, thus S = S∪{M ethodDeclaration}. capture less syntactical knowledge of statements than ours.
For nested statements, as shown in Figure 1, we define the Our experimental results show that proposed statement-level
set of separate nodes P = {block, body} where block is granularity is better since it has a good trade-off between the
for splitting the header and body of nested statements such size of ST-tree and the richness of syntactical information.
as T ry and W hile statements, and body for the method
declaration. All of the descendants of statement node s ∈ S B. Encoding Statements on Multi-way ST-trees
is denoted by D(s). For any d ∈ D(s), if there exists 1) Statement Vectors: Given the ST-trees, we design a
one path from s to d through one node p ∈ P , it means RvNN based statement encoder, which is used for learning
that the node d is included by one statement in the body vector representations of statements.
of statement s. We call node d one substatement node of Since there are a variety of special syntactic symbols in
s. Then a statement tree (ST-tree) rooted by the statement ASTs, we obtain all the symbols by preorder traversal of
node s ∈ S is the tree consisting of node s and all of ASTs as the corpus for training. The word2vec [42] is used
its descendants excluding its substatement nodes in T . For to learn unsupervised vectors of the symbols, and the trained
example, the first ST-tree rooted by M ethodDeclaration is embeddings of symbols are served as initial parameters in
surrounded by dashed lines in Figure 1(b), which includes the statement encoder. Because all the leaf nodes of ASTs
the header part such as “static”, “public” and “readText” and representing the lexical information such as identifiers are
excludes the nodes of the two LocalV ariable, one T ry and also incorporated in the leaf nodes of ST-trees, our symbol
one Return statement in the body. Since nodes of one ST- embeddings can capture the lexical knowledge well.
tree may have three or more children nodes, we also call it Taking the first ST-tree rooted by the node of MethodDec-
multi-way ST-tree for distinguishing it from a binary tree. In laration in Figure 1 as an example, the encoder traverses the
this way, one large AST can be decomposed to a sequence of ST-tree and recursively takes the symbol of current node as
non-overlapping and multi-way ST-trees. new input to compute, together with the hidden states of its
The splitting of ASTs and the construction of ST-tree children nodes. This is illustrated in Figure 3. We only show
sequences are straightforward by a traverser and a constructor. the first two levels here. In the ST-tree, the two children nodes
The traverser visits each node through the ASTs in a depth- readText (i.e., the method name) and FormalParameter (i.e.,
first walk in preorder and the constructor recursively creates the grammar structure defining parameters of the method) as
a ST-tree to sequentially add to the ST-tree sequences. Such a well as other siblings enrich the meaning of MethodDeclara-
practice guarantees that one new ST-tree are appended by the tion. If we transform the ST-tree to one binary tree as described
order in the source code. In this way, we get the sequence of in [5], [6], for example, moving the node of readText to one
ST-trees as the raw input of ASTNN. child node or descendant of the FormalParameter node, the
Note that the selection of splitting granularity of ASTs original semantics may be destroyed. Instead, we take original
is not trivial. We choose the statement trees in this work multi-way ST-trees as input.
since statements are essential units for carrying source code Specifically, given a ST-tree t, let n denote a non-leaf
semantics. We also experimented with other granularities such node and C denote the number of its children nodes. At
786
the beginning, with the pre-trained embedding parameters Algorithm 1 Dynamic batching algorithm of ST-trees
We ∈ R|V |×d where V is the vocabulary size and d is the Input: The array of root nodes in batched ST-trees, B;
embedding dimension of symbols, the lexical vector of node Output: The vectors of batched ST-trees, BV ;
n can be obtained by: 1: L ← len(B);
vn = We xn (1) 2: BI ← [1, · · · , L]; // ST-tree indexes in the batch
3: S ∈ RN ×L×k ← φ; // record all node vectors
where xn is the one-hot representation of symbol n and vn is 4: Call DynamicBatch(B,BI);
the embedding. Next, the vector representation of node n is 5: Perform pooling on S by Eq. 3 to get BV ∈ RL×k ;
computed by the following equation: 6: return BV ;
7: function DYNAMIC BATCH (ns, ids) The batched
h = σ(Wn vn + h i + bn ) (2) current nodes ns and their indexes ids
i∈[1,C]
8: l ← len(ns);
where Wn ∈ R d×k
is the weight matrix with encoding 9: BC ∈ Rl×d ← 0; // initialize the matrix
dimension k, bn is a bias term, hi is the hidden state for each 10: Calculate Eq. 1 in batch for ns and fill into BC
children i, h is the updated hidden state, and σ is the activation according to ids;
function such as tanh or the identity function. We use the 11: Initialize two array list C, CI ← φ to record children
identity function in this paper. Similarly, we can recursively nodes and their batch indexes;
compute and optimize the vectors of all nodes in the ST-tree t. 12: for each node ∈ ns do
In addition, in order to determine the most important features 13: for each children node child of node do
of the node vectors, all nodes are pushed into a stack and 14: group child by its position, and record child
then sampled by the max pooling. That is, we get the final to C and its batch index to CI ;
representation of the ST-tree and corresponding statement by 15: end for
Equation 3, where N is the number of nodes in the ST-tree. 16: end for
17: for i = 0 → len(C) − 1 do
et = [max(hi1 ), · · · , max(hik )], i = 1, · · · , N (3) 18: h̃ ∈ Rl×k ← 0;
19: η ← DynamicBatch(C[i], CI[i]);
These statement vectors can capture both lexical and 20: h̃ ← h̃ + η;
statement-level syntactical information of statements. 21: end for
2) Batch Processing: For improving the training efficiency 22: Calculate h by Eq. 2 in batch;
on large datasets, it is necessary to design the batch processing 23: BZ ∈ RL×k ← 0; // for calculating BV
algorithm to encode multiple samples (i.e., code fragments) 24: Fill h into BZ according to ids and add BZ to S;
simultaneously. However, generally batch processing on multi- 25: return h;
way ST-trees makes it difficult since the number of children 26: end function
nodes varies for the parent nodes in the same position of one
batch [2], [6]. For example, given two parent nodes ns1 with
3 children nodes and ns2 with 2 children nodes in Figure
4, directly calculating Equation 2 for the two parents in one
batch is impossible due to different C values. To tackle this
problem, we design an algorithm that dynamically processes
batch samples in Algorithm 1.
Intuitively, although parent nodes have different number of 1, 2 1, 2 1
children nodes, the algorithm can dynamically detect and put
all possible children nodes with the same positions to groups, Fig. 4. An example of dynamically batching children nodes
and then speed up the calculations of Equations 2 of each
group in a batch way by leveraging matrix operations. In
algorithm 1, we batch L samples of ST-trees and then breadth- batched ST-trees to the stack S (line 24). Finally we can obtain
first traverse them starting from the root nodes (line 4). For the vectors of ST-tree samples and corresponding statements
the current nodes ns in the same position of the batch, we by pooling described in Equation 3 (line 5).
firstly calculate Equation 1 in batch (line 10), then detect and C. Representing the Sequence of Statements
group all their children nodes by the node positions (line 12-
Based on the sequences of ST-tree vectors, we exploit GRU
16). As shown in Figure 4, we separate the children nodes to
[34] to track the naturalness of statements. We also considered
three groups by their positions and record the groups in the
the choice of using LSTM and the comparison between LSTM
array lists C and CI. Based on these groups, we recursively
and GRU will be discussed in our experiment.
perform batch processing on all children nodes (line 17-21).
Given one code fragment, suppose there are T ST-
After getting the results of all children nodes, we compute
tree extracted from its AST and let Q ∈ RT ×k =
Equation 2 in batch (line 22), and push all node vectors of
[e1 , · · · , et , · · · , eT ], t ∈ [1, T ] denote the vectors of encoded
787
ST-trees in the sequence. At time t, the transition equations which is to detect whether two code fragments implement the
are as follows: same functionality. Suppose there are code fragment vectors
rt = σ(Wr et + Ur ht−1 + br ) r1 and r2 , and their distance is measured by r = |r1 − r2 | for
semantic relatedness [41]. Then we can treat the output ŷ =
zt = σ(Wz et + Uz ht−1 + bz )
(4) sigmoid(x̂) ∈ [0, 1] as their similarity where x̂ = Wo r + bo .
h˜t = tanh(Wh et + rt (Uh ht−1 ) + bh ) The loss function is defined as binary cross-entropy:
ht = (1 − zt ) ht−1 + zt h˜t
J(Θ, ŷ, y) = (−(y · log(ŷ) + (1 − y) · log(1 − ŷ))) (7)
where rt is the reset gate to control the influence of previous
state, zt is the update gate to combine past and new infor- To train ASTNN models for the two tasks, the goal is to
mation, h˜t is the candidate state and used to make a linear minimize the loss. We use AdaMax [46] in this paper because
interpolation together with previous state ht−1 to determine it is computationally efficient.
the current state ht . Wr , Wz , Wh , Ur , Uz , Uh ∈ Rk×m are After all the parameters are optimized, the trained models
weight matrices and br , bz , bh are bias terms. After iteratively are stored. For new code fragments, they should be pre-
computing hidden states of all time steps, the sequential processed into sequences of ST-trees and then fed into the
naturalness of these statements can be obtained. reloaded models for prediction. The output are probabilities
In order to further enhance the capability of the recurrent p for different labels. For code classification, since there are
layer for capturing the dependency information, we adopt multiple categories, the inferred value can be obtained by:
a bidirectional GRU [34], where the hidden states of both
prediction = arg max(pi ), i = 1, · · · , M (8)
directions are concatenated to form the new states as follows: i
→
− −−−→
ht = GRU (et ), t ∈ [1, T ] While for clone detection, p is a single value in the range
←
− ←−−− [0,1], thus we get the prediction by:
ht = GRU (et ), t ∈ [T, 1] (5)

→ ←
− − T rue, p > δ
ht = [ ht , ht ], t ∈ [1, T ] prediction = (9)
F alse, p ≤ δ
Similar to the statement encoder, the most important fea-
tures of these states are then sampled by the max pooling where δ is the threshold.
or average pooling. Considering the importance of different V. E XPERIMENTS
statements are intuitively not equal, for example, API calls in
the MethodInvocation statements may contain more functional In this section, we evaluate the proposed source code
information [43], thus we use max pooling for capturing representation on two tasks of code classification (Task 1) and
the most important semantics by default. The model finally clone detection (Task 2), and compare it with several state-of-
produces a vector r ∈ R2m , which is treated as the vector the-art approaches.
representation of the code fragment. A. Dataset Description
IV. A PPLICATIONS OF THE P ROPOSED M ODEL There are two public dataset benchmarks for code classifica-
The proposed ASTNN model is general-purpose. It can be tion and clone detection. One dataset consists of simple C pro-
trained for task-specific vector representations of source code grams collected from the Online Judge (OJ) system and made
fragments to characterize different source code semantics for public by Mou et al. [2]2 . The programs in OJ benchmark are
many program comprehension tasks. In this work, we take two for 104 different programming problems. Programs have the
common tasks including source code classification and code same functionality if they aim to solve the same problem.
clone detection as examples to show the applications of the The other dataset BigCloneBench (BCB) was provided by
proposed model. Svajlenko et al. [47] for evaluating code clone detection tools.
Source code classification. This task aims to classify code BCB consists of known true and false positive clones from
fragments by their functionalities, which is useful for program a big data inter-project Java repository. As benchmarks, the
understanding and maintenance [2], [44], [45]. Given the code two datasets have been used by many researchers concerning
fragment vector r and the number of categories M , we obtain on code similarity [48], [49] and clone detection [5], [6]. The
the logits by x̂ = Wo r+bo , where Wo ∈ R2m×M is the weight basic statistics corresponding to our two tasks are summarized
matrix and bo is the bias term. We define the loss function as in Table I.
the widely used cross-entropy loss:
B. Experiment Settings
exp(x̂y )
J(Θ, x̂, y) = −log (6) We used the pycparser3 and javalang tools4 to obtain ASTs
j exp(x̂j ) for C and Java code, respectively. For both tasks, we trained
where Θ denotes parameters of all the weight matrices in our embeddings of symbols using word2vec [42] with Skip-gram
model and y is the true label. 2 https://sites.google.com/site/treebasedcnn/
Code clone detection. Detecting code clones is widely 3 https://pypi.python.org/pypi/pycparser
studied in software engineering research [3], [4], [5], [6], [26], 4 https://github.com/c2nes/javalang
788
TABLE I tool Frama-C5 to get their PDGs. Based on the PDGs,
T HE STATISTICS OF DATASETS USED FOR OUR TWO TASKS
we represents nodes of PDGs by the numerical ID of
Code Classification Clone Detection
statements in HOPE [26], and average the embeddings
of all tokens in each PDG node as its initial embedding
Dataset OJ Dataset OJ BCB
#Programs 52,000 #Code fragments 7,500 59,688 [25] in GGNN6 . After graph embedding, we add a max
#Classes 104 %True clone pairs 6.6% 95.7% pooling layer on all nodes of PDGs to obtain the final
Max tokens 8,737 Max tokens 2,271 16,253 code representation.
Avg. tokens 245 Avg. tokens 244 227
Max AST depth 76 Max AST depth 60 192 To evaluate the effectiveness of source code classification, we
Avg. AST depth 13.3 Avg. AST depth 13.2 9.9 use the test accuracy metric, which computes the percentage
Max AST nodes 7,027 Max AST nodes 1,624 15,217
Avg. AST nodes 190 Avg. AST nodes 192 206
of correct classifications for the test set.
2) Code Clone Detection: There are generally four different
types of code clones [54]. Type-1 is identical code fragments
in addition to variations in comments and layout; Apart from
algorithm and set the embedding size to be 128. The hidden Type-1, Type-2 is identical code fragments except for different
dimension of ST-tree encoder and bidirectional GRU is 100. identifier names and literal values; Type-3 is syntactically
We set the mini-batch size to 64 and a maximum of 15 and 5 similar code snippets that differ at the statement level; Type-
epochs for the two tasks, respectively. The threshold is set to 4 is syntactically dissimilar code snippets that implement the
0.5 for clone detection. For each dataset, we randomly divide it same functionality. For BCB, the similarity of clone pairs is
into three parts, of which the proportions are 60%, 20%, 20% defined as the average result of line-based and token-based
for training, validation and testing. We use the optimizer metrics [47]. The similarity of two fragments of Type-1 and
AdaMax [46] with learning rate 0.002 for training. All the Type-2 is 1. Type-3 is further divided into Strongly Type-3
experiments are conducted on a server with 16 cores of and Moderately Type-3, of which the similarities are in range
2.4GHz CPU and a Titan Xp GPU. [0.7, 1) and [0.5, 0.7), respectively. The similarity of Type-4
is in range [0, 0.5) and its clone pairs take up more than 98%
C. Evaluation on Two Tasks
over all clone types. While in OJ, two programs for the same
1) Source Code Classification: We conduct extensive ex- problem form a clone pair of unknown type.
periments on the OJ dataset. Apart from the state-of-the-art As Table I shows, similar to the previous work [6], we
model TBCNN [2], we also take into account of traditional choose 500 programs from each of the first 15 programming
and other neural network based approaches including SVMs problems in OJ, namely OJClone. It will produce more than
with statistical features, TextCNN [50], LSTM [51], LSCNN 28 million clone pairs which is extremely time-consuming for
[52] and PDG-based Graph embedding approaches [25], [26] comparison, thus we randomly select 50 thousand samples
as follows: instead. Likewise, we parsed nearly 6 million true clone pairs
• SVMs. We use the linear SVMs with traditional IR and 260 thousand false clone pairs from BCB. We compare
methods. TF-IDF, N-gram and LDA are used to extract our approach with existing state-of-the-art neural models for
textual features. The corpus are tokens extracted from clone detection including RAE [5] and CDLH [6]. For RAE,
the source code files. For N-gram, we set the number the unsupervised vectors of programs are obtained by the
of grams to 2 and the number of max features to 20 authors’ open-source tool7 and we use them for supervised
thousand. The number of topics for LDA is 300. training, namely RAE+. Its configurations are set according
• TextCNN and LSTM. These two models are widely used to their paper. CDLH is not made public by the authors, so we
for sentence classification in NLP. We adapt them for this directly cite their results from the paper since their experiments
task with token sequences by treating code fragments as share the same datasets with ours. Since other traditional
plain texts. For TextCNN, the kernel size is set to 3 and clone detection methods like DECKARD [55] and common
the number of filters is 100. For LSTM, The dimension neural models such as doc2Vec8 have been compared in RAE
of hidden states is set to 100. and CDLH, we omit them in our experiment. Similar to the
• LSCNN. Originally proposed for bug location [52], experiments on code classification, we also compare with the
LSCNN extracts program features with CNN for state- two PDG-based Graph embedding approaches in OJClone.
ment embedding and uses LSTM for statement sequences. However, BCB mainly contains incomplete and uncompilable
• PDG based Graph Embedding. Most recently some method-level code fragments, we fail to get their PDGs.
studies [25], [26] construct program graphs by consid- Since the code clone detection can be formulated as a
ering control flow and data flow dependencies, and adopt binary classification problem (clone or not), we choose the
graph embedding techniques such as HOPE [24] and commonly-used Precision (P), Recall (R) and F1-measure (F1)
Gated Graph Neural Network (GGNN) [53] for code as evaluation metrics.
representation. Although original code fragments in the 5 https://frama-c.com/
OJ dataset is incomplete and uncompilable, they can be 6 https://github.com/Microsoft/gated-graph-neural-network-samples

7 https://github.com/micheletufano/AutoenCODE
manually complemented by adding standard C header
8 https://radimrehurek.com/gensim/models/doc2vec.html
files and third-party libraries and we use an open-source
789
TABLE II TABLE III
COMPARED APPROACHES FOR CODE CLASSIFICATION CODE CLONE DETECTION MODELS ON BCB
Groups Methods Test Accuracy(%) Type

RAE+ CDLH ASTNN
SVMs SVM+TF-IDF 79.4 P R F1 P R F1 P R F1

SVM+N-gram 84.7 BCB-T1 100 100 100 - - 100 100 100 100
SVM+LDA 47.9 BCB-T2 86.5 97.2 91.5 - - 100 100 100 100
BCB-ST3 79.9 72.2 75.9 - - 94 99.9 94.2 97.0
Neural models TextCNN 88.7 BCB-MT3 66.4 74.8 70.3 - - 88 99.5 91.7 95.5
LSTM 88.0 BCB-T4 76.3 58.7 66.3 - - 81 99.8 88.3 93.7
TBCNN 94.0 BCB-ALL 76.4 59.1 66.6 92 74 82 99.8 88.4 93.8
LSCNN 90.9
PDG+HOPE 4.2
PDG+GGNN 79.6 TABLE IV
CODE CLONE DETECTION MODELS ON OJC LONE
Our approach ASTNN 98.2
Metric RAE+ CDLH PDG+HOPE PDG+GGNN ASTNN
P 52.5 47 76.2 77.3 98.9

R 68.3 73 7.0 43.6 92.7
D. Research Questions and Results F1 59.4 57 12.9 55.8 95.5
Based on the evaluation on the two tasks, we aim to

investigate the following research questions:
RQ1: How does our approach perform in source code transformation. Instead, it captures knowledge about AST
classification? In the task of code classification, the samples statements and the sequential dependencies between state-
are strictly balanced among the 104 classes and our evaluation ments.
metric is the test accuracy. Experimental results are provided RQ2: How does our approach perform in code clone
in Table II. The best results are shown in bold. detection? In this research question, we want to explore
We can see that traditional methods such as SVMs perform whether our model is effective on the challenging problem of
poorly in our experiment. These methods mainly rely on the code clone detection. We conduct experiments on the OJClone
semantics of tokens or shallow semantics features of programs and BCB datasets.
to distinguish the code functionalities, but the tokens used in From OJClone, we sample 50 thousand clone pairs for
OJ dataset are mostly arbitrary. For example, the names of experiments. While from BCB, we firstly sample 20 thousand
identifiers are usually a, i, j , etc. Thus most tokens contribute false clone pairs as the negative samples. For Type-1 to
little to recognize the functionalities. Strongly Type-3, we fetch all the true clone pairs belonging
For those neural models, the results of TextCNN and to that type as positive samples since their numbers are less
LSTM are better than token-based methods above because than 20 thousand. For other types, we sample 20 thousand true
they can capture some local functional features. For example, clone pairs. Then we turn them into five groups to detect each
the semantics of a short scanf statement can be captured by type.
the sliding window of TextCNN or the memory cell unit in The detailed results are shown in Table III (BCB) and IV
LSTM. As a neural network based on entire ASTs, TBCNN (OJClone). In Table III, as mentioned before, we cite the
significantly improves the accuracy since it uses the convo- results of CDLH from [6]. Since there are no P and R values
lution sliding window over ASTs to capture tree structural reported for detailed clone types, we fill them by “-” instead.
features. LSCNN has a relatively competitive performance BCB-ST3 and BCB-MT3 represent Strongly Type-3 and Mod-
among existing neural models. This can be inferred that erately Type-3 in BCB, respectively, and so on. The BCB-
the sequential information of statements does contribute to ALL is a weighted sum result according to the percentage
recognize the functionality, but the accuracy is still lower than of various clone types [6]. In Table IV, we compare with two
TBCNN because it cannot capture the rich structural seman- more PDG-based Graph embedding methods PDG+HOPE and
tics. Graph-based approaches including HOPE and GGNN PDG+GGNN as described before.
on PDGs perform poorly among the above approaches. In We first discuss the performances of RAE+, CDLH, and our
particular, PDG with HOPE gets an accuracy of only 4.2%, ASTNN model on BCB. Obviously, all the three approaches
because the nodes of PDGs are represented by their numerical are quite effective in recognizing the similarities of two code
ID which miss lexical knowledge and only focuses on the fragments in Type-1 and Type-2, since both code fragments
explicit dependency information in a high abstraction level are almost the same excepting different identifier names,
[26]. PDG with GGNN uses tokens for node embedding and comments and so on. While for other types of BCB, RAE+
has an improvement, but still lacks the syntactical information. performs much worse than the other two approaches since it
Among all the approaches, our model achieves the best has no mechanism on memorizing history information such as
accuracy. Specifically, our model improves TBCNN by 4.2%, LSTM or GRU in CDLH and ASTNN. Comparing CDLH with
since our ASTNN model performs RvNN on much smaller our approach, we can see that ASTNN outperforms CDLH in
ST-trees than original ASTs. Unlike existing neural models, terms of F1-measure especially for Type-4. In BCB Type-4,
our model does not use the sliding window and binary tree false clone pairs share syntactical similarity as well, which
790
TABLE V TWB PBR DBA
C OMPARISON BETWEEN THE PROPOSED MODEL AND I TS DESIGN 80.0
71.2 71.2 71.2 71.2 71.2
ALTERNATIVES 70.0
Time Cost (min/epoch)

60.0
Code Classification Clone Detection
Description 49.6
Accuracy F1( OJClone) F1(BCB) 50.0 45.4 45.1 44.9 44.8
AST-Full 96.2 87.7 85.7
40.0
AST-Block 97.3 92.9 90.2
AST-Node 96.7 92.1 87.4 30.0
Removing Pooling-I 98.1 95.2 93.1
20.0
Removing Pooling-II 96.0 94.0 90.0 11.9 9.4 6.8
LSTM instead of GRU 97.8 95.6 92.3 10.0 5.0 3.5
Long code fragments 95.7 92.9 93.5 0.0
ASTNN 98.2 95.5 93.8
4 8 16 32 64
Batch Size
Fig. 5. The time cost of different batching methods

is validated to be coincidental [47] and is challenging to be
distinguished. This indicates our ASTNN model can capture
more subtle syntactical difference and complex semantics than states. From the table, we can see that the pooling on statement
CDLH by overcoming its limitations described in Section III sequences provides a comparatively significant performance
and capturing sequential naturalness of statements. boost, whereas pooling on ST-trees matters little. This shows
In OJClone, similar results can be observed by RAE+ and that different statements of the same code fragments actually
our model. However, CDLH performs much worse than on have different weights.
BCB. Unlike BCB, the variable names of programs in OJClone LSTM. In the recurrent layer of our proposed model, we
are usually meaningless, thus CDLH may miss a lot of lexical use GRU by default. If replacing GRU by LSTM, the results
information and can only capture some syntactical informa- indicate that overall LSTM has a slightly poor but comparative
tion. By contrast, our ASTNN model can further measure performance with GRU. We prefer GRU in our ASTNN model
the functional similarities by learning more local syntactical since it can achieve more efficient training.
knowledge and the global sequential naturalness among state- Long code fragments. Considering only long code frag-
ments. Similar to code classification, we also compare with ments which have more than 100 statements, the percentage
PDG-based Graph embedding techniques HOPE and GGNN. of long code fragments is 1.6% in OJ, and the percentage of
They achieve worse performance than our ASTNN model due clone pairs which include at least one long code fragment is
to the facts mentioned in the last research question. 4.1% in BCB. From the table, we can see that our ASTNN
RQ3: What are the effects of different design choices for model can also deal with long sequences of statements well
the proposed model? We conduct experiments to study how and the performance on long code fragments remains good.
different design choices affect the performance of the ASTNN RQ4: To what extent does the proposed dynamic batching
model on the two tasks. As shown in Table V, we consider algorithm contribute to the efficiency? As we described in
the following design choices: Section III, our statement encoder can accept batch samples
Splitting granularities of ASTs. Given a large AST, there with arbitrary tree structures as inputs, thus can accelerate the
are many ways to split it into different sequences of non- speed of training. However, it is still unknown how efficient
overlapping small trees. The two extreme ways are treating this algorithm is. In order to test and verify its capability,
the original full AST as one special subtree (AST-Full), or we train our model in three different ways: totally without
extracting all nodes of the AST as special “trees” (AST- batching (TWB), batching only on the recurrent layer (PBR),
Node). Besides the statement level splitting, another possible batching on the recurrent layer and the encoding layer by using
way (AST-Block) is to split the AST according to blocks our dynamic batching algorithm in Algorithm 1 (DBA). In
(compound statements that include multiple statements within detail, TWB means calculating one sample each time; PBR
the same brace pairs). After splitting, the follow-up encoding accepts batch samples, but encodes only one ST-tree at each
and bidirectional GRU processings are the same as those in time and performs batching on the recurrent layer by padding
ASTNN. We can see that AST-Block and ASTNN outperform sequences of ST-tree vectors; DBA encodes all batch samples
both extreme splitting approaches of AST-Full and AST-Node. of ST-trees at once and then deals with ST-tree sequences as
Our ASTNN model achieves the best performance, as analyzed PBR does. The experiment is conducted for Task 1 and the
in Section III, this is because it has a good trade-off between time cost is the average running time of training and testing
the size of ST-tree and the richness of syntactical information. for each epoch.
Pooling. In our ASTNN model, we use the max pooling In the experiment, the batching has no effect on the perfor-
on ST-trees in the statement encoder (Pooling-I) and the max mance but changes the efficiency a lot. We find that the average
pooling layer on the statement sequences after the recur- time required by TWB is 71.2 minutes per epoch. From Figure
rent layer (Pooling-II) as described in Section III. We study 5 we can see that both PBR and DBA speed up the training
whether the two pooling components affect the performance or and testing process when compared with TWB. DBA shows
not by removing them and directly using the last layer hidden a significant improvement over the others. Furthermore, DBA
791
runs 12+ times faster than PBR and 20+ times faster than TWB network to predict the related questions in StackOverflow. The
when the batch size is 64. This confirms that the proposed neural machine translation is used to automatically generate
batching algorithm makes our ASTNN model more efficient. commit messages [10]. Guo et al. [65] proposes a RNN based
neural network to generate trace links. A joint embedding
VI. R ELATED W ORK model is used in code search to map source code and natural
A. Source Code Representation language descriptions into a unified vector space for evaluating
How to effectively represent source code is a fundamental semantics similarity [66]. The above related work mainly uses
problem for software engineering research. neural network models to understand software-related natural
Traditional IR and machine learning methods have been language texts for various tasks while we focus on the neural
widely used for textual token-based code representation. Pro- representation of source code.
grams are transformed to regularized token sequences for code
clone detection [3]. SourcererCC [4] has an improvement by VII. T HREATS TO VALIDITY
exploiting token ordering along with an optimized inverted- There are three main threats to the validity. First, the OJ
index technique. Besides the lexical information, Deckard [55] dataset is not collected from the real production environment.
enriches programs with some syntax-structured information However, BigCloneBench includes code snippets of real-world
for clone detection as well. Based on the statistical and Java repositories from SourceForge [47], which reduces this
machine learning methods, the n-gram model [1] and SVM threat. The second threat is about the construction of OJClone.
[45] are used for classifying source code authorship and As we described, programs under the same problem belong
domains. Maletic et al. [56] adopts LSI to identify semantic to a clone pair. This leads to the uncertainty about whether
similarities of code fragments, and the cohesion of classes in they are true clone pairs, although similar practice has been
software is evaluated by LDA [15]. done by previous work [6]. Nevertheless, BigCloneBench is
Recently deep learning based approaches have attracted also used for validation where the code clones are inspected
much attention to learn distributed representation of source manually. Therefore, we believe it is of little influence on ex-
code [57]. Raychev et al. [58] adopts RNN and n-gram model perimental results. The last threat is that we cannot reproduce
for code completion. Allamanis et al. [59] uses a neural the approach of CDLH due to some details missed in that
context model to suggest method and class names. For defect paper. Alternatively, we construct the same datasets described
prediction, semantic features are extracted from source code by in their paper to reduce this threat. We will make supplement
a deep belief network [60]. DeepBugs [61] represents code via for comparison when the CDLH tool is available.
word2vec for detecting name-based bugs. In order to capture
the syntactical information of ASTs, White et al. [5] exploits VIII. C ONCLUSION
a recursive auto-encoder over the ASTs with pre-trained token
In this work, we have presented an efficient AST-based Neu-
embeddings. TBCNN [2] uses custom convolutional neural
ral Network (ASTNN) to learn vector representations of source
network on ASTs to learn vector representations of code
code fragments, which can capture the lexical, statement-
snippets. CDLH [6] incorporates Tree-LSTM to represent
level syntactical knowledge and naturalness of statements.
the functionality semantics of code fragments. Furthermore,
The model decomposes large ASTs of code fragments into
Allamanis et al. [25] performs Gated Graph Neural Networks
sequences of small statement trees, obtains statement vectors
on program graphs which track the dependencies of the same
by recursively encoding multi-way statement trees, and finally
variables and functions to predict variable names and detect
learns the vector representations of code fragments by leverag-
variable misuses. DeepSim [62] encodes code control flow
ing the naturalness of statements. We have evaluated ASTNN
and data flow into a semantic matrix for measuring code
on two common program comprehension tasks: source code
functional similarity. Multiple different code representations
classification and code clone detection. The experimental
such as identifiers, CFGs and bytecodes can also be integrated
results show that our model significantly outperforms existing
by the ensemble learning technique [26]. Compared with these
approaches. Our code and experimental data are publicly
neural networks, our model focuses on improving existing
available at https://github.com/zhangj1994/astnn.
AST-based methods and can capture the lexical, statement-
In the future, we will further evaluate the proposed model
level syntactical knowledge and the sequential naturalness of
on larger-scale datatsets in different programming languages
statements.
and for a variety of software engineering tasks. We will also
B. Deep Learning in Software Engineering explore other neural models to capture more deep semantics
In recent years, there are many emerging deep learning of source code.
applications in software engineering. DeepAPI [43] uses a
ACKNOWLEDGMENT
sequence-to-sequence neural network to learn representations
of natural language queries and predict relevant API se- This work was supported partly by National Key Research
quences. Lam et al. [63] combines deep neural network with and Development Program of China (No.2016YFB1000804),
IR technique to recommend potential buggy files. Xu et partly by National Natural Science Foundation of China
al. [64] adopts word embeddings and convolutional neural (No.61702024, 61828201 and 61421003).
792
R EFERENCES Fuzziness Knowl.-Based Syst., vol. 6, no. 2, pp. 107–116, Apr. 1998.
[Online]. Available: http://dx.doi.org/10.1142/S0218488598000094
[1] G. Frantzeskou, S. MacDonell, E. Stamatatos, and S. Gritzalis, “Exam- [21] P. Le and W. H. Zuidema, “Quantifying the vanishing gradient and long
ining the significance of high-level programming features in source code distance dependency problem in recursive neural networks and recur-
author classification,” Journal of Systems and Software, vol. 81, no. 3, sive LSTMs,” in Proceedings of the 1st Workshop on Representation
pp. 447–460, 2008. Learning for NLP, Berlin, Germany, 2016, p. 8793.
[2] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural [22] S.-Y. Cho, Z. Chi, W.-C. Siu, and A. C. Tsoi, “An improved algorithm
networks over tree structures for programming language processing,” in for learning long-term dependency problems in adaptive processing of
AAAI, vol. 2, no. 3, 2016, p. 4. data structures,” IEEE Transactions on Neural Networks, vol. 14, no. 4,
[3] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: a multilinguistic pp. 781–793, 2003.
token-based code clone detection system for large scale source code,” [23] X. Zhu, P. Sobhani, and H. Guo, “Long short-term memory over
IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654–670, recursive structures,” in Proceedings of the 32Nd International
2002. Conference on International Conference on Machine Learning -
[4] H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 1604–1612. [Online].
“SourcererCC: Scaling code clone detection to big-code,” in Software Available: http://dl.acm.org/citation.cfm?id=3045118.3045289
Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on. [24] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity
IEEE, 2016, pp. 1157–1168. preserving graph embedding,” in Proceedings of the 22Nd ACM
[5] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning SIGKDD International Conference on Knowledge Discovery and Data
code fragments for code clone detection,” in Proceedings of the 31st Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 1105–
IEEE/ACM International Conference on Automated Software Engineer- 1114. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939751
ing. ACM, 2016, pp. 87–98. [25] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning
[6] H.-H. Wei and M. Li, “Supervised deep features for software functional to represent programs with graphs,” in International Conference
clone detection by exploiting lexical and syntactical information in on Learning Representations, 2018. [Online]. Available: https:
source code,” in Proceedings of the 26th International Joint Conference //openreview.net/forum?id=BJOFETxR-
on Artificial Intelligence. AAAI Press, 2017, pp. 3034–3040. [26] G. B. M. D. P. M. W. Michele Tufano, Cody Watson and D. Poshyvanyk,
[7] M. DAmbros, M. Lanza, and R. Robbes, “Evaluating defect prediction “Deep learning similarities from different representations of source
approaches: a benchmark and an extensive comparison,” Empirical code,” in 15th International Conference on Mining Software Reposi-
Software Engineering, vol. 17, no. 4-5, pp. 531–577, 2012. tories, 2018.
[8] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, [27] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program
“An empirical comparison of model validation techniques for defect dependence graph and its use in optimization,” ACM Trans. Program.
prediction models,” IEEE Transactions on Software Engineering, vol. 43, Lang. Syst., vol. 9, no. 3, pp. 319–349, Jul. 1987. [Online]. Available:
no. 1, pp. 1–18, 2017. http://doi.acm.org/10.1145/24039.24041
[9] S. Haiduc, J. Aponte, and A. Marcus, “Supporting program compre- [28] E. M. Myers, “A precise inter-procedural data flow algorithm,”
hension with source code summarization,” in Proceedings of the 32nd in Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium
ACM/IEEE International Conference on Software Engineering-Volume on Principles of Programming Languages, ser. POPL ’81. New
2. ACM, 2010, pp. 223–226. York, NY, USA: ACM, 1981, pp. 219–230. [Online]. Available:
[10] S. Jiang, A. Armaly, and C. McMillan, “Automatically generating http://doi.acm.org/10.1145/567532.567556
commit messages from diffs using neural machine translation,” in Pro- [29] “Llvms analysis and transform passes,” https://llvm.org/docs/Passes.
ceedings of the 32nd IEEE/ACM International Conference on Automated html, accessed August 3, 2018.
Software Engineering. IEEE Press, 2017, pp. 135–146. [30] J. Gosling, B. Joy, G. L. Steele, G. Bracha, and A. Buckley, The Java
[11] J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed? Language Specification, Java SE 8 Edition, 1st ed. Addison-Wesley
more accurate information retrieval-based bug localization based on bug Professional, 2014.
reports,” in 2012 34th International Conference on Software Engineering [31] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur,
(ICSE), June 2012, pp. 14–24. “Recurrent neural network based language model,” in Eleventh Annual
[12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and Conference of the International Speech Communication Association,
R. Harshman, “Indexing by latent semantic analysis,” Journal of the 2010.
American society for information science, vol. 41, no. 6, p. 391, 1990. [32] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the
[13] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” naturalness of software,” in Software Engineering (ICSE), 2012 34th
Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, International Conference on. IEEE, 2012, pp. 837–847.
2003. [33] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and
[14] R. Tairas and J. Gray, “An information retrieval process to aid in the P. Devanbu, “On the ”naturalness” of buggy code,” in Proceedings
analysis of code clones,” Empirical Software Engineering, vol. 14, no. 1, of the 38th International Conference on Software Engineering, ser.
pp. 33–56, 2009. ICSE ’16. New York, NY, USA: ACM, 2016, pp. 428–439. [Online].
[15] Y. Liu, D. Poshyvanyk, R. Ferenc, T. Gyimóthy, and N. Chrisochoides, Available: http://doi.acm.org/10.1145/2884781.2884848
“Modeling class cohesion as mixtures of latent topics,” in Software [34] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
Maintenance, 2009. ICSM 2009. IEEE International Conference on. jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
IEEE, 2009, pp. 233–242. 2014.
[16] A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella, and S. Panichella, [35] D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent
“Using IR methods for labeling source code artifacts: Is it worthwhile?” neural network for sentiment classification,” in Proceedings of the 2015
in Program Comprehension (ICPC), 2012 IEEE 20th International conference on empirical methods in natural language processing, 2015,
Conference on. IEEE, 2012, pp. 193–202. pp. 1422–1432.
[17] A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshynanyk, and [36] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone
A. De Lucia, “How to effectively use topic models for software detection using abstract syntax trees,” in Software Maintenance, 1998.
engineering tasks? an approach based on genetic algorithms,” in Software Proceedings., International Conference on. IEEE, 1998, pp. 368–377.
Engineering (ICSE), 2013 35th International Conference on. IEEE, [37] S. Paul and A. Prakash, “A framework for source code search using
2013, pp. 522–531. program patterns,” IEEE Transactions on Software Engineering, vol. 20,
[18] J. F. Pane, B. A. Myers et al., “Studying the language and structure in no. 6, pp. 463–475, 1994.
non-programmers’ solutions to programming problems,” International [38] W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, “Automatically
Journal of Human-Computer Studies, vol. 54, no. 2, pp. 237–264, 2001. finding patches using genetic programming,” in Proceedings of the 31st
[19] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies International Conference on Software Engineering. IEEE Computer
with gradient descent is difficult,” IEEE transactions on neural networks, Society, 2009, pp. 364–374.
vol. 5, no. 2, pp. 157–166, 1994. [39] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus,
[20] S. Hochreiter, “The vanishing gradient problem during learning “Fine-grained and accurate source code differencing,” in Proceedings
recurrent neural nets and problem solutions,” Int. J. Uncertain. of the 29th ACM/IEEE international conference on Automated software
engineering. ACM, 2014, pp. 313–324.
793
[40] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenes [54] C. K. Roy and J. R. Cordy, “A survey on software clone detection
and natural language with recursive neural networks,” in Proceedings of research,” Queens School of Computing TR, vol. 541, no. 115, pp. 64–
the 28th international conference on machine learning (ICML-11), 2011, 68, 2007.
pp. 129–136. [55] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “DECKARD: Scalable
[41] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic rep- and accurate tree-based detection of code clones,” in Proceedings of the
resentations from tree-structured long short-term memory networks,” 29th International Conference on Software Engineering, ser. ICSE ’07.
in Proceedings of the 53rd Annual Meeting of the Association for Washington, DC, USA: IEEE Computer Society, 2007, pp. 96–105.
Computational Linguistics and the 7th International Joint Conference [Online]. Available: http://dx.doi.org/10.1109/ICSE.2007.30
on Natural Language Processing (Volume 1: Long Papers), vol. 1, 2015, [56] J. I. Maletic and A. Marcus, “Supporting program comprehension
pp. 1556–1566. using semantic and structural information,” in Proceedings of the 23rd
[42] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, International Conference on Software Engineering. IEEE Computer
“Distributed representations of words and phrases and their composi- Society, 2001, pp. 103–112.
tionality,” in Advances in neural information processing systems, 2013, [57] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey
pp. 3111–3119. of machine learning for big code and naturalness,” arXiv preprint
[43] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,” in arXiv:1709.06182, 2017.
Proceedings of the 2016 24th ACM SIGSOFT International Symposium [58] V. Raychev, M. Vechev, and E. Yahav, “Code completion with statistical
on Foundations of Software Engineering. ACM, 2016, pp. 631–642. language models,” in Acm Sigplan Notices, vol. 49, no. 6. ACM, 2014,
[44] S. Kawaguchi, P. K. Garg, M. Matsushita, and K. Inoue, “Mudablue: An pp. 419–428.
automatic categorization system for open source repositories,” Journal [59] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Suggesting accurate
of Systems and Software, vol. 79, no. 7, pp. 939–953, 2006. method and class names,” in Proceedings of the 2015 10th Joint Meeting
[45] M. Linares-Vásquez, C. McMillan, D. Poshyvanyk, and M. Grechanik, on Foundations of Software Engineering. ACM, 2015, pp. 38–49.
“On using machine learning to automatically classify software applica- [60] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features
tions into domain categories,” Empirical Software Engineering, vol. 19, for defect prediction,” in Proceedings of the 38th International Confer-
no. 3, pp. 582–618, 2014. ence on Software Engineering. ACM, 2016, pp. 297–308.
[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [61] M. Pradel and K. Sen, “DeepBugs: A learning approach to
arXiv preprint arXiv:1412.6980, 2014. name-based bug detection,” Proc. ACM Program. Lang., vol. 2,
[47] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, no. OOPSLA, pp. 147:1–147:25, Oct. 2018. [Online]. Available:
“Towards a big data curated benchmark of inter-project code clones,” in http://doi.acm.org/10.1145/3276517
Software Maintenance and Evolution (ICSME), 2014 IEEE International [62] G. Zhao and J. Huang, “DeepSim: Deep learning code functional
Conference on. IEEE, 2014, pp. 476–480. similarity,” in Proceedings of the 2018 26th ACM Joint Meeting on
[48] A. Charpentier, J.-R. Falleri, D. Lo, and L. Réveillère, “An empirical European Software Engineering Conference and Symposium on the
assessment of bellon’s clone benchmark,” in Proceedings of the 19th Foundations of Software Engineering, ser. ESEC/FSE 2018. New
International Conference on Evaluation and Assessment in Software York, NY, USA: ACM, 2018, pp. 141–151. [Online]. Available:
Engineering. ACM, 2015, p. 20. http://doi.acm.org/10.1145/3236024.3236068
[49] P. Accioly, P. Borba, and G. Cavalcanti, “Understanding semi-structured [63] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “Combining
merge conflict characteristics in open-source java projects,” Empirical deep learning with information retrieval to localize buggy files for
Software Engineering, pp. 1–35, 2017. bug reports (n),” in Automated Software Engineering (ASE), 2015 30th
[50] Y. Kim, “Convolutional neural networks for sentence classification,” in IEEE/ACM International Conference on. IEEE, 2015, pp. 476–481.
Proceedings of the 2014 Conference on Empirical Methods in Natural [64] B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, “Predicting seman-
Language Processing (EMNLP), 2014, pp. 1746–1751. tically linkable knowledge in developer online forums via convolutional
[51] W. Zaremba and I. Sutskever, “Learning to execute,” arXiv preprint neural network,” in Proceedings of the 31st IEEE/ACM International
arXiv:1410.4615, 2014. Conference on Automated Software Engineering. ACM, 2016, pp. 51–
[52] X. Huo and M. Li, “Enhancing the unified features to locate buggy files 62.
by exploiting the sequential nature of source code,” in Proceedings of [65] J. Guo, J. Cheng, and J. Cleland-Huang, “Semantically enhanced soft-
the 26th International Joint Conference on Artificial Intelligence, ser. ware traceability using deep learning techniques,” in Proceedings of the
IJCAI’17. AAAI Press, 2017, pp. 1909–1915. [Online]. Available: 39th International Conference on Software Engineering. IEEE Press,
http://dl.acm.org/citation.cfm?id=3172077.3172153 2017, pp. 3–14.
[53] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph [66] X. Gu, H. Zhang, and S. Kim, “Deep code search,” in Proceedings of
sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015. the 2018 40th International Conference on Software Engineering (ICSE
2018). ACM, 2018.
794

Zhang 2019

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Zhang 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zhang 2019

Uploaded by

Copyright:

Available Formats

2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

A Novel Neural Source Code Representation Based

{zhangj, wangxu}@act.buaa.edu.cn, hongyu.zhang@newcastle.edu.au, {sunhl, wangkx, liuxd}@act.buaa.edu.cn

1558-1225/19/$31.00 ©2019 IEEE 783

Fig. 1. An example of AST Statement nodes (marked in red)

ℎ ℎ ℎ ℎ ℎ ℎ Pooling Layer Pooling

Reference Formal throws body

OJ dataset is incomplete and uncompilable, they can be 6 https://github.com/Microsoft/gated-graph-neural-network-samples

Groups Methods Test Accuracy(%) Type

SVMs SVM+TF-IDF 79.4 P R F1 P R F1 P R F1

P 52.5 47 76.2 77.3 98.9

Based on the evaluation on the two tasks, we aim to

Time Cost (min/epoch)

Fig. 5. The time cost of different batching methods

You might also like