Kernels for Structured Natural Language Data
Jun Suzuki, Yutaka Sasaki, and Eisaku Maeda
NTT Communication Science Laboratories, NTT Corp.
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237 Japan
{jun, sasaki, maeda}@cslab.kecl.ntt.co.jp
Abstract
This paper devises a novel kernel function for structured natural language
data. In the field of Natural Language Processing, feature extraction
consists of the following two steps: (1) syntactically and semantically
analyzing raw data, i.e., character strings, then representing the results
as discrete structures, such as parse trees and dependency graphs with
part-of-speech tags; (2) creating (possibly high-dimensional) numerical
feature vectors from the discrete structures. The new kernels, called Hierarchical Directed Acyclic Graph (HDAG) kernels, directly accept DAGs
whose nodes can contain DAGs. HDAG data structures are needed to
fully reflect the syntactic and semantic structures that natural language
data inherently have. In this paper, we define the kernel function and
show how it permits efficient calculation. Experiments demonstrate that
the proposed kernels are superior to existing kernel functions, e.g., sequence kernels, tree kernels, and bag-of-words kernels.
1 Introduction
Recent developments in kernel technology enable us to handle discrete structures, such as
sequences, trees, and graphs. Kernel functions suitable for Natural Language Processing
(NLP) have recently been proposed. Convolution Kernels [4, 12] demonstrate how to build
kernels over discrete structures. Since texts can be analyzed as discrete structures, these
discrete kernels have been applied to NLP tasks, such as sequence kernels [8, 9] for text
categorization and tree kernels [1, 2] for (shallow) parsing.
In this paper, we focus on tasks in the application areas of NLP, such as Machine Translation, Text Summarization, Text Categorization and Question Answering. In these tasks,
richer types of information within texts, such as syntactic and semantic information, are required for higher performance. However, syntactic information and semantic information
are formed by very complex structures that cannot be written in simple structures, such as
sequences and trees. The motivation of this paper is to propose kernels specifically suited to
structured natural language data. The proposed kernels can handle several of the structures
found within texts and calculate kernels with regard to these structures at a practical cost
and time. Accordingly, these kernels can be efficiently applied to learning and clustering
problems in NLP applications.
Text :
Junichiro
Koizumi
is
prime minister
of
Japan .
(1) result of a part-of-speech tagger
(2) result of a noun phrase chunker
Junichiro
Koizumi
is
prime
minister
of
Japan
.
NNP
NNP
VBJ
JJ
NN
IN
NNP
.
Koizumi
is
prime minister
of
Japan
.
Country
Person
Koizumi
is
prime minister
[number]
is
prime
minister
of
NP
Japan
.
NP
of
[executive]
[executive director]
Japan
[Asian Country]
[Asian nation]
Junichiro
Koizumi
is
prime
minister
of
Japan
.
(1) - (5)
(5) semantic information from dictionary (eg. Word-Net)
Junichiro
Koizumi
NP
(4) result of a dependency structure analyzer
(3) result of a named entities tagger
Junichiro
Junichiro
.
Junichiro
Koizumi
NNP
NNP
Person
NP
is
VBJ
prime
JJ
[number]
minister
of
NN
IN
[executive]
NP
Japan
NNP
.
.
[Asian Country]
Country
NP
Figure 1: Examples of structures within texts as determined by basic NLP tools
2 Structured Natural Language Data for Application Tasks in NLP
In general, natural language data contain many kinds of syntactic and semantic structures.
For example, texts have several levels of syntactic and semantic chunks, such as part-ofspeech (POS) chunks, named entities (NEs), noun phrase (NP) chunks, sentences, and discourse segments, and these are bound by relation structures, such as dependency structures,
anaphora, discourse relations and coreference. These syntactic and semantic structures can
provide important information for understanding natural language and, moreover, tackling
real tasks in application areas of NLP. The accuracies of basic NLP tools such as POS taggers, NP chunkers, NE taggers, and dependency structure analyzers have improved to the
point that they can help to develop real applications.
This paper proposes a method to handle these syntactic and semantic structures in a single
framework: We combine the results of basic NLP tools to make one hierarchically structured data set. Figure 1 shows an example of structures within texts analyzed by basic NLP
tools that are currently available and that offer easy use and high performance. As shown
in Figure 1, structures in texts can be hierarchical or recursive “graphs in graph”. A certain node can be constructed or characterized by other graphs. Nodes usually have several
kinds of attributes, such as words, POS tags, semantic information such as WordNet [3],
and classes of the named entities. Moreover, the relations between nodes are usually directed. Therefore, we should employ a (1) directed, (2) multi-labeled, and (3) hierarchically
structured graph to model structured natural language data.
Let V be a set of vertices (or nodes) and E be a set of edges (or links). Then, a graph
G = (V, E) is called a directed graph if E is a set of directed links E ⊂ V × V .
Definition 1 (Multi-Labeled Graph) Let Γ be a set of labels (or attributes) and M ⊂ V ×Γ
be label allocations. Then, G = (V, E, M ) is called a multi-labeled graph.
Definition 2 (Hierarchically Structured Graph) Let Gi = (Vi , Ei ) be a subgraph in G =
(V, E) where Vi ⊆ V and Ei ⊆ E, and G = {G1 , . . . , Gn } be a set of subgraphs in G.
F ⊂ V × G represents a set of vertical links from a node v ∈ V to a subgraph Gi ∈ G.
Then, G = (V, E, G, F ) is called a hierarchically structured graph if each node has at most
one vertical edge. Intuitively, vertical link fi,Gj ∈ F from node vi to graph Gj indicates
that node vi contains graph Gj .
Finally, in this paper, we successfully represent structured natural language data by using a
multi-labeled hierarchical directed graph.
Definition 3 (Multi-Labeled Hierarchical Directed Graph) G = (V, E, M, G, F ) is a
multi-labeled hierarchical directed graph.
A Graphical model of structures within text
a
N
c
b
a
d
P
P
e
b
N
V
P
: chunk
a
c
d
: relation between
chunks
Multi-Labeled Hierarchical Directed Graph
G
1
q1
:
e1,5
P
f1,
q2
q3
a
N
G
1
1
c
e2,4
e3,4
G
q5
q7
e5,7
f5,
1
1
q4
b
e6,4
q6
G
G
f7,
1
2
q8
a
d
G
1
2
G
r1
2
G
f1,
1
e
N
1
3
G
G
2
1
e2,3
f4,
1
r3
G
b
P
2
3
r2
r4
e1,4
:
P
V
2
2
r5
a
G
2
2
: node
r6
e6,5
f6,
e7,3
r7
G
2
c
G
e7,8 r8
qi, ri
: subgraph
2
G
: directed link
3
d
: vertical link
j
ei,j
fi, G
j
3
Figure 2: Examples of Hierarchical Directed Graph structures (these are also HDAG): each
letter represents an attribute
Figure 2 shows examples of multi-labeled hierarchical directed graphs. In this paper, we
call a multi-labeled hierarchical directed graph a hierarchical directed graph.
3 Kernels on Hierarchical Directed Acyclic Graph
At first, in order to calculate kernels efficiently, we add one constraint: that the hierarchical
directed graph has no cyclic paths. First, we define a path on a Hierarchical Directed Graph.
If a node has no vertical link, then the node is called a terminal node, which is denoted as
T ⊂ V ; otherwise it is a non-terminal node, which is denoted as T̄ ⊂ V .
Definition 4 (Hierarchical Path (HiP)) Let p = hvi , ei,j , vj , . . . , vk , ek,l , vl i be a path.
Let Υ(v) be a function that returns a subgraph Gi that is linked with v by a vertical link if
v ∈ T̄ . Let P(G) be a function that returns the set of all HiPs in G, where links between
v ∈ G and v ∈
/ G are ignored. Then, ph = hh(vi ), ei,j , h(vj ), . . . , h(vk ), ek,l , h(vl )i is
defined as a HiP, where h(v) returns vphx , phx ∈ P(Gx ) s.t. Gx = Υ(v) if v ∈ T̄ otherwise
returns v. Intuitively, a HiP is constructed by a path in the path structure, e.g., ph =
hvi , ei,j , vj hvm , em,n , vn i, . . . , vk , ek,l , vl i.
Definition 5 (Hierarchical Directed Acyclic Graph (HDAG)) hierarchical directed graph
G = (V, E, M, G, F ) is an HDAG if there is no HiP from any node v to the same node v.
A primitive feature for defining kernels on HDAGs is a hierarchical attribute subsequence.
Definition 6 (Hierarchical Attribute Subsequence (HiAS)) A HiAS is defined as a list of
attributes with hierarchical information extracted from nodes on HiPs.
For example, let ph = hvi , ei,j , vj hvm , em,n , vn i, . . . , vk , ek,l , vl i be a HiP, then, HiASs in
ph are written as τ (ph ) = hai , aj ham , an i, . . . , ak , al i, which is all combinations for all
ai ∈ τ (vi ), where τ (v) of node v is a function that returns the set of attributes allocated to
node v, and τ (ph ) of HiP ph is a function that returns all possible HiASs extracted from
HiP ph .
Γ∗ denotes all possible HiASs constructed by the attribute in Γ and γi ∈ Γ∗ denotes the
i’th HiAS. An explicit representation of a feature vector of an HDAG kernel is defined
as φ(G) = (φ1 (G), . . . , φ|Γ∗ | (G)), where φ represents the explicit feature mapping from
HDAG to the numerical feature space. The value of φi (G) becomes the weighted number
of occurrences of γi in G. According to this approach, the HDAG kernel, K(G 1 , G 2 ) =
P|Γ∗ |
1
2
i=1 hφi (G ) · φi (G )i, calculates the inner product of the weighted common HiASs in
G
v1:1.0
:
L5:1.0
Li : weight of Label Li
f1, G :0.8
: node
1
v2:0.8
L1:0.4
L2:0.5
G
v3:0.7
L3:1.2
e3,4:0.7
v4:0.9
L4:0.9
vi : weight of node vi
vi
: subgraph
G
: directed link
: vertical link
e2,4:0.6
ei,j : weight of directed link
j
from vi to vj
ei,j
fi, G
j
fi,G : weight of vertical link
1
j
from vi to
G
j
Figure 3: An Example of Hierarchical Directed Graph “G” with weight factors
two HDAGs, G 1 and G 2 . In this paper, we use | stand for the meaning of “such that,” since
it is simple.
X
X
X
Wγi (ph1 )Wγi (ph2 ), (1)
KHDAG (G 1 , G 2 ) =
h
h
h
2
1
γi ∈Γ∗ γi ∈τ (ph
1 )|p1 ∈P(G ) γi ∈τ (p2 )|p2 ∈P(G )
where Wγi (ph ) represents the weight value of HiAS γi in HiP ph . The weight of HiAS γi
in HiP ph is determined by
Y
Y
Y
Y
Wγi (ph ) =
WV (v)
WE (vi , vj )
WF (vi , Gj )
WΓ (a), (2)
v∈V (ph )
ei,j ∈E(ph )
fi,Gj ∈F (ph )
a∈τ (γi )
where WV (v), WE (vi , vj ), WF (vi , Gj ), and WΓ (a) represent the weight of node v, link
from vi to vj , vertical link from vi to subgraph Gj , and attribute a, respectively. An example
of how each weight factor is given is shown in Figure 3. In the case of NL data, for example,
WΓ (a) might be given by the score of tf ∗ idf from large scale documents, WV (v) by the
type of chunk such as word, phrase or named entity, WE (vi , vj ) by the type of relation
between vi and vj , and WF (vi , Gj ) by the number of nodes in Gj .
Soft Structural Matching Frameworks
Since HDAG kernels permit not only the exact matching of substructures but also approximate matching, we add the framework of node skip and relaxation of hierarchical information.
First, we discuss the framework of the node skip. We introduce decay function Λ V (v)(0 <
ΛV (v) ≤ 1), which represents the cost of skipping node v when extracting HiASs from
the HiPs, which is almost the same architecture as [8]. For example, a HiAS under the
node skips is written as h∗ha2 , a3 i, ∗, ha5 ii from HiP hv1 hv2 , v3 i, v4 , hv5 ii, where ∗ is the
explicit representation of a node that is skipped.
Next, in the case of the relaxation of hierarchical information, we perform two processes:
(1) we form one hierarchy if there is multiple hierarchy information in the same point, for
example, hhhai , aj ii, ak i becomes hhai , aj i, ak i; and (2) we delete hierarchical information
if there exists only one node, for example, hhai i, aj , ak i becomes hai , aj , ak i.
These two frameworks achieve approximate substructure matching automatically. Table 1
shows an explicit representation of the common HiASs (features) of G 1 and G 2 in Figure 2.
For the sake of simplicity, for all the weights WV (v), WE (vi , vj ), WF (vi , Gj ), and WΓ (a),
are taken as 1 and for all v, ΛV (v) = λ if v has at least one attribute, otherwise ΛV (v) = 1.
Efficient Recursive Computation
In general, when the dimension of the feature space |Γ∗ | becomes very high, it is computationally infeasible to generate feature vector φ(G) explicitly. We define an efficient
calculation formula between HDAGs G 1 and G 2 , which is written as:
XX
KHDAG (G 1 , G 2 ) =
K(q, r),
(3)
q∈Q r∈R
Table 1: Common HiASs of G 1 and G 2 in Figure 2: (N.S. represents the node skip, H.R.
represents the relaxation of hierarchical information)
G1
HiAS with ∗
HiAS
hP i
hP i
hN i
hN i
hai
hai
hbi
hbi
hci
hci
hdi
hdi
hc, bi
hc, bi
hd, bi
hd, bi
P hai
P hai
P hci
P hci
hhN i, haii
h∗hN i, h∗i, ∗haii
h∗hN i, h∗i, P i
hhN i, P i
hN, bi
hN, bi
hhN i, hdii
h∗hN i, hdii
hhbi, haii
h∗hbi, h∗i, ∗haii
hhbi, P i
h∗hbi, h∗i, P i
hhbi, hdii
h∗hbi, hdii
hhci, haii
h∗hci, h∗i, ∗haii
h∗hci, hdii
hhci, hdii
hhdi, haii
hhdi, ∗haii
h∗hN i, h∗i, P haii hhN i, P haii
hhbi, P haii
h∗hbi, h∗i, P haii
h∗hN, bi, h∗i, ∗haii hhN, bi, haii
hhN, bi, P i
h∗hN, bi, h∗i, P i
h∗hN, bi, hdii
hhN, bi, hdii
h∗hN, bi, h∗i, P haii hhN, bi, P haii
G2
value
HiAS with ∗
2
hP i
1
hN i
2
hai
1
hbi
1
hci
1
hdi
1
hc, bi
1
hhdi, hhbiii
2
P hai
1
P hhcii
3
λ
hhN i, ∗haii
λ2
hhN i, P i
1
hN, bi
λ hhN i, ∗hhdiii
3
λ
hhbi, ∗haii
λ2
hhbi, P i
λ
hhbi, ∗hhdiii
λ3
hhci, ai
λ
hc, di
λ
hhdi, ai
2
λ
hhN i, P haii
λ2
hhbi, P haii
3
λ
hhN, bi, ∗haii
2
λ
hhN, bi, P i
λ hhN, bi, ∗hhdiii
2
λ
hhN, bi, P haii
N.S.
HiAS
value common HiAS
hP i
1
hP i
hN i
1
hN i
hai
1
hai
hbi
1
hbi
hci
1
hci
hdi
1
hdi
hc, bi
1
hc, bi
hhdi, hhbiii
1
P hai
1
P hai
P hhcii
1
hhN i, haii
λ
hhN i, haii
hhN i, P i
1
hhN i, P i
hN, bi
1
hN, bi
hhN i, hhdiii
λ
hhbi, haii
λ
hhbi, haii
hhbi, P i
1
hhbi, P i
hhbi, hhdiii
λ
hhci, ai
1
hc, di
1
hhdi, ai
1
hhN i, P haii
1 hhN i, P haii
hhbi, P haii
1
hhbi, P haii
hhN, bi, haii
λ hhN, bi, haii
hhN, bi, P i
1
hhN, bi, P i
hhN, bi, hhdiii λ
hhN, bi, P haii 1 hhN, bi, P haii
N.S.+ H.R.
value common HiAS
2
hP i
1
hN i
2
hai
1
hbi
1
hci
1
hdi
1
hc, bi
0
hb, di
2
P hai
0
P hci
4
λ
hN, ai
2
λ
hN, P i
1
hN, bi
0
hN, di
4
λ
hb, ai
λ2
hb, P i
0
hb, di
0
hc, ai
0
hc, di
0
hd, ai
2
λ
hN, P haii
λ2
hb, P haii
4
λ
hhN, bi, ai
2
λ
hhN, bi, P i
0
hhN, bi, di
2
λ hhN, bi, P haii
value
2
1
2
1
1
1
1
1
2
1
λ4
λ2
1
λ2
λ4
λ2
λ2
λ3
λ
λ
λ2
λ2
λ4
λ2
λ2
λ2
where Q = {q1 , . . . , q|Q| } and R = {r1 , . . . , r|R| } represent nodes in G 1 and G 2 , respectively. K(q, r) represents the sum of the weighted common HiASs that are extracted from
the HiPs whose sink nodes are q and r.
K(q, r) = JG′′1 ,G 2 (q, r)H(q, r) + Ĥ(q, r)I(q, r) + I(q, r)
(4)
Function I(q, r) returns the weighted number of common attributes of nodes q and r,
X
X
I(q, r) = WV (q)WV (r)
WΓ (a1 )WΓ (a2 )δ(a1 , a2 ),
(5)
a1 ∈τ (q) a2 ∈τ (r)
where δ(a1 , a2 ) = 1 if a1 = a2 , and 0 otherwise. Let H(q, r) be a function that returns the
sum of the weighted common HiASs between q and r including Υ(q) and Υ(r).
H(q, r) = I(q, r) + (I(q, r) + ΛV (q)ΛV (r)) Ĥ(q, r), if q, r ∈ T̄
(6)
I(q, r),
otherwise
X
X
Ĥ(q, r) =
(7)
WF (q, Gi1 )WF (r, Gj2 )JGi1 ,Gj2 (s, t)
s∈Gi1 |Gi1 =Υ(q) t∈Gj2 |Gj2 =Υ(r)
′
Jx,y
(q, r),
′′
Let Jx,y (q, r),
and Jx,y
(q, r), where x, y are (sub)graphs, be recursive functions
to calculate H(q, r) and K(q, r).
′
Jx,y
(q, r) =
′′
Jx,y
(q, r) =
′′
Jx,y (q, r) = Jx,y
(q, r)H(q, r) + H(q, r)
X
′
WE (q, t) Λ′V (t)Jx,y
(q, t)+Jx,y (q, t) , if ψ(r) 6= ∅
t∈{ψ(r)∩V (y)}
0, otherwise
X
′′
′
WE (s, r) Λ′V (s)Jx,y
(s, r)+Jx,y
(s, r) , if ψ(q) 6= ∅
s∈{ψ(q)∩V (x)}
0, otherwise
(8)
(9)
(10)
Q
where Λ′V (v) = ΛV (v) t∈Gi |Gi =Υ(v) ΛV (t) if v ∈ T̄ , Λ′V (v) = ΛV (v) otherwise. Function ψ(q) returns a set of nodes that have direct links to node q. ψ(q) = ∅ means that no
node has a direct link to s.
Next, we show the formula when using the framework of relaxation of hierarchical information. The functions have the same meanings as in the previous formula. We denote
H̃(q, r) = H(q, r) + H ′ (q, r).
(11)
K(q, r) = JG′′1 ,G 2 (q, r)H̃(q, r) + H ′ (q, r) + H ′′ (q, r) I(q, r) + I(q, r)
H(q, r) = H ′ (q, r) + H ′′ (q, r) I(q, r) + H ′′ (q, r) + I(q, r)
X
WF (r, Gj2 )H̃(q, t), if r ∈ T̄
′
H (q, r) = t∈Gj2 |Gj2 =Υ(r)
0,
otherwise
X
WF (q, Gi1 )H(s, r) + Ĥ(q, r), if q, r ∈ T̄
1
s∈Gi1 |G
i =Υ(q)
X
H ′′ (q, r) =
WF (q, Gi1 )H(s, r), if q ∈ T̄
1
1
s∈Gi |Gi =Υ(q)
0,
otherwise
′′
Jx,y (q, r) = Jx,y
(q, r)H̃(q, r)
′
Jx,y
(q, r) =
X
(12)
(13)
(14)
(15)
′
WE (q, t) Λ′V (t)Jx,y
(q, t)+Jx,y (q, t)+ H̃(q, t) , if ψ(r) 6= ∅
t∈{ψ(r)∩V (y)}
0, otherwise
(16)
′′
(q, r), and Ĥ(q, r) are the same as those shown above.
Functions I(q, r), Jx,y
According to equation (3), given the recursive definition of KHDAG (q, r), the value between
two HDAGs can be calculated in time O(|Q||R|). In actual use, we may want to evaluate only the subset of all HiASs whose sizes are under n when determining the kernel
value because of the problem discussed in [1]. This can simply realized by not calculating
those HiASs whose size exceeds n when calculating K(q, r); the calculation cost becomes
O(n|Q||R|).
Finally, we normalize the values of the HDAG kernels to remove any bias introduced by
the number of nodes in the graphs. This normalization corresponds to the standard unit
norm normalization of examples in the feature space corresponding to the kernel space
K̂(x, y) = K(x, y) · (K(x, x)K(y, y))−1/2 [4].
We will now elucidate an efficient processing algorithm. First, as a pre-process, the nodes
are sorted under two conditions, V (Υ(v)) ≺ v and Ψ(v) ≺ v, where Ψ(v) represents all
nodes that have a path to v. The dynamic programming technique can be used to compute
HDAG kernels very efficiently: By following the sorted order, the values that are needed to
calculate K(q, r) have already been calculated in the previous calculation.
4 Experiments
Our aim was to test the efficiency of using the richer syntactic and semantic structures
available within texts, which can be treated now for the first time by our proposed method.
We evaluated the performance of the proposed method in the actual NLP task of Question
Classification, which is similar to the Text Classification task except that it requires many
question:
Who is
prime minister
of
Japan ?
word order of attributes (Seq-K)
Who
WP
is
prime
JJ
VBJ
[number]
hierarchical chunks and their relations (HDAG-K)
minister
of
NN
IN
Japan
?
.
NNP
Country
[executive]
Who
WP
[Asian Country]
NP
dependency structures of attributes (DS-K, DAG-K)
Who
WP
is
prime
VBJ
JJ
[number]
minister
of
NN
IN
[executive]
Japan
NNP
Country
is
VBJ
prime
JJ
[number]
minister
of
NN
IN
[executive]
Japan
NNP
[Asian Country]
?
.
Country
NP
NP
?
.
[Asian Country]
Figure 4: Examples of input data of comparison methods
Table 2: Results of question classification by SVM with comparison kernel functions evaluated by F-measure
n
1
HDAG-K DAG-K
DS-K
Seq-K
BOW-K .899
TIME
2
.951
.946
.615
.946
.906
TOP
3
.942
.913
.564
.910
.885
4
.926
.869
.403
.866
.853
1
.748
LOCATION
2
3
.802 .813
.803 .774
.544 .507
.792 .774
.772 .757
4
.784
.729
.466
.733
.745
ORGANIZATION
1
2
3
4
.716 .712 .697
.704 .671 .610
.535 .509 .419
.706 .668 .595
.638 .690 .633 .571
1
.841
NUMEX
2
3
.916 .922
.912 .880
.602 .504
.913 .885
.846 .804
4
.874
.813
.424
.815
.719
more semantic features within texts [7, 10]. We used three different QA data sets written
in Japanese [10].
We compared the performance of the proposed kernel, the HDAG Kernel (HDAG-K), with
DAG kernels (DAG-K), Dependency Structure kernels (DS-K) [2], and sequence kernels
(Seq-K) [9]. Moreover, we evaluated the bag-of-words kernel (BOW-K) [6], that is, the
bag-of-words with polynomial kernels, as the baseline method. The main difference between each method is the ability to treat syntactic and semantic information within texts.
Figure 4 shows the differences of input objects between each method. For better understanding, these examples are shown in English. We used words, named entity tags, and semantic information [5] for attributes. Seq-K only treats word order, DS-K and DAG-K treat
dependency structures, and HDAG-K treats the NP and NE chunks with their dependency
structures. We used the same formula with our proposed method for DAG-K. Comparing
HDAG-K to DAG-K shows the difference in performance between handling the hierarchical structures and not handling them. We extended Seq-K and DS-K to improve the total
performance and to establish a more equal evaluation, with the same conditions, against our
proposed method. Note that though DAG-K and DS-K handle input objects of the same
form, their kernel calculation methods differ as do their return values. We used node skip
parameter ΛV (v) = 0.5 for all nodes v in each comparison.
We used SVM [11] as a kernel-based machine learning algorithm. We evaluated the performance of the comparison methods with question type TIME TOP, ORGANIZATION,
LOCATION, and NUMEX, which are defined in the CRL QA-data1 .
Table 2 shows the average F-measure as evaluated by 5-fold cross validation. n in Table 2
indicates the threshold of an attribute’s number, that is, we evaluated only those HiASs that
contain less than n-attributes for each kernel calculation. As shown in this table, HDAGK showed the best performance in the experiments. The experiments in this paper were
designed to investigate how to improve the performance by using the richer syntactic and
semantic structures within texts. In the task of Question Classification, a given question
is classified into Question Type, which reflects the intention of the question. These results
1
http://www.cs.nyu.edu/˜sekine/PROJECT/CRLQA/
indicate that our approach, incorporating richer structure features within texts, is well suited
to the tasks in the NLP applications.
The original DS-K requires exact matching of the tree structure, even when it is extended
for more flexible matching. This is why DS-K showed the worst performance in our experiments. The sequence, DAG, and HDAG kernels offer approximate matching by the
framework of node skip, which produces better performance in the tasks that evaluate the
intention of the texts.
The structure of HDAG approaches that of DAG if we do not consider the hierarchical
structure. In addition, the structures of sequences and trees are entirely included in that of
DAG. Thus, the HDAG kernel subsumes some of the discrete kernels, such as sequence,
tree, and graph kernels.
5 Conclusions
This paper proposed HDAG kernels, which can handle more of the rich syntactic and
semantic information present within texts. Our proposed method is a very generalized
framework for handling structured natural language data. We evaluated the performance of
HDAG kernels with the real NLP task of question classification. Our experiments showed
that HDAG kernels offer better performance than sequence kernels, tree kernels, and the
baseline method bag-of-words kernels if the target task requires the use of the richer information within texts.
References
[1] M. Collins and N. Duffy. Convolution Kernels for Natural Language. In Proc. of Neural
Information Processing Systems (NIPS’2001), 2001.
[2] M. Collins and N. Duffy. Parsing with a Single Neuron: Convolution Kernels for Natural
Language Problems. In Technical Report UCS-CRL-01-10. UC Santa Cruz, 2001.
[3] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
[4] D. Haussler. Convolution Kernels on Discrete Structures. In Technical Report UCS-CRL-99-10.
UC Santa Cruz, 1999.
[5] S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo, H. Nakaiwa, K. Ogura, Y. Oyama, and Y. Hayashi,
editors. The Semantic Attribute System, Goi-Taikei — A Japanese Lexicon, volume 1. Iwanami
Publishing, 1997. (in Japanese).
[6] T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant
Features. In Proc. of European Conference on Machine Learning(ECML ’98), pages 137–142,
1998.
[7] X. Li and D. Roth. Learning Question Classifiers. In Proc. of the 19th International Conference
on Computational Linguistics (COLING 2002), pages 556–562, 2002.
[8] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text Classification
Using String Kernel. Journal of Machine Learning Research, 2:419–444, 2002.
[9] N. Cancedda and E. Gaussier and C. Goutte and J.-M. Renders. Word-Sequence Kernels. Journal of Machine Learning Research, 3:1059–1082, 2003.
[10] J. Suzuki, H. Taira, Y. Sasaki, and E. Maeda. Question Classification using HDAG Kernel. In
Workshop on Multilingual Summarization and Question Answering (2003), pages 61–68, 2003.
[11] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[12] C. Watkins. Dynamic Alignment Kernels. In Technical Report CSD-TR-98-11. Royal Holloway,
University of London Computer Science Department, 1999.