TransG A Generative Model For Knowledge Graph Embe
TransG A Generative Model For Knowledge Graph Embe
TransG A Generative Model For Knowledge Graph Embe
net/publication/306093828
CITATIONS READS
316 1,355
3 authors, including:
All content following this page was uploaded by Han Xiao on 18 August 2016.
2316
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2316–2325,
Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics
dicate the embedding vectors of triple (h, r, t), bedding model. For example, we can distinguish
with the head and tail entity vector projected with the two clusters HasPart.1 or HasPart.2, where
respect to the relation space. the relation semantics are automatically clustered
to represent the meaning of associated entity pairs.
In spite of the success of these models, none
To summarize, our contributions are as follows:
of the previous models has formally discussed
the issue of multiple relation semantics that a • We propose a new issue in knowledge graph
relation may have multiple meanings revealed by embedding, multiple relation semantics that
the entity pairs associated with the corresponding a relation in knowledge graph may have dif-
triples. As can be seen from Fig. 1, visualization ferent meanings revealed by the associated
results on embedding vectors obtained from entity pairs, which has never been studied
TransE (Bordes et al., 2013) show that, there previously.
are different clusters for a specific relation,
and different clusters indicate different latent • To address the above issue, we propose a
semantics. For example, the relation HasPart has novel Bayesian non-parametric infinite mix-
at least two latent semantics: composition-related ture embedding model, TransG. The model
as (Table, HasPart, Leg) and location-related can automatically discover semantic clusters
as (Atlantics, HasPart, NewYorkBay). As one of a relation, and leverage a mixture of multi-
more example, in Freebase, (Jon Snow, birth ple relation components for translating an en-
place, Winter Fall) and (George R. R. Martin, tity pair. Moreover, we present new insights
birth place, U.S.) are mapped to schema /fic- from the generative perspective.
tional universe/fictional character/place of birth
• Extensive experiments show that our pro-
and /people/person/place of birth respectively,
posed model obtains substantial improve-
indicating that birth place has different meanings.
ments against the state-of-the-art baselines.
This phenomenon is quite common in knowledge
bases for two reasons: artificial simplification and 2 Related Work
nature of knowledge. On one hand, knowledge
base curators could not involve too many similar Translation-Based Embedding. Existing
relations, so abstracting multiple similar relations translation-based embedding methods share the
into one specific relation is a common trick. On same translation principle h + r ≈ t and the
the other hand, both language and knowledge score function is designed as:
representations often involve ambiguous infor-
mation. The ambiguity of knowledge means a fr (h, t) = ||hr + r − tr ||22
semantic mixture. For example, when we mention
where hr , tr are entity embedding vectors pro-
“Expert”, we may refer to scientist, businessman
jected in the relation-specific space. TransE (Bor-
or writer, so the concept “Expert” may be ambigu-
des et al., 2013), lays the entities in the original en-
ous in a specific situation, or generally a semantic
tity space: hr = h, tr = t. TransH (Wang et al.,
mixture of these cases.
2014) projects entities into a hyperplane for ad-
However, since previous translation-based mod- dressing the issue of complex relation embedding:
els adopt hr + r ≈ tr , they assign only one trans- hr = h − wr> hwr , tr = t − wr> twr . To address
lation vector for one relation, and these models are the same issue, TransR (Lin et al., 2015b), trans-
not able to deal with the issue of multiple relation forms the entity embeddings by the same relation-
semantics. To illustrate more clearly, as showed specific matrix: hr = Mr h, tr = Mr t. TransR
in Fig.2, there is only one unique representation also proposes an ad-hoc clustering-based method,
for relation HasPart in traditional models, thus CTransR, where the entity pairs for a relation
the models made more errors when embedding the are clustered into different groups, and the pairs
triples of the relation. Instead, in our proposed in the same group share the same relation vec-
model, we leverage a Bayesian non-parametric in- tor. In comparison, our model is more elegant
finite mixture model to handle multiple relation se- to address such an issue theoretically, and does
mantics by generating multiple translation compo- not require a pre-process of clustering. Further-
nents for a relation. Thus, different semantics are more, our model has much better performance
characterized by different components in our em- than CTransR, as expected. TransM (Fan et al.,
2317
Figure 2: Visualization of multiple relation semantics. The data are selected from Wordnet. The dots
are correct triples that belong to HasPart relation, while the circles are incorrect ones. The point coor-
dinate is the difference vector between tail and head entity, which should be near to the centre. (a) The
correct triples are hard to be distinguished from the incorrect ones. (b) By applying multiple semantic
components, our proposed model could discriminate the correct triples from the wrong ones.
2014) leverages the structure of the knowledge and Hadamard product. The unstructured model
graph via pre-calculating the distinct weight for (Bordes et al., 2012) may be a simplified version
each training triple to enhance embedding. KG2E of TransE without considering any relation-related
(He et al., 2015) is a probabilistic embedding information. The score function is directly defined
method for modeling the uncertainty in knowledge as fr (h, t) = ||h − t||22 .
graph.
Neural Network based Embedding. Sin-
There are many works to improve translation-
gle Layer Model (SLM) (Socher et al., 2013)
based methods by considering other information.
applies neural network to knowledge graph
For instance, (Guo et al., 2015) aims at discov-
embedding. The score function is defined
ering the geometric structure of the embedding
as fr (h, t) = u> r g(Mr,1 h + Mr,2 t) where
space to make it semantically smooth. (Wang et
Mr,1 , Mr,2 are relation-specific weight matri-
al., 2014) focuses on bridging the gap between
ces. Neural Tensor Network (NTN) (Socher
knowledge and texts, with a loss function for
et al., 2013) defines a very expressive score
jointly modeling knowledge graph and text re-
function by applying tensor: fr (h, t) =
sources. (Wang et al., 2015) incorporates the rules
u>r g(h > W t + M h + M t + b ), where
··r r,1 r,2 r
that are related with relation types such as 1-N and
ur is a relation-specific linear layer, g(·) is the
N-1. PTransE (Lin et al., 2015a) takes into ac-
tanh function, W ∈ Rd×d×k is a 3-way tensor.
count path information in knowledge graph.
Since the previous models are point-wise mod- Factor Models. The latent factor models (Jenat-
eling methods, ManifoldE (Xiao et al., 2016) pro- ton et al., 2012) (Sutskever et al., 2009) attempt
poses a novel manifold-based approach for knowl- to capturing the second-order correlations between
edge graph embedding. In aid of kernel tricks, entities by a quadratic form. The score function
manifold-based methods can improve embedding is defined as fr (h, t) = h> Wr t. RESCAL is a
performance substantially. collective matrix factorization model which is also
a common method in knowledge base embedding
Structured & Unstructured Embedding. The (Nickel et al., 2011) (Nickel et al., 2012).
structured embedding model (Bordes et al., 2011)
transforms the entity space with the head-specific 3 Methods
and tail-specific matrices. The score function is
defined as fr (h, t) = ||Mh,r h − Mt,r t||. Ac- 3.1 TransG: A Generative Model for
cording to (Socher et al., 2013), this model cannot Embedding
capture the relationship between entities. Seman- As just mentioned, only one single translation vec-
tic Matching Energy (SME) (Bordes et al., 2012) tor for a relation may be insufficient to model mul-
(Bordes et al., 2014) can handle the correlations tiple relation semantics. In this paper, we pro-
between entities and relations by matrix product pose to use Bayesian non-parametric infinite mix-
2318
ture embedding model (Griffiths and Ghahramani,
Table 1: Statistics of datasets
2011). The generative process of the model is as Data WN18 FB15K WN11 FB13
follows: #Rel 18 1,345 11 13
#Ent 40,943 14,951 38,696 75,043
1. For an entity e ∈ E:
#Train 141,442 483,142 112,581 316,232
(a) Draw each entity embedding mean vec- #Valid 5,000 50,000 2,609 5,908
tor from a standard normal distribution #Test 5,000 59,071 10,544 23,733
as a prior: ue v N (0, 1).
2319
Table 2: Evaluation results on link prediction
Datasets WN18 FB15K
Mean Rank HITS@10(%) Mean Rank HITS@10(%)
Metric
Raw Filter Raw Filter Raw Filter Raw Filter
Unstructured (Bordes et al., 2011) 315 304 35.3 38.2 1,074 979 4.5 6.3
RESCAL (Nickel et al., 2012) 1,180 1,163 37.2 52.8 828 683 28.4 44.1
SE(Bordes et al., 2011) 1,011 985 68.5 80.5 273 162 28.8 39.8
SME(bilinear) (Bordes et al., 2012) 526 509 54.7 61.3 284 158 31.3 41.3
LFM (Jenatton et al., 2012) 469 456 71.4 81.6 283 164 26.0 33.1
TransE (Bordes et al., 2013) 263 251 75.4 89.2 243 125 34.9 47.1
TransH (Wang et al., 2014) 401 388 73.0 82.3 212 87 45.7 64.4
TransR (Lin et al., 2015b) 238 225 79.8 92.0 198 77 48.2 68.7
CTransR (Lin et al., 2015b) 231 218 79.4 92.3 199 75 48.4 70.2
PTransE (Lin et al., 2015a) N/A N/A N/A N/A 207 58 51.4 84.6
KG2E (He et al., 2015) 362 348 80.5 93.2 183 69 47.5 71.5
TransG (this paper) 357 345 84.5 94.9 152 50 55.9 88.2
TransG have their own contributions, but the pri- where ∆ is the set of golden triples and ∆0 is the
mary one makes the most. set of false triples. C controls the scaling degree.
E is the set of entities and R is the set of relations.
3.3 Training Algorithm Noted that the mixing factors π and the variances
The maximum data likelihood principle is applied σ are also learned jointly in the optimization.
for training. As to the non-parametric part, πr,m SGD is applied to solve this optimization prob-
is generated from the CRP with Gibbs Sampling, lem. In addition, we apply a trick to control the
similar to (He et al., 2015) and (Griffiths and parameter updating process during training. For
Ghahramani, 2011). A new component is sampled those very impossible triples, the update process is
for a triple (h,r,t) with the below probability: skipped. Hence, we introduce a similar condition
||h−t||22
as TransE (Bordes et al., 2013) adopts: the train-
−
σ 2 +σt2 +2 ing algorithm will update the embedding vectors
βe h
P(mr,new ) = ||h−t||2
(3) only if the below condition is satisfied:
− 2 22
βe σ +σt +2
h + P{(h, r, t)} ||uh +ur,m −ut ||2
PMr −
σ 2 +σt2
2
where P{(h, r, t)} is the current posterior prob- P{(h, r, t)} m=1 πr,m e
h
=
ability. As to other parts, in order to better distin- P{(h0 , r0 , t0 )} ||uh0 +ur0 ,m −ut0 ||2
2
PMr0 −
σ 2 0 +σ 20
guish the true triples from the false ones, we max- m=1 πr0 ,m e
h t
2320
Table 3: Evaluation results on FB15K by mapping properties of relations(%)
Tasks Predicting Head(HITS@10) Predicting Tail(HITS@10)
Relation Category 1-1 1-N N-1 N-N 1-1 1-N N-1 N-N
Unstructured (Bordes et al., 2011) 34.5 2.5 6.1 6.6 34.3 4.2 1.9 6.6
SE(Bordes et al., 2011) 35.6 62.6 17.2 37.5 34.9 14.6 68.3 41.3
SME(bilinear) (Bordes et al., 2012) 30.9 69.6 19.9 38.6 28.2 13.1 76.0 41.8
TransE (Bordes et al., 2013) 43.7 65.7 18.2 47.2 43.7 19.7 66.7 50.0
TransH (Wang et al., 2014) 66.8 87.6 28.7 64.5 65.5 39.8 83.3 67.2
TransR (Lin et al., 2015b) 78.8 89.2 34.1 69.2 79.2 37.4 90.4 72.1
CTransR (Lin et al., 2015b) 81.5 89.0 34.7 71.2 80.8 38.6 90.1 73.8
PTransE (Lin et al., 2015a) 90.1 92.0 58.7 86.1 90.1 70.7 87.5 88.7
KG2E (He et al., 2015) 92.3 93.7 66.0 69.6 92.6 67.9 94.4 73.4
TransG (this paper) 93.0 96.0 62.5 86.8 92.8 68.1 94.5 88.8
costs 1200.0s on the same computer for the same the corrupted triples that exist in the training, val-
dataset. idation, or test datasets, this is the“Filter” setting.
If a corrupted triple exists in the knowledge graph,
4 Experiments ranking it ahead the original triple is also accept-
Our experiments are conducted on four public able. To eliminate this case, the “Filter” setting
benchmark datasets that are the subsets of Word- is preferred. In both settings, a lower Mean Rank
net and Freebase, respectively. The statistics of and a higher HITS@10 mean better performance.
these datasets are listed in Tab.1. Experiments Implementation. As the datasets are the same,
are conducted on two tasks : Link Prediction and we directly report the experimental results of sev-
Triple Classification. To further demonstrate how eral baselines from the literature, as in (Bordes
the proposed model approaches multiple relation et al., 2013), (Wang et al., 2014) and (Lin et al.,
semantics, we present semantic component analy- 2015b). We have attempted several settings on
sis at the end of this section. the validation dataset to get the best configuration.
For example, we have tried the dimensions of 100,
4.1 Link Prediction 200, 300, 400. Under the “bern.” sampling strat-
Link prediction concerns knowledge graph com- egy, the optimal configurations are: learning rate
pletion: when given an entity and a relation, the α = 0.001, the embedding dimension k = 100,
embedding models predict the other missing en- γ = 2.5, β = 0.05 on WN18; α = 0.0015,
tity. More specifically, in this task, we predict t k = 400, γ = 3.0, β = 0.1 on FB15K. Note
given (h, r, ∗), or predict h given (∗, r, t). The that all the symbols are introduced in “Methods”.
WN18 and FB15K are two benchmark datasets for We train the model until it converges.
this task. Note that many AI tasks could be en- Results. Evaluation results on WN18 and
hanced by Link Prediction such as relation extrac- FB15K are reported in Tab.2 and Tab.31 . We ob-
tion (Hoffmann et al., 2011). serve that:
Evaluation Protocol. We adopt the same proto-
col used in previous studies. For each testing triple 1. TransG outperforms all the baselines obvi-
(h, r, t), we corrupt it by replacing the tail t (or the ously. Compared to TransR, TransG makes
head h) with every entity e in the knowledge graph improvements by 2.9% on WN18 and 26.0%
and calculate a probabilistic score of this corrupted on FB15K, and the averaged semantic com-
triple (h, r, e) (or (e, r, t)) with the score function ponent number on WN18 is 5.67 and that on
fr (h, e). After ranking these scores in descend- FB15K is 8.77. This result demonstrates cap-
ing order, we obtain the rank of the original triple. turing multiple relation semantics would ben-
There are two metrics for evaluation: the averaged efit embedding.
rank (Mean Rank) and the proportion of testing 1
Note that correctly regularized TransE can produce much
triple whose rank is not larger than 10 (HITS@10). better performance than what were reported in the ogirinal
This is called “Raw” setting. When we filter out paper, see (Garcı́a-Durán et al., 2015).
2321
Table 4: Different clusters in WN11 and FB13 relations.
Relation Cluster Triples (Head, Tail)
Location (Capital of Utah, Beehive State), (Hindustan, Bharat) ...
PartOf
Composition (Monitor, Television), (Bush, Adult Body), (Cell Organ, Cell)...
Catholicism (Cimabue, Catholicism), (St.Catald, Catholicism) ...
Religion
Others (Michal Czajkowsk, Islam), (Honinbo Sansa, Buddhism) ...
Abstract (Computer Science, Security System), (Computer Science, PL)..
DomainRegion
Specific (Computer Science, Router), (Computer Science, Disk File) ...
Scientist (Michael Woodruf, Surgeon), (El Lissitzky, Architect)...
Profession Businessman (Enoch Pratt, Entrepreneur), (Charles Tennant, Magnate)...
Writer (Vlad. Gardin, Screen Writer), (John Huston, Screen Writer) ...
2322
Figure 4: Semantic component number on WN18 (left) and FB13 (right).
the best configuration. The optimal configurations 2. Different components indeed correspond
of TransG are as follows: “bern” sampling, learn- to different semantics, justifying the
ing rate α = 0.001, k = 50, γ = 6.0, β = 0.1 theoretical analysis and effectiveness
on WN11, and “bern” sampling, α = 0.002, of TransG. For example, “Profession”
k = 400, γ = 3.0, β = 0.1 on FB13. has at least three semantics: scientist-
Results. Accuracies are reported in Tab.5 and related as (ElLissitzky, Architect),
Fig.3. The following are our observations: businessman-related as
(EnochPratt, Entrepreneur) and writer-
1. TransG outperforms all the baselines remark-
related as (Vlad.Gardin, ScreenWriter).
ably. Compared to TransR, TransG improves
by 1.7% on WN11 and 5.8% on FB13, and
3. WN11 and WN18 are different subsets of
the averaged semantic component number on
Wordnet. As we know, the semantic compo-
WN11 is 2.63 and that on FB13 is 4.53. This
nent number is decided on the triples in the
result shows the benefit of capturing multiple
dataset. Therefore, It’s reasonable that sim-
relation semantics for a relation.
ilar relations, such as “Synset Domain” and
2. The relations, such as “Synset Domain” and “Synset Usage” may hold different semantic
“Type Of”, which hold more semantic com- numbers for WN11 and WN18.
ponents, are improved much more. In com-
parison, the relation “Similar” holds only one 5 Conclusion
semantic component and is almost not pro-
In this paper, we propose a generative Bayesian
moted. This further demonstrates that cap-
non-parametric infinite mixture embedding model,
turing multiple relation semantics can benefit
TransG, to address a new issue, multiple relation
embedding.
semantics, which can be commonly seen in knowl-
4.3 Semantic Component Analysis edge graph. TransG can discover the latent se-
In this subsection, we analyse the number of se- mantics of a relation automatically and leverage
mantic components for different relations and list a mixture of relation components for embedding.
the component number on the dataset WN18 and Extensive experiments show our method achieves
FB13 in Fig.4. substantial improvements against the state-of-the-
Results. As Fig. 4 and Tab. 4 show, we have the art baselines.
following observations:
6 Acknowledgements
1. Multiple semantic components are indeed
necessary for most relations. Except for re- This work was partly supported by the National
lations such as “Also See”, “Synset Usage” Basic Research Program (973 Program) under
and “Gender”, all other relations have more grant No. 2012CB316301/2013CB329403, the
than one semantic component. National Science Foundation of China under grant
2323
No. 61272227/61332007, and the Beijing Higher 24th ACM International on Conference on Informa-
Education Young Elite Teacher Project. tion and Knowledge Management, pages 623–632.
ACM.
Antoine Bordes, Jason Weston, Ronan Collobert, Rodolphe Jenatton, Nicolas L Roux, Antoine Bordes,
Yoshua Bengio, et al. 2011. Learning structured and Guillaume R Obozinski. 2012. A latent fac-
embeddings of knowledge bases. In Proceedings of tor model for highly multi-relational data. In Ad-
the Twenty-fifth AAAI Conference on Artificial Intel- vances in Neural Information Processing Systems,
ligence. pages 3167–3175.
Antoine Bordes, Xavier Glorot, Jason Weston, and Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2015a.
Yoshua Bengio. 2012. Joint learning of words Modeling relation paths for representation learn-
and meaning representations for open-text semantic ing of knowledge bases. Proceedings of the 2015
parsing. In International Conference on Artificial Conference on Empirical Methods in Natural Lan-
Intelligence and Statistics, pages 127–135. guage Processing (EMNLP). Association for Com-
putational Linguistics.
Antoine Bordes, Nicolas Usunier, Alberto Garcia-
Duran, Jason Weston, and Oksana Yakhnenko. Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and
2013. Translating embeddings for modeling multi- Xuan Zhu. 2015b. Learning entity and relation em-
relational data. In Advances in Neural Information beddings for knowledge graph completion. In Pro-
Processing Systems, pages 2787–2795. ceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence.
Antoine Bordes, Xavier Glorot, Jason Weston, and
Yoshua Bengio. 2014. A semantic matching en- George A Miller. 1995. Wordnet: a lexical
ergy function for learning with multi-relational data. database for english. Communications of the ACM,
Machine Learning, 94(2):233–259. 38(11):39–41.
Miao Fan, Qiang Zhou, Emily Chang, and Maximilian Nickel, Volker Tresp, and Hans-Peter
Thomas Fang Zheng. 2014. Transition-based Kriegel. 2011. A three-way model for collective
knowledge graph embedding with relational map- learning on multi-relational data. In Proceedings of
ping properties. In Proceedings of the 28th Pacific the 28th international conference on machine learn-
Asia Conference on Language, Information, and ing (ICML-11), pages 809–816.
Computation, pages 328–337.
Maximilian Nickel, Volker Tresp, and Hans-Peter
Alberto Garcı́a-Durán, Antoine Bordes, Nicolas Kriegel. 2012. Factorizing yago: scalable machine
Usunier, and Yves Grandvalet. 2015. Combining learning for linked data. In Proceedings of the 21st
two and three-way embeddings models for link pre- international conference on World Wide Web, pages
diction in knowledge bases. CoRR, abs/1506.00999. 271–280. ACM.
Xavier Glorot and Yoshua Bengio. 2010. Understand- Richard Socher, Danqi Chen, Christopher D Manning,
ing the difficulty of training deep feedforward neural and Andrew Ng. 2013. Reasoning with neural ten-
networks. In International conference on artificial sor networks for knowledge base completion. In Ad-
intelligence and statistics, pages 249–256. vances in Neural Information Processing Systems,
pages 926–934.
Thomas L Griffiths and Zoubin Ghahramani. 2011.
The indian buffet process: An introduction and re- Ilya Sutskever, Joshua B Tenenbaum, and Ruslan
view. The Journal of Machine Learning Research, Salakhutdinov. 2009. Modelling relational data us-
12:1185–1224. ing bayesian clustered tensor factorization. In Ad-
vances in neural information processing systems,
Shu Guo, Quan Wang, Bin Wang, Lihong Wang, and pages 1821–1828.
Li Guo. 2015. Semantically smooth knowledge
graph embedding. In Proceedings of ACL. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng
Chen. 2014. Knowledge graph embedding by trans-
Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao. lating on hyperplanes. In Proceedings of the Twenty-
2015. Learning to represent knowledge graphs Eighth AAAI Conference on Artificial Intelligence,
with gaussian embedding. In Proceedings of the pages 1112–1119.
2324
Quan Wang, Bin Wang, and Li Guo. 2015. Knowl-
edge base completion using embeddings and rules.
In Proceedings of the 24th International Joint Con-
ference on Artificial Intelligence.
Han Xiao, Minlie Huang, and Xiaoyan Zhu. 2016.
From one point to a manifold: Knowledge graph em-
bedding for precise link prediction. In IJCAI.
2325