Entity Embeddings of Categorical Variables
Entity Embeddings of Categorical Variables
Entity Embeddings of Categorical Variables
other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained
neural network boost the performance of all tested machine learning methods considerably when
used as the input features instead. As entity embedding defines a distance measure for categorical
variables it can be used for visualizing categorical data and for data clustering.
Distributed representation of entities has been used in III. TREE BASED METHODS
many contexts before[1921]. Our main contributions
are: First we explored this idea in the general function As tree based methods are the most widely used
approximation problem and demonstrated its power in a method for structured data and they are the main meth-
large machine learning competition. Second we studied ods we are comparing to, we will briefly review them here.
the properties of the learned embeddings and showed how Random Forests and in particular Gradient Boosted
the embeddings can be used to understand and visualize Trees have proven their capabilities in numerous recent
categorical data. Kaggle competitions [13]. In the following, we will briefly
describe the process of growing a single decision tree used
for regression, as well as two popular tree ensemble meth-
ods: random forests and gradient tree boosting.
II. RELATED WORK
As far as we know the first domain where the entity A. Single decision tree
embedding method in the context of neural networks has
been explored is the representation of relational data[19]. Decision trees partition the feature space X into M
More recently, knowledge base which is a large collection different sub-spaces R1 , R2 , . . . RM . The function f in
of complex relational data is seeing lots of works using equation (1) is thus modeled as
entity embedding[2224]. The basic data structure of re-
M
lational data is triplets (h, r, t), where h and t are two X
entities and r is the relation. The entities are mapped f (x) = cm I(x Rm ) (5)
to vectors and relations are sometimes mapped to a ma- m=1
trix(e.g. Linear Relation Embedding [25]) or two matri- with I being the
ces(e.g. Structured Embeddings[26]) or a vector in the ( indicator function
1 if x Rm
same embedding space as the entities[27] etc. Various I(x Rm ) = . Using the common sum of
kind of score function can be defined (see Table. 1 of [28]) 0 else
to measure the likelihood of such a triplet, and the score squares
function is used as the objective function for learning the X 2
embeddings. L= (yi f (xi )) (6)
i
In natural language processing, Word embeddings have
been used to map words and phrases [9] into a continu- as loss function, it follows from standard linear regression
ous distributed vector in a semantic space. In this space theory that, for given Rm , the optimal choices for the
similar words are closer. What is even more interesting parameters cm are just the averages
is that not only the distance between words are meaning-
ful but also the direction of the difference vectors. For 1 X
cm = yi (7)
example, it has been observed [11] that the learned word |Rm |
xi Rm
vectors have relations such as:
with |Rm | the number of elements in the set Rm . Ideally,
King Man Queen Woman (2) we would try to find the optimal partition {Rm } such
as to minimize the loss function (6). However, this is
Paris France Rome Italy (3) not computationally feasible, as the number of possible
partitions grows exponentially with the size of the feature
There are many ways [9, 11, 18, 29, 30] to learn word space X. Instead, a greedy algorithm is applied, that
embeddings. A very fast way [31] is to use the word tries to find subsequent splits of X that try to minimize
context with the aim to maximize (6) locally at each split. To start with, given a splitting
variable j and a split point s, we define the pair of half-
exp(w wc ) planes
p(wc |w) = P , (4)
i exp(w wi )
R1 (j, s) = {X|Xj s} (8)
where w and wc are the vector representation of a word R2 (j, s) = {X|Xj > s} (9)
w and its neighbor word wc inside the context window and optimize (6) for j and s:
while p(wc |w) is the probability to have wc in the context
of w. The sum is over the whole vocabulary. Word em-
X X
beddings can also be learned with supervised methods. min (yi c1 )2 + (yi c2 )2 (10)
For example in Ref. [30] the embeddings can be learned j,s
xi R1 (j,s) xi R2 (j,s)
using text with labeled sentiment. This approach is very
close to the approach we use in this paper but in a dif- The optimal choices for the parameters c1 and c2 follow
ferent context. directly from (7).
3
output layer
merged layer is treated like a normal input layer in neural
networks and other layers can be build on top of it. The
dense layer 1
whole network can be trained with the standard back-
propagation method. In this way, the entity embedding
layer learns about the intrinsic properties of each cate-
dense layer 0 gory, while the deeper layers form complex combinations
of them.
The dimensions of the embedding layers Di are hyper-
EE layer A EE layer B EE layer C
parameters that need to be pre-defined. The bound of
the dimensions of entity embeddings are between 1 and
one-hot encoding layer A one-hot encoding layer B one-hot encoding layer C mi 1 where mi is the number of values for the cate-
gorical variable xi . In practice we chose the dimensions
FIG. 1. Illustration that entity embedding layers are equiva- based on experiments. The following empirical guidelines
lent to extra layers on top of each one-hot encoded input. are used during this process: First, the more complex the
more dimensions. We roughly estimated how many fea-
tures/aspects one might need to describe the entities and
integer as nominal numbers. For example the month or- used that as the dimension to start with. Second, if we
dering has nothing to do with number of days in a month had no clue about the first guideline, then we started
(January is closer to Jun than February regarding num- with mi 1.
ber of days it has). Therefore we will treat both types of It would be good to have more theoretical guidelines
discrete variables in the same way. The task of entity em- on how to choose Di . We think this probably relates to
bedding is to map discrete values to a multi-dimensional the problem of embedding of finite metric space, and that
space where values with similar function output are close is what we want to explore next.
to each other.
2.0
TABLE I. Features we used from the Kaggle Rossmann com-
petition dataset. promotion signals whether or not the store
was issuing a promotion on the observation date. state cor-
1.5
responds to the German state where the store resides. The
last column describes the dimension we used for each entity
1.0
embedding (EE).
0.5
0 5000 10000 15000 20000
distance in metric space
The dataset published by the Rossmann hosts1 has two
parts: the first part is train.csv which comprises about
FIG. 2. Distance in the store embedding space versus distance
in the metric space for 10000 random pair of stores. 2.5 years of daily sales data for 1115 different Rossmann
stores, resulting in a total number of 1017210 records; the
second part is store.csv which describes some further
Ref. [32] proved sufficient and necessary conditions to details about each of these 1115 stores.
isometrically embed a generic metric space in an eu- Besides the data published by the host, external data
clidean metric space. Applied on the metric Eq. (19), was also allowed as long as it was shared on the competi-
it would require that the matrix tion forum. Many features had been proposed by partic-
ipants of this competition. For example the Kaggle user
p q
(Mi )pq = eh|f (xi ,xi )f (xi ,xi )|ixi (23) dune dweller smartly figured out the German state each
store belongs to by correlating the store open variable
is positive definite. We took the store feature (see Ta- with the state holiday and school holiday calendar of the
ble I) as an example and verified this numerically and German states (state and school holidays differ in Ger-
found that it is not true. Therefore the store metric many from state to state)2 . Other popular external data
space as we defined cannot be isometrically embedded was weather data, Google Trends data and even sport
in an Euclidean space. events dates.
What is the relation of the learned embeddings of a In our winning solution we used most of the above
categorical variable to this metric space? To answer this data, but in this paper the aim is to compare different
question we plot in Fig. 2 the distance between 10000 machine learning methods and not to obtain the very
random store pairs in the learned store embedding space best result. Therefore, to simplify, we use only a small
and in the metric space as defined in Eq. (19). It is not subset of the features (see Table I) and we do not apply
an isometric embedding obviously. We can also see from any feature engineering.
the figure that there is a linear relation with well defined The dataset is divided into a 90% portion for training,
upper and lower boundary. Why are there clear bound- and a 10% portion for testing. We consider both a split
aries and what does the shape mean? Is this related to leaving the temporal structure of the data intact (i.e., us-
some theorems regarding the distorted mapping of met- ing the first 90% days for training), as well as a random
ric space[33, 34]? How is the distortion related to the shuffling of the dataset before the training-test split was
embedding dimension Di ? If we apply multidimensional applied. For shuffled data, the test data shares the same
scaling[35] directly on the metric di how is the result statistical distribution as the training data. More specifi-
different to the learned entity embeddings of the neural cally, as the Rossmann dataset has relatively few features
network? Due to time limit we will leave these interesting compared to the number of samples, the distribution of
questions for future investigations. the test data in the feature space is well represented by
the distribution of the training data. The shuffled data is
useful for us to benchmark model performance with re-
VI. EXPERIMENTS spect to the pure statistical prediction accuracy. For the
time based splitting (i.e. unshuffled data), the test data
In this paper we will use the dataset from the Kag-
gle Rossmann Sale Prediction competition as an exam-
ple. The goal of the competition is to predict the daily 1 https://www.kaggle.com/c/rossmann-store-sales/data
2
sales of each store of Dirk Rossmann GmbH (abbreviated https://www.kaggle.com/c/rossmann-store-sales/forums/t/
as Rossmann in the following) as accurate as possible. 17048/putting-stores-on-the-map
6
As can be seen in Fig 1, the neural network is fed We thank Dirk Rossmann GmbH to allow us to use
with the direct product of all the entity embedding sub- their data for the publication. We thank Kaggle Inc. for
spaces. We also investigated the statistical properties of hosting such an interesting competition. We thank Gert
this concatenated space. We found that there is no strong Jacobusse for helpful discussions regarding xgboost. We
correlation between the individual subspaces. It is thus thank Neokami Inc. co-founders Ozel Christo and Andrei
sufficient to consider them independently, as we did in Ciobotar for their support joining the competition and
this section. writing this paper.
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, [6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
Deep learning, Nature 521, 436444 (2015). manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, Vincent Vanhoucke, and Andrew Rabinovich, Going
Imagenet classification with deep convolutional neural deeper with convolutions, in Proceedings of the IEEE
networks, in Advances in neural information processing Conference on Computer Vision and Pattern Recognition
systems (2012) pp. 10971105. (2015) pp. 19.
[3] Matthew D. Zeiler and Rob Fergus, Visualizing and un- [7] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl,
derstanding convolutional networks, in Computer Vi- Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se-
sion?ECCV 2014 (Springer, 2014) pp. 818833. nior, Vincent Vanhoucke, Patrick Nguyen, Tara N
[4] Karen Simonyan and Andrew Zisserman, Very deep con- Sainath, et al., Deep neural networks for acoustic mod-
volutional networks for large-scale image recognition, eling in speech recognition: The shared views of four
arXiv preprint arXiv:1409.1556 (2014). research groups, Signal Processing Magazine, IEEE 29,
[5] Pierre Sermanet, David Eigen, Xiang Zhang, Michael 8297 (2012).
Mathieu, Rob Fergus, and Yann LeCun, Overfeat: [8] Tara N Sainath, Abdel-rahman Mohamed, Brian Kings-
Integrated recognition, localization and detection using bury, and Bhuvana Ramabhadran, Deep convolutional
convolutional networks, arXiv preprint arXiv:1312.6229 neural networks for lvcsr, in Acoustics, Speech and Sig-
(2013). nal Processing (ICASSP), 2013 IEEE International Con-
ference on (IEEE, 2013) pp. 86148618.
9
[9] Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and from positive and negative propositions, in Neural Net-
Christian Janvin, A neural probabilistic language works, 2000. IJCNN 2000, Proceedings of the IEEE-
model, The Journal of Machine Learning Research 3, INNS-ENNS International Joint Conference on, Vol. 2
11371155 (2003). (IEEE) pp. 259264.
[10] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas [26] Antoine Bordes, Jason Weston, Ronan Collobert, and
Burget, and Jan Cernocky, Empirical evaluation and Yoshua Bengio, Learning structured embeddings of
combination of advanced language modeling techniques. knowledge bases, in Conference on Artificial Intelli-
in INTERSPEECH, s 1 (2011) pp. 605608. gence, EPFL-CONF-192344 (2011).
[11] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey [27] Antoine Bordes, Xavier Glorot, Jason Weston, and
Dean, Efficient estimation of word representations in Yoshua Bengio, A semantic matching energy function
vector space, . for learning with multi-relational data, Machine Learn-
[12] Yoon Kim, Convolutional neural networks for sentence ing 94, 233259 (2014).
classification, arXiv preprint arXiv:1408.5882 (2014). [28] Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao,
[13] Tianqi Chen and Carlos Guestrin, Xgboost: A scalable Learning to represent knowledge graphs with gaussian
tree boosting system, (2016), arXiv:1603.02754. embedding, in Proceedings of the 24th ACM Interna-
[14] George Cybenko, Approximation by superpositions of a tional on Conference on Information and Knowledge
sigmoidal function, 2, 303314. Management (ACM, 2015) pp. 623632.
[15] Michael Nielsen, Neural networks and deep learning, [29] Omer Levy and Yoav Goldberg, Neural word embedding
(Determination Press, 2015) Chap. 4. as implicit matrix factorization, in Advances in Neural
[16] Bernardo Llanas, Sagrario Lantaron, and Francisco J Information Processing Systems (2014) pp. 21772185.
Sainz, Constructive approximation of discontinuous [30] Yoon Kim, Convolutional neural networks for sentence
functions by neural networks, Neural Processing Letters classification, arXiv preprint arXiv:1408.5882 (2014).
27, 209226 (2008). [31] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
[17] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick rado, and Jeff Dean, Distributed representations of
Haffner, Gradient-based learning applied to document words and phrases and their compositionality, in Ad-
recognition, Proceedings of the IEEE 86, 22782324 vances in neural information processing systems (2013)
(1998). pp. 31113119.
[18] Jeffrey Pennington, Richard Socher, and Christopher D. [32] Schoenberg, Metric spaces and positive definite func-
Manning, Glove: Global vectors for word representa- tions, American Mathematical Society (1938).
tion, in Empirical Methods in Natural Language Pro- [33] Ofer Neiman Ittai Abraham, Yair Bartal, On embedding
cessing (EMNLP) (2014) pp. 15321543. of finite metric spaces into hilbert space, Leibniz Center
[19] Geoffrey E. Hinton, Learning distributed representa- for Research in Computer Science (2006).
tions of concepts, in Proceedings of the eighth an- [34] Ji? Matouek, On the distortion required for embedding
nual conference of the cognitive science society, Vol. 1 finite metric spaces into normed spaces, Isreal Journal
(Amherst, MA) p. 12. of Mathematics , 333344 (1996).
[20] Yoshua Bengio and Samy Bengio, Modeling high- [35] Joseph B Kruskal, Multidimensional scaling by optimiz-
dimensional discrete data with multi-layer neural net- ing goodness of fit to a nonmetric hypothesis, Psychome-
works. in NIPS, Vol. 99 (1999) pp. 400406. trika 29, 127 (1964).
[21] Alberto Paccanaro Geoffrey E Hinton, Learning hier- [36] Diederik P. Kingma and Jimmy Ba, Adam: A method
archical structures with linear relational embedding, in for stochastic optimization, CoRR abs/1412.6980
Advances in Neural Information Processing Systems 14: (2014).
Proceedings of the 2001 Conference, Vol. 2 (MIT Press, [37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
2002) p. 857. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
[22] Rodolphe Jenatton, Nicolas L Roux, Antoine Bordes, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
and Guillaume R Obozinski, A latent factor model for napeau, M. Brucher, M. Perrot, and E. Duchesnay,
highly multi-relational data, in Advances in Neural In- Scikit-learn: Machine learning in Python, Journal of
formation Processing Systems (2012) pp. 31673175. Machine Learning Research 12, 28252830 (2011).
[23] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, [38] Laurens Van der Maaten and Geoffrey Hinton, Visu-
and Li Deng, Embedding entities and relations for learn- alizing data using t-sne, Journal of Machine Learning
ing and inference in knowledge bases, arXiv preprint Research 9, 85 (2008).
arXiv:1412.6575 (2014). [39] K.V. Mardia, Measures of multivariate skewness and
[24] Fei Wu, Jun Song, Yi Yang, Xi Li, Zhongfei Zhang, and kurtosis with applications, Biometrika 57, 519530
Yueting Zhuang, Structured embedding via pairwise re- (1970).
lations and long-range interactions in knowledge base,
(2015).
[25] Alberto Paccanaro and Geoffrey E. Hinton, Extract-
ing distributed representations of concepts and relations