Entity Embeddings of Categorical Variables

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Entity Embeddings of Categorical Variables

Cheng Guo and Felix Berkhahn


Neokami Inc.
(Dated: April 25, 2016)
We map categorical variables in a function approximation problem into Euclidean spaces, which
are the entity embeddings of the categorical variables. The mapping is learned by a neural network
during the standard supervised training process. Entity embedding not only reduces memory usage
and speeds up neural networks compared with one-hot encoding, but more importantly by mapping
similar values close to each other in the embedding space it reveals the intrinsic properties of the
categorical variables. We applied it successfully in a recent Kaggle competitiona and were able to
reach the third position with relative simple features. We further demonstrate in this paper that
entity embedding helps the neural network to generalize better when the data is sparse and statistics
is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where
arXiv:1604.06737v1 [cs.LG] 22 Apr 2016

other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained
neural network boost the performance of all tested machine learning methods considerably when
used as the input features instead. As entity embedding defines a distance measure for categorical
variables it can be used for visualizing categorical data and for data clustering.

I. INTRODUCTION variables and can divide the states of a variable as fine


as necessary.
Many advances have been achieved in the past 15 Interestingly the problems we usually face in nature
years in the field of neural networks due to a com- are often continuous if we use the right representation
bination of faster computers, more data and better of data. Whenever we find a better way to reveal the
methods [1]. Neural networks revolutionized computer continuity of the data we increase the power of neural
vision[26], speech recognition[7, 8] and natural language networks to learn the data. For example, convolutional
processing[912] and have replaced or are replacing the neural networks [17] group pixels in the same neighbor-
long dominating methods in each field. hood together. This increases the continuity of the data
Unlike in the above fields where data is unstructured, compared to simply representing the image as a flattened
neural networks are not as prominent when dealing with vector of all the pixel values of the images. The rise of
machine learning problems with structured data. This neural networks in natural language processing is based
can be easily seen by the fact that the top teams in many on the word embedding [9, 11, 18] which puts words with
online machine learning competitions like those hosted on similar meaning closer to each other in a word space thus
Kaggle use tree based methods more often than neural increasing the continuity of the words compared to using
networks[13]. one-hot encoding of words.
To understand this, we compared neural network and Unlike unstructured data found in nature, structured
decision trees approach to the general machine learning data with categorical features may not have continuity
problem, which is to approximate the function at all and even if it has it may not be so obvious. The
continuous nature of neural networks limits their appli-
y = f (x1 , x2 , ..., xn ). (1) cability to categorical variables. Therefore, naively ap-
plying neural networks on structured data with integer
Given a set of input values (x1 , x2 , ..., xn ) it generates the
representation for category variables does not work well.
target output value y.
A common way to circumvent this problem is to use one-
In principle a neural network can approximate any con-
hot encoding, but it has two shortcomings: First when
tinuous function[14, 15] and piece wise continuous func-
we have many high cardinality features one-hot encoding
tion [16]. However, it is not suitable to approximate ar-
often results in an unrealistic amount of computational
bitrary non-continuous functions as it assumes certain
resource requirement. Second, it treats different values
level of continuity in its general form. During the train-
of categorical variables completely independent of each
ing phase the continuity of the data guarantees the con-
other and often ignores the informative relations between
vergence of the optimization, and during the prediction
them.
phase it ensures that slightly changing the values of the
input keeps the output stable. On the other hand de- In this paper we show how to use the entity embed-
cision trees do not assume any continuity of the feature ding method to automatically learn the representation
of categorical features in multi-dimensional spaces which
puts values with similar effect in the function approxi-
mation problem Eq. (1) close to each other, and thereby
cheng.guo.work@gmail.com reveals the intrinsic continuity of the data and helps neu-
felix.berkhahn@gmail.com ral networks as well as other common machine learning
a https://www.kaggle.com/c/rossmann-store-sales algorithms to solve the problem.
2

Distributed representation of entities has been used in III. TREE BASED METHODS
many contexts before[1921]. Our main contributions
are: First we explored this idea in the general function As tree based methods are the most widely used
approximation problem and demonstrated its power in a method for structured data and they are the main meth-
large machine learning competition. Second we studied ods we are comparing to, we will briefly review them here.
the properties of the learned embeddings and showed how Random Forests and in particular Gradient Boosted
the embeddings can be used to understand and visualize Trees have proven their capabilities in numerous recent
categorical data. Kaggle competitions [13]. In the following, we will briefly
describe the process of growing a single decision tree used
for regression, as well as two popular tree ensemble meth-
ods: random forests and gradient tree boosting.
II. RELATED WORK

As far as we know the first domain where the entity A. Single decision tree
embedding method in the context of neural networks has
been explored is the representation of relational data[19]. Decision trees partition the feature space X into M
More recently, knowledge base which is a large collection different sub-spaces R1 , R2 , . . . RM . The function f in
of complex relational data is seeing lots of works using equation (1) is thus modeled as
entity embedding[2224]. The basic data structure of re-
M
lational data is triplets (h, r, t), where h and t are two X
entities and r is the relation. The entities are mapped f (x) = cm I(x Rm ) (5)
to vectors and relations are sometimes mapped to a ma- m=1

trix(e.g. Linear Relation Embedding [25]) or two matri- with I being the
ces(e.g. Structured Embeddings[26]) or a vector in the ( indicator function
1 if x Rm
same embedding space as the entities[27] etc. Various I(x Rm ) = . Using the common sum of
kind of score function can be defined (see Table. 1 of [28]) 0 else
to measure the likelihood of such a triplet, and the score squares
function is used as the objective function for learning the X 2
embeddings. L= (yi f (xi )) (6)
i
In natural language processing, Word embeddings have
been used to map words and phrases [9] into a continu- as loss function, it follows from standard linear regression
ous distributed vector in a semantic space. In this space theory that, for given Rm , the optimal choices for the
similar words are closer. What is even more interesting parameters cm are just the averages
is that not only the distance between words are meaning-
ful but also the direction of the difference vectors. For 1 X
cm = yi (7)
example, it has been observed [11] that the learned word |Rm |
xi Rm
vectors have relations such as:
with |Rm | the number of elements in the set Rm . Ideally,
King Man Queen Woman (2) we would try to find the optimal partition {Rm } such
as to minimize the loss function (6). However, this is
Paris France Rome Italy (3) not computationally feasible, as the number of possible
partitions grows exponentially with the size of the feature
There are many ways [9, 11, 18, 29, 30] to learn word space X. Instead, a greedy algorithm is applied, that
embeddings. A very fast way [31] is to use the word tries to find subsequent splits of X that try to minimize
context with the aim to maximize (6) locally at each split. To start with, given a splitting
variable j and a split point s, we define the pair of half-
exp(w wc ) planes
p(wc |w) = P , (4)
i exp(w wi )
R1 (j, s) = {X|Xj s} (8)
where w and wc are the vector representation of a word R2 (j, s) = {X|Xj > s} (9)
w and its neighbor word wc inside the context window and optimize (6) for j and s:
while p(wc |w) is the probability to have wc in the context
of w. The sum is over the whole vocabulary. Word em-

X X
beddings can also be learned with supervised methods. min (yi c1 )2 + (yi c2 )2 (10)
For example in Ref. [30] the embeddings can be learned j,s
xi R1 (j,s) xi R2 (j,s)
using text with labeled sentiment. This approach is very
close to the approach we use in this paper but in a dif- The optimal choices for the parameters c1 and c2 follow
ferent context. directly from (7).
3

After (10) is solved for j and s, the same algorithm trees Ti :


is applied recursively on the two half-planes R1 and R2
N
until the tree is fully grown. X
The size up to which the tree is grown governs the f (x) = Tk (x) (13)
complexity of the model and thus implies a bias-variance k=1
tradeoff: A very large tree likely overfits the train-
ing data, while a very small tree likely is not complex For a generic loss function L (not necessarily quadratic),
enough to capture the important dependencies in the the n-th tree is grown on the quantity rin
data. There are several strategies and measures available L(yi , f (xi ))
to control the tree size. A very popular strategy is prun- rin = (14)
f (xi )

f =fn1
ing, where first large trees are grown until they reach a
minimal tree size (like minimum number of nodes or min-
computed using its n 1 predecessor trees. Here, the yi
imal height), and then internal nodes are collapsed (i.e.
are the target labels, xi are the sample features and fn1
pruned) to minimize a cost-complexity measure C such
is the sum of the first n 1 trees
as
X n1
C = (yi f (xi ))2 + |T | (11) X
fn1 (x) = Tk (x) (15)
i
k=1
where |T | is the number of terminal nodes in the tree T P 2
and is a free parameter to control the complexity of In case of a squared error loss L = i (yi f (xi ))
the model. this amounts to fitting the n-th tree on the residuals
yi fn1 (xi ) of its n 1 predecessor trees. Hence, equa-
tion (14) generalizes to a generic loss function by mini-
B. Random forests mizing the loss function L iteratively at each step along
the gradient descent direction in the space spanned by
A single decision tree is a highly non-linear classi- all possible trees Tn . This is where the name gradient
fier with typically low bias but high variance. Random boosted trees comes from.
forests address the problem of high variance by establish- As for every boosting algorithm, the next iterative clas-
ing a committee (i.e. average) of identically distributed sifier Tn tries to correct its Tn1 predecessors. Hence, in
single decision trees. contrast to random forests, gradient tree boosting also
To be precise, random forests contain N single decision aims to minimize the bias of the ensemble and not only
trees grown by the following algorithm: the variance.
1. Draw a bootstrap sample from the training data,
that is, select n random records from the training
IV. STRUCTURED DATA
data.
2. Grow a single decision tree Ti as described in sec- By structured data we mean data collected and orga-
tion III A, with the only difference that at each nized in a table format with columns representing differ-
split-node m features are randomly picked that are ent features (variables) or target values and rows repre-
considered for the best split at the split-node. senting different samples. We focus on this type of data
in this paper.
3. Output the ensemble of all decision trees
{Ti }i=1...N . The most common variable types in structured data
are continuous variables and discrete variables. Contin-
For regression, an unseen sample is then predicted as: uous variables such as temperature, price, weight can be
N
represented by real numbers. Discrete variables such as
1 X age, color, bus line number can be represented by inte-
f (x) = Ti (x) (12)
N i=1 gers. Often the integers are just used for convenience
to label the different states and have no information in
As all Ti are identically distributed, the linear average of themselves. For example if we use 1, 2, 3 to represent
(12) preserves the presumably low bias of a single decision red, blue and yellow, one can not assume that blue is
tree. However, averaging will reduce the variance of the bigger than red or the average of red and yellow are
single decision trees. blue or anything that introduces additional information
based on the properties of integers. These integers are
called nominal numbers. Other times there is an intrinsic
C. Gradient boosted trees ordering in the integer index such as age or month of the
year. These integers are called cardinal number or ordi-
Gradient tree boosting is another ensemble tree based nal numbers. Note that the meaning or order may not
method, that is we try to approximate f (x) by a sum of be more useful for the problem than only considering the
4

output layer
merged layer is treated like a normal input layer in neural
networks and other layers can be build on top of it. The
dense layer 1
whole network can be trained with the standard back-
propagation method. In this way, the entity embedding
layer learns about the intrinsic properties of each cate-
dense layer 0 gory, while the deeper layers form complex combinations
of them.
The dimensions of the embedding layers Di are hyper-
EE layer A EE layer B EE layer C
parameters that need to be pre-defined. The bound of
the dimensions of entity embeddings are between 1 and
one-hot encoding layer A one-hot encoding layer B one-hot encoding layer C mi 1 where mi is the number of values for the cate-
gorical variable xi . In practice we chose the dimensions
FIG. 1. Illustration that entity embedding layers are equiva- based on experiments. The following empirical guidelines
lent to extra layers on top of each one-hot encoded input. are used during this process: First, the more complex the
more dimensions. We roughly estimated how many fea-
tures/aspects one might need to describe the entities and
integer as nominal numbers. For example the month or- used that as the dimension to start with. Second, if we
dering has nothing to do with number of days in a month had no clue about the first guideline, then we started
(January is closer to Jun than February regarding num- with mi 1.
ber of days it has). Therefore we will treat both types of It would be good to have more theoretical guidelines
discrete variables in the same way. The task of entity em- on how to choose Di . We think this probably relates to
bedding is to map discrete values to a multi-dimensional the problem of embedding of finite metric space, and that
space where values with similar function output are close is what we want to explore next.
to each other.

A. Relation with embedding of finite metric space


V. ENTITY EMBEDDING
With entity embedding we want to put similar values of
To learn the approximation of the function Eq. (1) we a categorical variable closer to each other in the embed-
map each state of a discrete variable to a vector as ding space. If we use a real number to define similarity of
the values then entity embedding is closely related to the
ei : xi 7 xi (16) embedding of finite metric space problem in topology.
We define a finite metric space (Mi , di ) associated with
This mapping is equivalent to an extra layer of linear each categorical variable xi in the function approxima-
neurons on top of the one-hot encoded input as shown in tion problem Eq. (1), where Mi is the set of all possible
Fig. 1. To show this we represent one-hot encoding of xi values of xi . di is the metric on Mi , which is the distance
as function between any two pairs of values (xpi , xqi ) of xi .
We want di to represent the similarity of (xpi , xqi ). There
ui : xi 7 xi , (17) are many ways to define it, one simple and natural way
is
where xi is Kronecker delta and the possible values for
are the same as xi . If mi is the number of values for
di (xpi , xqi ) = h|f (xpi , xi ) f (xqi , xi )|ixi (19)
the categorical variable xi , then xi is a vector of length
mi , where the element is only non-zero when = xi . where h. . . ixi is the average over all values of the pa-
The output of the extra layer of linear neurons given rameters of f other than xi . xi is shorter notation for
the input xi is defined as (x1 , x2 , . . . , xi1 , xi+1 , . . . ). It can be verified that the
X following conditions hold for the metric Eq. (19):
xi w xi = wxi (18)
di (xpi , xqi ) = 0 xpi = xqi (20)
where w is the weight connecting the one-hot encod- di (xpi , xqi ) = di (xqi , xpi ) (21)
ing layer to the embedding layer and is the index of di (xpi , xri ) di (xpi , xqi ) + di (xqi , xri ) (22)
the embedding layer. Now we can see that the mapped
embeddings are just the weights of this layer and can be Eq. (20) may not automatically hold in a real problem
learned in the same way as the parameters of other neural when two different values always generate the same out-
network layers. put. However, this also means one value is redundant,
After we use entity embeddings to represent all cate- and it is easy to simply merge these two values into one
gorical variables, all embedding layers and the input of by redefining the categorical variable to make Eq. (20)
all continuous variables (if any) are concatenated. The hold.
5

4.5 feature data type number of values EE dimension


store nominal 1115 10
4.0
day of week ordinal 7 6
3.5
day ordinal 31 10
month ordinal 12 6
distance in embedding space

3.0 year ordinal 3 (2013-2015) 2


promotion binary 2 1
2.5 state nominal 12 6

2.0
TABLE I. Features we used from the Kaggle Rossmann com-
petition dataset. promotion signals whether or not the store
was issuing a promotion on the observation date. state cor-
1.5
responds to the German state where the store resides. The
last column describes the dimension we used for each entity
1.0
embedding (EE).
0.5
0 5000 10000 15000 20000
distance in metric space
The dataset published by the Rossmann hosts1 has two
parts: the first part is train.csv which comprises about
FIG. 2. Distance in the store embedding space versus distance
in the metric space for 10000 random pair of stores. 2.5 years of daily sales data for 1115 different Rossmann
stores, resulting in a total number of 1017210 records; the
second part is store.csv which describes some further
Ref. [32] proved sufficient and necessary conditions to details about each of these 1115 stores.
isometrically embed a generic metric space in an eu- Besides the data published by the host, external data
clidean metric space. Applied on the metric Eq. (19), was also allowed as long as it was shared on the competi-
it would require that the matrix tion forum. Many features had been proposed by partic-
ipants of this competition. For example the Kaggle user
p q
(Mi )pq = eh|f (xi ,xi )f (xi ,xi )|ixi (23) dune dweller smartly figured out the German state each
store belongs to by correlating the store open variable
is positive definite. We took the store feature (see Ta- with the state holiday and school holiday calendar of the
ble I) as an example and verified this numerically and German states (state and school holidays differ in Ger-
found that it is not true. Therefore the store metric many from state to state)2 . Other popular external data
space as we defined cannot be isometrically embedded was weather data, Google Trends data and even sport
in an Euclidean space. events dates.
What is the relation of the learned embeddings of a In our winning solution we used most of the above
categorical variable to this metric space? To answer this data, but in this paper the aim is to compare different
question we plot in Fig. 2 the distance between 10000 machine learning methods and not to obtain the very
random store pairs in the learned store embedding space best result. Therefore, to simplify, we use only a small
and in the metric space as defined in Eq. (19). It is not subset of the features (see Table I) and we do not apply
an isometric embedding obviously. We can also see from any feature engineering.
the figure that there is a linear relation with well defined The dataset is divided into a 90% portion for training,
upper and lower boundary. Why are there clear bound- and a 10% portion for testing. We consider both a split
aries and what does the shape mean? Is this related to leaving the temporal structure of the data intact (i.e., us-
some theorems regarding the distorted mapping of met- ing the first 90% days for training), as well as a random
ric space[33, 34]? How is the distortion related to the shuffling of the dataset before the training-test split was
embedding dimension Di ? If we apply multidimensional applied. For shuffled data, the test data shares the same
scaling[35] directly on the metric di how is the result statistical distribution as the training data. More specifi-
different to the learned entity embeddings of the neural cally, as the Rossmann dataset has relatively few features
network? Due to time limit we will leave these interesting compared to the number of samples, the distribution of
questions for future investigations. the test data in the feature space is well represented by
the distribution of the training data. The shuffled data is
useful for us to benchmark model performance with re-
VI. EXPERIMENTS spect to the pure statistical prediction accuracy. For the
time based splitting (i.e. unshuffled data), the test data
In this paper we will use the dataset from the Kag-
gle Rossmann Sale Prediction competition as an exam-
ple. The goal of the competition is to predict the daily 1 https://www.kaggle.com/c/rossmann-store-sales/data
2
sales of each store of Dirk Rossmann GmbH (abbreviated https://www.kaggle.com/c/rossmann-store-sales/forums/t/
as Rossmann in the following) as accurate as possible. 17048/putting-stores-on-the-map
6

is of a future time compared to the training data and the xgboost


statistical distribution of the test data with respect to max depth 10
time is not exactly sampled by the training data. There- eta 0.02
fore, it can measure the models generalization ability objective reg:linear
colsample bytree 0.7
based on what it has learned from the training data.
subsample 0.7
The code used for this experiment can be found in this
num round 3000
github repository3 .
random forest
n estimators 200
A. Neural networks max depth 35
min samples split 2
min samples leaf 1
In this experiment we use both one-hot encoding and
KNN
entity embedding to represent input features of neural
n neighbors 10
networks. We use two fully connected layers (1000 and
weights distance
500 neurons respectively) on top of either the embedding p 1
layer or directly on top of the one-hot encoding layer.
The fully connected layer uses ReLU activation function. TABLE II. Parameters of models used to compare with neural
The output layer contains one neuron with sigmoid acti- networks. If a parameter is not specified, the default choice of
vation function. No dropout is used as we found that it scikit-learn (for random forests and KNN) and xgboost was
did not improve the result. We also experimented with taken.
a neural network where the entity embedding layer was
replaced with an extra fully connected layer (on top of
the one-hot encoding layer) of the same size as the sum of pling 200,000 samples out of the training set for bench-
all entity embedding components but the result is worse marking the models.
than without this layer. We use the deep learning frame- Instead of root mean square percentage error (RM-
work Keras4 to implement the neural network. SPE) used in the competition we use mean absolute per-
As Sales in the data set spans 4 orders of magnitude, centage error (MAPE) as the criterion:
we used log(Sale) and rescaled it to the same range as  
Sales Salespredict
the neural network output with log(Sale)/ log(Salemax ). M AP E = (24)
Adam optimization method[36] is used to optimize the Sales
networks. Each network is trained for 10 epochs. For The reason is that we find MAPE is more stable with
prediction we use the average result of 5 neural networks, outliners, which may be caused by factors not included
as an individual neural network showed notable variance. as features in the Rossmann dataset.
The results that we obtained can be found in Table III
and IV. We can see that neural networks give the best
B. Comparison of different methods results for non-shuffled data. For shuffled data, gradi-
ent boosted trees with entity embedding (see below for
We compared k-nearest neighbors (KNN), random an explanation) and neural networks give comparable
forests and gradient boosted trees with neural networks. good results. Neural networks with one-hot encoding give
KNN and random forests are tested using the scikit-learn slightly better results than entity embedding for the shuf-
library of python [37], while we use the xgboost imple- fled data while entity embedding is clearly better than
mentation of gradient boosted trees [13]. The used model one-hot encoding for the non-shuffled data. The expla-
parameters can be found in Table II. They were empiri- nation is that entity embedding, by restricting the net-
cally found by optimizing the results of the validation set. work in a much smaller parameter space in a meaningful
For the input variables, KNN is fed with one-hot-encoded way, reduces the chance that the network converges to lo-
features, while random forests and gradient boosted trees cal minimums far from the global minimum. More intu-
use the integer coded categorical variables directly. We itively, entity embeddings force the network to learn the
use log(Sales) as the target value for all machine learning intrinsic properties of each of the feature as well as the
methods. sales distribution in the feature space. One-hot encoding,
As we are using relatively small number of features (7) on the other hand, only learns about the sales distribu-
compared to available training samples (about 1 million) tion. A better understanding of the intrinsic properties of
the dataset is not sparse enough for our purpose. There- the components (features) will give the model an advan-
fore, we sparsified the training data by randomly sam- tage when facing a new combination of the components
not seen during training. We expect this effect will be
stronger when we add more features, for both shuffled
and unshuffled data.
3 https://github.com/entron/entity-embedding-rossmann We also used the entity embeddings learned from a
4 https://github.com/fchollet/keras neural network as the input for other machine learning
7
600
method MAPE MAPE (with EE)
KNN 0.315 0.099 Berlin
random forest 0.167 0.089 400 Hamburg
gradient boosted trees 0.122 0.071
neural network 0.070 0.070
Schleswig Holstein
200 Niedersachsen/Bremen
TABLE III. Comparison of different methods on the Kaggle
Rossmann dataset with 10% shuffled data used for testing and
Sachsen
200,000 random samples from the remaining 90% for training. 0 Nordrhein Westfalen Thueringen
Hessen

method MAPE MAPE (with EE)


200 Rheinland Pfalz Sachsen Anhalt
KNN 0.290 0.116 Bayern
random forest 0.158 0.108
gradient boosted trees 0.152 0.115 Baden Wuerttemberg
400
neural network 0.101 0.093

TABLE IV. Same as Table IV except the data is not shuffled


600
and the test data is the latest 10% of the data. This result 400 200 0 200 400 600
shows the models generalization ability based on what they
have learned from the training data. FIG. 3. The learned German state embedding is mapped to a
2D space with t-SNE. The relative positions of German states
here resemble that on the real German map surprisingly well.
methods, that is, we feed the embedded features into
other machine learning methods. This significantly im-
proves all the methods tested here as shown in the right we take entity embedding of the store as an example. Fig-
columns of the tables. ure 4 shows the sales distribution in the store embedding
along its first two principal components and along two
random directions. It is apparent from the plot that the
C. Distribution in the embedding space sales follows a continuous functional relationship along
the first principal component. This allows the neural
network to understand the impact of the store index, as
The main goal of entity embedding is to map similar
stores with similar sales are mapped close to each other.
categories close to each other in the embedding space. A
Although the other directions in the subspace have no
natural question is thus how the embedding space and
direct correlation with sales, they are encoding probably
the distribution of the data within it look like. For the
other properties of the store and when combined with
following analyses, we used a store embedding matrix of
other features in the deeper layers of the network they
dimension 50 and trained the network on the full first
could have an impact on the final sales prediction.
90% of data, i.e. we did not apply data sparsification.
To visualize the high dimensional embeddings we used
t-SNE[38] to map the embeddings to a 2D space. Fig 3
shows the result for the German state embeddings.
Though the algorithm does not know anything about
German geography and society, the relative positions of
the learned embedding of German states resemble that
on the German map surprisingly well! The reason is that
the embedding maps states with similar distribution of
features, i.e. similar economical and cultural environ-
ments, close to each other, while at the same time two
geographically neighboring states are likely sharing sim-
ilar economy and culture. Especially, the three states on
the right cluster, namely Sachsen, Thueringen and Sach-
sen Anhalt are all from eastern Germany while states in
the left cluster are from western Germany. This shows
the effectiveness of entity embedding for abductive rea-
soning. It also shows that entity embedding can be used
to cluster categorical data. This is a consequence of en-
tity embedding putting similar values close to each other FIG. 4. Sales distribution along first principal component
in an euclidean space equipped with distance measure, on (upper left) and second principal component (upper right)
which any known clustering algorithm can be applied. of embedded store indices and along two random directions
Regarding the sales distribution in entity embeddings, (lower left and right). All 1115 stores contributed to the plot.
8

The density distribution of store embedding is visu- VII. FUTURE WORK


alized in Fig. 5, which shows the distribution along the
first four principal components. Interestingly, the uni- Due to the limitation of time we leave the following
variate density along the first principal components is points for future explorations:
approximately gaussian distributed. However, their joint
First of all, entity embedding should be tested with
distribution is not multivariate gaussian, as the Mardia
more datasets, in particular datasets with many high car-
test [39] reveals.
dinality features, where the data is getting sparse and en-
tity embedding is supposed to show its full strength com-
pared with other methods. For some datasets and entity
embeddings it could also be interesting to explore the
meaning of the directions in the embeddings like those in
Eq. (2) and Eq. (3).
Second, we only touched the surface of the relation
of entity embedding with the finite metric spaces. A
deeper understanding of this relation might also help to
find the optimal dimension of the embedding space and
how neural networks work in general.
Third, similar methods may be applied to improve the
approximation of continuous (i.e. non-categorical), but
non-monotone functions. One way to achieve this is by
discretizing the continuous variables and transform them
into categorical variables as discussed in this paper.
Last, it might be interesting to systematically com-
pare different activation functions of the entity embed-
FIG. 5. Density distribution of embedded store indices along ding layer.
the first four principal components (from upper left to lower
right). The red line corresponds to a gaussian fit. The p-
values of the DAgostinos K 2 normality test are all statisti-
cally significant, i.e. below 0.05. VIII. ACKNOWLEDGE

As can be seen in Fig 1, the neural network is fed We thank Dirk Rossmann GmbH to allow us to use
with the direct product of all the entity embedding sub- their data for the publication. We thank Kaggle Inc. for
spaces. We also investigated the statistical properties of hosting such an interesting competition. We thank Gert
this concatenated space. We found that there is no strong Jacobusse for helpful discussions regarding xgboost. We
correlation between the individual subspaces. It is thus thank Neokami Inc. co-founders Ozel Christo and Andrei
sufficient to consider them independently, as we did in Ciobotar for their support joining the competition and
this section. writing this paper.

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, [6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
Deep learning, Nature 521, 436444 (2015). manet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, Vincent Vanhoucke, and Andrew Rabinovich, Going
Imagenet classification with deep convolutional neural deeper with convolutions, in Proceedings of the IEEE
networks, in Advances in neural information processing Conference on Computer Vision and Pattern Recognition
systems (2012) pp. 10971105. (2015) pp. 19.
[3] Matthew D. Zeiler and Rob Fergus, Visualizing and un- [7] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl,
derstanding convolutional networks, in Computer Vi- Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se-
sion?ECCV 2014 (Springer, 2014) pp. 818833. nior, Vincent Vanhoucke, Patrick Nguyen, Tara N
[4] Karen Simonyan and Andrew Zisserman, Very deep con- Sainath, et al., Deep neural networks for acoustic mod-
volutional networks for large-scale image recognition, eling in speech recognition: The shared views of four
arXiv preprint arXiv:1409.1556 (2014). research groups, Signal Processing Magazine, IEEE 29,
[5] Pierre Sermanet, David Eigen, Xiang Zhang, Michael 8297 (2012).
Mathieu, Rob Fergus, and Yann LeCun, Overfeat: [8] Tara N Sainath, Abdel-rahman Mohamed, Brian Kings-
Integrated recognition, localization and detection using bury, and Bhuvana Ramabhadran, Deep convolutional
convolutional networks, arXiv preprint arXiv:1312.6229 neural networks for lvcsr, in Acoustics, Speech and Sig-
(2013). nal Processing (ICASSP), 2013 IEEE International Con-
ference on (IEEE, 2013) pp. 86148618.
9

[9] Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and from positive and negative propositions, in Neural Net-
Christian Janvin, A neural probabilistic language works, 2000. IJCNN 2000, Proceedings of the IEEE-
model, The Journal of Machine Learning Research 3, INNS-ENNS International Joint Conference on, Vol. 2
11371155 (2003). (IEEE) pp. 259264.
[10] Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas [26] Antoine Bordes, Jason Weston, Ronan Collobert, and
Burget, and Jan Cernocky, Empirical evaluation and Yoshua Bengio, Learning structured embeddings of
combination of advanced language modeling techniques. knowledge bases, in Conference on Artificial Intelli-
in INTERSPEECH, s 1 (2011) pp. 605608. gence, EPFL-CONF-192344 (2011).
[11] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey [27] Antoine Bordes, Xavier Glorot, Jason Weston, and
Dean, Efficient estimation of word representations in Yoshua Bengio, A semantic matching energy function
vector space, . for learning with multi-relational data, Machine Learn-
[12] Yoon Kim, Convolutional neural networks for sentence ing 94, 233259 (2014).
classification, arXiv preprint arXiv:1408.5882 (2014). [28] Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao,
[13] Tianqi Chen and Carlos Guestrin, Xgboost: A scalable Learning to represent knowledge graphs with gaussian
tree boosting system, (2016), arXiv:1603.02754. embedding, in Proceedings of the 24th ACM Interna-
[14] George Cybenko, Approximation by superpositions of a tional on Conference on Information and Knowledge
sigmoidal function, 2, 303314. Management (ACM, 2015) pp. 623632.
[15] Michael Nielsen, Neural networks and deep learning, [29] Omer Levy and Yoav Goldberg, Neural word embedding
(Determination Press, 2015) Chap. 4. as implicit matrix factorization, in Advances in Neural
[16] Bernardo Llanas, Sagrario Lantaron, and Francisco J Information Processing Systems (2014) pp. 21772185.
Sainz, Constructive approximation of discontinuous [30] Yoon Kim, Convolutional neural networks for sentence
functions by neural networks, Neural Processing Letters classification, arXiv preprint arXiv:1408.5882 (2014).
27, 209226 (2008). [31] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
[17] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick rado, and Jeff Dean, Distributed representations of
Haffner, Gradient-based learning applied to document words and phrases and their compositionality, in Ad-
recognition, Proceedings of the IEEE 86, 22782324 vances in neural information processing systems (2013)
(1998). pp. 31113119.
[18] Jeffrey Pennington, Richard Socher, and Christopher D. [32] Schoenberg, Metric spaces and positive definite func-
Manning, Glove: Global vectors for word representa- tions, American Mathematical Society (1938).
tion, in Empirical Methods in Natural Language Pro- [33] Ofer Neiman Ittai Abraham, Yair Bartal, On embedding
cessing (EMNLP) (2014) pp. 15321543. of finite metric spaces into hilbert space, Leibniz Center
[19] Geoffrey E. Hinton, Learning distributed representa- for Research in Computer Science (2006).
tions of concepts, in Proceedings of the eighth an- [34] Ji? Matouek, On the distortion required for embedding
nual conference of the cognitive science society, Vol. 1 finite metric spaces into normed spaces, Isreal Journal
(Amherst, MA) p. 12. of Mathematics , 333344 (1996).
[20] Yoshua Bengio and Samy Bengio, Modeling high- [35] Joseph B Kruskal, Multidimensional scaling by optimiz-
dimensional discrete data with multi-layer neural net- ing goodness of fit to a nonmetric hypothesis, Psychome-
works. in NIPS, Vol. 99 (1999) pp. 400406. trika 29, 127 (1964).
[21] Alberto Paccanaro Geoffrey E Hinton, Learning hier- [36] Diederik P. Kingma and Jimmy Ba, Adam: A method
archical structures with linear relational embedding, in for stochastic optimization, CoRR abs/1412.6980
Advances in Neural Information Processing Systems 14: (2014).
Proceedings of the 2001 Conference, Vol. 2 (MIT Press, [37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
2002) p. 857. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
[22] Rodolphe Jenatton, Nicolas L Roux, Antoine Bordes, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
and Guillaume R Obozinski, A latent factor model for napeau, M. Brucher, M. Perrot, and E. Duchesnay,
highly multi-relational data, in Advances in Neural In- Scikit-learn: Machine learning in Python, Journal of
formation Processing Systems (2012) pp. 31673175. Machine Learning Research 12, 28252830 (2011).
[23] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, [38] Laurens Van der Maaten and Geoffrey Hinton, Visu-
and Li Deng, Embedding entities and relations for learn- alizing data using t-sne, Journal of Machine Learning
ing and inference in knowledge bases, arXiv preprint Research 9, 85 (2008).
arXiv:1412.6575 (2014). [39] K.V. Mardia, Measures of multivariate skewness and
[24] Fei Wu, Jun Song, Yi Yang, Xi Li, Zhongfei Zhang, and kurtosis with applications, Biometrika 57, 519530
Yueting Zhuang, Structured embedding via pairwise re- (1970).
lations and long-range interactions in knowledge base,
(2015).
[25] Alberto Paccanaro and Geoffrey E. Hinton, Extract-
ing distributed representations of concepts and relations

You might also like