GNN-Foundations-Frontiers-and-Applications-chapter1
GNN-Foundations-Frontiers-and-Applications-chapter1
GNN-Foundations-Frontiers-and-Applications-chapter1
Representation Learning
Abstract In this chapter, we first describe what representation learning is and why
we need representation learning. Among the various ways of learning representa-
tions, this chapter focuses on deep learning methods: those that are formed by the
composition of multiple non-linear transformations, with the goal of resulting in
more abstract and ultimately more useful representations. We summarize the repre-
sentation learning techniques in different domains, focusing on the unique chal-
lenges and models for different data types including images, natural languages,
speech signals and networks. Last, we summarize this chapter.
The effectiveness of machine learning techniques heavily relies on not only the de-
sign of the algorithms themselves, but also a good representation (feature set) of
data. Ineffective data representations that lack some important information or con-
tains incorrect or huge redundant information could lead to poor performance of
the algorithm in dealing with different tasks. The goal of representation learning is
to extract sufficient but minimal information from data. Traditionally, this can be
achieved via human efforts based on the prior knowledge and domain expertise on
the data and tasks, which is also named as feature engineering. In deploying ma-
Liang Zhao
Department of Computer Science, Emory University, e-mail: liang.zhao@emory.edu
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: lwu@email.wm.edu
Peng Cui
Department of Computer Science, Tsinghua University, e-mail: cuip@tsinghua.edu.cn
Jian Pei
Department of Computer Science, Simon Fraser University, e-mail: jpei@cs.sfu.ca
3
4 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
chine learning and many other artificial intelligence algorithms, historically a large
portion of the human efforts goes into the design of prepossessing pipelines and data
transformations. More specifically, feature engineering is a way to take advantage
of human ingenuity and prior knowledge in the hope to extract and organize the dis-
criminative information from the data for machine learning tasks. For example, po-
litical scientists may be asked to define a keyword list as the features of social-media
text classifiers for detecting those texts on societal events. For speech transcription
recognition, one may choose to extract features from raw sound waves by the op-
erations including Fourier transformations. Although feature engineering is widely
adopted over the years, its drawbacks are also salient, including: 1) Intensive labors
from domain experts are usually needed. This is because feature engineering may
require tight and extensive collaboration between model developers and domain ex-
perts. 2) Incomplete and biased feature extraction. Specifically, the capacity and
discriminative power of the extracted features are limited by the knowledge of dif-
ferent domain experts. Moreover, in many domains that human beings have limited
knowledge, what features to extract itself is an open questions to domain experts,
such as cancer early prediction. In order to avoid these drawbacks, making learn-
ing algorithms less dependent on feature engineering has been a highly desired goal
in machine learning and artificial intelligence domains, so that novel applications
could be constructed faster and hopefully addressed more effectively.
The techniques of representation learning witness the development from the tra-
ditional representation learning techniques to more advanced ones. The traditional
methods belong to “shallow” models and aim to learn transformations of data that
make it easier to extract useful information when building classifiers or other pre-
dictors, such as Principal Component Analysis (PCA) (Wold et al, 1987), Gaussian
Markov random field (GMRF) (Rue and Held, 2005), and Locality Preserving Pro-
jections (LPP) (He and Niyogi, 2004). Deep learning-based representation learning
is formed by the composition of multiple non-linear transformations, with the goal
of yielding more abstract and ultimately more useful representations. In the light of
introducing more recent advancements and sticking to the major topic of this book,
here we majorly focus on deep learning-based representation learning, which can
be categorized into several types: (1) Supervised learning, where a large number of
labeled data are needed for the training of the deep learning models. Given the well-
trained networks, the output before the last fully-connected layers is always utilized
as the final representation of the input data; (2) Unsupervised learning (including
self-supervised learning), which facilitates the analysis of input data without corre-
sponding labels and aims to learn the underlying inherent structure or distribution
of data. The pre-tasks are utilized to explore the supervision information from large
amounts of unlabelled data. Based on this constructed supervision information, the
deep neural networks are trained to extract the meaningful representations for the
future downstream tasks; (3) Transfer learning, which involves methods that utilize
any knowledge resource (i.e., data, model, labels, etc.) to increase model learning
and generalization for the target task. Transfer learning encompasses different sce-
narios including multi-task learning (MTL), model adaptation, knowledge transfer,
co-variance shift, etc. There are also other important representation learning meth-
1 Representation Learning 5
help of hand-crafted features by human beings based on prior knowledge. For exam-
ple, Huang et al (2000) extracted the character’s structure features from the strokes,
then use them to recognize the handwritten characters. Rui (2005) adopted the mor-
phology method to improve local feature of the characters, then use PCA to ex-
tract features of characters. However, all of these methods need to extract features
from images manually and thus the prediction performances strongly rely on the
prior knowledge. In the field of computer vision, manual feature extraction is very
cumbersome and impractical because of the high dimensionality of feature vec-
tors. Thus, representation learning of images which can automatically extract mean-
ingful, hidden and complex patterns from high-dimension visual data is necessary.
Deep learning-based representation learning for images is learned in an end-to-end
fashion, which can perform much better than hand-crafted features in the target ap-
plications, as long as the training data is of sufficient quality and quantity.
Supervised Representation Learning for image processing. In the domain of im-
age processing, supervised learning algorithm, such as Convolution Neural Network
(CNN) and Deep Belief Network (DBN), are commonly applied in solving various
tasks. One of the earliest deep-supervised-learning-based works was proposed in
2006 (Hinton et al, 2006), which is focused on the MNIST digit image classifica-
tion problem, outperforming the state-of-the-art SVMs. Following this, deep convo-
lutional neural networks (ConvNets) showed amazing performance which is greatly
depends on their properties of shift in-variance, weights sharing and local pattern
capturing. Different types of network architectures were developed to increase the
capacity of network models, and larger and larger datasets were collected these days.
Various networks including AlexNet (Krizhevsky et al, 2012), VGG (Simonyan and
Zisserman, 2014b), GoogLeNet (Szegedy et al, 2015), ResNet (He et al, 2016a),
and DenseNet (Huang et al, 2017a) and large scale datasets, such as ImageNet and
OpenImage, have been proposed to train very deep convolutional neural networks.
With the sophisticated architectures and large-scale datasets, the performance of
convolutional neural networks keeps outperforming the state-of-the-arts in various
computer vision tasks.
Unsupervised Representation Learning for image processing. Collection and an-
notation of large-scale datasets are time-consuming and expensive in both image
datasets and video datasets. For example, ImageNet contains about 1.3 million la-
beled images covering 1,000 classes while each image is labeled by human workers
with one class label. To alleviate the extensive human annotation labors, many unsu-
pervised methods were proposed to learn visual features from large-scale unlabeled
images or videos without using any human annotations. A popular solution is to
propose various pretext tasks for models to solve, while the models can be trained
by learning objective functions of the pretext tasks and the features are learned
through this process. Various pretext tasks have been proposed for unsupervised
learning, including colorizing gray-scale images (Zhang et al, 2016d) and image in-
painting (Pathak et al, 2016). During the unsupervised training phase, a predefined
pretext task is designed for the models to solve, and the pseudo labels for the pretext
task are automatically generated based on some attributes of data. Then the models
are trained according to the objective functions of the pretext tasks. When trained
1 Representation Learning 7
with pretext tasks, the shallower blocks of the deep neural network models focus on
the low-level general features such as corners, edges, and textures, while the deeper
blocks focus on the high-level task-specific features such as objects, scenes, and
object parts. Therefore, the models trained with pretext tasks can learn kernels to
capture low-level features and high-level features that are helpful for other down-
stream tasks. After the unsupervised training is finished, the learned visual features
in this pre-trained models can be further transferred to downstream tasks (especially
when only relatively small data is available) to improve performance and overcome
over-fitting.
Transfer Learning for image processing. In real-world applications, due to the
high cost of manual labeling, sufficient training data that belongs to the same fea-
ture space or distribution as the testing data may not always be accessible. Transfer
learning mimics the human vision system by making use of sufficient amounts of
prior knowledge in other related domains (i.e., source domains) when executing
new tasks in the given domain (i.e., target domain). In transfer learning, both the
training set and the test set can contribute to the target and source domains. In most
cases, there is only one target domain for a transfer learning task, while either single
or multiple source domains can exist. The techniques of transfer learning in im-
ages processing can be categorized into feature representation knowledge transfer
and classifier-based knowledge transfer. Specifically, feature representation trans-
fer methods map the target domain to the source domains by exploiting a set of
extracted features, where the data divergence between the target domain and the
source domains can be significantly reduced so that the performance of the task
in the target domain is improved. For example, classifier-based knowledge-transfer
methods usually share the common trait that the learned source domain models are
utilized as prior knowledge, which are used to learn the target model together with
the training samples. Instead of minimizing the cross-domain dissimilarity by up-
dating instances’ representations, classifier-based knowledge-transfer methods aim
to learn a new model that minimizes the generalization error in the target domain
via the provided training set from both domains and the learned model.
Other Representation Learning for Image Processing. Other types of representa-
tion learning are also commonly observed for dealing with image processing, such
as reinforcement learning, and semi-supervised learning. For example, reinforce-
ment learning are commonly explored in the task of image captioning Liu et al
(2018a); Ren et al (2017) and image editing Kosugi and Yamasaki (2020), where
the learning process is formalized as a sequence of actions based on a policy net-
work.
8 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
Nowadays, speech interfaces or systems have become widely developed and inte-
grated into various real-life applications and devices. Services like Siri 1 , Cortana 2 ,
and Google Voice Search 3 have become a part of our daily life and are used by mil-
lions of users. The exploration in speech recognition and analysis has always been
motivated by a desire to enable machines to participate in verbal human-machine
interactions. The research goals of enabling machines to understand human speech,
identify speakers, and detect human emotion have attracted researchers’ attention
for more than sixty years across several distinct research areas, including but not
limited to Automatic Speech Recognition (ASR), Speaker Recognition (SR), and
Speaker Emotion Recognition (SER).
Analyzing and processing speech has been a key application of machine learning
(ML) algorithms. Research on speech recognition has traditionally considered the
task of designing hand-crafted acoustic features as a separate distinct problem from
the task of designing efficient models to accomplish prediction and classification
decisions. There are two main drawbacks of this approach: First, the feature engi-
neering is cumbersome and requires human knowledge as introduced above; and
second, the designed features might not be the best for the specific speech recog-
nition tasks at hand. This has motivated the adoption of recent trends in the speech
community towards the utilization of representation learning techniques, which can
learn an intermediate representation of the input signal automatically that better fits
into the task at hand and hence lead to improved performance. Among all these suc-
cesses, deep learning-based speech representations play an important role. One of
the major reasons for the utilization of representation learning techniques in speech
technology is that speech data is fundamentally different from two-dimensional im-
age data. Images can be analyzed as a whole or in patches, but speech has to be
formatted sequentially to capture temporal dependency and patterns.
Supervised representation learning for speech recognition. In the domain of
speech recognition and analyzing, supervised representation learning methods are
widely employed, where feature representations are learned on datasets by leverag-
ing label information. For example, restricted Boltzmann machines (RBMs) (Jaitly
and Hinton, 2011; Dahl et al, 2010) and deep belief networks (DBNs) (Cairong
et al, 2016; Ali et al, 2018) are commonly utilized in learning features from speech
for different tasks, including ASR, speaker recognition, and SER. For example,
in 2012, Microsoft has released a new version of their MAVIS (Microsoft Audio
Video Indexing Service) speech system based on context-dependent deep neural net-
works (Seide et al, 2011). These authors managed to reduce the word error rate on
four major benchmarks by about 30% (e.g., from 27.4% to 18.5% on RT03S) com-
1 Siri is an artificial intelligence assistant software that is built into Apple’s iOS system.
2 Microsoft Cortana is an intelligent personal assistant developed by Microsoft, known as ”the
world’s first cross-platform intelligent personal assistant”.
3 Google Voice Search is a product of Google that allows you to use Google to search by speaking
to a mobile phone or computer, that is, to use the legendary content on the device to be identified
by the server, and then search for information based on the results of the recognition
1 Representation Learning 9
ing MTL with different auxiliary tasks including gender, speaker adaptation, speech
enhancement, it has been shown that the learned shared representations for differ-
ent tasks can act as complementary information about the acoustic environment and
give a lower word error rate (WER) (Parthasarathy and Busso, 2017; Xia and Liu,
2015).
Other Representation Learning for speech recognition. Other than the above-
mentioned three categories of representation learning for speech signals, there are
also some other representation learning techniques commonly explored, such as
semi-supervised learning and reinforcement learning. For example, in the speech
recognition for ASR, semi-supervised learning is mainly used to circumvent the lack
of sufficient training data. This can be achieved either by creating features fronts
ends (Thomas et al, 2013), or by using multilingual acoustic representations (Cui
et al, 2015), or by extracting an intermediate representation from large unpaired
datasets (Karita et al, 2018). RL is also gaining interest in the area of speech recog-
nition, and there have been multiple approaches to model different speech problems,
including dialog modeling and optimization (Levin et al, 2000), speech recogni-
tion (Shen et al, 2019), and emotion recognition (Sangeetha and Jayasankar, 2019).
Besides speech recognition, there are many other Natural Language Processing
(NLP) applications of representation learning, such as the text representation learn-
ing. For example, Google’s image search exploits huge quantities of data to map im-
ages and queries in the same space (Weston et al, 2010) based on NLP techniques.
In general, there are two types of applications of representation learning in NLP.
In one type, the semantic representation, such as the word embedding, is trained
in a pre-training task (or directly designed by human experts) and is transferred to
the model for the target task. It is trained by using language modeling objective
and is taken as inputs for other down-stream NLP models. In the other type, the
semantic representation lies within the hidden states of the deep learning model and
directly aims for better performance of the target tasks in an end-to-end fashion. For
example, many NLP tasks want to semantically compose sentence or document rep-
resentation, such as tasks like sentiment classification, natural language inference,
and relation extraction, which require sentence representation.
Conventional NLP tasks heavily rely on feature engineering, which requires care-
ful design and considerable expertise. Recently, representation learning, especially
deep learning-based representation learning is emerging as the most important tech-
nique for NLP. First, NLP is typically concerned with multiple levels of language en-
tries, including but not limited to characters, words, phrases, sentences, paragraphs,
and documents. Representation learning is able to represent the semantics of these
multi-level language entries in a unified semantic space, and model complex se-
mantic dependence among these language entries. Second, there are various NLP
tasks that can be conducted on the same input. For example, given a sentence, we
1 Representation Learning 11
can perform multiple tasks such as word segmentation, named entity recognition,
relation extraction, co-reference linking, and machine translation. In this case, it
will be more efficient and robust to build a unified representation space of inputs
for multiple tasks. Last, natural language texts may be collected from multiple do-
mains, including but not limited to news articles, scientific articles, literary works,
advertisement and online user-generated content such as product reviews and so-
cial media. Moreover, texts can also be collected from different languages, such as
English, Chinese, Spanish, Japanese, etc. Compared to conventional NLP systems
which have to design specific feature extraction algorithms for each domain accord-
ing to its characteristics, representation learning enables us to build representations
automatically from large-scale domain data and even add bridges among these lan-
guages from different domains. Given these advantages of representation learning
for NLP in the feature engineering reduction and performance improvement, many
researchers have developed efficient algorithms on representation learning, espe-
cially deep learning-based approaches, for NLP.
Supervised Representation Learning for NLP. Deep neural networks in the su-
pervised learning setting for NLP emerge from distributed representation learning,
then to CNN models, and finally to RNN models in recent years. At early stage,
distributed representations are first developed in the context of statistical language
modeling by Bengio (2008) in so-called neural net language models. The model
is about learning a distributed representation for each word (i.e., word embedding).
Following this, the need arose for an effective feature function that extracts higher-
level features from constituting words or n-grams. CNNs turned out to be the nat-
ural choice given their properties of excellent performance in computer vision and
speech processing tasks. CNNs have the ability to extract salient n-gram features
from the input sentence to create an informative latent semantic representation of
the sentence for downstream tasks. This domain was pioneered by Collobert et al
(2011) and Kalchbrenner et al (2014), which led to a huge proliferation of CNN-
based networks in the succeeding literature. The neural net language model was also
improved by adding recurrence to the hidden layers (Mikolov et al, 2011a) (i.e.,
RNN), allowing it to beat the state-of-the-art (smoothed n-gram models) not only in
terms of perplexity (exponential of the average negative log-likelihood of predicting
the right next word) but also in terms of WER in speech recognition. RNNs use
the idea of processing sequential information. The term “recurrent” applies as they
perform the same computation over each token of the sequence and each step is de-
pendent on the previous computations and results. Generally, a fixed-size vector is
produced to represent a sequence by feeding tokens one by one to a recurrent unit. In
a way, RNNs have “memory” over previous computations and use this information
in current processing. This template is naturally suited for many NLP tasks such
as language modeling (Mikolov et al, 2010, 2011b), machine translation (Liu et al,
2014; Sutskever et al, 2014), and image captioning (Karpathy and Fei-Fei, 2015).
Unsupervised Representation Learning for NLP. Unsupervised learning (includ-
ing self-supervised learning) has made a great success in NLP, for the plain text itself
contains abundant knowledge and patterns about languages. For example, in most
deep learning based NLP models, words in sentences are first mapped to their corre-
12 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
recently. For example, researchers have explored few-shot relation extractio (Han
et al, 2018) where each relation has a few labeled instances, and low-resource ma-
chine translation (Zoph et al, 2016) where the size of the parallel corpus is limited.
Beyond popular data like images, texts, and sounds, network data is another im-
portant data type that is becoming ubiquitous across a large scale of real-world ap-
plications ranging from cyber-networks (e.g., social networks, citation networks,
telecommunication networks, etc.) to physical networks (e.g., transportation net-
works, biological networks, etc). Networks data can be formulated as graphs math-
ematically, where vertices and their relationships jointly characterize the network
information. Networks and graphs are very powerful and flexible data formulation
such that sometimes we could even consider other data types like images, and texts
as special cases of it. For example, images can be considered as grids of nodes with
RGB attributes which are special types of graphs, while texts can also be organized
into sequential-, tree-, or graph-structured information. So in general, representa-
tion learning for networks is widely considered as a promising yet more challenging
tasks that require the advancement and generalization of many techniques we devel-
oped for images, texts, and so forth. In addition to the intrinsic high complexity of
network data, the efficiency of representation learning on networks is also an impor-
tant issues considering the large-scale of many real-world networks, ranging from
hundreds to millions or even billions of vertices. Analyzing information networks
plays a crucial role in a variety of emerging applications across many disciplines.
For example, in social networks, classifying users into meaningful social groups is
useful for many important tasks, such as user search, targeted advertising and recom-
mendations; in communication networks, detecting community structures can help
better understand the rumor spreading process; in biological networks, inferring in-
teractions between proteins can facilitate new treatments for diseases. Nevertheless,
efficient and effective analysis of these networks heavily relies on good representa-
tions of the networks.
Traditional feature engineering on network data usually focuses on obtaining a
number of predefined straightforward features in graph levels (e.g., the diameter,
average path length, and clustering co-efficient), node levels (e.g., node degree and
centrality), or subgraph levels (e.g., frequent subgraphs and graph motifs). Those
limited number of hand-crafted, well-defined features, though describe several fun-
damental aspects of the graphs, discard the patterns that cannot be covered by them.
Moreover, real-world network phenomena are usually highly complicated require
sophisticated, unknown combinations among those predefined features or cannot be
characterized by any of the existing features. In addition, traditional graph feature
engineering usually involve expensive computations with super-linear or exponen-
tial complexity, which often makes many network analytic tasks computationally
expensive and intractable over large-scale networks. For example, in dealing with
14 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
the task of community detection, classical methods involve calculating the spectral
decomposition of a matrix with at least quadratic time complexity with respect to
the number of vertices. This computational overhead makes algorithms hard to scale
to large-scale networks with millions of vertices.
More recently, network representation learning (NRL) has aroused a lot of re-
search interest. NRL aims to learn latent, low-dimensional representations of net-
work vertices, while preserving network topology structure, vertex content, and
other side information. After new vertex representations are learned, network ana-
lytic tasks can be easily and efficiently carried out by applying conventional vector-
based machine learning algorithms to the new representation space. Earlier work
related to network representation learning dates back to the early 2000s, when re-
searchers proposed graph embedding algorithms as part of dimensionality reduction
techniques. Given a set of independent and identically distributed (i.i.d.) data points
as input, graph embedding algorithms first calculate the similarity between pairwise
data points to construct an affinity graph, e.g., the k-nearest neighbor graph, and
then embed the affinity graph into a new space having much lower dimensionality.
However, graph embedding algorithms are designed on i.i.d. data mainly for dimen-
sionality reduction purpose, which usually have at least quadratic time complexity
with respect to the number of vertices.
Since 2008, significant research efforts have shifted to the development of ef-
fective and scalable representation learning techniques that are directly designed
for complex information networks. Many network representation learning algo-
rithms (Perozzi et al, 2014; Yang et al, 2015b; Zhang et al, 2016b; Manessi et al,
2020) have been proposed to embed existing networks, showing promising per-
formance for various applications. These methods embed a network into a latent,
low-dimensional space that preserves structure proximity and attribute affinity. The
resulting compact, low-dimensional vector representations can be then taken as fea-
tures to any vector-based machine learning algorithms. This paves the way for a
wide range of network analytic tasks to be easily and efficiently tackled in the new
vector space, such as node classification (Zhu et al, 2007), link prediction (Lü and
Zhou, 2011), clustering (Malliaros and Vazirgiannis, 2013), network synthesis (You
et al, 2018b). The following chapters of this book will then provide a systematic and
comprehensive introduction into network representation learning.
1.3 Summary
Representation learning is a very active and important field currently, which heavily
influences the effectiveness of machine learning techniques. Representation learn-
ing is about learning the representations of the data that makes it easier to extract
useful and discriminative information when building classifiers or other predictors.
Among the various ways of learning representations, deep learning algorithms have
increasingly been employed in many areas nowadays where the good representation
can be learned in an efficient and automatic way based on large amount of complex
1 Representation Learning 15
and high dimensional data. The evaluation of a representation is closely related to its
performance on the downstream tasks. Generally, there are also some general prop-
erties that the good representations may hold, such as the smoothness, the linearity,
disentanglement, as well as capturing multiple explanatory and casual factors.
We have summarized the representation learning techniques in different domains,
focusing on the unique challenges and models for different areas including the
processing of images, natural language, and speech signals. For each area, there
emerges many deep learning-based representation techniques from different cate-
gories, including supervised learning, unsupervised learning, transfer learning, dis-
entangled representation learning, reinforcement learning, etc. We have also briefly
mentioned about the representation learning on networks and its relations to that on
images, texts, and speech, in order for the elaboration of it in the following chapters.