Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis
Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis
Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis
https://doi.org/10.1007/s11263-020-01429-5
Received: 31 January 2020 / Accepted: 31 December 2020 / Published online: 10 February 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021
Abstract
Despite the great success of Generative Adversarial Networks (GANs) in synthesizing images, there lacks enough under-
standing of how photo-realistic images are generated from the layer-wise stochastic latent codes introduced in recent GANs.
In this work, we show that highly-structured semantic hierarchy emerges in the deep generative representations from the
state-of-the-art GANs like StyleGAN and BigGAN, trained for scene synthesis. By probing the per-layer representation with
a broad set of semantics at different abstraction levels, we manage to quantify the causality between the layer-wise activations
and the semantics occurring in the output image. Such a quantification identifies the human-understandable variation factors
that can be further used to steer the generation process, such as changing the lighting condition and varying the viewpoint
of the scene. Extensive qualitative and quantitative results suggest that the generative representations learned by the GANs
with layer-wise latent codes are specialized to synthesize various concepts in a hierarchical manner: the early layers tend to
determine the spatial layout, the middle layers control the categorical objects, and the later layers render the scene attributes
as well as the color scheme. Identifying such a set of steerable variation factors facilitates high-fidelity scene editing based
on well-learned GAN models without any retraining (code and demo video are available at https://genforce.github.io/higan).
Keywords Generative model · Scene understanding · Image manipulation · Representation learning · Feature visualization
1 Introduction for object recognition are able to detect semantic object parts,
and Bau et al. (2017) confirms that representations from clas-
Success of deep neural networks stems from representation sifying images learn to detect different categorical concepts
learning, which identifies the explanatory factors underly- at different layers.
ing the high-dimensional observed data (Bengio et al. 2013). Analyzing the deep representations and their emergent
Prior work has shown that many concept detectors spon- structures gives insight into the generalization ability of deep
taneously emerge in the deep representations trained for features (Morcos et al. 2018) as well as the feature transfer-
classification tasks (Zhou et al. 2015; Zeiler and Fergus 2014; ability across different tasks (Yosinski et al. 2014), but current
Bau et al. 2017; Gonzalez-Garcia et al. 2018). For exam- efforts mainly focus on discriminative models (Zhou et al.
ple, Gonzalez-Garcia et al. (2018) observes that networks 2015; Gonzalez-Garcia et al. 2018; Zeiler and Fergus 2014;
Agrawal et al. 2014; Bau et al. 2017). Generative Adversar-
Communicated by Jifeng Dai. ial Networks (GANs) (Goodfellow et al. 2014; Karras et al.
2017, 2019; Brock et al. 2018) are capable of mapping ran-
Ceyuan Yang and Yujun Shen are equal contribution in this work. dom noises to high-quality images, however, the nature of
B Ceyuan Yang the learned generative representations and how a synthesized
yc019@ie.cuhk.edu.hk image is composed over different layers of the GAN gener-
B Bolei Zhou ator remain much less explored.
bzhou@ie.cuhk.edu.hk It has been known that some internal units of deep mod-
Yujun Shen els emerge as object detectors when trained to categorize
sy116@ie.cuhk.edu.hk scenes (Zhou et al. 2015). Representing and detecting objects
that are most informative to a specific category provides an
1 Department of Information Engineering, The Chinese ideal solution for classifying scenes, like sofa and TV are rep-
University of Hong Kong, Hong Kong SAR, China
123
1452 International Journal of Computer Vision (2021) 129:1451–1466
Fig. 1 Scene manipulation results at four different abstraction levels, including spatial layout, categorical objects, scene attributes, and color
scheme. For each tuple of images, the first is the raw synthesis, whilst the followings present the editing process (Color figure online)
resentative of the living room while bed and lamp are of the layers determines the spatial layout, the middle layers com-
bedroom. However, synthesizing a scene requires far more pose the categorical objects, and the later layers render the
complex knowledge. In particular, in order to produce realis- attributes and color scheme of the entire scene. We also show
tic yet diverse scene images, a good generative representation that identifying such a set of steerable variation factors facili-
is required to not only generate every individual object, but tates the versatile semantic image editing, as shown in Fig. 1.
also decide the underlying room layout and render various The proposed manipulation technique is applicable to other
scene attributes (e.g., the lighting condition). Bau et al. (2018) GAN variants, such as BigGAN (Brock et al. 2018) and
has found that some filters in the GAN generator correspond PGGAN (Karras et al. 2017). More importantly, discovering
to the generation of some certain objects, however this anal- the emerged hierarchy in the scene generation brings impacts
ysis is only at the object level. Fully understanding how a on the research of scene understanding, which is one of the
scene image is synthesized requires examining the variation milestone tasks in computer vision and visual perception. Our
factors of scenes at multiple levels, i.e., from the layout level, work shows that the deep generative models ‘draws’ a scene
the category level, to the attribute level. Recent GAN variants like what humans do, i.e., drawing layout first, then repre-
introduce layer-wise stochasticity to control the synthesis sentative objects, and finally fine-grained attributes and color
from coarse to fine (Karras et al. 2019; Brock et al. 2018; schemes. It leads to many applications in scene understand-
Shaham et al. 2019; Nguyen-Phuoc et al. 2019), however, ing tasks such as scene editing, categorization, and parsing.
how the variation factors originate from the generative rep-
resentations layer by layer and how to quantify such semantic
information still remain unknown.
In this paper, instead of designing new architectures for 2 Related Work
better synthesis, we examine the nature of the internal rep-
resentations learned by the state-of-the-art GAN models. 2.1 Deep Representations from Classifying Images
Starting with StyleGAN (Karras et al. 2019) as an example,
we reveal that highly-structured semantic hierarchy emerges Many attempts have been made to study the internal represen-
from the deep generative representations, which can well tations of deep models trained for classification tasks. Zhou
match the human-understandable scene variations from mul- et al. (2015) analyzed hidden units by simplifying the input
tiple abstraction levels, including layout, category, attribute, image to see which context region gives the highest response,
and color scheme. We first probe the per-layer representa- Simonyan et al. (2014) applied the back-propagation tech-
tions of the generator with a broad set of visual concepts as nique to compute the image-specific class saliency map, Bau
candidates and then identify the most relevant variation fac- et al. (2017) interpreted the hidden representations via the aid
tors for each layer. For this purpose, we propose a simply of the segmentation mask, Alain and Bengio (2016) trained
yet effective re-scoring technique to quantify the causality independent linear probes to analyze the information sepa-
between the layer-wise activations and the semantics occur- rability among different layers. There are also some studies
ring in the output image. In particular, we find that the early transferring the discriminative features to verify how learned
representations fit with different datasets or tasks (Yosinski
123
International Journal of Computer Vision (2021) 129:1451–1466 1453
et al. 2014; Agrawal et al. 2014). In addition, reversing the 2.3 Scene Manipulation
feature extraction process by mapping a given representation
back to the image space (Zeiler and Fergus 2014; Nguyen Editing scene images has been a long-standing task in the
et al. 2016; Mahendran and Vedaldi 2015) also gives insight computer vision field. Laffont et al. (2014) defined 40 tran-
into how neural networks learn to distinguish different cat- sient attributes and managed to transfer the appearance
egories. However, these interpretation techniques developed of a similar scene to the image for editing. Cheng et al.
for classification networks cannot be directly applied to gen- (2014) proposed verbal guided image parsing to recog-
erative models. nize and manipulate the objects in indoor scenes. Karacan
et al. (2016) learned a conditional GAN to synthesize out-
door scenes based on pre-defined layout and attributes. Bau
2.2 Deep Representations from Synthesizing Images et al. (2019) developed a technique to locally edit generated
images based on the internal interpretation of GANs. Some
Generative Adversarial Networks (GANs) (Goodfellow et al. other work (Liao et al. 2017; Zhu et al. 2017; Isola et al. 2017;
2014) advance the image synthesis significantly. Some recent Luan et al. 2017; Park et al. 2020) studied image-to-image
models (Karras et al. 2017, 2019; Brock et al. 2018) are able translation and can be used to transfer the style of one scene
to generate photo-realistic faces, objects, and scenes, making to another. Besides, recent work (Abdal et al. 2019, 2020;
GANs applicable to real-world image editing tasks, such as Zhu et al. 2020) projected real images onto the latent space
image manipulation (Shen et al. 2018; Xiao et al. 2018a; of a well-trained GAN generator and leveraged the GAN
Wang et al. 2018; Yao et al. 2018), image painting (Bau knowledge for image editing. Different from prior work, we
et al. 2018; Park et al. 2019), and image style transfer (Zhu achieve scene manipulation from multiple abstraction levels
et al. 2017; Choi et al. 2018). Despite such a great success, by reusing the knowledge from well-learned GAN models
it remains uncertain what GANs have learned to produce without any retraining.
diverse and realistic images. Radford et al. (2015) pointed
out the vector arithmetic phenomenon in the underlying latent 2.4 Scene Understanding at Multiple Abstraction
space of GAN, however, discovering what kinds of semantics Levels
exist inside a well-trained model and how these semantics
are structured to compose high-quality images still remain The abstraction levels of scene representations are inspired
unsolved. Bau et al. (2018) analyzed the individual units of by prior literature on cognition studies of scene understand-
the generator in GAN and found that they learn to synthe- ing. Oliva and Torralba (2001) proposed a computational
size informative visual contents such as objects and textures model for a holistic representation (i.e., the shape of the
spontaneously. Besides, Jahanian et al. (2019) explored the scene) instead of individual objects or regions. Oliva and
steerability of GANs via distributional shift, and Goetschal- Torralba (2006) investigated that scene images are initially
ckx et al. (2019) boosted the memorability of GANs by processed as a single entity and local information about
modulating the latent codes. Unlike them, our work quan- objects and parts comes into play at a later stage of visual
titatively explores the emergence of hierarchical semantics processing. Torralba and Oliva (2003) demonstrated how
inside the layer-wise generative representations. A closely scene categories could provide the contextual information in
relevant work, InterFaceGAN (Shen et al. 2020a), interpreted the visual processing chain. Considering that scenes would
the latent space of GANs for diverse face editing. We dif- have a multivariate attribute representation instead of sim-
fer from InterFaceGAN in the following three aspects. First, ply a binary category membership, Patterson et al. (2014)
instead of examining the initial latent space, we study the advanced the scene understanding into more fine-grained rep-
layer-wise generative representations and reveal the seman- resentations, i.e., scene attributes. In this work, we discover
tic hierarchy learned for scene generation, which highly the semantic hierarchy learned by deep generative networks
aligns with human perception. Second, scene images are far and manage to align the aforementioned various concepts at
more complex than faces due to the large variety of scene different layers in a hierarchy.
categories as well as the objects inside, increasing the dif-
ficulty of interpreting scene synthesis models. Accordingly,
unlike InterFaceGAN that clearly knows the target seman- 3 Variation Factors in Generative
tics in advance, we employ a broad set of 105 semantics to Representations
serve as candidates for further analysis. Third, we propose
a re-scoring technique to quantify how a particular variation 3.1 Multi-Level Variation Factors for Scene Synthesis
factor is relevant to different layers of the generator. This also
enables layer-wise manipulation, resulting in a more precise Imagine an artist drawing a picture of the living room. The
control of scene editing. very first step is to choose a perspective and set up the room
123
1454 International Journal of Computer Vision (2021) 129:1451–1466
Fig. 2 Multi-level semantics extracted from two synthesized scenes (Color figure online)
3.2 Layer-Wise Generative Representations we separate them into four abstraction levels, including lay-
out, categorical objects, scene attributes, and color scheme.
In general, existing GANs take a randomly sampled latent We further propose a framework in Sect. 4 to quantify the
code as the input and output an image synthesis. Such a causality between the input generative representations and
mapping from the latent codes to the synthesized images is the output variation factors. We surprisingly find that GANs
very similar to the feature extraction process in discriminative synthesize a scene in a manner that is highly consistent with
models. Accordingly, in this work, we treat the input latent humans. Over all convolutional layers, GANs manage to
code as the generative representation which will uniquely organize these multi-level abstractions as a hierarchy. In par-
determine the appearance and properties of the output scene. ticular, GAN constructs the spatial layout at the early stage,
On the other hand, the recent state-of-the-art GAN models synthesizes category-specified objects at the middle stage,
[e.g., StyleGAN (Karras et al. 2019) and BigGAN (Brock and renders the scene attribute and color scheme at the later
et al. 2018)] introduce layer-wise stochasticity, as shown in stage.
Fig. 3 We therefore treat them as per-layer generative repre-
sentations.
To explore how GANs are able to produce high-quality 4 Identifying the Emergent Variation Factors
scene synthesis by learning multi-level variation factors as
well as what role the generative representation of each layer As described in Sect. 3, we target at interpreting the latent
plays in such generation process, this work aims at estab- semantics learned by scene synthesis models from four
lishing the relationship between the variation factors and the abstraction levels. Previous efforts on several scene under-
generative representations. Karras et al. (2019) has already standing databases (Zhou et al. 2017; Xiao et al. 2010;
pointed out that the design of layer-wise stochasticity actu- Laffont et al. 2014; Patterson et al. 2014) enable a series of
ally controls the synthesis from coarse to fine, however, what classifiers to predict scene attributes and categories. Besides,
“coarse” and “fine” actually refer to still remains uncertain. we also employ several classifiers focusing on layout detec-
To better align the variation factors with human perception, tion (Zhang et al. 2019) and semantic segmentation (Xiao
123
International Journal of Computer Vision (2021) 129:1451–1466 1455
Fig. 4 Pipeline of identifying the emergent variation factors in gen- latent space by considering it as a binary classification task. Then we
erative representation. By deploying a broad set of off-the-shelf image move the sampled latent code towards the boundary to see how the
classifiers as scoring functions, F(·), we are able to assign a synthesized semantic varies in the synthesis, and use a re-scoring technique to quan-
image with semantic scores corresponding to each candidate variation titatively verify the emergence of the target concept (Color figure online)
factor. For a particular concept, we learn a decision boundary in the
et al. 2018b). Specially, given an image, we are able to use these abstraction levels respectively, forming a hierarchical
these classifiers to get the response scores with respect to var- semantic space S. After establishing the mapping from the
ious semantics. However, only predicting the semantic labels latent space Z to the semantic space S, we search the decision
is far from identifying the variation factors that GANs have boundary for each concept by treating it as a bi-classification
captured from the training data. More concretely, among all problem, as shown in Fig. 4. Here, taking “indoor lighting”
the candidate concepts, not all of them are meaningful to a as an instance, the boundary separates the latent space Z to
particular model. For instance, “indoor lighting” will never two sets, i.e., presence or absence of indoor lighting.
happen in outdoor scenes such as bridge and tower, which
“enclosed area” is always true for indoor scenes such as bed- 4.2 Verifying Manipulable Variation Factors
room and kitchen. Accordingly, we come up with a method
to quantitatively identify the most relevant and manipulable After probing the latent space with a broad set of candidate
variation factors that emerge inside the learned generative concepts, we still need to figure out which ones are most
representation. Figure 4 illustrates the identification process relevant to the generative model by acting as the variation
which consists of two steps, i.e., probing (Sect. 4.1) and ver- factors. The key issue is how to define “relevance”. We argue
ification (Sect. 4.2). Such identification enables the diverse that if the target concept is manipulable from the latent space
scene manipulation (Sect. 4.3). Note that we use the same perspective (e.g., changing the indoor lighting status of the
approach as InterFaceGAN (Shen et al. 2020b) to get the synthesized image via simply varying the latent code), the
latent boundary for each candidate in the probing process in GAN model is considered as having captured such variation
Sect. 4.1. factor during training.
As mentioned above, we have already got a separation
boundary for each candidate. Let {ni }i=1
C denote the normal
4.1 Probing Latent Space vectors of these boundaries, where C is the total number of
candidates. For a certain boundary, if we move a latent code
The generator of GAN, G(·), typically learns the mapping z along its normal direction (positive), the semantic score
from latent space Z to image space X . Latent vectors z ∈ Z should also increase correspondingly. Therefore, we propose
can be considered as the generative representations learned to re-score the varied latent code to quantify how a variation
by GANs. To study the emergence of variation factors inside factor is relevant to the target model for analysis. As shown
Z, we need to first extract semantic information from z. For in Fig. 4, this process can be formulated as
this purpose, we utilize the synthesized image, x = G(z),
as an intermediate step and employ a broad set of image 1
K
classifiers to help assign semantic scores for each sampled Δsi = max Fi G zk + λni −Fi G zk , 0 ,
K
latent code z. Taking “indoor lighting” as an example, the k=1
scene attribute classifier is able to output the probability of (1)
how an input image looks like having indoor lighting, which
K
we use as the semantic score. Recall that we divide scene where K1 k=1 stands for the average of K samples to make
representation into layout, object (category), and attribute the metric more accurate. λ is a fixed moving step. To make
levels, we introduce layout estimator, scene category recog- this metric comparable among all candidates, all normal vec-
nizer, and attribute classifier to predict semantic scores from tors {ni }i=1
C are normalized to the fixed norm 1 and λ is set
123
1456 International Journal of Computer Vision (2021) 129:1451–1466
123
International Journal of Computer Vision (2021) 129:1451–1466 1457
Negative
Negative Positive
Positive Negative
Negative Positive
Positive
sifies a scene image to 365 categories, and (3) an attribute
predictor (Zhou et al. 2017), which predicts 102 pre-defined
scene attributes in SUN attribute database (Patterson et al.
2014). We also extract color scheme of a scene image through
its hue histogram in HSV space. Among them, the category
classifier and attribute predictor can directly output the prob-
ability of how likely an image belongs to a certain category
or how likely an image has a particular attribute. As for the
layout estimator, it only detects the outline structure of an
Fig. 6 The definition of layout for indoor scenes. Green lines represent indoor place, shown in Fig. 6.
the outline predicted by the layout estimator. The dashed line indicates
the horizontal center, and the red point is the center point of the intersec-
5.1.4 Semantic Probing and Verification
tion line between two walls. The relative position between the vertical
line and the center point is used to split the dataset (Color figure online)
Given a well-trained GAN model for analysis, we first gen-
erate a collection of synthesized scene images by randomly
5.1.2 Scene Categories sampling N latent codes (5,00,000 in practice). And then, the
aforementioned image classifiers are used to assign semantic
Among the mentioned generator models, PGGAN and Style- scores for each visual concept. It is worth noting that we use
GAN are actually trained on LSUN dataset (Yu et al. 2015) the relative position between image horizontal center and the
while BigGAN is trained on Places dataset (Zhou et al. intersection line of two walls to quantify layout, as shown in
2017). To be specific, LSUN dataset consists of 7 indoor Fig. 6. After that, for each candidate, we select 2000 images
scene categories and 3 outdoor scene categories, and Places with the highest response as positive samples, and another
dataset contains 10 million images across 434 categories. For 2000 with the lowest response as negative ones. In particular,
PGGAN model, we use the officially released models, each of living room and bedroom are treated as positive and negative
which is trained to synthesize scene within a individual cate- for scene category respectively for the mixed model. A linear
gory of LSUN dataset. For StyleGAN, only one model related SVM is trained by treating it as a bi-classification problem
to scene synthesis (i.e., bedroom) is released at this link. For (i.e., data is the sampled latent code while the label is binary
a more thorough analysis, we use the official implementation indicating whether the target semantic appears in the corre-
to train multiple models on other scene categories, including sponding synthesis or not) to get a linear decision boundary.
both indoor scenes (living room, kitchen, restaurant) and out- Finally, we re-generate K = 1000 samples for semantic ver-
door scenes (bridge, church, tower). We also train a mixed ification as described in Sect. 4.2.
model on the combination of images from bedroom, living
room, and dining room with the same implementation. This 5.2 Emerging Semantic Hierarchy
model is specifically used for categorical analysis. For each
StyleGAN model, Table 1 shows the category, the number Humans typically interpret a scene in a hierarchy of seman-
of training samples, as well as the corresponding Fréchet tics, from its layout, underlying objects, to the detailed
inception distances (FID) (Heusel et al. 2017) which can attributes and the color scheme. Here the underlying objects
reflect the synthesis quality to some extent. For BigGAN, we refer to the set of objects most relevant to a specific cat-
use the author’s officially unofficial PyTorch BigGAN imple- egory. This section shows that GAN composes a scene
mentation to train a conditional generative model by taking over the layers in a similar way with human perception.
category label as the constraint on Places dataset (Zhou et al. To enable analysis on layout and object, we take the mixed
2017). The resolution of the scene images synthesized by all StyleGAN model trained on indoor scenes as the target
of the above models is 256 × 256. model. StyleGAN (Karras et al. 2019) learns a more dis-
entangled latent space W on top of the conventional latent
5.1.3 Semantic Classifiers space Z. Specifically, for -th layer, w ∈ W is linearly
transformed to layer-wise transformed latent code y() with
To extract semantic from synthesized images, we employ y() = A() w + b() , where A() , b() are the weight and
various off-the-shelf image classifiers to assign these images bias for style transformation respectively. We thus perform
with semantic scores from multiple abstraction levels, includ- layer-wise analysis by studying y() instead of z in Eq. (1).
ing layout, category, scene attribute, and color scheme. To quantify the importance of each layer with respect
Specifically, we use (1) a layout estimator (Zhang et al. to each variation factor, we use the re-scoring technique to
2019), which predicts the spatial structure of an indoor place, identify the causality between the layer-wise generative rep-
(2) a scene category classifier (Zhou et al. 2017), which clas- resentation y() and the semantic emergence. The normalized
123
1458 International Journal of Computer Vision (2021) 129:1451–1466
Attributes 0% 5% 85% 5%
Color
0% 0% 25% 75%
Scheme
User Study
0 Layer Index 13 Fig. 8 User study on how different layers correspond to variation fac-
tors from different abstraction levels (Color figure online)
Layout Bottom Lower Upper Top
123
International Journal of Computer Vision (2021) 129:1451–1466 1459
generative representation, i.e., from layout, object (category), Living Room Bedroom Dining Room
123
1460 International Journal of Computer Vision (2021) 129:1451–1466
(a) (b)
Fig. 10 a Independent attribute manipulation results on Upper layers. are manipulated at proper layers. The first column indicates the source
The middle row is the source images. We are able to both decrease images and the middle three columns are the independently manipulated
(top row) and increase (bottom row) the variation factors in the images. images (Color figure online)
b Joint manipulation results, where the layout, objects and attribute
n
ny ting izo d n ion on e g al es ion on
tio s s e l
sun boa y hor clou geta leave tree foliag meta ocea
n l g e
tica ourin sunnycloud oliag grass trees getat horiz rayin
g iag cloudsunny ourin ertic leav getat trees horiz brick
ver fol t v ve
wa ve t f ve no p w ay
r- a -a
fa far
17 Living room 18 Kitchen 12 Restaurant
g g
g g g t y
din wood glass ghtin ing ightincarpe loss cloth spac
e g g e g
ing lossyglass ghtin ghtin etal spac ood lizin tatio
n zin hting ng ood gatin ding hting ss ation ing
rea g eat g i m ed w cia ge iali g ti w gre rea lig gla get ooth
o or li ooth ral l tered ral l oor l
i
t e r so v e soc oor li ea con ral ve s
s u t u t u
ind nat clu nat ind clu ind nat
Fig. 11 Comparison of the top scene attributes identified in the generative representations learned by StyleGAN models for synthesizing different
scenes. Vertical axis shows the perturbation score Δsi (Color figure online)
kitchen). Figure 11 shows the top-10 relevant semantics to manipulated along the positive (right) direction. We can tell
each model. It is seen that “sunny” has high scores on all that the edited images are still with high quality and the target
outdoor categories, while “lighting” has high scores on all attributes indeed change as desired. We then jointly manip-
indoor categories. Furthermore, “boating” is identified for ulate two attributes with bridge synthesis model as shown
bridge model, “touring” for church and tower, “reading” in Fig. 13. The central image of the 3 × 3 image grid is the
for living room, “eating” for kitchen, and “socializing” for original synthesis, the second row and the second column
restaurant. These results are highly consistent with human show the independent manipulation results with respect to
understanding and perception, suggesting the effectiveness “vegetation” and “cloud” attributes respectively, while other
of the proposed quantification method. images on the four corners are the joint manipulation results.
It turns out that we achieve good control of these two seman-
5.4.2 Attribute Manipulation tics and they seem to barely affect each other. However, not
all variation factors show such strong disentanglement. From
Recall the three types of manipulation in Sect. 4.3: indepen- this point of view, our approach also provides a new metric
dent manipulation, joint manipulation, and jittering manip- to help measure the entanglement between two variation fac-
ulation. We first conduct independent manipulation on 3 tors, which will be discussed in Sect. 6. Finally, we evaluate
indoor and 3 outdoor scenes with the most relevant scene the proposed jittering manipulation by introducing noise into
attributes identified with our approach. Figure 12 shows the the “cloud” manipulation . From Fig. 14, we observe that
results where the original synthesis (left image in each pair) is the newly introduced noise indeed increases the manipula-
123
International Journal of Computer Vision (2021) 129:1451–1466 1461
Fig. 12 Independent manipulation results on StyleGAN models trained for synthesizing indoor and outdoor scenes. In each pair of images, the
first is the original synthesized sample and the second is the one after manipulating a certain semantic (Color figure online)
123
1462 International Journal of Computer Vision (2021) 129:1451–1466
Vegetation
Fig. 13 Joint manipulation results along both cloud and vegetation boundaries with bridge synthesis model. Along the vertical and horizontal axis,
the original synthesis (the central image) is manipulated with respect to vegetation and cloud attributes respectively (Color figure online)
tion diversity. It is interesting that the introduced randomness but actually manipulable scene attributes, such as “cluttered
may not only affect the shape of added cloud, but also change space”.
the appearance of the synthesized tower. But both cases keep In the middle figure, almost all attributes get similar
the primary goal, which is to edit the cloudiness. scores, making them indistinguishable. Actually, even the
worst SVM classifier (i.e., “railroad”) achieves 72.3% accu-
racy. That is because even some variation factors are not
5.5 Ablation Studies encoded in the latent representation (or say, not manipulable),
the corresponding attribute classifier still assigns synthesized
5.5.1 Re-scoring Technique images with different scores. Training SVM on these inac-
curate data can also result in a separation boundary, even it
Before performing the proposed re-scoring technique, we is not expected as the target concept. Therefore, only relying
have two more steps, which are (1) assigning semantic scores on the SVM classifier is not enough to detect relevant varia-
for synthesized samples, and (2) training SVM classifiers to tion factors. By contrast, our method pays more attention to
search semantic boundary. We would like to verify the neces- the score modulation after varying the latent code, which is
sity of the re-scoring technique in identifying manipulable not biased by the initial response of attribute classifier or the
semantics. Ablation study is conducted on the StyleGAN performance of SVM. As a result, we are able to thoroughly
model trained for synthesizing bedrooms. As shown in yet precisely detect the variation factors in the latent space
Fig. 15, the left figure sorts the scene attributes by how many from a broad candidate set.
samples are labelled as positive ones, the middle figure sorts
by the accuracy of the trained SVM classifiers, while the right 5.5.2 Layer-Wise Manipulation
figure sorts by our proposed quantification metric.
In left figure, “no horizon”, “man-made”, and “enclosed To further validate the emergence of semantic hierarchy, we
area” are attributes with highest percentage. However, all make ablation study on layer-wise manipulation with Style-
these three attributes are default properties of the bedroom GAN model. First, we select “indoor lighting” as the target
and thus not manipulable. On the contrary, with the re-scoring semantic, and vary the latent code only on upper (attribute-
technique for verification, our method successfully filters relevant) layers v.s. on all layers. We can easily tell from
out these invariable candidates and reveals more meaningful Fig. 16 that when manipulation “indoor lighting” at all lay-
semantics, like “wood” and “indoor lighting”. In addition, ers, the objects inside the room are also changed. By contrast,
our method also manages to identify some less frequent manipulating latent codes only at attribute-relevant layers can
123
International Journal of Computer Vision (2021) 129:1451–1466 1463
Fig. 14 Jittering manipulation results with tower synthesis model for of added cloud and appearance of the generated tower change. The top
cloud attribute. Specifically, the movement in the latent space of synthe- left image of two samples is the original output while the rest are the
sized image is disturbed. Thus, when the cloud appears, both the shape results under jittering manipulation separately (Color figure online)
Fig. 15 Ablation study on the proposed re-scoring technique with StyleGAN model for bedroom synthesis. The left shows the percentage of scene
attributes with the positive scores, the middle figure sorts by the accuracy of SVM classifiers, while the right figure sorts by our methods (Color
figure online)
satisfyingly increase the indoor lighting without affecting out is modified, all attributes are barely affected, suggesting
other factors. Second, we select bottom layers as the target that GAN learns to disentangle layout-level semantic from
layers, and select boundaries from all four abstraction levels attribute-level. However, there are also some scene attributes
for manipulation. As shown in Fig. 17, no matter what level (from same abstraction level) entangling with each other.
of semantics we choose, as long as the latent code is modified Taking Fig. 18c as an example, when modulating “indoor
at bottom (layout-relevant) layers, only layout instead of all lighting”, “natural lighting” also varies. This is also aligned
other semantics varies. These two experiments further verify with human perception, further demonstrating the effective-
our discovery about the emergence of the semantic hierarchy ness of our proposed quantification metric. Qualitative results
that the early layers tend to determine the spatial layout and are also included in Fig. 18d–f.
configuration instead of other abstraction level semantics.
6.2 Application to Other GANs
6 Discussions We further apply our method for two other GAN structures,
i.e., PGGAN (Karras et al. 2017) and BigGAN (Brock et al.
6.1 Disentanglement of Semantics 2018). These two models are trained on LSUN dataset (Yu
et al. 2015) and Places dataset (Zhou et al. 2017) respectively.
Some variation factors we detect in the generative repre- Compared to StyleGAN, PGGAN feeds the latent vector only
sentation are more disentangled with each other than other to the very first convolutional layer and hence does not sup-
semantics. Compared to the perceptual path length and lin- port layer-wise analysis. But the proposed re-scoring method
ear separability described in Karras et al. (2019) and the can still be applied to help identify manipulatable seman-
cosine similarity proposed in Shen et al. (2020a), our work tics, as shown in Fig. 19a. BigGAN is the state-of-the-art
offers a new metric for disentanglement analysis. In particu- conditional GAN model that concatenates the latent vector
lar, we move the latent code along one semantic direction and with a class-guided embedding code before feeding it to the
then check how the semantic scores of other factors change generator, and it also allows layer-wise analysis like Style-
accordingly. As shown in Fig. 18a, when the spatial lay- GAN. Figure 19b gives analysis results on BigGAN from
123
1464 International Journal of Computer Vision (2021) 129:1451–1466
Fig. 16 Comparison results between manipulating latent codes at only upper (attribute-relevant) layers and manipulating latent codes at all layers
with respect to indoor lighting on StyleGAN (Color figure online)
Fig. 17 Manipulation at the bottom layers in 4 different directions, along the directions of layout, objects (category), indoor lighting, and color
scheme on StyleGAN (Color figure online)
5 2.5 3
2 Bedroom 2 Bridge 2 Living room
Fig. 18 a–c Quantitative effects on scene attributes (already sorted). Vertical axis shows the perturbation score Δsi in log scale. d–f Qualitative
results also show the effect when varying the most relevant factor (Color figure online)
attribute level, where we can tell that scene attribute can be rize them into various levels by prior work, such as layout
best modified at upper layers compared to lower layers or in Oliva and Torralba (2001), category in Torralba and Oliva
all layers. As for BigGAN model with 256 × 256 resolu- (2003), and attribute in Patterson et al. (2014), such classi-
tion, there are total 12 convolutional layers. As the category fiers remain to be improved together with the development of
information is already encoded in the “class” code, we only scene understanding. In case the defined broad set of seman-
separate the layers to two groups, which are lower (bottom tics is not enough, we could further enlarge the dictionary
6 layers) and upper (top 6 layers). Meanwhile, the quantita- following the stardard annotation pipeline in Zhou et al.
tive curve shows the consistent result with the discovery on (2017) and Patterson et al. (2014). In addition, such classifiers
StyleGAN as in Fig. 7a. These results demonstrate the gen- trained on the large-scale benchmark of scene understanding
eralization ability of our approach as well as the emergence could be replaced by more powerful discriminative models
of manipulatable factors in other GANs. to improve the accuracy. (2) Boundary search: for simplicity
we only use the linear SVM for semantic boundary search.
6.3 Limitation This limits our framework from interpreting the latent seman-
tic subspace with more complex and nonlinear structure. (3)
There are several limitations for future improvement. (1) Generalization beyond scene understanding: the main pur-
More thorough and precise off-the-shelf classifiers: although pose of this work is to interpret scene-related GANs, which
we collect as many visual concepts as possible and summa- is a challenging task considering the large diversity of scene
123
International Journal of Computer Vision (2021) 129:1451–1466 1465
Layout Attributes
1
Wood Vegetation
Fig. 19 a Some variation factors identified from PGGAN (bedroom). b Layer-wise analysis on BigGAN from the attribute level (Color figure
online)
images as well as the difficulty of scene understanding. How- Agrawal, P., Girshick, R., & Malik, J. (2014). Analyzing the perfor-
ever, these abstraction levels can be hard to generalize to other mance of multilayer neural networks for object recognition. In:
European conference on computer vision (pp. 329–344). Springer.
datasets beyond scenes. Even so, we believe that this work is
Alain, G., & Bengio, Y. (2016). Understanding intermediate layers using
still able to provide some insights on analyzing GAN models linear classifier probes. In: International conference on learning
trained on other datasets. For example, for scene synthesis, representations workshop.
we found that early layers control scene layout, which can be Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J.-Y., & Tor-
ralba, A. (2019). Semantic photo manipulation with a generative
viewed as structural information, such as rotation. Accord- image prior. ACM Transactions on Graphics, 38(4), 59.
ingly, we can fairly generalize that the early layers of face Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017).
synthesis models control the face pose and the early layers Network dissection: Quantifying interpretability of deep visual
of car synthesis models control the car orientation. representations. In: IEEE conference on computer vision and pat-
tern recognition (pp. 6541–6549).
Bau, D., Zhu, J. Y., Strobelt, H., Zhou, B., Tenenbaum, J. B., Free-
man, W. T., & Torralba, A. (2018). Gan dissection: Visualizing
and understanding generative adversarial networks. In: Interna-
7 Conclusion tional conference on learning representations.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning:
In this paper, we show the emergence of highly-structured A review and new perspectives. IEEE Transactions on Pattern
variation factors inside the deep generative representations Analysis and Machine Intelligence, 35(8), 1798–1828.
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training
learned by GANs with layer-wise stochasticity. In particular, for high fidelity natural image synthesis. In: International confer-
the GAN model spontaneously learns to set up layout at early ence on learning representations.
layers, generate categorical objects at middle layers, and ren- Cheng, M. M., Zheng, S., Lin, W. Y., Vineet, V., Sturgess, P., Crook,
der scene attribute and color scheme at later layers when N., et al. (2014). Imagespirit: Verbal guided image parsing. ACM
Transactions on Graphics, 34(1), 1–11.
trained to synthesize scenes. A re-scoring method is proposed Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., & Choo, J. (2018).
to quantitatively identify the manipulatable semantic con- Stargan: Unified generative adversarial networks for multi-domain
cepts within a well-trained model, enabling photo-realistic image-to-image translation. In: IEEE conference on computer
scene manipulation. We will explore to extend this manip- vision and pattern recognition (pp. 8789–8797).
Goetschalckx, L., Andonian, A., Oliva, A., & Isola, P. (2019). Gana-
ulation capability of GANs for real image editing in future lyze: Toward visual definitions of cognitive image properties. In:
work. Proceedings of the IEEE international conference on computer
vision (pp. 5744–5753).
Acknowledgements This work is supported by Early Career Scheme Gonzalez-Garcia, A., Modolo, D., & Ferrari, V. (2018). Do seman-
(ECS) through the Research Grants Council (RGC) of Hong Kong under tic parts emerge in convolutional neural networks? International
Grant No.24206219 and CUHK FoE RSFS Grant (No. 3133233). Journal of Computer Vision, 126(5), 476–494.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adver-
sarial nets. In: Advances in neural information processing systems
(pp. 2672–2680).
References Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter,
S. (2017). Gans trained by a two time-scale update rule converge
Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed to a local nash equilibrium. In: Advances in neural information
images into the stylegan latent space? In: International conference processing systems (pp. 6626–6637).
on computer vision (pp. 4432–4441). Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit translation with conditional adversarial networks. In: IEEE confer-
the embedded images? In: IEEE conference on computer vision ence on computer vision and pattern recognition (pp. 1125–1134).
and pattern recognition (pp. 8296–8305).
123
1466 International Journal of Computer Vision (2021) 129:1451–1466
Jahanian, A., Chai, L., & Isola, P. (2019). On the“steerability” of genera- Shen, Y., Yang, C., Tang, X., & Zhou, B. (2020b). InterFaceGAN: Inter-
tive adversarial networks. In: International conference on learning preting the disentangled face representation learned by GANs.
representations. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Karacan, L., Akata, Z., Erdem, A., & Erdem, E. (2016) Learning to https://doi.org/10.1109/TPAMI.2020.3034267.
generate images of outdoor scenes from attributes and semantic Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside con-
layouts. arXiv preprint arXiv:1612.00215. volutional networks: Visualising image classification models and
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive saliency maps. In: Workshop at international conference on learn-
growing of gans for improved quality, stability, and variation. In: ing representations.
International conference on learning representations. Torralba, A., & Oliva, A. (2003). Statistics of natural image categories.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator archi- Network: Computation in Neural Systems, 14(3), 391–412.
tecture for generative adversarial networks. In: IEEE conference Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro,
on computer vision and pattern recognition (pp. 4401–4410). B. (2018). High-resolution image synthesis and semantic manip-
Laffont, P. Y., Ren, Z., Tao, X., Qian, C., & Hays, J. (2014). Tran- ulation with conditional gans. In: IEEE conference on computer
sient attributes for high-level understanding and editing of outdoor vision and pattern recognition (pp. 8798–8807).
scenes. ACM Transactions on Graphics, 33(4), 1–11. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba A (2010) Sun
Liao, J., Yao, Y., Yuan, L., Hua, G., & Kang, S. B. (2017). Visual database: Large-scale scene recognition from abbey to zoo. In:
attribute transfer through deep image analogy. ACM Transactions 2010 IEEE computer society conference on computer vision and
on Graphics, 36(4), 120. pattern recognition (pp. 3485–3492). IEEE.
Luan, F., Paris, S., Shechtman, E., Bala, K. (2017) Deep photo style Xiao, T., Hong, J., & Ma, J. (2018) Elegant: Exchanging latent encod-
transfer. In: IEEE conference on computer vision and pattern ings with gan for transferring multiple face attributes. In: European
recognition (pp. 4990–4998). conference on computer vision (pp. 168–184).
Mahendran, A., & Vedaldi, A. (2015). Understanding deep image rep- Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual
resentations by inverting them. In: IEEE conference on computer parsing for scene understanding. In: Proceedings of the European
vision and pattern recognition (pp. 5188–5196). conference on computer vision (ECCV) (pp. 418–434).
Morcos, A. S., Barrett, D. G., Rabinowitz, N. C., & Botvinick, M. Yao, S., Hsu, T. M., Zhu, J. Y., Wu, J., Torralba, A., Freeman, B., &
(2018). On the importance of single directions for generalization. Tenenbaum, J. (2018). 3D-aware scene manipulation via inverse
In: International conference on learning representations. graphics. In: Advances in neural information processing systems
Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). (pp. 1887–1898).
Synthesizing the preferred inputs for neurons in neural networks Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transfer-
via deep generator networks. In: Advances in neural information able are features in deep neural networks? In: Advances in neural
processing systems (pp. 3387–3395). information processing systems (pp. 3320–3328).
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019) Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., & Xiao, J. (2015)
Hologan: Unsupervised learning of 3D representations from nat- Lsun: Construction of a large-scale image dataset using deep learn-
ural images. In: International conference on computer vision (pp. ing with humans in the loop. arXiv preprint arXiv:1506.03365.
7588–7597). Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding con-
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A volutional networks. In: European conference on computer vision
holistic representation of the spatial envelope. International Jour- (pp. 818–833). Springer.
nal of Computer Vision, 42(3), 145–175. Zhang, W., Zhang, W., & Gu, J. (2019). Edge-semantic learning strategy
Oliva, A., & Torralba, A. (2006). Building the gist of a scene: The role of for layout estimation in indoor environment. IEEE Transactions
global image features in recognition. Progress in Brain Research, on Cybernetics, 50(6), 2730–2739.
155, 23–36. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015).
Park, T., Liu, M. Y., Wang, T. C., Zhu, J. Y. (2019). Semantic image syn- Object detectors emerge in deep scene cnns. In: International con-
thesis with spatially-adaptive normalization. In: IEEE conference ference on learning representations.
on computer vision and pattern recognition (pp. 2337–2346). Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017).
Park, T., Zhu, J.-Y., Wang, O., Lu, J., Shechtman, E., Efros, A. A., & Places: A 10 million image database for scene recognition. IEEE
Zhang, R. (2020). Swapping autoencoder for deep image manip- Transactions on Pattern Analysis and Machine Intelligence, 40(6),
ulation. In: Advances in Neural Information Processing Systems. 1452–1464.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The sun attribute Zhu, J., Shen, Y., Zhao, D., & Zhou, B. (2020). In-domain gan inver-
database: Beyond categories for deeper scene understanding. Inter- sion for real image editing. In: European conference on computer
national Journal of Computer Vision, 108(1–2), 59–81. vision.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised repre- Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-
sentation learning with deep convolutional generative adversarial to-image translation using cycle-consistent adversarial networks.
networks. In: International conference on learning representa- In: International conference on computer vision (pp. 2223–2232).
tions.
Shaham, T. R., Dekel, T., & Michaeli, T. (2019). Singan: Learning a
generative model from a single natural image. In: International Publisher’s Note Springer Nature remains neutral with regard to juris-
conference on computer vision (pp. 4570–4580). dictional claims in published maps and institutional affiliations.
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020a). Interpreting the latent
space of gans for semantic face editing. In: IEEE conference on
computer vision and pattern recognition (pp. 9243–9252).
Shen, Y., Luo, P., Yan, J., Wang, X., & Tang, X. (2018). Faceid-gan:
Learning a symmetry three-player gan for identity-preserving face
synthesis. In: IEEE conference on computer vision and pattern
recognition (pp. 821–830).
123