Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis

International Journal of Computer Vision (2021) 129:1451–1466
https://doi.org/10.1007/s11263-020-01429-5
Semantic Hierarchy Emerges in Deep Generative Representations for

Scene Synthesis
Ceyuan Yang1 · Yujun Shen1 · Bolei Zhou1
Received: 31 January 2020 / Accepted: 31 December 2020 / Published online: 10 February 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021
Abstract
Despite the great success of Generative Adversarial Networks (GANs) in synthesizing images, there lacks enough under-
standing of how photo-realistic images are generated from the layer-wise stochastic latent codes introduced in recent GANs.
In this work, we show that highly-structured semantic hierarchy emerges in the deep generative representations from the
state-of-the-art GANs like StyleGAN and BigGAN, trained for scene synthesis. By probing the per-layer representation with
a broad set of semantics at different abstraction levels, we manage to quantify the causality between the layer-wise activations
and the semantics occurring in the output image. Such a quantification identifies the human-understandable variation factors
that can be further used to steer the generation process, such as changing the lighting condition and varying the viewpoint
of the scene. Extensive qualitative and quantitative results suggest that the generative representations learned by the GANs
with layer-wise latent codes are specialized to synthesize various concepts in a hierarchical manner: the early layers tend to
determine the spatial layout, the middle layers control the categorical objects, and the later layers render the scene attributes
as well as the color scheme. Identifying such a set of steerable variation factors facilitates high-fidelity scene editing based
on well-learned GAN models without any retraining (code and demo video are available at https://genforce.github.io/higan).
Keywords Generative model · Scene understanding · Image manipulation · Representation learning · Feature visualization
1 Introduction for object recognition are able to detect semantic object parts,
and Bau et al. (2017) confirms that representations from clas-
Success of deep neural networks stems from representation sifying images learn to detect different categorical concepts
learning, which identifies the explanatory factors underly- at different layers.
ing the high-dimensional observed data (Bengio et al. 2013). Analyzing the deep representations and their emergent
Prior work has shown that many concept detectors spon- structures gives insight into the generalization ability of deep
taneously emerge in the deep representations trained for features (Morcos et al. 2018) as well as the feature transfer-
classification tasks (Zhou et al. 2015; Zeiler and Fergus 2014; ability across different tasks (Yosinski et al. 2014), but current
Bau et al. 2017; Gonzalez-Garcia et al. 2018). For exam- efforts mainly focus on discriminative models (Zhou et al.
ple, Gonzalez-Garcia et al. (2018) observes that networks 2015; Gonzalez-Garcia et al. 2018; Zeiler and Fergus 2014;
Agrawal et al. 2014; Bau et al. 2017). Generative Adversar-
Communicated by Jifeng Dai. ial Networks (GANs) (Goodfellow et al. 2014; Karras et al.
2017, 2019; Brock et al. 2018) are capable of mapping ran-
Ceyuan Yang and Yujun Shen are equal contribution in this work. dom noises to high-quality images, however, the nature of
B Ceyuan Yang the learned generative representations and how a synthesized
yc019@ie.cuhk.edu.hk image is composed over different layers of the GAN gener-
B Bolei Zhou ator remain much less explored.
bzhou@ie.cuhk.edu.hk It has been known that some internal units of deep mod-
Yujun Shen els emerge as object detectors when trained to categorize
sy116@ie.cuhk.edu.hk scenes (Zhou et al. 2015). Representing and detecting objects
that are most informative to a specific category provides an
1 Department of Information Engineering, The Chinese ideal solution for classifying scenes, like sofa and TV are rep-
University of Hong Kong, Hong Kong SAR, China
123
1452 International Journal of Computer Vision (2021) 129:1451–1466
Layout Category: Objects from bedroom to living room
Attribute: Indoor lighting Color Scheme
Fig. 1 Scene manipulation results at four different abstraction levels, including spatial layout, categorical objects, scene attributes, and color
scheme. For each tuple of images, the first is the raw synthesis, whilst the followings present the editing process (Color figure online)
resentative of the living room while bed and lamp are of the layers determines the spatial layout, the middle layers com-
bedroom. However, synthesizing a scene requires far more pose the categorical objects, and the later layers render the
complex knowledge. In particular, in order to produce realis- attributes and color scheme of the entire scene. We also show
tic yet diverse scene images, a good generative representation that identifying such a set of steerable variation factors facili-
is required to not only generate every individual object, but tates the versatile semantic image editing, as shown in Fig. 1.
also decide the underlying room layout and render various The proposed manipulation technique is applicable to other
scene attributes (e.g., the lighting condition). Bau et al. (2018) GAN variants, such as BigGAN (Brock et al. 2018) and
has found that some filters in the GAN generator correspond PGGAN (Karras et al. 2017). More importantly, discovering
to the generation of some certain objects, however this anal- the emerged hierarchy in the scene generation brings impacts
ysis is only at the object level. Fully understanding how a on the research of scene understanding, which is one of the
scene image is synthesized requires examining the variation milestone tasks in computer vision and visual perception. Our
factors of scenes at multiple levels, i.e., from the layout level, work shows that the deep generative models ‘draws’ a scene
the category level, to the attribute level. Recent GAN variants like what humans do, i.e., drawing layout first, then repre-
introduce layer-wise stochasticity to control the synthesis sentative objects, and finally fine-grained attributes and color
from coarse to fine (Karras et al. 2019; Brock et al. 2018; schemes. It leads to many applications in scene understand-
Shaham et al. 2019; Nguyen-Phuoc et al. 2019), however, ing tasks such as scene editing, categorization, and parsing.
how the variation factors originate from the generative rep-
resentations layer by layer and how to quantify such semantic
information still remain unknown.
In this paper, instead of designing new architectures for 2 Related Work
better synthesis, we examine the nature of the internal rep-
resentations learned by the state-of-the-art GAN models. 2.1 Deep Representations from Classifying Images
Starting with StyleGAN (Karras et al. 2019) as an example,
we reveal that highly-structured semantic hierarchy emerges Many attempts have been made to study the internal represen-
from the deep generative representations, which can well tations of deep models trained for classification tasks. Zhou
match the human-understandable scene variations from mul- et al. (2015) analyzed hidden units by simplifying the input
tiple abstraction levels, including layout, category, attribute, image to see which context region gives the highest response,
and color scheme. We first probe the per-layer representa- Simonyan et al. (2014) applied the back-propagation tech-
tions of the generator with a broad set of visual concepts as nique to compute the image-specific class saliency map, Bau
candidates and then identify the most relevant variation fac- et al. (2017) interpreted the hidden representations via the aid
tors for each layer. For this purpose, we propose a simply of the segmentation mask, Alain and Bengio (2016) trained
yet effective re-scoring technique to quantify the causality independent linear probes to analyze the information sepa-
between the layer-wise activations and the semantics occur- rability among different layers. There are also some studies
ring in the output image. In particular, we find that the early transferring the discriminative features to verify how learned
representations fit with different datasets or tasks (Yosinski
123
International Journal of Computer Vision (2021) 129:1451–1466 1453
et al. 2014; Agrawal et al. 2014). In addition, reversing the 2.3 Scene Manipulation
feature extraction process by mapping a given representation
back to the image space (Zeiler and Fergus 2014; Nguyen Editing scene images has been a long-standing task in the
et al. 2016; Mahendran and Vedaldi 2015) also gives insight computer vision field. Laffont et al. (2014) defined 40 tran-
into how neural networks learn to distinguish different cat- sient attributes and managed to transfer the appearance
egories. However, these interpretation techniques developed of a similar scene to the image for editing. Cheng et al.
for classification networks cannot be directly applied to gen- (2014) proposed verbal guided image parsing to recog-
erative models. nize and manipulate the objects in indoor scenes. Karacan
et al. (2016) learned a conditional GAN to synthesize out-
door scenes based on pre-defined layout and attributes. Bau
2.2 Deep Representations from Synthesizing Images et al. (2019) developed a technique to locally edit generated
images based on the internal interpretation of GANs. Some
Generative Adversarial Networks (GANs) (Goodfellow et al. other work (Liao et al. 2017; Zhu et al. 2017; Isola et al. 2017;
2014) advance the image synthesis significantly. Some recent Luan et al. 2017; Park et al. 2020) studied image-to-image
models (Karras et al. 2017, 2019; Brock et al. 2018) are able translation and can be used to transfer the style of one scene
to generate photo-realistic faces, objects, and scenes, making to another. Besides, recent work (Abdal et al. 2019, 2020;
GANs applicable to real-world image editing tasks, such as Zhu et al. 2020) projected real images onto the latent space
image manipulation (Shen et al. 2018; Xiao et al. 2018a; of a well-trained GAN generator and leveraged the GAN
Wang et al. 2018; Yao et al. 2018), image painting (Bau knowledge for image editing. Different from prior work, we
et al. 2018; Park et al. 2019), and image style transfer (Zhu achieve scene manipulation from multiple abstraction levels
et al. 2017; Choi et al. 2018). Despite such a great success, by reusing the knowledge from well-learned GAN models
it remains uncertain what GANs have learned to produce without any retraining.
diverse and realistic images. Radford et al. (2015) pointed
out the vector arithmetic phenomenon in the underlying latent 2.4 Scene Understanding at Multiple Abstraction
space of GAN, however, discovering what kinds of semantics Levels
exist inside a well-trained model and how these semantics
are structured to compose high-quality images still remain The abstraction levels of scene representations are inspired
unsolved. Bau et al. (2018) analyzed the individual units of by prior literature on cognition studies of scene understand-
the generator in GAN and found that they learn to synthe- ing. Oliva and Torralba (2001) proposed a computational
size informative visual contents such as objects and textures model for a holistic representation (i.e., the shape of the
spontaneously. Besides, Jahanian et al. (2019) explored the scene) instead of individual objects or regions. Oliva and
steerability of GANs via distributional shift, and Goetschal- Torralba (2006) investigated that scene images are initially
ckx et al. (2019) boosted the memorability of GANs by processed as a single entity and local information about
modulating the latent codes. Unlike them, our work quan- objects and parts comes into play at a later stage of visual
titatively explores the emergence of hierarchical semantics processing. Torralba and Oliva (2003) demonstrated how
inside the layer-wise generative representations. A closely scene categories could provide the contextual information in
relevant work, InterFaceGAN (Shen et al. 2020a), interpreted the visual processing chain. Considering that scenes would
the latent space of GANs for diverse face editing. We dif- have a multivariate attribute representation instead of sim-
fer from InterFaceGAN in the following three aspects. First, ply a binary category membership, Patterson et al. (2014)
instead of examining the initial latent space, we study the advanced the scene understanding into more fine-grained rep-
layer-wise generative representations and reveal the seman- resentations, i.e., scene attributes. In this work, we discover
tic hierarchy learned for scene generation, which highly the semantic hierarchy learned by deep generative networks
aligns with human perception. Second, scene images are far and manage to align the aforementioned various concepts at
more complex than faces due to the large variety of scene different layers in a hierarchy.
categories as well as the objects inside, increasing the dif-
ficulty of interpreting scene synthesis models. Accordingly,
unlike InterFaceGAN that clearly knows the target seman- 3 Variation Factors in Generative
tics in advance, we employ a broad set of 105 semantics to Representations
serve as candidates for further analysis. Third, we propose
a re-scoring technique to quantify how a particular variation 3.1 Multi-Level Variation Factors for Scene Synthesis
factor is relevant to different layers of the generator. This also
enables layer-wise manipulation, resulting in a more precise Imagine an artist drawing a picture of the living room. The
control of scene editing. very first step is to choose a perspective and set up the room
123
Synthesis Layout Objects Synthesis Layout Objects
Scene category: bedroom Scene category: living room

Scene attributes: Scene attributes:
Fig. 2 Multi-level semantics extracted from two synthesized scenes (Color figure online)
layout. After the spatial structure is set, the next step is to

add objects that typically occur in a living room, such as sofa
and TV. Finally, the artist will refine the details of the picture
with specified decoration styles, e.g., warm or cold, natural
lighting or indoor lighting, etc.. The above process reflects
how a human draws a scene by interpreting it from multiple
abstraction levels. Meanwhile, given a scene image, we are
able to extract multiple levels of semantics from it, as shown
in Fig. 2. As a comparison, GANs follow a completely end-
to-end training manner for synthesizing scenes without any
prior knowledge about the drawing techniques or the con-
cepts of layout and object. Even so, the trained GANs are Fig. 3 Comparison between the conventional generator structure where
the latent code is only fed into the very first layer and the generator in
able to produce photo-realistic scenes, which makes us won- state-of-the-art GANs [e.g., StyleGAN (Karras et al. 2019) and Big-
der if the GANs have mastered any human-understandable GAN (Brock et al. 2018)] which introduce layer-wise stochasticity by
drawing knowledge as well as the variation factors of scenes feeding latent codes to all convolutional layers
spontaneously.
3.2 Layer-Wise Generative Representations we separate them into four abstraction levels, including lay-
out, categorical objects, scene attributes, and color scheme.
In general, existing GANs take a randomly sampled latent We further propose a framework in Sect. 4 to quantify the
code as the input and output an image synthesis. Such a causality between the input generative representations and
mapping from the latent codes to the synthesized images is the output variation factors. We surprisingly find that GANs
very similar to the feature extraction process in discriminative synthesize a scene in a manner that is highly consistent with
models. Accordingly, in this work, we treat the input latent humans. Over all convolutional layers, GANs manage to
code as the generative representation which will uniquely organize these multi-level abstractions as a hierarchy. In par-
determine the appearance and properties of the output scene. ticular, GAN constructs the spatial layout at the early stage,
On the other hand, the recent state-of-the-art GAN models synthesizes category-specified objects at the middle stage,
[e.g., StyleGAN (Karras et al. 2019) and BigGAN (Brock and renders the scene attribute and color scheme at the later
et al. 2018)] introduce layer-wise stochasticity, as shown in stage.
Fig. 3 We therefore treat them as per-layer generative repre-
sentations.
To explore how GANs are able to produce high-quality 4 Identifying the Emergent Variation Factors
scene synthesis by learning multi-level variation factors as
well as what role the generative representation of each layer As described in Sect. 3, we target at interpreting the latent
plays in such generation process, this work aims at estab- semantics learned by scene synthesis models from four
lishing the relationship between the variation factors and the abstraction levels. Previous efforts on several scene under-
generative representations. Karras et al. (2019) has already standing databases (Zhou et al. 2017; Xiao et al. 2010;
pointed out that the design of layer-wise stochasticity actu- Laffont et al. 2014; Patterson et al. 2014) enable a series of
ally controls the synthesis from coarse to fine, however, what classifiers to predict scene attributes and categories. Besides,
“coarse” and “fine” actually refer to still remains uncertain. we also employ several classifiers focusing on layout detec-
To better align the variation factors with human perception, tion (Zhang et al. 2019) and semantic segmentation (Xiao
123
Fig. 4 Pipeline of identifying the emergent variation factors in gen- latent space by considering it as a binary classification task. Then we
erative representation. By deploying a broad set of off-the-shelf image move the sampled latent code towards the boundary to see how the
classifiers as scoring functions, F(·), we are able to assign a synthesized semantic varies in the synthesis, and use a re-scoring technique to quan-
image with semantic scores corresponding to each candidate variation titatively verify the emergence of the target concept (Color figure online)
factor. For a particular concept, we learn a decision boundary in the
et al. 2018b). Specially, given an image, we are able to use these abstraction levels respectively, forming a hierarchical
these classifiers to get the response scores with respect to var- semantic space S. After establishing the mapping from the
ious semantics. However, only predicting the semantic labels latent space Z to the semantic space S, we search the decision
is far from identifying the variation factors that GANs have boundary for each concept by treating it as a bi-classification
captured from the training data. More concretely, among all problem, as shown in Fig. 4. Here, taking “indoor lighting”
the candidate concepts, not all of them are meaningful to a as an instance, the boundary separates the latent space Z to
particular model. For instance, “indoor lighting” will never two sets, i.e., presence or absence of indoor lighting.
happen in outdoor scenes such as bridge and tower, which
“enclosed area” is always true for indoor scenes such as bed- 4.2 Verifying Manipulable Variation Factors
room and kitchen. Accordingly, we come up with a method
to quantitatively identify the most relevant and manipulable After probing the latent space with a broad set of candidate
variation factors that emerge inside the learned generative concepts, we still need to figure out which ones are most
representation. Figure 4 illustrates the identification process relevant to the generative model by acting as the variation
which consists of two steps, i.e., probing (Sect. 4.1) and ver- factors. The key issue is how to define “relevance”. We argue
ification (Sect. 4.2). Such identification enables the diverse that if the target concept is manipulable from the latent space
scene manipulation (Sect. 4.3). Note that we use the same perspective (e.g., changing the indoor lighting status of the
approach as InterFaceGAN (Shen et al. 2020b) to get the synthesized image via simply varying the latent code), the
latent boundary for each candidate in the probing process in GAN model is considered as having captured such variation
Sect. 4.1. factor during training.
As mentioned above, we have already got a separation
boundary for each candidate. Let {ni }i=1
C denote the normal
4.1 Probing Latent Space vectors of these boundaries, where C is the total number of
candidates. For a certain boundary, if we move a latent code
The generator of GAN, G(·), typically learns the mapping z along its normal direction (positive), the semantic score
from latent space Z to image space X . Latent vectors z ∈ Z should also increase correspondingly. Therefore, we propose
can be considered as the generative representations learned to re-score the varied latent code to quantify how a variation
by GANs. To study the emergence of variation factors inside factor is relevant to the target model for analysis. As shown
Z, we need to first extract semantic information from z. For in Fig. 4, this process can be formulated as
this purpose, we utilize the synthesized image, x = G(z),
as an intermediate step and employ a broad set of image 1
K
classifiers to help assign semantic scores for each sampled Δsi = max Fi G zk + λni −Fi G zk , 0 ,
K
latent code z. Taking “indoor lighting” as an example, the k=1
scene attribute classifier is able to output the probability of (1)
how an input image looks like having indoor lighting, which
K
we use as the semantic score. Recall that we divide scene where K1 k=1 stands for the average of K samples to make
representation into layout, object (category), and attribute the metric more accurate. λ is a fixed moving step. To make
levels, we introduce layout estimator, scene category recog- this metric comparable among all candidates, all normal vec-
nizer, and attribute classifier to predict semantic scores from tors {ni }i=1
C are normalized to the fixed norm 1 and λ is set
123
Table 1 Description of the StyleGAN models trained on different cat-

egories
Scene category Type Training number FID↓
Bedroom (official) Indoor 3M 2.65

Living room Indoor 1.3 M 5.16
(a) (b) (c)
Kitchen Indoor 1M 5.06
Fig. 5 Three types of manipulation: a independent manipulation; b Restaurant Indoor 626 K 4.03
joint manipulation; c jittering manipulation (Color figure online) Bridge Outdoor 819 K 6.42
Church Outdoor 126 K 4.82
Tower Outdoor 708 K 5.99
as 2. With this re-scoring technique, we can easily rank the
Mixed Indoor 500 K each 3.74
score Δsi among all C concepts to retrieve the most manipu-
lable variation factors. Here, this technique is also performed ↓ Means the lower the better
layer by layer to identify the most relevant layers for each
semantic.
in the deep generative representations as a result of learning
4.3 Manipulation with Diversity to synthesize scenes.
The experimental section is organized as follows: Sect. 5.1
After identifying the semantics as well as the most adequate introduces our experimental details including generative
layers, we propose several manipulation approaches to con- models, training datasets and the off-the-shelf classifiers.
trol the generation process, as shown in Fig. 5. A simple Section 5.2 contains the layer-wise analysis on the state-of-
and straightforward way, named independent manipulation, the-art StyleGAN model (Karras et al. 2019), quantitatively
is to push the code z along the normal vector ni of a cer- and qualitatively verifying that the multi-level variation fac-
tain semantic with a step length λ. The manipulated code tors are encoded in the latent space. In Sect. 5.3, we explore
z ← z + λn is then fed into the most relevant layers of the the question on how GANs represent categorical informa-
generator to produce a new image. A second way of manip- tion such as bedroom v.s. living room, revealing that GAN
ulation enables scene editing with respect to more than one synthesizes the shared objects at some intermediate layers.
variation factor simultaneously. We call it joint manipula- By controlling their activations only, we can easily overwrite
tion. Taking two variation factors, with normal vector n1 and the category of the output image, e.g. turning bedroom into
n2 , as an example, the original code z is moved along the two living room, while preserving its original layout and high-
directions simultaneously as z ← z + λ1 n1 + λ2 n2 . Here, λ1 level attributes such as indoor lighting. Section 5.4 further
and λ2 are step parameters which control the strength of the shows that our approach can faithfully identify the most
manipulation of these two semantics respectively. Besides the relevant attributes associated with a particular scene, facil-
above two types of manipulation, we further propose to intro- itating semantic scene manipulation. Section 5.5 conducts
duce randomness into the manipulation process to increase the ablation studies on re-scoring technique and layer-wise
the diversity, namely jittering manipulation. The key idea is manipulation to show the effectiveness of our approach.
to slightly modulate the manipulation direction with a ran-
domly sampled noise δ ∼ N (0, 1), bringing perturbation
onto the main direction. It can be accordingly formulated as 5.1 Experimental Details
z ← z + λn + δ.
5.1.1 Generator Models
5 Experiments This work conducts experiments on state-of-the-art deep gen-

erative models for high-resolution scene synthesis, including
In the generation process, the deep representation at each StyleGAN (Karras et al. 2019), BigGAN (Brock et al. 2018),
layer, especially for StyleGAN (Karras et al. 2019) and Big- and PGGAN (Karras et al. 2017). Among them, PGGAN
GAN (Brock et al. 2018), is actually directly derived from the employs the conventional generator structure where the latent
projected latent code. Therefore, we consider the latent code code is only fed into the very first layer. Differently, Style-
as the generative representation. In addition, we conduct a GAN and BigGAN introduce layer-wise stochasticity by
detailed empirical analysis of the variation factors identified feeding latent codes to all convolutional layers as shown in
across the layers of the generators in GANs. Experimental Fig. 3. And our layer-wise analysis sheds light on why it is
results suggest that the hierarchy of variation factors emerges effective.
123
Negative
Negative Positive
Positive Negative
Negative Positive
Positive
sifies a scene image to 365 categories, and (3) an attribute
predictor (Zhou et al. 2017), which predicts 102 pre-defined
scene attributes in SUN attribute database (Patterson et al.
2014). We also extract color scheme of a scene image through
its hue histogram in HSV space. Among them, the category
classifier and attribute predictor can directly output the prob-
ability of how likely an image belongs to a certain category
or how likely an image has a particular attribute. As for the
layout estimator, it only detects the outline structure of an
Fig. 6 The definition of layout for indoor scenes. Green lines represent indoor place, shown in Fig. 6.
the outline predicted by the layout estimator. The dashed line indicates
the horizontal center, and the red point is the center point of the intersec-
5.1.4 Semantic Probing and Verification
tion line between two walls. The relative position between the vertical
line and the center point is used to split the dataset (Color figure online)
Given a well-trained GAN model for analysis, we first gen-
erate a collection of synthesized scene images by randomly
5.1.2 Scene Categories sampling N latent codes (5,00,000 in practice). And then, the
aforementioned image classifiers are used to assign semantic
Among the mentioned generator models, PGGAN and Style- scores for each visual concept. It is worth noting that we use
GAN are actually trained on LSUN dataset (Yu et al. 2015) the relative position between image horizontal center and the
while BigGAN is trained on Places dataset (Zhou et al. intersection line of two walls to quantify layout, as shown in
2017). To be specific, LSUN dataset consists of 7 indoor Fig. 6. After that, for each candidate, we select 2000 images
scene categories and 3 outdoor scene categories, and Places with the highest response as positive samples, and another
dataset contains 10 million images across 434 categories. For 2000 with the lowest response as negative ones. In particular,
PGGAN model, we use the officially released models, each of living room and bedroom are treated as positive and negative
which is trained to synthesize scene within a individual cate- for scene category respectively for the mixed model. A linear
gory of LSUN dataset. For StyleGAN, only one model related SVM is trained by treating it as a bi-classification problem
to scene synthesis (i.e., bedroom) is released at this link. For (i.e., data is the sampled latent code while the label is binary
a more thorough analysis, we use the official implementation indicating whether the target semantic appears in the corre-
to train multiple models on other scene categories, including sponding synthesis or not) to get a linear decision boundary.
both indoor scenes (living room, kitchen, restaurant) and out- Finally, we re-generate K = 1000 samples for semantic ver-
door scenes (bridge, church, tower). We also train a mixed ification as described in Sect. 4.2.
model on the combination of images from bedroom, living
room, and dining room with the same implementation. This 5.2 Emerging Semantic Hierarchy
model is specifically used for categorical analysis. For each
StyleGAN model, Table 1 shows the category, the number Humans typically interpret a scene in a hierarchy of seman-
of training samples, as well as the corresponding Fréchet tics, from its layout, underlying objects, to the detailed
inception distances (FID) (Heusel et al. 2017) which can attributes and the color scheme. Here the underlying objects
reflect the synthesis quality to some extent. For BigGAN, we refer to the set of objects most relevant to a specific cat-
use the author’s officially unofficial PyTorch BigGAN imple- egory. This section shows that GAN composes a scene
mentation to train a conditional generative model by taking over the layers in a similar way with human perception.
category label as the constraint on Places dataset (Zhou et al. To enable analysis on layout and object, we take the mixed
2017). The resolution of the scene images synthesized by all StyleGAN model trained on indoor scenes as the target
of the above models is 256 × 256. model. StyleGAN (Karras et al. 2019) learns a more dis-
entangled latent space W on top of the conventional latent
5.1.3 Semantic Classifiers space Z. Specifically, for -th layer, w ∈ W is linearly
transformed to layer-wise transformed latent code y() with
To extract semantic from synthesized images, we employ y() = A() w + b() , where A() , b() are the weight and
various off-the-shelf image classifiers to assign these images bias for style transformation respectively. We thus perform
with semantic scores from multiple abstraction levels, includ- layer-wise analysis by studying y() instead of z in Eq. (1).
ing layout, category, scene attribute, and color scheme. To quantify the importance of each layer with respect
Specifically, we use (1) a layout estimator (Zhang et al. to each variation factor, we use the re-scoring technique to
2019), which predicts the spatial structure of an indoor place, identify the causality between the layer-wise generative rep-
(2) a scene category classifier (Zhou et al. 2017), which clas- resentation y() and the semantic emergence. The normalized
123
Layout Objects Attributes Color Scheme Bottom Lower Upper Top

1
Layout 95% 5% 0% 0%
Objects 10% 90% 0% 0%
Attributes 0% 5% 85% 5%
Color
0% 0% 25% 75%
Scheme
User Study
0 Layer Index 13 Fig. 8 User study on how different layers correspond to variation fac-
tors from different abstraction levels (Color figure online)
Layout Bottom Lower Upper Top
tic hierarchy, e.g., layout can be best controlled at the early

stage while color scheme can only be changed at the final
stage. Besides, varying latent code at the inappropriate lay-
Objects (Bedroom Living room) ers may also change the image content, but the changing
might be inconsistent with the desired output. For example,
in the second row, modulating the code at bottom layers for
category only leads to a random change in the scene view-
Attributes (Indoor lighting) point.
To better evaluate the manipulability across layers, we
conduct a user study. We first generate 500 samples and
manipulate them with respect to several concepts on differ-
Color Scheme ent layers. For each concept, 20 users are asked to choose
the most appropriate layers for manipulation. Specifically,
in terms of a certain concept, we manipulate it at the bot-
tom, lower, upper, top layers to produce a quadruplet. Users
are asked to select single image with the desired change,
Fig. 7 Top: Four levels of visual abstractions emerge at different layers unknowing the shuffled order of the quadruplet. The distri-
of StyleGAN. Vertical axis shows the normalized perturbation score bution of the choice for each abstraction level is recorded.
Δsi . Bottom: Layer-wise manipulation results. The first column is the
original synthesis and the other columns are the manipulated images Figure 8 shows the user study results, where most peo-
at layers from four different stages respectively. Blue boxes highlight ple think bottom layers best align with layout, lower layers
the results from varying the latent code at the most proper layer for the control scene category, etc.. This is consistent with our obser-
target concept (Color figure online) vations in Fig. 7. It suggests that hierarchical variation factors
emerge inside the generative representation for synthesizing
score in the top Fig. 7 shows that the layers of the generator in
scenes. and that our re-scoring method indeed helps identify
GAN are specialized to compose semantics in a hierarchical
the variation factors from a broad set of semantics.
manner: the bottom layers determine the layout, the lower
Identifying the semantic hierarchy and the variation fac-
layers and upper layers control category-level and attribute-
tors across layers facilitates semantic scene manipulation.
level variations respectively, while color scheme is mostly
We can simply push the latent code toward the boundary
rendered at the top. This is consistent with human perception.
of the desired attribute at the appropriate layer. Figure 10a
In StyleGAN model that is trained to produce 256×256 scene
shows that we can change the decoration style (crude to
images, there are totally 14 convolutional layers. Accord-
glossy), the material of furniture (cloth to wood), or even
ing to our experimental results, layout, object (category),
the cleanliness (tidy to cluttered) respectively. Furthermore,
attribute, color scheme correspond to bottom, lower, upper,
hierarchical variation factors could be jointly manipulated.
and top layers respectively, which are actually [0, 2), [2, 6),
In Fig. 10b we simultaneously edit the room layout (rotating
[6, 12) and [12, 14) layers.
viewpoint) at early layers, scene category (converting bed-
To visually inspect the identified variation factors, we
room to living room) at middle layers, and scene attribute
move the latent vector along the boundaries at different lay-
(increasing indoor lighting) at later layers.
ers to show how the synthesis varies correspondingly. For
example, given a boundary in regards to room layout, we
vary the latent code towards the normal direction at bot- 5.3 What Makes a Scene?
tom, lower, upper, and top layers respectively. The bottom of
Fig. 7 shows the qualitative results for several concepts. The As mentioned above, GAN models for synthesizing scenes
emerged variation factors follow a highly-structured seman- are capable of encoding hierarchical semantics inside the
123
generative representation, i.e., from layout, object (category), Living Room Bedroom Dining Room
to scene attribute and color scheme. One of the most notice-

able properties is that the middle layers of GAN actually
synthesize different objects for different scene categories. It
raises the question of what makes a scene as living room
rather than bedroom. Thus we further dive into the encoding
of categorical information in GANs, to quantify how GAN
interprets a scene category as well as how the scene category
is transformed from an object perspective.
We employ the StyleGAN model trained on the mixture
of bedroom, living room, and dining room, and then search
the semantic boundary between every two categories. To
extract the objects from the synthesized images, we apply
a semantic segmentation model (Xiao et al. 2018b), which
can segment 150 objects (TV, sofa, etc.) and stuff (ceil-
ing, floor, etc.). Specifically, we first randomly synthesize
500 living room images, and then vary the corresponding
latent codes towards the “living room-bedroom” boundary
and “bedroom-dining room” boundary in turn. Segmenta-
tion masks of images before and after manipulation are
obtained, as shown in Fig. 9. After tracking label map-
ping for each pixel via the image coordinate during the
manipulation process, we are able to compute the statis-
tics and observe how objects change along with transformed
categories.
Figure 9 shows the objects mapping in the category trans-
formation process. It clearly suggests that (1) when an image
is manipulated among different categories, most of the stuff
classes (e.g., ceiling and floor) remain the same, but some
objects are mapped into other classes. For example, the sofa
in living room is mapped to the pillow and bed in bed-
room, and the bed in bedroom is further mapped to the
table and chair in dining room. This phenomenon happens Fig. 9 Objects are transformed by GANs to represent different scene
because sofa, bed, dining table and chair are distinguish- categories. The top shows that the object segmentation mask varies
able and discriminative objects for living room, bedroom, when manipulating a living room into a bedroom, and further into a
dining room. The bottom visualizes the object mapping that appears
and dining room respectively. Thus, when category is trans-
during category transition, where pixels are counted only from object
formed, the representative objects are supposed to change. level instead of instance level. GANs can learn shared objects as well
(2) Some objects are shareable between different scene cat- as the transformation of objects with similar appearance when trained
egories, and the GAN model is able to spot such property to synthesize scene images from more than one category
and learn to generate these shared objects across differ-
ent classes. For example, the lamp in living room (on the
left boundary of the image) still remains after the image 5.4 Diverse Attribute Manipulation
is converted to bedroom, especially in the same position.
(3) With the ability to learn object mapping as well as 5.4.1 Attribute Identification
share objects across different classes, we are able to turn
an unconditional GAN into a GAN that can control cate- The emergence of variation factors for scene synthesis
gory. Typically, to make GAN produce images from different depends on the training data. Here we apply our method to
categories, class labels have to be fed into the genera- a collection of StyleGAN models, to capture a wide range
tor to learn a categorical embedding, like BigGAN (Brock of manipulable attributes out of the 102 scene attributes pre-
et al. 2018). Our result suggests an alternative approach defined in SUN attribute database (Patterson et al. 2014).
(Fig. 10). Each StyleGAN in the collection is trained to synthesize
scene images from a certain category, including both out-
door (bridge, church, tower) and indoor scenes (living room,
123
Glossy Wood Cluttered space Layout Objects Indoor lighting Joint
(a) (b)
Fig. 10 a Independent attribute manipulation results on Upper layers. are manipulated at proper layers. The first column indicates the source
The middle row is the source images. We are able to both decrease images and the middle three columns are the independently manipulated
(top row) and increase (bottom row) the variation factors in the images. images (Color figure online)
b Joint manipulation results, where the layout, objects and attribute
5 Bridge 10 Church 5 Tower
n
ny ting izo d n ion on e g al es ion on
tio s s e l
sun boa y hor clou geta leave tree foliag meta ocea
n l g e
tica ourin sunnycloud oliag grass trees getat horiz rayin
g iag cloudsunny ourin ertic leav getat trees horiz brick
ver fol t v ve
wa ve t f ve no p w ay
r- a -a
fa far
17 Living room 18 Kitchen 12 Restaurant
g g
g g g t y
din wood glass ghtin ing ightincarpe loss cloth spac
e g g e g
ing lossyglass ghtin ghtin etal spac ood lizin tatio
n zin hting ng ood gatin ding hting ss ation ing
rea g eat g i m ed w cia ge iali g ti w gre rea lig gla get ooth
o or li ooth ral l tered ral l oor l
i
t e r so v e soc oor li ea con ral ve s
s u t u t u
ind nat clu nat ind clu ind nat
Fig. 11 Comparison of the top scene attributes identified in the generative representations learned by StyleGAN models for synthesizing different
scenes. Vertical axis shows the perturbation score Δsi (Color figure online)
kitchen). Figure 11 shows the top-10 relevant semantics to manipulated along the positive (right) direction. We can tell
each model. It is seen that “sunny” has high scores on all that the edited images are still with high quality and the target
outdoor categories, while “lighting” has high scores on all attributes indeed change as desired. We then jointly manip-
indoor categories. Furthermore, “boating” is identified for ulate two attributes with bridge synthesis model as shown
bridge model, “touring” for church and tower, “reading” in Fig. 13. The central image of the 3 × 3 image grid is the
for living room, “eating” for kitchen, and “socializing” for original synthesis, the second row and the second column
restaurant. These results are highly consistent with human show the independent manipulation results with respect to
understanding and perception, suggesting the effectiveness “vegetation” and “cloud” attributes respectively, while other
of the proposed quantification method. images on the four corners are the joint manipulation results.
It turns out that we achieve good control of these two seman-
5.4.2 Attribute Manipulation tics and they seem to barely affect each other. However, not
all variation factors show such strong disentanglement. From
Recall the three types of manipulation in Sect. 4.3: indepen- this point of view, our approach also provides a new metric
dent manipulation, joint manipulation, and jittering manip- to help measure the entanglement between two variation fac-
ulation. We first conduct independent manipulation on 3 tors, which will be discussed in Sect. 6. Finally, we evaluate
indoor and 3 outdoor scenes with the most relevant scene the proposed jittering manipulation by introducing noise into
attributes identified with our approach. Figure 12 shows the the “cloud” manipulation . From Fig. 14, we observe that
results where the original synthesis (left image in each pair) is the newly introduced noise indeed increases the manipula-
123
Bridge Sunny Boating Vegetation
Tower Cloud Sunny

Vegetation
Church Vertical components Vegetation Cloud
Kitchen Eating Metal Glossy
Resturant Wood Eating Glass
Living room Wood Cluttered space Indoor lighting
Fig. 12 Independent manipulation results on StyleGAN models trained for synthesizing indoor and outdoor scenes. In each pair of images, the
first is the original synthesized sample and the second is the one after manipulating a certain semantic (Color figure online)
123
Vegetation Cloud Cloud
Vegetation
Fig. 13 Joint manipulation results along both cloud and vegetation boundaries with bridge synthesis model. Along the vertical and horizontal axis,
the original synthesis (the central image) is manipulated with respect to vegetation and cloud attributes respectively (Color figure online)
tion diversity. It is interesting that the introduced randomness but actually manipulable scene attributes, such as “cluttered
may not only affect the shape of added cloud, but also change space”.
the appearance of the synthesized tower. But both cases keep In the middle figure, almost all attributes get similar
the primary goal, which is to edit the cloudiness. scores, making them indistinguishable. Actually, even the
worst SVM classifier (i.e., “railroad”) achieves 72.3% accu-
racy. That is because even some variation factors are not
5.5 Ablation Studies encoded in the latent representation (or say, not manipulable),
the corresponding attribute classifier still assigns synthesized
5.5.1 Re-scoring Technique images with different scores. Training SVM on these inac-
curate data can also result in a separation boundary, even it
Before performing the proposed re-scoring technique, we is not expected as the target concept. Therefore, only relying
have two more steps, which are (1) assigning semantic scores on the SVM classifier is not enough to detect relevant varia-
for synthesized samples, and (2) training SVM classifiers to tion factors. By contrast, our method pays more attention to
search semantic boundary. We would like to verify the neces- the score modulation after varying the latent code, which is
sity of the re-scoring technique in identifying manipulable not biased by the initial response of attribute classifier or the
semantics. Ablation study is conducted on the StyleGAN performance of SVM. As a result, we are able to thoroughly
model trained for synthesizing bedrooms. As shown in yet precisely detect the variation factors in the latent space
Fig. 15, the left figure sorts the scene attributes by how many from a broad candidate set.
samples are labelled as positive ones, the middle figure sorts
by the accuracy of the trained SVM classifiers, while the right 5.5.2 Layer-Wise Manipulation
figure sorts by our proposed quantification metric.
In left figure, “no horizon”, “man-made”, and “enclosed To further validate the emergence of semantic hierarchy, we
area” are attributes with highest percentage. However, all make ablation study on layer-wise manipulation with Style-
these three attributes are default properties of the bedroom GAN model. First, we select “indoor lighting” as the target
and thus not manipulable. On the contrary, with the re-scoring semantic, and vary the latent code only on upper (attribute-
technique for verification, our method successfully filters relevant) layers v.s. on all layers. We can easily tell from
out these invariable candidates and reveals more meaningful Fig. 16 that when manipulation “indoor lighting” at all lay-
semantics, like “wood” and “indoor lighting”. In addition, ers, the objects inside the room are also changed. By contrast,
our method also manages to identify some less frequent manipulating latent codes only at attribute-relevant layers can
123
Fig. 14 Jittering manipulation results with tower synthesis model for of added cloud and appearance of the generated tower change. The top
cloud attribute. Specifically, the movement in the latent space of synthe- left image of two samples is the original output while the rest are the
sized image is disturbed. Thus, when the cloud appears, both the shape results under jittering manipulation separately (Color figure online)
Fig. 15 Ablation study on the proposed re-scoring technique with StyleGAN model for bedroom synthesis. The left shows the percentage of scene
attributes with the positive scores, the middle figure sorts by the accuracy of SVM classifiers, while the right figure sorts by our methods (Color
figure online)
satisfyingly increase the indoor lighting without affecting out is modified, all attributes are barely affected, suggesting
other factors. Second, we select bottom layers as the target that GAN learns to disentangle layout-level semantic from
layers, and select boundaries from all four abstraction levels attribute-level. However, there are also some scene attributes
for manipulation. As shown in Fig. 17, no matter what level (from same abstraction level) entangling with each other.
of semantics we choose, as long as the latent code is modified Taking Fig. 18c as an example, when modulating “indoor
at bottom (layout-relevant) layers, only layout instead of all lighting”, “natural lighting” also varies. This is also aligned
other semantics varies. These two experiments further verify with human perception, further demonstrating the effective-
our discovery about the emergence of the semantic hierarchy ness of our proposed quantification metric. Qualitative results
that the early layers tend to determine the spatial layout and are also included in Fig. 18d–f.
configuration instead of other abstraction level semantics.
6.2 Application to Other GANs
6 Discussions We further apply our method for two other GAN structures,
i.e., PGGAN (Karras et al. 2017) and BigGAN (Brock et al.
6.1 Disentanglement of Semantics 2018). These two models are trained on LSUN dataset (Yu
et al. 2015) and Places dataset (Zhou et al. 2017) respectively.
Some variation factors we detect in the generative repre- Compared to StyleGAN, PGGAN feeds the latent vector only
sentation are more disentangled with each other than other to the very first convolutional layer and hence does not sup-
semantics. Compared to the perceptual path length and lin- port layer-wise analysis. But the proposed re-scoring method
ear separability described in Karras et al. (2019) and the can still be applied to help identify manipulatable seman-
cosine similarity proposed in Shen et al. (2020a), our work tics, as shown in Fig. 19a. BigGAN is the state-of-the-art
offers a new metric for disentanglement analysis. In particu- conditional GAN model that concatenates the latent vector
lar, we move the latent code along one semantic direction and with a class-guided embedding code before feeding it to the
then check how the semantic scores of other factors change generator, and it also allows layer-wise analysis like Style-
accordingly. As shown in Fig. 18a, when the spatial lay- GAN. Figure 19b gives analysis results on BigGAN from
123
Upper All Upper All
Fig. 16 Comparison results between manipulating latent codes at only upper (attribute-relevant) layers and manipulating latent codes at all layers
with respect to indoor lighting on StyleGAN (Color figure online)
Layout Objects Indoor lighting Color Scheme
Fig. 17 Manipulation at the bottom layers in 4 different directions, along the directions of layout, objects (category), indoor lighting, and color
scheme on StyleGAN (Color figure online)
5 2.5 3
2 Bedroom 2 Bridge 2 Living room
(a) (b) (c)

ss
ting al n n ting ing ing g ng ine
ght l light ood ossy alizinegati mingg bus atte orki
ng
out l lighcold now etric ice rick horizophaltarble ny tatio ud age ligh m ass ing ers ing l i
a y
L tu ra s mm b o
n as m Sun vege clo foli tural war gr soothflow tour d o or tura w gl soci ongr ga ctin m w
na sy na In na c d u
con
(d) (e) (f)
Fig. 18 a–c Quantitative effects on scene attributes (already sorted). Vertical axis shows the perturbation score Δsi in log scale. d–f Qualitative
results also show the effect when varying the most relevant factor (Color figure online)
attribute level, where we can tell that scene attribute can be rize them into various levels by prior work, such as layout
best modified at upper layers compared to lower layers or in Oliva and Torralba (2001), category in Torralba and Oliva
all layers. As for BigGAN model with 256 × 256 resolu- (2003), and attribute in Patterson et al. (2014), such classi-
tion, there are total 12 convolutional layers. As the category fiers remain to be improved together with the development of
information is already encoded in the “class” code, we only scene understanding. In case the defined broad set of seman-
separate the layers to two groups, which are lower (bottom tics is not enough, we could further enlarge the dictionary
6 layers) and upper (top 6 layers). Meanwhile, the quantita- following the stardard annotation pipeline in Zhou et al.
tive curve shows the consistent result with the discovery on (2017) and Patterson et al. (2014). In addition, such classifiers
StyleGAN as in Fig. 7a. These results demonstrate the gen- trained on the large-scale benchmark of scene understanding
eralization ability of our approach as well as the emergence could be replaced by more powerful discriminative models
of manipulatable factors in other GANs. to improve the accuracy. (2) Boundary search: for simplicity
we only use the linear SVM for semantic boundary search.
6.3 Limitation This limits our framework from interpreting the latent seman-
tic subspace with more complex and nonlinear structure. (3)
There are several limitations for future improvement. (1) Generalization beyond scene understanding: the main pur-
More thorough and precise off-the-shelf classifiers: although pose of this work is to interpret scene-related GANs, which
we collect as many visual concepts as possible and summa- is a challenging task considering the large diversity of scene
123
Lower Upper All
Layout Attributes
1
Wood Vegetation
(a) (b) 0 Layer Index 11

Indoor lighting Cloud
Fig. 19 a Some variation factors identified from PGGAN (bedroom). b Layer-wise analysis on BigGAN from the attribute level (Color figure
online)
images as well as the difficulty of scene understanding. How- Agrawal, P., Girshick, R., & Malik, J. (2014). Analyzing the perfor-
ever, these abstraction levels can be hard to generalize to other mance of multilayer neural networks for object recognition. In:
European conference on computer vision (pp. 329–344). Springer.
datasets beyond scenes. Even so, we believe that this work is
Alain, G., & Bengio, Y. (2016). Understanding intermediate layers using
still able to provide some insights on analyzing GAN models linear classifier probes. In: International conference on learning
trained on other datasets. For example, for scene synthesis, representations workshop.
we found that early layers control scene layout, which can be Bau, D., Strobelt, H., Peebles, W., Wulff, J., Zhou, B., Zhu, J.-Y., & Tor-
ralba, A. (2019). Semantic photo manipulation with a generative
viewed as structural information, such as rotation. Accord- image prior. ACM Transactions on Graphics, 38(4), 59.
ingly, we can fairly generalize that the early layers of face Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017).
synthesis models control the face pose and the early layers Network dissection: Quantifying interpretability of deep visual
of car synthesis models control the car orientation. representations. In: IEEE conference on computer vision and pat-
tern recognition (pp. 6541–6549).
Bau, D., Zhu, J. Y., Strobelt, H., Zhou, B., Tenenbaum, J. B., Free-
man, W. T., & Torralba, A. (2018). Gan dissection: Visualizing
and understanding generative adversarial networks. In: Interna-
7 Conclusion tional conference on learning representations.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning:
In this paper, we show the emergence of highly-structured A review and new perspectives. IEEE Transactions on Pattern
variation factors inside the deep generative representations Analysis and Machine Intelligence, 35(8), 1798–1828.
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale gan training
learned by GANs with layer-wise stochasticity. In particular, for high fidelity natural image synthesis. In: International confer-
the GAN model spontaneously learns to set up layout at early ence on learning representations.
layers, generate categorical objects at middle layers, and ren- Cheng, M. M., Zheng, S., Lin, W. Y., Vineet, V., Sturgess, P., Crook,
der scene attribute and color scheme at later layers when N., et al. (2014). Imagespirit: Verbal guided image parsing. ACM
Transactions on Graphics, 34(1), 1–11.
trained to synthesize scenes. A re-scoring method is proposed Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., & Choo, J. (2018).
to quantitatively identify the manipulatable semantic con- Stargan: Unified generative adversarial networks for multi-domain
cepts within a well-trained model, enabling photo-realistic image-to-image translation. In: IEEE conference on computer
scene manipulation. We will explore to extend this manip- vision and pattern recognition (pp. 8789–8797).
Goetschalckx, L., Andonian, A., Oliva, A., & Isola, P. (2019). Gana-
ulation capability of GANs for real image editing in future lyze: Toward visual definitions of cognitive image properties. In:
work. Proceedings of the IEEE international conference on computer
vision (pp. 5744–5753).
Acknowledgements This work is supported by Early Career Scheme Gonzalez-Garcia, A., Modolo, D., & Ferrari, V. (2018). Do seman-
(ECS) through the Research Grants Council (RGC) of Hong Kong under tic parts emerge in convolutional neural networks? International
Grant No.24206219 and CUHK FoE RSFS Grant (No. 3133233). Journal of Computer Vision, 126(5), 476–494.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adver-
sarial nets. In: Advances in neural information processing systems
(pp. 2672–2680).
References Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter,
S. (2017). Gans trained by a two time-scale update rule converge
Abdal, R., Qin, Y., & Wonka, P. (2019). Image2stylegan: How to embed to a local nash equilibrium. In: Advances in neural information
images into the stylegan latent space? In: International conference processing systems (pp. 6626–6637).
on computer vision (pp. 4432–4441). Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image
Abdal, R., Qin, Y., & Wonka, P. (2020). Image2stylegan++: How to edit translation with conditional adversarial networks. In: IEEE confer-
the embedded images? In: IEEE conference on computer vision ence on computer vision and pattern recognition (pp. 1125–1134).
and pattern recognition (pp. 8296–8305).
123
Jahanian, A., Chai, L., & Isola, P. (2019). On the“steerability” of genera- Shen, Y., Yang, C., Tang, X., & Zhou, B. (2020b). InterFaceGAN: Inter-
tive adversarial networks. In: International conference on learning preting the disentangled face representation learned by GANs.
representations. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Karacan, L., Akata, Z., Erdem, A., & Erdem, E. (2016) Learning to https://doi.org/10.1109/TPAMI.2020.3034267.
generate images of outdoor scenes from attributes and semantic Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside con-
layouts. arXiv preprint arXiv:1612.00215. volutional networks: Visualising image classification models and
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive saliency maps. In: Workshop at international conference on learn-
growing of gans for improved quality, stability, and variation. In: ing representations.
International conference on learning representations. Torralba, A., & Oliva, A. (2003). Statistics of natural image categories.
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator archi- Network: Computation in Neural Systems, 14(3), 391–412.
tecture for generative adversarial networks. In: IEEE conference Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro,
on computer vision and pattern recognition (pp. 4401–4410). B. (2018). High-resolution image synthesis and semantic manip-
Laffont, P. Y., Ren, Z., Tao, X., Qian, C., & Hays, J. (2014). Tran- ulation with conditional gans. In: IEEE conference on computer
sient attributes for high-level understanding and editing of outdoor vision and pattern recognition (pp. 8798–8807).
scenes. ACM Transactions on Graphics, 33(4), 1–11. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba A (2010) Sun
Liao, J., Yao, Y., Yuan, L., Hua, G., & Kang, S. B. (2017). Visual database: Large-scale scene recognition from abbey to zoo. In:
attribute transfer through deep image analogy. ACM Transactions 2010 IEEE computer society conference on computer vision and
on Graphics, 36(4), 120. pattern recognition (pp. 3485–3492). IEEE.
Luan, F., Paris, S., Shechtman, E., Bala, K. (2017) Deep photo style Xiao, T., Hong, J., & Ma, J. (2018) Elegant: Exchanging latent encod-
transfer. In: IEEE conference on computer vision and pattern ings with gan for transferring multiple face attributes. In: European
recognition (pp. 4990–4998). conference on computer vision (pp. 168–184).
Mahendran, A., & Vedaldi, A. (2015). Understanding deep image rep- Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual
resentations by inverting them. In: IEEE conference on computer parsing for scene understanding. In: Proceedings of the European
vision and pattern recognition (pp. 5188–5196). conference on computer vision (ECCV) (pp. 418–434).
Morcos, A. S., Barrett, D. G., Rabinowitz, N. C., & Botvinick, M. Yao, S., Hsu, T. M., Zhu, J. Y., Wu, J., Torralba, A., Freeman, B., &
(2018). On the importance of single directions for generalization. Tenenbaum, J. (2018). 3D-aware scene manipulation via inverse
In: International conference on learning representations. graphics. In: Advances in neural information processing systems
Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). (pp. 1887–1898).
Synthesizing the preferred inputs for neurons in neural networks Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transfer-
via deep generator networks. In: Advances in neural information able are features in deep neural networks? In: Advances in neural
processing systems (pp. 3387–3395). information processing systems (pp. 3320–3328).
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019) Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., & Xiao, J. (2015)
Hologan: Unsupervised learning of 3D representations from nat- Lsun: Construction of a large-scale image dataset using deep learn-
ural images. In: International conference on computer vision (pp. ing with humans in the loop. arXiv preprint arXiv:1506.03365.
7588–7597). Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding con-
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A volutional networks. In: European conference on computer vision
holistic representation of the spatial envelope. International Jour- (pp. 818–833). Springer.
nal of Computer Vision, 42(3), 145–175. Zhang, W., Zhang, W., & Gu, J. (2019). Edge-semantic learning strategy
Oliva, A., & Torralba, A. (2006). Building the gist of a scene: The role of for layout estimation in indoor environment. IEEE Transactions
global image features in recognition. Progress in Brain Research, on Cybernetics, 50(6), 2730–2739.
155, 23–36. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015).
Park, T., Liu, M. Y., Wang, T. C., Zhu, J. Y. (2019). Semantic image syn- Object detectors emerge in deep scene cnns. In: International con-
thesis with spatially-adaptive normalization. In: IEEE conference ference on learning representations.
on computer vision and pattern recognition (pp. 2337–2346). Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017).
Park, T., Zhu, J.-Y., Wang, O., Lu, J., Shechtman, E., Efros, A. A., & Places: A 10 million image database for scene recognition. IEEE
Zhang, R. (2020). Swapping autoencoder for deep image manip- Transactions on Pattern Analysis and Machine Intelligence, 40(6),
ulation. In: Advances in Neural Information Processing Systems. 1452–1464.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The sun attribute Zhu, J., Shen, Y., Zhao, D., & Zhou, B. (2020). In-domain gan inver-
database: Beyond categories for deeper scene understanding. Inter- sion for real image editing. In: European conference on computer
national Journal of Computer Vision, 108(1–2), 59–81. vision.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised repre- Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-
sentation learning with deep convolutional generative adversarial to-image translation using cycle-consistent adversarial networks.
networks. In: International conference on learning representa- In: International conference on computer vision (pp. 2223–2232).
tions.
Shaham, T. R., Dekel, T., & Michaeli, T. (2019). Singan: Learning a
generative model from a single natural image. In: International Publisher’s Note Springer Nature remains neutral with regard to juris-
conference on computer vision (pp. 4570–4580). dictional claims in published maps and institutional affiliations.
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020a). Interpreting the latent
space of gans for semantic face editing. In: IEEE conference on
computer vision and pattern recognition (pp. 9243–9252).
Shen, Y., Luo, P., Yan, J., Wang, X., & Tang, X. (2018). Faceid-gan:
Learning a symmetry three-player gan for identity-preserving face
synthesis. In: IEEE conference on computer vision and pattern
recognition (pp. 821–830).
123

Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Semantic Hierarchy Emerges in Deep Generative Representations For Scene Synthesis

Uploaded by

Copyright:

Available Formats

International Journal of Computer Vision (2021) 129:1451–1466

Semantic Hierarchy Emerges in Deep Generative Representations for

Layout Category: Objects from bedroom to living room

Attribute: Indoor lighting Color Scheme

Synthesis Layout Objects Synthesis Layout Objects

Scene category: bedroom Scene category: living room

layout. After the spatial structure is set, the next step is to

Table 1 Description of the StyleGAN models trained on different cat-

Bedroom (official) Indoor 3M 2.65

5 Experiments This work conducts experiments on state-of-the-art deep gen-

Layout Objects Attributes Color Scheme Bottom Lower Upper Top

Objects 10% 90% 0% 0%

tic hierarchy, e.g., layout can be best controlled at the early

to scene attribute and color scheme. One of the most notice-

Glossy Wood Cluttered space Layout Objects Indoor lighting Joint

5 Bridge 10 Church 5 Tower

Bridge Sunny Boating Vegetation

Tower Cloud Sunny

Church Vertical components Vegetation Cloud

Kitchen Eating Metal Glossy

Resturant Wood Eating Glass

Living room Wood Cluttered space Indoor lighting

Vegetation Cloud Cloud

Upper All Upper All

Layout Objects Indoor lighting Color Scheme

(a) (b) (c)

(d) (e) (f)

Lower Upper All

(a) (b) 0 Layer Index 11

You might also like