Unsupervised Domain Adaptation by Backpropagation

Unsupervised Domain Adaptation by Backpropagation
Yaroslav Ganin GANIN @ SKOLTECH . RU

Victor Lempitsky LEMPITSKY @ SKOLTECH . RU
Skolkovo Institute of Science and Technology (Skoltech)
Abstract at “test time”. One particularly important example is syn-

arXiv:1409.7495v2 [stat.ML] 27 Feb 2015
Top-performing deep architectures are trained on thetic or semi-synthetic training data, which may come in
massive amounts of labeled data. In the absence abundance and be fully labeled, but which inevitably have
of labeled data for a certain task, domain adap- a distribution that is different from real data (Liebelt &
tation often provides an attractive option given Schmid, 2010; Stark et al., 2010; Vázquez et al., 2014; Sun
that labeled data of similar nature but from a dif- & Saenko, 2014).
ferent domain (e.g. synthetic images) are avail- Learning a discriminative classifier or other predictor in
able. Here, we propose a new approach to do- the presence of a shift between training and test distribu-
main adaptation in deep architectures that can tions is known as domain adaptation (DA). A number of
be trained on large amount of labeled data from approaches to domain adaptation has been suggested in the
the source domain and large amount of unlabeled context of shallow learning, e.g. in the situation when data
data from the target domain (no labeled target- representation/features are given and fixed. The proposed
domain data is necessary). approaches then build the mappings between the source
As the training progresses, the approach pro- (training-time) and the target (test-time) domains, so that
motes the emergence of “deep” features that are the classifier learned for the source domain can also be ap-
(i) discriminative for the main learning task on plied to the target domain, when composed with the learned
the source domain and (ii) invariant with respect mapping between domains. The appeal of the domain
to the shift between the domains. We show that adaptation approaches is the ability to learn a mapping be-
this adaptation behaviour can be achieved in al- tween domains in the situation when the target domain data
most any feed-forward model by augmenting it are either fully unlabeled (unsupervised domain annota-
with few standard layers and a simple new gra- tion) or have few labeled samples (semi-supervised domain
dient reversal layer. The resulting augmented adaptation). Below, we focus on the harder unsupervised
architecture can be trained using standard back- case, although the proposed approach can be generalized to
propagation. the semi-supervised case rather straightforwardly.
Unlike most previous papers on domain adaptation that
Overall, the approach can be implemented with
worked with fixed feature representations, we focus on
little effort using any of the deep-learning pack-
combining domain adaptation and deep feature learning
ages. The method performs very well in a se-
within one training process (deep domain adaptation). Our
ries of image classification experiments, achiev-
goal is to embed domain adaptation into the process of
ing adaptation effect in the presence of big do-
learning representation, so that the final classification de-
main shifts and outperforming previous state-of-
cisions are made based on features that are both discrim-
the-art on Office datasets.
inative and invariant to the change of domains, i.e. have
the same or very similar distributions in the source and the
1. Introduction target domains. In this way, the obtained feed-forward net-
Deep feed-forward architectures have brought impressive work can be applicable to the target domain without being
advances to the state-of-the-art across a wide variety of hindered by the shift between the two domains.
machine-learning tasks and applications. At the moment, We thus focus on learning features that combine (i)
however, these leaps in performance come only when a discriminativeness and (ii) domain-invariance. This is
large amount of labeled training data is available. At the achieved by jointly optimizing the underlying features as
same time, for problems lacking labeled data, it may be well as two discriminative classifiers operating on these
still possible to obtain training sets that are big enough for features: (i) the label predictor that predicts class labels
training large-scale deep models, but that suffer from the and is used both during training and at test time and (ii) the
shift in data distribution from the actual data encountered
domain classifier that discriminates between the source and Several approaches perform gradual transition from the
the target domains during training. While the parameters of source to the target domain (Gopalan et al., 2011; Gong
the classifiers are optimized in order to minimize their error et al., 2012) by a gradual change of the training distribu-
on the training set, the parameters of the underlying deep tion. Among these methods, (S. Chopra & Gopalan, 2013)
feature mapping are optimized in order to minimize the loss does this in a “deep” way by the layerwise training of a
of the label classifier and to maximize the loss of the domain sequence of deep autoencoders, while gradually replacing
classifier. The latter encourages domain-invariant features source-domain samples with target-domain samples. This
to emerge in the course of the optimization. improves over a similar approach of (Glorot et al., 2011)
Crucially, we show that all three training processes can that simply trains a single deep autoencoder for both do-
be embedded into an appropriately composed deep feed- mains. In both approaches, the actual classifier/predictor
forward network (Figure 1) that uses standard layers and is learned in a separate step using the feature representa-
loss functions, and can be trained using standard backprop- tion learned by autoencoder(s). In contrast to (Glorot et al.,
agation algorithms based on stochastic gradient descent or 2011; S. Chopra & Gopalan, 2013), our approach performs
its modifications (e.g. SGD with momentum). Our ap- feature learning, domain adaptation and classifier learning
proach is generic as it can be used to add domain adaptation jointly, in a unified architecture, and using a single learning
to any existing feed-forward architecture that is trainable by algorithm (backpropagation). We therefore argue that our
backpropagation. In practice, the only non-standard com- approach is simpler (both conceptually and in terms of its
ponent of the proposed architecture is a rather trivial gra- implementation). Our method also achieves considerably
dient reversal layer that leaves the input unchanged during better results on the popular O FFICE benchmark.
forward propagation and reverses the gradient by multiply- While the above approaches perform unsupervised domain
ing it by a negative scalar during the backpropagation. adaptation, there are approaches that perform supervised
Below, we detail the proposed approach to domain adap- domain adaptation by exploiting labeled data from the tar-
tation in deep architectures, and present results on tradi- get domain. In the context of deep feed-forward archi-
tional deep learning image datasets (such as MNIST (Le- tectures, such data can be used to “fine-tune” the net-
Cun et al., 1998) and SVHN (Netzer et al., 2011)) as well work trained on the source domain (Zeiler & Fergus, 2013;
as on O FFICE benchmarks (Saenko et al., 2010), where Oquab et al., 2014; Babenko et al., 2014). Our approach
the proposed method considerably improves over previous does not require labeled target-domain data. At the same
state-of-the-art accuracy. time, it can easily incorporate such data when it is avail-
able.
2. Related work An idea related to ours is described in (Goodfellow et al.,
A large number of domain adaptation methods have been 2014). While their goal is quite different (building gener-
proposed over the recent years, and here we focus on the ative deep networks that can synthesize samples), the way
most related ones. Multiple methods perform unsuper- they measure and minimize the discrepancy between the
vised domain adaptation by matching the feature distri- distribution of the training data and the distribution of the
butions in the source and the target domains. Some ap- synthesized data is very similar to the way our architecture
proaches perform this by reweighing or selecting samples measures and minimizes the discrepancy between feature
from the source domain (Borgwardt et al., 2006; Huang distributions for the two domains.
et al., 2006; Gong et al., 2013), while others seek an ex- Finally, a recent and concurrent report by (Tzeng et al.,
plicit feature space transformation that would map source 2014) also focuses on domain adaptation in feed-forward
distribution into the target ones (Pan et al., 2011; Gopalan networks. Their set of techniques measures and minimizes
et al., 2011; Baktashmotlagh et al., 2013). An important the distance of the data means across domains. This ap-
aspect of the distribution matching approach is the way the proach may be regarded as a “first-order” approximation
(dis)similarity between distributions is measured. Here, to our approach, which seeks a tighter alignment between
one popular choice is matching the distribution means in distributions.
the kernel-reproducing Hilbert space (Borgwardt et al.,
2006; Huang et al., 2006), whereas (Gong et al., 2012; Fer- 3. Deep Domain Adaptation
nando et al., 2013) map the principal axes associated with 3.1. The model
each of the distributions. Our approach also attempts to We now detail the proposed model for the domain adap-
match feature space distributions, however this is accom- tation. We assume that the model works with input sam-
plished by modifying the feature representation itself rather ples x ∈ X, where X is some input space and cer-
than by reweighing or geometric transformation. Also, our tain labels (output) y from the label space Y . Below,
method uses (implicitly) a rather different way to measure we assume classification problems where Y is a finite set
the disparity between distributions based on their separa- (Y = {1, 2, . . . L}), however our approach is generic and
bility by a deep discriminatively-trained classifier. can handle any output label space that other deep feed-
Figure 1. The proposed architecture includes a deep feature extractor (green) and a deep label predictor (blue), which together form
a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding a domain classifier (red) connected to the
feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during the backpropagation-
based training. Otherwise, the training proceeds in a standard way and minimizes the label prediction loss (for source examples) and
the domain classification loss (for all samples). Gradient reversal ensures that the feature distributions over the two domains are made
similar (as indistinguishable as possible for the domain classifier), thus resulting in the domain-invariant features.
forward models can handle. We further assume that there classifier) with the parameters θd (Figure 1).
exist two distributions S(x, y) and T (x, y) on X ⊗ Y , During the learning stage, we aim to minimize the label
which will be referred to as the source distribution and prediction loss on the annotated part (i.e. the source part)
the target distribution (or the source domain and the tar- of the training set, and the parameters of both the feature
get domain). Both distributions are assumed complex and extractor and the label predictor are thus optimized in or-
unknown, and furthermore similar but different (in other der to minimize the empirical loss for the source domain
words, S is “shifted” from T by some domain shift). samples. This ensures the discriminativeness of the fea-
Our ultimate goal is to be able to predict labels y given tures f and the overall good prediction performance of the
the input x for the target distribution. At training time, combination of the feature extractor and the label predictor
we have an access to a large set of training samples on the source domain.
{x1 , x2 , . . . , xN } from both the source and the target do- At the same time, we want to make the features f
mains distributed according to the marginal distributions domain-invariant. That is, we want to make the dis-
S(x) and T (x). We denote with di the binary variable (do- tributions S(f ) = {Gf (x; θf ) | x∼S(x)} and T (f ) =
main label) for the i-th example, which indicates whether {Gf (x; θf ) | x∼T (x)} to be similar. Under the covariate
xi come from the source distribution (xi ∼S(x) if di =0) or shift assumption, this would make the label prediction ac-
from the target distribution (xi ∼T (x) if di =1). For the ex- curacy on the target domain to be the same as on the source
amples from the source distribution (di =0) the correspond- domain (Shimodaira, 2000). Measuring the dissimilarity
ing labels yi ∈ Y are known at training time. For the ex- of the distributions S(f ) and T (f ) is however non-trivial,
amples from the target domains, we do not know the labels given that f is high-dimensional, and that the distributions
at training time, and we want to predict such labels at test themselves are constantly changing as learning progresses.
time. One way to estimate the dissimilarity is to look at the loss
We now define a deep feed-forward architecture that for of the domain classifier Gd , provided that the parameters
each input x predicts its label y ∈ Y and its domain label θd of the domain classifier have been trained to discrim-
d ∈ {0, 1}. We decompose such mapping into three parts. inate between the two feature distributions in an optimal
We assume that the input x is first mapped by a mapping way.
Gf (a feature extractor) to a D-dimensional feature vector This observation leads to our idea. At training time, in or-
f ∈ RD . The feature mapping may also include several der to obtain domain-invariant features, we seek the param-
feed-forward layers and we denote the vector of parame- eters θf of the feature mapping that maximize the loss of
ters of all layers in this mapping as θf , i.e. f = Gf (x; θf ). the domain classifier (by making the two feature distribu-
Then, the feature vector f is mapped by a mapping Gy (la- tions as similar as possible), while simultaneously seeking
bel predictor) to the label y, and we denote the parameters the parameters θd of the domain classifier that minimize the
of this mapping with θy . Finally, the same feature vector f loss of the domain classifier. In addition, we seek to mini-
is mapped to the domain label d by a mapping Gd (domain mize the loss of the label predictor.
More formally, we consider the functional: stochastic gradient descent would try to make features dis-
X similar across domains in order to minimize the domain
E(θf , θy , θd ) = Ly Gy (Gf (xi ; θf ); θy ), yi − classification loss). Although direct implementation of (4)-
i=1..N (6) as SGD is not possible, it is highly desirable to reduce
di =0
X the updates (4)-(6) to some form of SGD, since SGD (and
λ Ld Gd (Gf (xi ; θf ); θd ), yi = its variants) is the main learning algorithm implemented in
i=1..N most packages for deep learning.
X X
= Liy (θf , θy ) − λ Lid (θf , θd ) (1) Fortunately, such reduction can be accomplished by intro-
i=1..N i=1..N ducing a special gradient reversal layer (GRL) defined as
di =0
follows. The gradient reversal layer has no parameters as-
Here, Ly (·, ·) is the loss for label prediction (e.g. multino- sociated with it (apart from the meta-parameter λ, which
mial), Ld (·, ·) is the loss for the domain classification (e.g. is not updated by backpropagation). During the forward
logistic), while Liy and Lid denote the corresponding loss propagation, GRL acts as an identity transform. During
functions evaluated at the i-th training example. the backpropagation though, GRL takes the gradient from
Based on our idea, we are seeking the parameters θ̂f , θ̂y , θ̂d the subsequent level, multiplies it by −λ and passes it to
that deliver a saddle point of the functional (1): the preceding layer. Implementing such layer using exist-
ing object-oriented packages for deep learning is simple, as
defining procedures for forwardprop (identity transform),
(θ̂f , θ̂y ) = arg min E(θf , θy , θ̂d ) (2) backprop (multiplying by a constant), and parameter up-
θf ,θy
date (nothing) is trivial.
θ̂d = arg max E(θ̂f , θ̂y , θd ) . (3) The GRL as defined above is inserted between the feature
θd
extractor and the domain classifier, resulting in the archi-
At the saddle point, the parameters θd of the domain classi- tecture depicted in Figure 1. As the backpropagation pro-
fier θd minimize the domain classification loss (since it en- cess passes through the GRL, the partial derivatives of the
ters into (1) with the minus sign) while the parameters θy of loss that is downstream the GRL (i.e. Ld ) w.r.t. the layer
the label predictor minimize the label prediction loss. The parameters that are upstream the GRL (i.e. θf ) get multi-
feature mapping parameters θf minimize the label predic- plied by −λ, i.e. ∂L ∂Ld
∂θf is effectively replaced with −λ ∂θf .
d
tion loss (i.e. the features are discriminative), while maxi- Therefore, running SGD in the resulting model implements
mizing the domain classification loss (i.e. the features are the updates (4)-(6) and converges to a saddle point of (1).
domain-invariant). The parameter λ controls the trade-off Mathematically, we can formally treat the gradient reversal
between the two objectives that shape the features during layer as a “pseudo-function” Rλ (x) defined by two (incom-
learning. patible) equations describing its forward- and backpropa-
Below, we demonstrate that standard stochastic gradient gation behaviour:
solvers (SGD) can be adapted for the search of the saddle
point (2)-(3). Rλ (x) = x (7)
dRλ
3.2. Optimization with backpropagation = −λI (8)
dx
A saddle point (2)-(3) can be found as a stationary point of
the following stochastic updates: where I is an identity matrix. We can then define the
objective “pseudo-function” of (θf , θy , θd ) that is being
! optimized by the stochastic gradient descent within our
∂Liy ∂Li method:
θf ←− θf − µ −λ d (4)
∂θf ∂θf Ẽ(θf , θy , θd ) =
X
Ly Gy (Gf (xi ; θf ); θy ), yi +
∂Liy i=1..N
di =0
θy ←− θy − µ (5)
∂θy X
Ld Gd (Rλ (Gf (xi ; θf )); θd ), yi (9)
∂Lid i=1..N
θd ←− θd − µ (6)
∂θd
Running updates (4)-(6) can then be implemented as do-
where µ is the learning rate (which can vary over time). ing SGD for (9) and leads to the emergence of features
The updates (4)-(6) are very similar to stochastic gradient that are domain-invariant and discriminative at the same
descent (SGD) updates for a feed-forward deep model that time. After the learning, the label predictor y(x) =
comprises feature extractor fed into the label predictor and Gy (Gf (x; θf ); θy ) can be used to predict labels for sam-
into the domain classifier. The difference is the −λ factor ples from the target domain (as well as from the source
in (4) (the difference is important, as without such factor, domain).
The simple learning procedure outlined above can be re- so that α(Gd ) becomes smaller effectively reducing
derived/generalized along the lines suggested in (Goodfel- dHp ∆Hp (S, T ) and leading to the better approximation of
low et al., 2014) (see Appendix A). εT (Gy ) by εS (Gy ).
3.3. Relation to H∆H-distance 4. Experiments
In this section we give a brief analysis of our method in We perform extensive evaluation of the proposed approach
terms of H∆H-distance (Ben-David et al., 2010; Cortes & on a number of popular image datasets and their modifi-
Mohri, 2011) which is widely used in the theory of non- cations. These include large-scale datasets of small im-
conservative domain adaptation. Formally, ages popular with deep learning methods, and the O FFICE
datasets (Saenko et al., 2010), which are a de facto standard
dH∆H (S, T ) = 2 sup |Pf ∼S [h1 (f ) 6= h2 (f )]− for domain adaptation in computer vision, but have much
h1 ,h2 ∈H
fewer images.
−Pf ∼T [h1 (f ) 6= h2 (f )]| (10)
Baselines. For the bulk of experiments the following base-
defines a discrepancy distance between two distributions S lines are evaluated. The source-only model is trained with-
and T w.r.t. a hypothesis set H. Using this notion one can out consideration for target-domain data (no domain clas-
obtain a probabilistic bound (Ben-David et al., 2010) on the sifier branch included into the network). The train-on-
performance εT (h) of some classifier h from T evaluated target model is trained on the target domain with class
on the target domain given its performance εS (h) on the labels revealed. This model serves as an upper bound on
source domain: DA methods, assuming that target data are abundant and
1 the shift between the domains is considerable.
εT (h) ≤ εS (h) + dH∆H (S, T ) + C , (11) In addition, we compare our approach against the recently
2
proposed unsupervised DA method based on subspace
where S and T are source and target distributions respec- alignment (SA) (Fernando et al., 2013), which is simple
tively, and C does not depend on particular h. to setup and test on new datasets, but has also been shown
Consider fixed S and T over the representation space pro- to perform very well in experimental comparisons with
duced by the feature extractor Gf and a family of label other “shallow” DA methods. To boost the performance
predictors Hp . We assume that the family of domain classi- of this baseline, we pick its most important free parame-
fiers Hd is rich enough to contain the symmetric difference ter (the number of principal components) from the range
hypothesis set of Hp : {2, . . . , 60}, so that the test performance on the target do-
main is maximized. To apply SA in our setting, we train
Hp ∆Hp = {h | h = h1 ⊕ h2 , h1 , h2 ∈ Hp } . (12) a source-only model and then consider the activations of
the last hidden layer in the label predictor (before the final
It is not an unrealistic assumption as we have a freedom to
linear classifier) as descriptors/features, and learn the map-
pick Hd whichever we want. For example, we can set the
ping between the source and the target domains (Fernando
architecture of the domain discriminator to be the layer-
et al., 2013).
by-layer concatenation of two replicas of the label predic-
Since the SA baseline requires to train a new classifier after
tor followed by a two layer non-linear perceptron aimed to
adapting the features, and in order to put all the compared
learn the XOR-function. Given the assumption holds, one
settings on an equal footing, we retrain the last layer of
can easily show that training the Gd is closely related to
the label predictor using a standard linear SVM (Fan et al.,
the estimation of dHp ∆Hp (S, T ). Indeed,
2008) for all four considered methods (including ours; the
dHp ∆Hp (S, T ) = performance on the target domain remains approximately
the same after the retraining).
=2 sup |Pf ∼S [h(f ) = 1] − Pf ∼T [h(f ) = 1]| ≤
h∈Hp ∆Hp For the O FFICE dataset (Saenko et al., 2010), we directly
compare the performance of our full network (feature ex-
≤ 2 sup |Pf ∼S [h(f ) = 1] − Pf ∼T [h(f ) = 1]| =
h∈Hd tractor and label predictor) against recent DA approaches
using previously published results.
= 2 sup |1 − α(h)| = 2 sup [α(h) − 1]
h∈Hd h∈Hd
CNN architectures. In general, we compose feature ex-
(13) tractor from two or three convolutional layers, picking their
where α(h) = Pf ∼S [h(f ) = 0] + Pf ∼T [h(f ) = 1] is max- exact configurations from previous works. We give the ex-
imized by the optimal Gd . act architectures in Appendix B.
Thus, optimal discriminator gives the upper bound for For the domain adaptator we stick to the three fully con-
dHp ∆Hp (S, T ). At the same time, backpropagation of nected layers (x → 1024 → 1024 → 2), except for
the reversed gradient changes the representation space MNIST where we used a simpler (x → 100 → 2) ar-
MNIST S YN N UMBERS SVHN S YN S IGNS

S OURCE
TARGET
MNIST-M SVHN MNIST GTSRB
Figure 2. Examples of domain pairs used in the experiments. See Section 4.1 for details.
S OURCE MNIST S YN N UMBERS SVHN S YN S IGNS

M ETHOD
TARGET MNIST-M SVHN MNIST GTSRB
S OURCE ONLY .5749 .8665 .5919 .7400
SA (F ERNANDO ET AL ., 2013) .6078 (7.9%) .8672 (1.3%) .6157 (5.9%) .7635 (9.1%)
P ROPOSED APPROACH .8149 (57.9%) .9048 (66.1%) .7107 (29.3%) .8866 (56.7%)
T RAIN ON TARGET .9891 .9244 .9951 .9987
Table 1. Classification accuracies for digit image classifications for different source and target domains. MNIST-M corresponds to
difference-blended digits over non-uniform background. The first row corresponds to the lower performance bound (i.e. if no adaptation
is performed). The last row corresponds to training on the target domain data with known class labels (upper bound on the DA perfor-
mance). For each of the two DA methods (ours and (Fernando et al., 2013)) we show how much of the gap between the lower and the
upper bounds was covered (in brackets). For all five cases, our approach outperforms (Fernando et al., 2013) considerably, and covers a
big portion of the gap.
chitecture to speed up the experiments. rate, the network architecture for our method) in an unsu-
For loss functions, we set Ly and Ld to be the logistic re- pervised way, i.e. without referring to labeled data in the
gression loss and the binomial cross-entropy respectively. target domain. In our method, one can assess the per-
formance of the whole system (and the effect of chang-
CNN training procedure. The model is trained on 128-
ing hyper-parameters) by observing the test error on the
sized batches. Images are preprocessed by the mean sub-
source domain and the domain classifier error. In general,
traction. A half of each batch is populated by the sam-
we observed a good correspondence between the success of
ples from the source domain (with known labels), the rest
adaptation and these errors (adaptation is more successful
is comprised of the target domain (with unknown labels).
when the source domain test error is low, while the domain
In order to suppress noisy signal from the domain classifier
classifier error is high). In addition, the layer, where the
at the early stages of the training procedure instead of fixing
the domain adaptator is attached can be picked by comput-
the adaptation factor λ, we gradually change it from 0 to 1
ing difference between means as suggested in (Tzeng et al.,
using the following schedule:
2014).
2
λp = − 1, (14) 4.1. Results
1 + exp(−γ · p)
We now discuss the experimental settings and the results.
where γ was set to 10 in all experiments (the schedule was In each case, we train on the source dataset and test on a
not optimized/tweaked). Further details on the CNN train- different target domain dataset, with considerable shifts be-
ing can be found in Appendix C. tween domains (see Figure 2). The results are summarized
in Table 1 and Table 2.
Visualizations. We use t-SNE (van der Maaten, 2013) pro-
jection to visualize feature distributions at different points MNIST → MNIST-M. Our first experiment deals with
of the network, while color-coding the domains (Figure 3). the MNIST dataset (LeCun et al., 1998) (source). In or-
We observe strong correspondence between the success of der to obtain the target domain (MNIST-M) we blend dig-
the adaptation in terms of the classification accuracy for the its from the original set over patches randomly extracted
target domain, and the overlap between the domain distri- from color photos from BSDS500 (Arbelaez et al., 2011).
butions in such visualizations. This operation is formally defined for two images I 1 , I 2 as
out 1 2
Iijk = |Iijk − Iijk |, where i, j are the coordinates of a
Choosing meta-parameters. In general, good unsu-
pixel and k is a channel index. In other words, an output
pervised DA methods should provide ways to set meta-
sample is produced by taking a patch from a photo and in-
parameters (such as λ, the learning rate, the momentum
S OURCE A MAZON DSLR W EBCAM

M ETHOD
TARGET W EBCAM W EBCAM DSLR
GFK(PLS, PCA) (G ONG ET AL ., 2012) .464 ± .005 .613 ± .004 .663 ± .004
SA (F ERNANDO ET AL ., 2013) .450 .648 .699
DA-NBNN (T OMMASI & C APUTO , 2013) .528 ± .037 .766 ± .017 .762 ± .025
DLID (S. C HOPRA & G OPALAN , 2013) .519 .782 .899
D E CAF6 S OURCE O NLY (D ONAHUE ET AL ., 2014) .522 ± .017 .915 ± .015 –
DA NN (G HIFARY ET AL ., 2014) .536 ± .002 .712 ± .000 .835 ± .000
DDC (T ZENG ET AL ., 2014) .594 ± .008 .925 ± .003 .917 ± .008
P ROPOSED A PPROACH .673 ± .017 .940 ± .008 .937 ± .010
Table 2. Accuracy evaluation of different DA approaches on the standard O FFICE (Saenko et al., 2010) dataset. Our method (last row)
outperforms competitors setting the new state-of-the-art.
MNIST → MNIST-M: top feature extractor layer S YN N UMBERS → SVHN: last hidden layer of the label predictor
(a) Non-adapted (b) Adapted (a) Non-adapted (b) Adapted
Figure 3. The effect of adaptation on the distribution of the extracted features (best viewed in color). The figure shows t-SNE (van der
Maaten, 2013) visualizations of the CNN’s activations (a) in case when no adaptation was performed and (b) in case when our adaptation
procedure was incorporated into training. Blue points correspond to the source domain examples, while red ones correspond to the target
domain. In all cases, the adaptation in our method makes the two distributions of features much closer.
verting its pixels at positions corresponding to the pixels of variation were chosen manually to simulate SVHN, how-
a digit. For a human the classification task becomes only ever the two datasets are still rather distinct, the biggest
slightly harder compared to the original dataset (the digits difference being the structured clutter in the background of
are still clearly distinguishable) whereas for a CNN trained SVHN images.
on MNIST this domain is quite distinct, as the background The proposed backpropagation-based technique works well
and the strokes are no longer constant. Consequently, the covering two thirds of the gap between training with source
source-only model performs poorly. Our approach suc- data only and training on target domain data with known
ceeded at aligning feature distributions (Figure 3), which target labels. In contrast, SA (Fernando et al., 2013) does
led to successful adaptation results (considering that the not result in any significant improvement in the classifica-
adaptation is unsupervised). At the same time, the im- tion accuracy, thus highlighting that the adaptation task is
provement over source-only model achieved by subspace even more challenging than in the case of the MNIST ex-
alignment (SA) (Fernando et al., 2013) is quite modest, periment.
thus highlighting the difficulty of the adaptation task.
MNIST ↔ SVHN. In this experiment, we further increase
Synthetic numbers → SVHN. To address a common sce- the gap between distributions, and test on MNIST and
nario of training on synthetic data and testing on real data, SVHN, which are significantly different in appearance.
we use Street-View House Number dataset SVHN (Netzer Training on SVHN even without adaptation is challeng-
et al., 2011) as the target domain and synthetic digits as the ing — classification error stays high during the first 150
source. The latter (S YN N UMBERS) consists of 500,000 epochs. In order to avoid ending up in a poor local min-
images generated by ourselves from Windows fonts by imum we, therefore, do not use learning rate annealing
varying the text (that includes different one-, two-, and here. Obviously, the two directions (MNIST → SVHN
three-digit numbers), positioning, orientation, background and SVHN → MNIST) are not equally difficult. As
and stroke colors, and the amount of blur. The degrees of SVHN is more diverse, a model trained on SVHN is ex-
1 mains: A MAZON, DSLR, and W EBCAM. Unlike previ-

Real data only
0.8 ously discussed datasets, O FFICE is rather small-scale with
Validation error
Synthetic data only

0.6 Both
only 2817 labeled images spread across 31 different cat-
egories in the largest domain. The amount of available
0.4
data is crucial for a successful training of a deep model,
0.2 hence we opted for the fine-tuning of the CNN pre-trained
0 on the ImageNet (Jia et al., 2014) as it is done in some re-
1 2 3 4 5
cent DA works (Donahue et al., 2014; Tzeng et al., 2014;
Batches seen ·104 Hoffman et al., 2013). We make our approach more com-
parable with (Tzeng et al., 2014) by using exactly the same
Figure 4. Semi-supervised domain adaptation for the traffic signs. network architecture replacing domain mean-based regu-
As labeled target domain data are shown to the method, it achieves larization with the domain classifier.
significantly lower error than the model trained on target domain
Following most previous works, we evaluate our method
data only or on source domain data only.
using 5 random splits for each of the 3 transfer tasks com-
monly used for evaluation. Our training protocol is close to
pected to be more generic and to perform reasonably on
(Tzeng et al., 2014; Saenko et al., 2010; Gong et al., 2012)
the MNIST dataset. This, indeed, turns out to be the case
as we use the same number of labeled source-domain im-
and is supported by the appearance of the feature distribu-
ages per category. Unlike those works and similarly to e.g.
tions. We observe a quite strong separation between the
DLID (S. Chopra & Gopalan, 2013) we use the whole un-
domains when we feed them into the CNN trained solely
labeled target domain (as the premise of our method is the
on MNIST, whereas for the SVHN-trained network the
abundance of unlabeled data in the target domain). Un-
features are much more intermixed. This difference prob-
der this transductive setting, our method is able to improve
ably explains why our method succeeded in improving the
previously-reported state-of-the-art accuracy for unsuper-
performance by adaptation in the SVHN → MNIST sce-
vised adaptation very considerably (Table 2), especially in
nario (see Table 1) but not in the opposite direction (SA is
the most challenging A MAZON → W EBCAM scenario (the
not able to perform adaptation in this case either). Unsu-
two domains with the largest domain shift).
pervised adaptation from MNIST to SVHN gives a failure
example for our approach (we are unaware of any unsuper-
5. Discussion
vised DA methods capable of performing such adaptation).
We have proposed a new approach to unsupervised do-
Synthetic Signs → GTSRB. Overall, this setting is sim- main adaptation of deep feed-forward architectures, which
ilar to the S YN N UMBERS → SVHN experiment, except allows large-scale training based on large amount of an-
the distribution of the features is more complex due to the notated data in the source domain and large amount of
significantly larger number of classes (43 instead of 10). unannotated data in the target domain. Similarly to many
For the source domain we obtained 100,000 synthetic im- previous shallow and deep DA techniques, the adaptation
ages (which we call S YN S IGNS) simulating various pho- is achieved through aligning the distributions of features
toshooting conditions. Once again, our method achieves across the two domains. However, unlike previous ap-
a sensible increase in performance once again proving its proaches, the alignment is accomplished through standard
suitability for the synthetic-to-real data adaptation. backpropagation training. The approach is therefore rather
As an additional experiment, we also evaluate the pro- scalable, and can be implemented using any deep learning
posed algorithm for semi-supervised domain adaptation, package. To this end we plan to release the source code for
i.e. when one is additionally provided with a small amount the Gradient Reversal layer along with the usage examples
of labeled target data. For that purpose we split GTSRB as an extension to Caffe (Jia et al., 2014).
into the train set (1280 random samples with labels) and Further evaluation on larger-scale tasks and in semi-
the validation set (the rest of the dataset). The validation supervised settings constitutes future work. It is also in-
part is used solely for the evaluation and does not partic- teresting whether the approach can benefit from a good ini-
ipate in the adaptation. The training procedure changes tialization of the feature extractor. For this, a natural choice
slightly as the label predictor is now exposed to the tar- would be to use deep autoencoder/deconvolution network
get data. Figure 4 shows the change of the validation error trained on both domains (or on the target domain) in the
throughout the training. While the graph clearly suggests same vein as (Glorot et al., 2011; S. Chopra & Gopalan,
that our method can be used in the semi-supervised setting, 2013), effectively using (Glorot et al., 2011; S. Chopra &
thorough verification of semi-supervised setting is left for Gopalan, 2013) as an initialization to our method.
future work.
Office dataset. We finally evaluate our method on O F -
FICE dataset, which is a collection of three distinct do-
Appendix A. An alternative optimization 1024 → 2) is attached to the 256-dimensional bottle-

approach neck of fc7.
There exists an alternative construction (inspired by (Good- The domain classifier branch in all cases is somewhat ar-
fellow et al., 2014)) that leads to the same updates (4)-(6). bitrary (better adaptation performance might be attained if
Rather than using the gradient reversal layer, the construc- this part of the architecture is tuned).
tion introduces two different loss functions for the domain
classifier. Minimization of the first domain loss (Ld+ ) Appendix C. Training procedure
should lead to a better domain discrimination, while the We use stochastic gradient descent with 0.9 momentum and
second domain loss (Ld− ) is minimized when the domains the learning rate annealing described by the following for-
are distinct. Stochastic updates for θf and θd are then de- mula:
fined as: µ0
µp = ,
! (1 + α · p)β
∂Liy ∂Lid−
θf ←− θf − µ + where p is the training progress linearly changing from 0
∂θf ∂θf
to 1, µ0 = 0.01, α = 10 and β = 0.75 (the schedule
∂Lid+ was optimized to promote convergence and low error on
θd ←− θd − µ , the source domain).
∂θd
Following (Srivastava et al., 2014) we also use dropout and
Thus, different parameters participate in the optimization `2 -norm restriction when we train the SVHN architecture.
of different losses
In this framework, the gradient reversal layer constitutes
a special case, corresponding to the pair of domain losses
(Ld , −λLd ). However, other pairs of loss functions can be
used. One example would be the binomial cross-entropy
(Goodfellow et al., 2014):
X
Ld+ (q, d) = di log(qi ) + (1 − di ) log(1 − qi ) ,
i=1..N
where d indicates domain indices and q is an output of the

predictor. In that case “adversarial” loss is easily obtained
by swapping domain labels, i.e. Ld− (q, d) = Ld+ (q, 1−d).
This particular pair has a potential advantage of produc-
ing stronger gradients at early learning stages if the do-
mains are quite dissimilar. In our experiments, however,
we did not observe any significant improvement resulting
from this choice of losses.
Appendix B. CNN architectures

Four different architectures were used in our experiments
(first three are shown in Figure 5):
• A smaller one (a) if the source domain is MNIST. This
architecture was inspired by the classical LeNet-5 (Le-
Cun et al., 1998).
• (b) for the experiments involving SVHN dataset. This
one is adopted from (Srivastava et al., 2014).
• (c) in the S YN S INGS → GTSRB setting. We used
the single-CNN baseline from (Cireşan et al., 2012)
as our starting point.
• Finally, we use pre-trained AlexNet from the
Caffe-package (Jia et al., 2014) for the O FFICE do-
mains. Adaptation architecture is identical to (Tzeng
et al., 2014): 2-layer domain classifier (x → 1024 →
conv 5x5 conv 5x5 fully-conn fully-conn fully-conn

max-pool 2x2 max-pool 2x2
32 maps 48 maps 100 units 100 units 10 units
2x2 stride 2x2 stride
ReLU ReLU ReLU ReLU Soft-max
fully-conn fully-conn
GRL 100 units 1 unit
ReLU Logistic
(a) MNIST architecture
conv 5x5 conv 5x5 conv 5x5 fully-conn fully-conn fully-conn

max-pool 3x3 max-pool 3x3
64 maps 64 maps 128 maps 3072 units 2048 units 10 units
2x2 stride 2x2 stride
ReLU ReLU ReLU ReLU ReLU Soft-max
fully-conn fully-conn fully-conn

GRL 1024 units 1024 units 1 unit
ReLU ReLU Logistic
(b) SVHN architecture
conv 5x5 conv 3x3 conv 5x5 fully-conn fully-conn

max-pool 2x2 max-pool 2x2 max-pool 2x2
96 maps 144 maps 256 maps 512 units 10 units
2x2 stride 2x2 stride 2x2 stride
ReLU ReLU ReLU ReLU Soft-max
fully-conn fully-conn fully-conn

GRL 1024 units 1024 units 1 unit
ReLU ReLU Logistic
(c) GTSRB architecture
Figure 5. CNN architectures used in the experiments. Boxes correspond to transformations applied to the data. Color-coding is the same
as in Figure 1.
References Cortes, Corinna and Mohri, Mehryar. Domain adaptation

Arbelaez, Pablo, Maire, Michael, Fowlkes, Charless, and in regression. In Algorithmic Learning Theory, 2011.
Malik, Jitendra. Contour detection and hierarchical im-
age segmentation. PAMI, 33, 2011. Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman,
Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. De-
Babenko, Artem, Slesarev, Anton, Chigorin, Alexander, caf: A deep convolutional activation feature for generic
and Lempitsky, Victor S. Neural codes for image re- visual recognition, 2014.
trieval. In ECCV, pp. 584–599, 2014.
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang,
Baktashmotlagh, Mahsa, Harandi, Mehrtash Tafazzoli, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library
Lovell, Brian C., and Salzmann, Mathieu. Unsupervised for large linear classification. Journal of Machine Learn-
domain adaptation by domain invariant projection. In ing Research, 9:1871–1874, 2008.
ICCV, pp. 769–776, 2013.
Fernando, Basura, Habrard, Amaury, Sebban, Marc, and
Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza, Tuytelaars, Tinne. Unsupervised visual domain adapta-
Alex, Pereira, Fernando, and Vaughan, Jennifer Wort- tion using subspace alignment. In ICCV, 2013.
man. A theory of learning from different domains.
JMLR, 79, 2010. Ghifary, Muhammad, Kleijn, W Bastiaan, and Zhang,
Mengjie. Domain adaptive neural networks for object
Borgwardt, Karsten M., Gretton, Arthur, Rasch, Malte J., recognition. In PRICAI 2014: Trends in Artificial Intel-
Kriegel, Hans-Peter, Schölkopf, Bernhard, and Smola, ligence. 2014.
Alexander J. Integrating structured biological data by
kernel maximum mean discrepancy. In ISMB, pp. 49– Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Do-
57, 2006. main adaptation for large-scale sentiment classification:
A deep learning approach. In ICML, pp. 513–520, 2011.
Cireşan, Dan, Meier, Ueli, Masci, Jonathan, and Schmid-
huber, Jürgen. Multi-column deep neural network for Gong, Boqing, Shi, Yuan, Sha, Fei, and Grauman, Kristen.
traffic sign classification. Neural Networks, (32):333– Geodesic flow kernel for unsupervised domain adapta-
338, 2012. tion. In CVPR, pp. 2066–2073, 2012.
Gong, Boqing, Grauman, Kristen, and Sha, Fei. Con- Saenko, Kate, Kulis, Brian, Fritz, Mario, and Darrell,
necting the dots with landmarks: Discriminatively learn- Trevor. Adapting visual category models to new do-
ing domain-invariant features for unsupervised domain mains. In ECCV, pp. 213–226. 2010.
adaptation. In ICML, pp. 222–230, 2013.
Shimodaira, Hidetoshi. Improving predictive inference un-
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, der covariate shift by weighting the log-likelihood func-
Bing, Warde-Farley, David, Ozair, Sherjil, Courville, tion. Journal of Statistical Planning and Inference, 90
Aaron, and Bengio, Yoshua. Generative adversarial nets. (2):227–244, October 2000.
In NIPS, 2014.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Gopalan, Raghuraman, Li, Ruonan, and Chellappa, Rama. Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
Domain adaptation for object recognition: An unsuper- A simple way to prevent neural networks from overfit-
vised approach. In ICCV, pp. 999–1006, 2011. ting. The Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
Hoffman, Judy, Tzeng, Eric, Donahue, Jeff, Jia, Yangqing,
Saenko, Kate, and Darrell, Trevor. One-shot adapta- Stark, Michael, Goesele, Michael, and Schiele, Bernt. Back
tion of supervised deep convolutional models. CoRR, to the future: Learning shape models from 3d CAD data.
abs/1312.6204, 2013. In BMVC, pp. 1–11, 2010.
Huang, Jiayuan, Smola, Alexander J., Gretton, Arthur, Sun, Baochen and Saenko, Kate. From virtual to reality:
Borgwardt, Karsten M., and Schölkopf, Bernhard. Cor- Fast adaptation of virtual object detectors to real do-
recting sample selection bias by unlabeled data. In NIPS, mains. In BMVC, 2014.
pp. 601–608, 2006.
Tommasi, Tatiana and Caputo, Barbara. Frustratingly easy
Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, nbnn domain adaptation. In ICCV, 2013.
Sergey, Long, Jonathan, Girshick, Ross, Guadar-
rama, Sergio, and Darrell, Trevor. Caffe: Convolu- Tzeng, Eric, Hoffman, Judy, Zhang, Ning, Saenko, Kate,
tional architecture for fast feature embedding. CoRR, and Darrell, Trevor. Deep domain confusion: Maximiz-
abs/1408.5093, 2014. ing for domain invariance. CoRR, abs/1412.3474, 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- van der Maaten, Laurens. Barnes-hut-sne. CoRR,
based learning applied to document recognition. Pro- abs/1301.3342, 2013.
ceedings of the IEEE, 86(11):2278–2324, November Vázquez, David, López, Antonio Manuel, Marı́n, Javier,
1998. Ponsa, Daniel, and Gomez, David Gerónimo. Virtual
Liebelt, Joerg and Schmid, Cordelia. Multi-view object and real world adaptationfor pedestrian detection. IEEE
class detection with a 3d geometric model. In CVPR, Trans. Pattern Anal. Mach. Intell., 36(4):797–809, 2014.
2010. Zeiler, Matthew D. and Fergus, Rob. Visualizing
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, and understanding convolutional networks. CoRR,
Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig- abs/1311.2901, 2013.
its in natural images with unsupervised feature learning.
In NIPS Workshop on Deep Learning and Unsupervised
Feature Learning 2011, 2011.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning

and transferring mid-level image representations using
convolutional neural networks. In CVPR, 2014.
Pan, Sinno Jialin, Tsang, Ivor W., Kwok, James T., and
Yang, Qiang. Domain adaptation via transfer component
analysis. IEEE Transactions on Neural Networks, 22(2):
199–210, 2011.
S. Chopra, S. Balakrishnan and Gopalan, R. Dlid: Deep

learning for domain adaptation by interpolating between
domains. In ICML Workshop on Challenges in Repre-
sentation Learning, 2013.

Unsupervised Domain Adaptation by Backpropagation

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Unsupervised Domain Adaptation by Backpropagation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Domain Adaptation by Backpropagation

Uploaded by

Copyright:

Available Formats

Unsupervised Domain Adaptation by Backpropagation

Yaroslav Ganin GANIN @ SKOLTECH . RU

Abstract at “test time”. One particularly important example is syn-

MNIST S YN N UMBERS SVHN S YN S IGNS

MNIST-M SVHN MNIST GTSRB

S OURCE MNIST S YN N UMBERS SVHN S YN S IGNS

S OURCE A MAZON DSLR W EBCAM

(a) Non-adapted (b) Adapted (a) Non-adapted (b) Adapted

1 mains: A MAZON, DSLR, and W EBCAM. Unlike previ-

Synthetic data only

Appendix A. An alternative optimization 1024 → 2) is attached to the 256-dimensional bottle-

where d indicates domain indices and q is an output of the

Appendix B. CNN architectures

conv 5x5 conv 5x5 fully-conn fully-conn fully-conn

(a) MNIST architecture

conv 5x5 conv 5x5 conv 5x5 fully-conn fully-conn fully-conn

fully-conn fully-conn fully-conn

(b) SVHN architecture

conv 5x5 conv 3x3 conv 5x5 fully-conn fully-conn

fully-conn fully-conn fully-conn

(c) GTSRB architecture

References Cortes, Corinna and Mohri, Mehryar. Domain adaptation

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning

S. Chopra, S. Balakrishnan and Gopalan, R. Dlid: Deep

You might also like