0% found this document useful (0 votes)
39 views10 pages

Yuan Unsupervised Image Super-Resolution CVPR 2018 Paper

1. The document proposes an unsupervised image super-resolution method using Cycle-in-Cycle generative adversarial networks (CinCGAN). 2. CinCGAN uses a two-cycle structure, where the first cycle maps noisy low-resolution images to a clean low-resolution space, and the second cycle upsamples the image and fine-tunes the model. 3. Experiments on NTIRE2018 datasets show the unsupervised CinCGAN method achieves comparable results to state-of-the-art supervised models.

Uploaded by

Abdelsalam Hamdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views10 pages

Yuan Unsupervised Image Super-Resolution CVPR 2018 Paper

1. The document proposes an unsupervised image super-resolution method using Cycle-in-Cycle generative adversarial networks (CinCGAN). 2. CinCGAN uses a two-cycle structure, where the first cycle maps noisy low-resolution images to a clean low-resolution space, and the second cycle upsamples the image and fine-tunes the model. 3. Experiments on NTIRE2018 datasets show the unsupervised CinCGAN method achieves comparable results to state-of-the-art supervised models.

Uploaded by

Abdelsalam Hamdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unsupervised Image Super-Resolution

using Cycle-in-Cycle Generative Adversarial Networks

Yuan Yuan12∗ Siyuan Liu134 Jiawei Zhang1 Yongbing Zhang3 Chao Dong1 Liang Lin1
1
Sensetime Research
2
Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University
3
Graduate School at Shenzhen, Tsinghua University, Shenzhen
4
Department of Automation, Tsinghua University, Beijing

Abstract

We consider the single image super-resolution problem


in a more general case that the low-/high-resolution pairs
and the down-sampling process are unavailable. Differ-
ent from traditional super-resolution formulation, the low-
resolution input is further degraded by noises and blur-
ring. This complicated setting makes supervised learn-
ing and accurate kernel estimation impossible. To solve
this problem, we resort to unsupervised learning without
paired data, inspired by the recent successful image-to-
image translation applications. With generative adversar-
ial networks (GAN) as the basic component, we propose
a Cycle-in-Cycle network structure to tackle the problem
within three steps. First, the noisy and blurry input is
mapped to a noise-free low-resolution space. Then the in-
termediate image is up-sampled with a pre-trained deep Ground Truth Bicubic EDSR [17] BM3D+EDSR CinCGAN
PSNR/SSIM 29.42/0.82 28.95/0.76 30.94/0.91 31.01/0.92
model. Finally, we fine-tune the two modules in an end-to-
end manner to get the high-resolution output. Experiments Figure 1. ×4 Super-resolution results of the proposed CinCGAN
on NTIRE2018 datasets demonstrate that the proposed un- method for “0896” (DIV2K). For comparison, the sub-figures are
supervised method achieves comparable results as the state- cropped from results of existing algorithms. When the input is
of-the-art supervised models. noisy, the results of bicubic interpolation and the EDSR [17]
model both are in low quality, while CinCGAN learns to recon-
struct clean result with fine details. The BM3D+EDSR method
means using BM3D for denoising first and then using EDSR for
1. Introduction
super-resolution.
Recent deep learning based super-resolution (SR) meth-
ods have achieved significant improvement either on PSNR
values [8, 12, 13, 16, 17, 25, 28, 30] or on visual qual- problem often have the following properties: 1) HR datasets
ity [16, 20]. These methods require supervised learning on are unavailable, 2) downscaling method is unknown, 3) in-
high-resolution (HR) and low-resolution (LR) image pairs. put LR images are noisy and blurry. This problem is ex-
However, their common assumption that the downscaling tremely difficult if the input images suffer from different
factor is known and the input image is noise-free hinders kinds of degradation. For an easier case, in this study, we
them from practical usages. In real-world scenarios, the SR assume that input images are degraded with the same pro-
∗ Yuan Yuan and Siyuan Liu are co-first authors.
cessing which is complex and unavailable.
This work
was done when they were interns at Sensetime. Contacting email: Under the above circumstances, models learned from
yuanyuan@szu.edu.cn synthetic data tend to generate similar results as traditional

1814
methods [13, 30] or even simple interpolation. In Fig. 1, high-resolution ground truth, down-sampling kernel and
we show the results of bicubic interpolation and the state- degradation function are unavailable. 2) We explore several
of-the-art deep learning model—EDSR [17] with a noisy unsupervised training strategies under the above assump-
input. This is mainly due to the data bias between training tion, and show that super-resolution task is different from
and testing images. Detailed survey and analysis of deep conventional image-to-image translation. 3) We propose a
learning based methods on real data can be found in [15]. Cycle-in-Cycle structure that could achieve comparable re-
As an alternative choice, blind SR [7, 19, 29] deal with sults as supervised CNN networks.
the real-world data by estimating the down-sampling kernel
from internal or external similar patches. However, when 2. Related work
the input is noisy, the down-sampling kernel cannot be ac-
2.1. Image Super-Resolution
curately estimated, and the inverse mapping results are ac-
companied by amplified noises. There are also works at- Single image super-resolution (SISR) has been widely
tempting at restoring LR images with addictive Gaussian studied for decades. Early approaches either rely on nat-
noises [34]. But real-world noises may neither be addictive ural image statistics [33] [13] or pre-defined models [10]
nor follow the standard Gaussian distribution, causing noise [5] [26]. Later, mapping functions between LR images and
estimation infeasible. More generally, LR images may suf- HR images are investigated, such as sparse coding based SR
fer from complex noises, blurry and non-uniform down- methods [30] [32].
sampling kernels, which fail almost all existing blind SR Recently, deep convolution neural networks (CNN) have
methods. shown explosive popularity and powerful capability to im-
Inspired by the development of unsupervised learning prove the quality of SR results. Ever since Dong [3] first
in image-to-image translation, such as CycleGAN [35] or proposed using CNN for SR and achieved the state-of-the-
WESPE [9], we intend to investigate unsupervised strate- art performance, plenty of CNN architectures have been
gies to overcome this obstacle. In CycleGAN, images are studied for SISR. Inspired by the VGG [24] networks used
translated between different domains with unpaired train- for ImageNet classification, Kim et al. [12] present a very
ing data. They assume that the input image is of the same deep network (VDSR) that learns a residual image. For ac-
size as the output image, with only the difference on styles. celerating the speed of SR, FSRCNN [4] and ESPCN [23]
However, in SR, output images are several times larger than extract feature maps at the low-resolution space and up-
the inputs, making the direct application of CycleGAN im- sample the image at the last layer by transposed convolu-
possible. Further, using a bicubic-upsampled image as the tion and sub-pixel convolution, respectively. All the above
input also could not obtain satisfactory results. SR problem mentioned CNN based SR methods aim at minimizing the
is specific as it requires high quality output but not just a mean-square error (MSE) between the reconstructed HR
different style. image and the ground truth. Based on the observation that
After exploring several training strategies, we find an ef- minimizing MSE will make the SR results overly smooth,
fective Cycle-in-Cycle structure, named CinCGAN, which SRGAN [16] combines an adversarial loss [6] and a per-
could achieve superior results. The whole pipeline consists ceptual loss [24] [11] as the final objective function, and
of two CycleGANs, while the second GAN covers the first generates visually pleasing images which contain more high
one (See Fig. 2). The first CycleGAN maps the LR image to frequency details than the MSE-loss based methods. The
the clean and bicubic-downsampled LR space. This module champion of NTIRE2017 Super-Resolution Challenge [27],
ensures that the LR input is fairly denoised/deblurred. We EDSR [17], employs deeper and wider networks to achieve
then stack another well-trained deep model with bicubic- the state-of-the-art performance by removing the unneces-
downsampling assumption to up-sample the intermediate sary modules in SRResNet [16].
result to the desired size. Finally, we fine-tune the whole 2.2. Blind Image Super-Resolution
network using adversarial learning in an end-to-end man-
ner. We conduct experiments on the NTIRE2018 Super- Although a lot of works focus on SR problems with
Resolution Challenge1 dataset, and show that the pro- known degradation/downsamping kernels, little works try
posed Cycle-in-Cycle structure is much stable at training to solve blind SR—the degradation operation from HR im-
and achieves competitive performance as supervised deep ages to LR images are unavailable. Estimating the degra-
learning methods. dation/blur kernel is an essential step for blind SR. Wang et
The contributions of this work are three-folds: 1) We al. [29] propose a probabilistic framework combined with
study a more general super-resolution problem, where the the image co-occurrence prior to estimate the unknown
point spread function (PSF) parameters. According to the
1 https://competitions.codalab.org/competitions/18024 property that small image patches will re-appear in natu-

2815
ral images, Michaeli and Irani [19] present a method that is analysis and unsupervised training.
able to estimate the optimal blur kernel. Another relevant
Motivation 1) Why applying unsupervised training? As the
work [21] introduces a convolution consistency constraint
down-sampling and degradation functions are complex and
and bi-l0 -l2 -norm regularization [22] to guide the blur ker-
coupled, it is hard to perform accurate estimation like tra-
nel estimation process, achieving state-of-the-art blind SR
ditional blind SR methods [19, 29]. The unavailability of
performance.
HR images in practise also makes supervised training with
In this work, we investigate how deep learning can be simulated paired data impractical. This drives us to explore
beneficial for addressing blind SR problems. unsupervised learning strategies. 2) What is the difference
between SR and image-to-image translation? SR accepts
2.3. Unsupervised Learning
an LR image and outputs a HR image with much larger
Existing supervised deep learning methods cannot han- resolution. Further, SR requires the output to be of high
dle blind SR without LR-HR image pairs. In real-world quality, not just a different style. If we directly apply the
scenarios, where paired data is unavailable, it is essential to image-to-image translation methods, we need to up-sample
find a way to realize unsupervised learning. Recent work the LR image first by interpolation, which will also enlarge
on GAN [6] provides a feasible solution, which includes a the noisy patterns. Directly applying existing methods like
generator and a discriminator. The generator tries to gen- CycleGAN cannot remove such amplified noises, and train-
erate fake images to fool the discriminator, while the dis- ing becomes very unstable. Experiments (in Sec. 4.4) also
criminator aims at distinguishing the generated results from show that when the degradation function varies from image
real data. GAN is widely used to solve the unsupervised to image, it is difficult to deal with all kinds of images in a
learning problems. DualGAN [31] and CycleGAN [35] are single forward pass.
two works about image-to-image translation using unsuper-
Solution pipeline Our solution pipeline consists of three
vised learning, and both of them present an interesting net-
steps. First, we learn a mapping from an LR image set X
work structure that contains a pair of forward and inverse
to a “clean” LR image set Y , where images are noise-free
generators. The forward generator maps domain X to do-
and down-sampled from HR images Z with bicubic kernel.
main Y, while the inverse generator maps the output back to
In other words, we deblur and denoise the input images at
domain X to maintain cycle consistency. Ignatov et al. [9]
low resolution. Second, we adopt an existing SR model to
use the similar architecture to design a weakly supervised
super-resolve the intermediate results to the desired resolu-
photo enhancer (WESPE) that translates ordinary photos to
tion. In the end, we combine and fine-tune these two models
DSLR-quality images.
simultaneously to get the final HR images.
Different from the proposed method, both Dual-
Under the guidance of the above pipeline, we propose
GAN [31] and CycleGAN [35] deal with input and output
a Cycle-in-Cycle structure named CinCGAN as shown in
images of the same size, while SR requires the output im-
Fig. 2. To be specific, we adopt two coupled CycleGANs to
ages several times larger than the inputs. Utilizing the prop-
learn the mapping from X to Y and Y to Z, respectively.
erty of cycle consistency, we present a Cycle-in-Cycle GAN
Unpaired images xi ∈ X, yj ∈ Y and zj ∈ Z are used for
(CinCGAN) to super-resolve the LR images of which the
training2 , where yj is down-sampled from zj with bicubic
degradation operators are unknown. Our method achieves
kernel. Details are given in the following.
a comparable performance with the state-of-the-art super-
vised CNN based algorithms [4, 16, 17]. 3.1. LR Image Restoration
The framework of the first CycleGAN that maps an LR
3. Proposed Method image x to a clean LR image y is shown as LR→clean LR
Problem formulation The conventional formulation of in Fig. 2. Given an input image x, the generator G1 learns
SISR [30] is x = SHz + n, where x and z denote LR and to generate an image ỹ that looks similar to the clean LR y,
HR image respectively, SH represents the down-sampling so as to fool the discriminator D1 . Meanwhile, D1 learns to
and blurring matrix, and n is the addictive noise. Blind distinguish the generated sample G1 (x) from the real sam-
SR [19,29] follow the same assumption, only with unknown ple y. To stabilize the training procedure, we use the least
SH. In this work, we study a more general formulation as square loss [18] instead of the negative log-likelihood used
x = fn (fd (z)) + n, where fd is the down-sampling pro- in [6]. The generator-adversarial loss is:
1 X
cess, fn is a degradation function that may introduce com- N

plex noises, shift and blur. Here, we assume that fd , fn and LLR
GAN = ||D1 (G1 (xi )) − 1||2 , (1)
N i
the paired HR-LR training data are unavailable. Neverthe-
less, we can obtain a set of LR images that can be used for 2 For simplicity, we omit the subscript i and j in the following.

3816
y
LR → clean LR z
~ D1 D2
y
x G1 part1 SR
~
z
x G2

x G3 LR → HR

Figure 2. The framework of the proposed CinCGAN, where G1 , G2 and G3 are generators and SR is a super-resolution network. D1 and
D2 are discriminators. The G1 , G2 and D1 compose the first LR→clean LR CycleGAN model, mapping the degrade LR images to clean
LR images. The G1 , SR, G3 and D2 compose the second LR→HR CycleGAN model, mapping the LR images to HR images.

where N is the number of training samples. To maintain 3.2. Jointly Restoration and Super-Resolution
consistency between input x and output y, we add a network
We then investigate how to super-resolve the interme-
G2 and let x′ = G2 (G1 (x)) be identical to the input x.
diate image ỹ to the desired size. Recently, the enhanced
Hence, we also use a cycle consistency loss as:
deep residual network – EDSR [17] has won the first
1 X
N prize in the NTIRE 2017 challenge on single image super-
LLR
cyc = ||G2 (G1 (xi )) − xi ||2 . (2) resolution [1]. For simplicity, we directly adopt EDSR as
N i
the SR network stacked after G1 . Similarly, we use a dis-
In the previous work [35], the authors introduce an criminator D2 for adversarial training both G1 and SR net-
identity loss to preserve color composition between input works. We also utilize another generator G3 to ensure cycle
and output images when they work on painting generation. consistency between x and the reconstructed x′′ . The GAN
They claim that the identity loss can help preserve the color loss, cycle loss and TV loss for the LR→HR network are
of input images. In image SR, we also need to avoid color formulated as follows:
variation among different iterations, thus we add an identity 1 X
N
loss LHR
GAN = ||D2 (SR(G1 (xi ))) − 1||2 , (6)
N i
1 X
N
LLR
idt = ||G1 (yi ) − yi ||1 . (3)
N i
1 X
N
LHR
cyc = ||G3 (SR(G1 (xi ))) − xi ||2 , (7)
In addition, we add a total variation (TV) loss to impose N i
spatial smoothness

1 X 1 X
N N

LLR = (||∇h G1 (xi )||2 + ||∇w G1 (xi )||2 ), (4) LHR


TV = (||∇h SR(G1 (xi ))||2 + ||∇w SR(G1 (xi ))||2 ).
TV
N i N i
(8)
where ∇h and ∇w are functions to compute the horizontal
and vertical gradient of G1 (xi ). For the identity loss, instead of maintaining the tint con-
In summary, the final objective loss for the LR→clean sistency between input and output, we consider ensuring
LR model is a weighted sum of the four losses: the SR network can generate adequate quality of super-
resolved images. We define a new identity loss as:
LLR LR LR LR LR
total = LGAN + w1 Lcyc + w2 Lidt + w3 LT V (5) X
LHR
idt = ||SR(z ′ ) − z||2 . (9)
where w1 , w2 , w3 are the weights of different losses. i

4817
Conv, k3n64s1/k4n64s2

Conv, k3n64s1/k4n64s2
Conv, k7n64s1

Conv, k3n64s1

Conv, k3n64s1

Conv, k3n64s1

Conv, k3n64s1

Conv, k7n3s1
input output

block2

block6

block1
(a) Generator
Conv, k4n128s1/k4n128s2

Conv, k4n256s1/k4n256s2
Conv, k4n64s1/k4n64s2

real

Conv, k4n512s1

Conv, k4n1s1
output
BN

BN

BN
fake

(b) Discriminator

Figure 3. The generators G1 , G2 and G3 share the same framework as (a) and the discriminators D1 and D2 share the same framework
as (b). For the 2-nd and 3-rd convolution layers in generator (a), k3n64s1 is for G1 and G2 , while k4n64s2 is for G3 . For the first three
convolution layers in discriminator (b), k4n64s1, k4n128s1, and k4n256s1 are for D1 and k4n64s2, k4n128s2, and k4n256s2 are for D2 .
Please see text for details.

where z ′ is down-sampled from z with bicubic kernel. This For the generators G1 and G2 , we use 3 convolution lay-
LHR
idt makes the SR network does not betray its original am- ers at the head and tail, and 6 residual blocks in the middle.
bition, such that the produced z̃ can be reasonable SR re- The generator G3 shares the same architecture as G1 and
sults. G2 , except for the 2-nd and 3-rd convolution layers, where
To sum up, the total loss for fine-tuning the LR to HR the stride is set to 2 to perform down-sampling. As to the
networks is discriminator, we use a 70 × 70 PatchGAN for D2 . Since
we up-sample LR images with a scale of ×4, the size of
LHR HR HR HR HR
total = LGAN + λ1 Lcyc + λ2 Lidt + λ3 LT V (10) input images is usually less than 70 (we use 32 × 32 LR
where λ1 , λ2 , λ3 , for i = 1, 2, 3, are weights of each loss. images and 128 × 128 HR images for training). Hence, we
modify the stride of the first three convolution layers as 1
3.3. Network Architecture for discriminator D1 , such that the respective field of D1 is
The architecture of generators G1 , G2 , G3 and discrim- reduced to 16 × 16.
inators D1 , D2 are shown in Fig. 3. We adapt similar ar-
chitecture as the work of Zhu et al. [35], which has shown
impressive results for unpaired image-to-image translation.
Here, “conv” means convolution layer, where a Leaky
4. Experiments
ReLU layer with negative slope 0.2 is added right after ex-
cept for the last convolution layer (we omit it for simplicity).
“BN” means a batch normalization layer. The number after In this section, we first introduce the dataset and details
symbols k, n and s represents kernel size, number of filters we used for training. We then evaluate the performance of
and stride size, respectively. For example, k3n64s1 refers the proposed CinCGAN model by comparing with several
to the convolution layer that contains 64 filters, of which the state-of-the-art SISR methods. Finally, we perform ablation
spatial size is 3 and stride is 1. study to validate the advantages of CinCGAN.

5818
4.1. Training data update the LR→clean LR network. We then train G1 , SR
and G3 simultaneously to update the LR→HR network.
We take the track 2 dataset from the NTIRE2018 Super-
We implement the proposed networks with PyTorch and
Resolution Challenge for training. The challenge aims to
train them on a Nvidia Tesla K80 GPU. It takes about 1 day
restore a HR image given a degraded LR image. They pro-
to pre-train the LR→clean LR model and about 2 days to
vide a high-quality image dataset, DIV2K [1], which con-
jointly fine-tune the LR→HR model.
tains 800 training images and 100 validation images. The
DIV2K dataset contains almost all kinds of natural scenar- 4.3. Results
ios: buildings (indoor and outdoor), forest, lakes, animals,
people, etc. The track 2 dataset is degraded from DIV2K We compare the performance of the proposed CinCGAN
dataset, with down-sampling, blurring, pixel shifting and model with several state-of-the-art SISR methods: FSR-
noises. Although the parameters of the degradation opera- CNN [4], EDSR [17] and SRGAN [16]. We use the publicly
tors are fixed for all images, the blur kernels are randomly available FSRCNN and EDSR models which are trained
generated and their resulting pixel shifts vary from image with paired LR and HR images, where the inputs are clean
to image. Hence, the degradation kernels of images in the LR images down-sampled from HR images. To make the
track 2 dataset are unknown and diverse. results more comparable, we also fine-tune EDSR and SR-
Since our purpose is to unsupervised train a network GAN (labelled as EDSR+ and SRGAN+ respectively) with
without paired LR-HR data, we take the first 400 images the paired track 2 dataset. To emphasize the effectiveness
(numbered from 1 to 400) from the training LR set as input of CinCGAN structure, we also try to first denoise the in-
images X, and the other 400 images (numbered from 401 put LR images and then super-resolve the denoised images
to 800) from the HR set as demanding HR images Z. The for comparison. BM3D [2] is one of the state-of-the-art im-
intermediate clean LR images Y are directly bicubic down- age denoising approach, which is an efficient and powerful
sampled from Z. Similar to [4] [24], we augment data with denoiser. Hence, we pre-process the test LR images with
90 degree rotation and flipping. Our experiments are per- BM3D first, and then super-resolve it using EDSR (labelled
formed with a scaling factor of ×4. We randomly crop X as BM3D+EDSR).
and Y with size 32 × 32 and crop Z with size 128 × 128. Table 1 shows the average PSNR and SSIM values of
We conduct testing on the provided 100 validation images. the restored test images. It shows that FSRCNN and EDSR
Note that, although DIV2K contains paired training dataset, cannot work well if the blur and noises are unknown in
we do not use paired data for supervised training. the training process. After fine-tuning by paired track 2
dataset, EDSR+ and SRGAN+ improve their results and
4.2. Training details our method can work comparably against SRGAN+ in
terms of PSNR and SSIM without paired training data. Al-
We divide our training process into two steps. We first
though BM3D can remove noise, it also over-smooth the in-
train the model G1 , G2 and D1 for mapping LR images to
put images. The PSNR and SSIM values of BM3D+EDSR
clean LR images (shown as LR→clean LR in Fig. 2). The
are lower than the proposed method. Several subjective re-
three parameters in (5) are set to be w1 = 10, w2 = 5
sults are illustrated in Fig. 4.
and w3 = 0.5, respectively. We train our model with
Adam optimizer [14] by setting β1 = 0.5, β2 = 0.999 and 4.4. Ablation Study
ǫ = 10−8 , without weight decay. Learning rate is initial-
ized as 2 × 10−4 and then decreased by a factor of 2 every To validate the advantages of the proposed CinCGAN
40000 iterations. The weights of filters in each layer are model for the unsupervised SISR problem, we design some
initialized using a normal distribution and the batch size is other network structures for comparison.
set as 16. We train the model over 400000 iterations, until
Structure 1 The first frame structure is to restore LR images
it converges.
X to HR images Z using only one CycleGAN, i.e. denoise,
We then jointly fine-tune the LR to HR model (shown as
deblur and super-resolve the LR images at the same time.
LR→HR in Fig. 2). We initialize our SR network by pub-
The structure of the model is shown in Fig. 5(a), where we
licly available EDSR model3 . We set parameters in (10) as
set an LR image x as input to the SR network directly. Cor-
λ1 = 10, λ2 = 5 and λ3 = 2. The optimizer is set almost
respondingly, we only minimize the total loss LHR total (with
the same as training the LR→clean LR model, except for
replacing SR(G1 (·)) as SR(·) in Eq. (6)(7)(8)). However,
we initialize learning rate with 10−4 . As to the weight of
during the training procedure, we found that the result z̃ are
identity loss LLR
idt in (5), we set w2 = 1. At each iteration, always unstable and there are a lot of undesired artifacts,
we update (5) and (10) in turn. We first train G1 and G2 to
as shown in Fig. 6(a). It is hard for a single network to si-
3 https://github.com/thstkdgus35/EDSR-PyTorch multaneously denoise, deblur and up-sample the degraded

6819
(a) ground truth (b) bicubic (c) EDSR+ [17] (d) SRGAN+ [16] (e) BM3D+EDSR (f) CinCGAN (ours)
PSNR/SSIM 23.22/0.64 26.23/0.68 24.06/0.58 23.06/0.65 24.83/0.65

(a) ground truth (b) bicubic (c) EDSR+ [17] (d) SRGAN+ [16] (e) BM3D+EDSR (f) CinCGAN (ours)
PSNR/SSIM 22.25/0.68 29.06/0.75 27.36/0.68 22.18/0.72 27.95/0.72

(a) ground truth (b) bicubic (c) EDSR+ (d) SRGAN+ [16] (e) BM3D+EDSR (f) CinCGAN (ours)
PSNR/SSIM 26.81/0.83 30.28/0.88 29.05/0.85 26.84/0.86 28.26/0.84

Figure 4. Super-resolution results of “0801”, “0816” and “0853” (DIV2K) with scale factor ×4. EDSR+ and SRGAN+ are trained on
paired NTIRE2018 track 2 dataset. BM3D+EDSR means using BM3D for denoising first and then using EDSR for super-resolution. The
proposed CinCGAN model shows comparable results with SRGAN+ and is better than BM3D+EDSR method.

Table 1. Quantitative evaluation on NTIRE 2018 track 2 dataset of the proposed CinCGAN model, in terms of PSNR and SSIM.
method bicubic FSRCNN [4] EDSR [17] EDSR+ SRGAN+ [16] BM3D+EDSR CinCGAN (ours)

PSNR 22.85 22.79 22.67 25.77 24.33 22.88 24.33


SSIM 0.65 0.61 0.62 0.71 0.67 0.68 0.69

images, especially when the degradation kernels are differ- LR→clean LR networks shown in Fig. 2; we then super-
ent from image to image and with unsupervised learning. resolve the converted LR images directly using the SR net-
work. The whole structure is shown in Fig. 5(b). The cor-
Structure 2 We remove D2 and G3 from the proposed responding result is illustrated in Fig. 6(b). As we can see,
CinCGAN model for our second experiment. We map the some negligible noise in the resulted clean LR images is
input LR images to a set of clean LR images using the same

7820
′′

SR G1 SR G1 SR
D2 D1 D2
′′ G3 ′ G2 ′′ G3

(a) Structure 1 (b) Structure 2 (c) Structure 3

Figure 5. Experiments for validating the advantages of the proposed structure. (a) Structure 1: transform the LR images x to HR images z
directly with one CycleGAN model; (b) Structure 2: remove D2 and G3 from the proposed CinCGAN model; (c) Structure 3: remove D1
and G2 from the proposed CinCGAN model.

′′

′′
(a) Structure 1 (b) Structure 2 (c) Structure 3 (d) CinCGAN (ours) (e) ground truth

Figure 6. Super-resolution results of “0829” (DIV2K) with scale factor ×4, for each frame structure as described in Fig. 5.

magnified and now is visible in the super-resolved images, 5. Conclusions


which affects the visual quality.
We investigate the single image super-resolution prob-
lem with a more general assumption: the low-/high-
resolution image pairs and the down-sampling process are
Structure 3 Our third experiment is performed by remov- unavailable. Inspired by the recent successful image-to-
ing D1 and G2 from the proposed CinCGAN model, as image translation applications, we resort to the unsuper-
shown in Fig. 5(c). We use one CycleGAN for the LR to vised learning methods to solve this problem. Using gen-
HR model, where we take G1 + SR as the forward network erative adversarial networks (GAN), the proposed method
and G3 as the inverse network. D2 is used for distinguish- contains two CycleGANs, where the second GAN cov-
ing z̃ from z. We load the pre-trained G1 (in the LR→clean ers the first one. The solution pipeline consists of three
LR networks) and the downloaded EDSR models for ini- steps. First, we map the input LR images to the clean
tialization. Experimental results on Fig. 6(c) show that the and bicubic-downsampled LR space with the first Cycle-
resulting z̃ are still noisy. Since without the LLR LR
cyc and LGAN GAN. We then stack another well-trained deep model with
LR LR
constraints on G1 network (Lidt and Ltv are still used for bicubic-downsampling assumption to up-sample the inter-
this model), G1 is unable to deonise and deblur. The whole mediate result to the desired size. Finally, we fine-tune
model becomes similar to Structure 1. the two modules in an end-to-end manner to get the high-
resolution out. Experimental results demonstrate that the
proposed unsupervised method achieves comparable results
Proposed Method We then propose our final solution as as the state-of-the-art supervised models.
shown in Fig. 2: jointly fine-tune LR to HR networks with
CinCGAN. We sequentially update the LR→ clean LR and
the LR→HR models. With the two constraint LLR total and
LHR
total , the G 1 network can denoise and deblur the degraded Acknowledgement. This work is supported by Sense-
input image x, while the SR network can up-sample as well Time Group Limited and in part by the Projects of Na-
as further restore the resulted intermediate image ỹ. The tional Science Foundations of China (61571254), Guang-
final resulted SR image is shown in Fig. 6(d), which shows dong Special Support plan (2015TQ01X16), and Shenzhen
the best visual result comparing with other three structures. Fundamental Research fund (JCYJ20160513103916577).

8821
References Photo-realistic single image super-resolution using a genera-
tive adversarial network. arXiv preprint, 2016.
[1] E. Agustsson and R. Timofte. Ntire 2017 challenge on sin-
[17] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced
gle image super-resolution: Dataset and study. In Computer
deep residual networks for single image super-resolution.
Vision and Pattern Recognition Workshops (CVPRW), 2017
In The IEEE Conference on Computer Vision and Pattern
IEEE Conference on, pages 1122–1131. IEEE, 2017.
Recognition (CVPR) Workshops, volume 1, page 3, 2017.
[2] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Bm3d
[18] X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multi-
image denoising with shape-adaptive principal component
class generative adversarial networks with the l2 loss func-
analysis. In SPARS’09-Signal Processing with Adaptive
tion. CoRR, abs/1611.04076, 2, 2016.
Sparse Structured Representations, 2009.
[19] T. Michaeli and M. Irani. Nonparametric blind super-
[3] C. Dong, C. C. Loy, K. He, and X. Tang. Image resolution. In Computer Vision (ICCV), 2013 IEEE Inter-
super-resolution using deep convolutional networks. IEEE national Conference on, pages 945–952. IEEE, 2013.
transactions on pattern analysis and machine intelligence,
[20] M. S. Sajjadi, B. Schölkopf, and M. Hirsch. Enhancenet:
38(2):295–307, 2016.
Single image super-resolution through automated texture
[4] C. Dong, C. C. Loy, and X. Tang. Accelerating the super- synthesis. In Computer Vision (ICCV), 2017 IEEE Interna-
resolution convolutional neural network. In European Con- tional Conference on, pages 4501–4510. IEEE, 2017.
ference on Computer Vision, pages 391–407. Springer, 2016.
[21] W.-Z. Shao and M. Elad. Simple, accurate, and robust non-
[5] R. Fattal. Image upsampling via imposed edge statistics. In parametric blind super-resolution. In International Con-
ACM transactions on graphics (TOG), volume 26, page 95. ference on Image and Graphics, pages 333–348. Springer,
ACM, 2007. 2015.
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [22] W.-Z. Shao, H.-B. Li, and M. Elad. Bi-l0-l2-norm regular-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- ization for blind motion deblurring. Journal of Visual Com-
erative adversarial nets. In Advances in neural information munication and Image Representation, 33:42–59, 2015.
processing systems, pages 2672–2680, 2014. [23] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken,
[7] Y. He, K.-H. Yap, L. Chen, and L.-P. Chau. A soft map R. Bishop, D. Rueckert, and Z. Wang. Real-time single im-
framework for blind super-resolution image reconstruction. age and video super-resolution using an efficient sub-pixel
Image and Vision Computing, 27(4):364–373, 2009. convolutional neural network. In Proceedings of the IEEE
[8] J.-B. Huang, A. Singh, and N. Ahuja. Single image super- Conference on Computer Vision and Pattern Recognition,
resolution from transformed self-exemplars. In Proceedings pages 1874–1883, 2016.
of the IEEE Conference on Computer Vision and Pattern [24] K. Simonyan and A. Zisserman. Very deep convolutional
Recognition, pages 5197–5206, 2015. networks for large-scale image recognition. arXiv preprint
[9] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and arXiv:1409.1556, 2014.
L. Van Gool. Wespe: Weakly supervised photo enhancer [25] A. Singh and N. Ahuja. Super-resolution using sub-band
for digital cameras. arXiv preprint arXiv:1709.01118, 2017. self-similarity. In Asian Conference on Computer Vision,
[10] M. Irani and S. Peleg. Improving resolution by image reg- pages 552–568. Springer, 2014.
istration. CVGIP: Graphical models and image processing, [26] J. Sun, Z. Xu, and H.-Y. Shum. Image super-resolution us-
53(3):231–239, 1991. ing gradient profile prior. In Computer Vision and Pattern
[11] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for Recognition, 2008. CVPR 2008. IEEE Conference on, pages
real-time style transfer and super-resolution. In European 1–8. IEEE, 2008.
Conference on Computer Vision, pages 694–711. Springer, [27] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang,
2016. L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee,
[12] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super- et al. Ntire 2017 challenge on single image super-resolution:
resolution using very deep convolutional networks. In Com- Methods and results. In Computer Vision and Pattern Recog-
puter Vision and Pattern Recognition, pages 1646–1654, nition Workshops (CVPRW), 2017 IEEE Conference on,
2016. pages 1110–1121. IEEE, 2017.
[13] K. I. Kim and Y. Kwon. Single-image super-resolution using [28] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted
sparse regression and natural image prior. IEEE transactions anchored neighborhood regression for fast super-resolution.
on pattern analysis and machine intelligence, 32(6):1127– In Asian Conference on Computer Vision, pages 111–126.
1133, 2010. Springer, 2014.
[14] D. P. Kingma and J. Ba. Adam: A method for stochastic [29] Q. Wang, X. Tang, and H. Shum. Patch based blind im-
optimization. arXiv preprint arXiv:1412.6980, 2014. age super resolution. In Computer Vision, 2005. ICCV 2005.
[15] T. Köhler, M. Bätz, F. Naderi, A. Kaup, A. K. Maier, and Tenth IEEE International Conference on, volume 1, pages
C. Riess. Benchmarking super-resolution algorithms on real 709–716. IEEE, 2005.
data. arXiv preprint arXiv:1709.04881, 2017. [30] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-
[16] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, resolution via sparse representation. IEEE transactions on
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. image processing, 19(11):2861–2873, 2010.

9822
[31] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsu-
pervised dual learning for image-to-image translation. arXiv
preprint arXiv:1704.02510, 2017.
[32] R. Zeyde, M. Elad, and M. Protter. On single image scale-up
using sparse-representations. In International conference on
curves and surfaces, pages 711–730. Springer, 2010.
[33] H. Zhang, J. Yang, Y. Zhang, and T. S. Huang. Non-local ker-
nel regression for image and video restoration. In European
Conference on Computer Vision, pages 566–579. Springer,
2010.
[34] K. Zhang, W. Zuo, and L. Zhang. Learning a single convo-
lutional super-resolution network for multiple degradations.
arXiv preprint arXiv:1712.06116, 2017.
[35] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
works. arXiv preprint arXiv:1703.10593, 2017.

10823

You might also like