Darkvisionnet: Low-Light Imaging Via Rgb-Nir Fusion With Deep Inconsistency Prior

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

PRELIMINARY PREPRINT VERSION: DO NOT CITE

The AAAI Digital Library will contain the published


version some time after the conference.

DarkVisionNet: Low-Light Imaging via RGB-NIR Fusion


with Deep Inconsistency Prior
Bingbing Yu,1, 2 * Shuangping Jin, 1, 3 * Minhao Jing 1 Yi Zhou 1 Jiajun Liang 1 Renhe Ji 1†
1
Megvii Technology
2
Dalian University Of Technology
3
Southeast University
220191583@seu.edu.cn, 21901037@mail.dlut.edu.cn, jingminhao@megvii.com, zhouyi@megvii.com,
liangjiajun@megvii.com, jirenhe@megvii.com

Abstract
RGB-NIR fusion is a promising method for low-light imag-
ing. However, high-intensity noise in low-light images ampli-
fies the effect of structure inconsistency between RGB-NIR
images, which fails existing algorithms. To handle this, we
propose a new RGB-NIR fusion algorithm called Dark Vi-
sion Net (DVN) with two technical novelties: Deep Struc-
ture and Deep Inconsistency Prior (DIP). The Deep Struc-
ture extracts clear structure details in deep multiscale feature
space rather than raw input space, which is more robust to
noisy inputs. Based on the deep structures from both RGB
and NIR domains, we introduce the DIP to leverage the struc-
ture inconsistency to guide the fusion of RGB-NIR. Benefit-
ing from this, the proposed DVN obtains high-quality low-
light images without the visual artifacts. We also propose a
new dataset called Dark Vision Dataset (DVD), consisting
of aligned RGB-NIR image pairs, as the first public RGB-
NIR fusion benchmark. Quantitative and qualitative results
on the proposed benchmark show that DVN significantly out-
performs other comparison algorithms in PSNR and SSIM,
especially in extremely low light conditions.

Introduction
High-quality low-light imaging is a challenging but signif-
icant task. On the one hand, it is the cornerstone of many
important applications such as 24-hour surveillance, smart- Figure 1: (a) and (b) are fusion examples from DVD, com-
phone photography, etc. On the other hand, though, mas- pared to CU-Net (Deng and Dragotti 2020), DKN (Kim,
sive noise of images in extremely dark environments hin- Ponce, and Ham 2021) and Scale Map (Yan et al. 2013),
ders algorithms from the satisfactory restoration of low-light our method, Dark Vision Net (DVN), effectively handle the
images. RGB-NIR fusion techniques provide a new per- structure inconsistency between RGB-NIR images. Regions
spective for the challenge: it enhances the low-light noisy with inconsistent structures are framed in red.
color (RGB) image through rich, detailed information in the
corresponding near-infrared (NIR) image (The high quality
taken in an extremely low-light environments, as shown in
of NIR images in dark environments comes from invisible
Figure 1(a).
near-infrared flash), which greatly improves the signal-to-
noise ratio (SNR) of the restored RGB image. Under the However, the existing RGB-NIR fusion algorithms suf-
constraint of cost, size and other factors, RGB-NIR fusion fers from the problem of structure inconsistency between
becomes the most promising technique to restore the van- RGB and NIR images, resulting in unnatural appearance
ished textual and structural details from noisy RGB images and loss of key information, which limits the application
of RGB-NIR fusion algorithm in low-light imaging. Fig-
* Authors contributed equally. ure 1 illustrates two typical examples of structure incon-

Corresponding author. sistency between RGB and NIR images: Figure 1(b) shows
Copyright © 2022, Association for the Advancement of Artificial the absence of NIR shadows in the RGB image (grass shad-
Intelligence (www.aaai.org). All rights reserved. ows only appear on the book edge in the NIR image), and
Figure 2: Overview of the proposed DVN. In the first stage, the network predicts deep structure maps utilising the multi-scale
features maps from restoration network R by the proposed Deep Structure Extraction Module (DSEM) for noisy RGB and NIR
respectively. In the second stage, taking advantage of the predicted deep structures, the DIP can be calculated by inconsistency
function F. In the third stage, the DIP-weighted NIR structures are fused with the RGB features to obtain the final fusion result
without obvious structure inconsistency.

the nonexistence of RGB color structure in the NIR im- cal novelties are introduced: (1) We find a new way, referred
age (text ‘complete’ almost disappears on the book cover to as deep structures, to represent clear structure information
in the NIR image). Fusion algorithms need to tackle these encoded in deep features extracted by the proposed Deep
structure inconsistency to avoid visual artifacts in output im- Structure Extraction Module (DSEM). Even facing images
ages. There are two categories of RGB-NIR fusion methods with low SNR, the deep structures can still be effectively ex-
currently, i.e. traditional methods and neural-network-based tracted and represent reliable structural information, which
methods, and modeling the structure of the paired RGB-NIR is critical to the introduction of prior knowledge. (2) We pro-
images plays an important role in both of them. Traditional pose Deep Inconsistency Prior (DIP), which indicates the
methods, such as Scale Map (Yan et al. 2013), tackle the differences between RGB-NIR deep structures. Integrated
structure inconsistency problem by manually designed func- into the fusion of RGB-NIR deep features, the DIP empow-
tions. Some neural-network-based methods (Kim, Ponce, ers the network to handle the structure inconsistency. Bene-
and Ham 2021; Li et al. 2019), on the other hand, utilize fiting from this, the proposed DVN can obtain high-quality
deep learning techniques to automatically learn the structure low-light images.
inconsistency by a large amount of data. Both of them per- In addition, to the best of our knowledge, there is no avail-
form well under certain circumstances. able benchmark dedicated for the RGB-NIR fusion task so
However, when confronted with extreme low-light envi- far. The lack of benchmarks to evaluate and train fusion al-
ronments, existing methods fail to maintain satisfactory per- gorithms greatly limits the development of this field. To fill
formance, since the structure inconsistency is dramatically this gap, we propose a dataset named Dark Vision Dataset
exacerbated by massive noise in the RGB image. As shown (DVD) as the first RGB-NIR fusion benchmark. Based on
in Figure 1(b), the dense noise in the RGB image makes this dataset, we give qualitative and quantitative evaluations
it difficult for Scale Map to extract structural information, to prove the effectiveness of our method. In summary, the
causing the failure of distinguishing which structures in the main contributions of this paper are as follows:
NIR image should be eliminated, and result in the unnatu- • We propose a novel RGB-NIR fusion algorithm called
ral ghost images on the book edge. Deformable Kernel Net- Dark Vision Net (DVN) With Deep Inconsistency Prior
works (DKN) (Kim, Ponce, and Ham 2021) falsely weak- (DIP). The DIP explicitly integrates the prior of struc-
ens gradients of input RGB image that do not exist in the ture inconsistency into the deep features, avoiding over-
corresponding NIR image, which leads to the blurriness of relying on NIR features in the feature fusion. Benefits
letters on the book cover. Even though these structural in- from this, DVN can obtain high-quality low-light images
consistency of corresponding RGB and NIR images can be without visual artifacts.
captured by human eyes effortlessly, they still confuse most • We propose a new dataset Dark Vision Dataset (DVD) as
of the existing fusion algorithms. the first public dataset for training and evaluating RGB-
In this paper, we focus on improving the RGB-NIR fu- NIR fusion algorithms.
sion algorithm for extremely low SNR images by tackling • The quantitative and qualitative results indicate that DVN
the structure inconsistency problem. Based on the above is significantly better than other compared methods.
analysis, we argue that the structure inconsistency under ex-
tremely low light can be handled well by introducing prior Related Work
knowledge into deep features. To achieve this, we propose Image Denoising. In recent years, denoising algorithms
a deep RGB-NIR fusion algorithm called Dark Vision Net based on deep neural networks have continually emerged
(DVN), which explicitly leverages the prior knowledge of and overcome the drawbacks of analytical methods (Lu-
structure inconsistency to guide the fusion of RGB-NIR cas et al. 2018). The image noise model is gradually im-
deep features, as shown in Figure 2. With DVN, two techni- proved simultaneously (Wei et al. 2020; Wang et al. 2020).
(Mao, Shen, and Yang 2016) applied an encoder-decoder
network to suppress the noise and recover high-quality im-
ages. (Zhang, Zuo, and Zhang 2018) presented a denois-
ing network to process blind noise denoising. (Guo et al.
2019; Cao et al. 2021) attempted to remove the noise from
real noisy images. There are also deep denoising algorithms Figure 3: Through applying F on edge maps of clean RGB
trained without clean data supervision (Lehtinen et al. 2018; and NIR images, the calculated inconsistency map clearly
Krull, Buchholz, and Jug 2019; Huang et al. 2021). How- shows the structure inconsistency between RGB and NIR.
ever, in extremely dark environments, fine texture details
damaged by the high-intensity noise are very difficult to
restore. In that case, denoising algorithms tend to gener- fill this gap, we collect a dataset named Dark Vision Dataset
ate over-smoothed outputs, which are unsatisfactory. By the (DVD) as the first public available RGB-NIR fusion bench-
way, low-light image enhancement algorithms (Chen et al. mark, which contains noise-free reference image pairs and
2018; Lamba and Mitra 2021; Gu et al. 2019) try to directly real noisy low-light image pairs. With noise-free reference
restore high-quality images in terms of brightness, color, image pairs, the proposed DVD can be used to quantitatively
etc. However, these algorithms cannot deal with such high- and qualitatively evaluate fusion algorithms. In addition, the
intensity noise as well. DVD also contains real noisy low-light image pairs, which
can be used to qualitatively evaluate the performance of the
RGB-NIR Fusion. To obtain high-quality low-light im- fusion algorithm in real scenes.
ages, researchers (Krishnan and Fergus 2009) try to fuse
NIR images with RGB images. Traditional RGB-NIR fu- Approach
sion algorithms include weighted least squares (Zhuo et al.
2010), Guided Image Filtering (GIF) (He, Sun, and Tang Prior Knowledge of Structure Inconsistency
2012), gradient preserving (Connah, Drew, and Finlayson As previously described, the network needs to be aware of
2015), multi-scale decomposition (Son and Zhang 2016). the inconsistent regions on the two inputs. We design an in-
Recently, (Yan et al. 2013) pointed out the gradient inconsis- tuitive function to measure the inconsistency from image
tency between RGB-NIR image pairs, and proposed Scale features. Firstly binary edge maps are extracted from each
Map to try to solve it. Among the methods based on deep feature channel. Then the inconsistency is defined as
neural network, Joint Image Filtering with Deep Convolu-
tional Networks (DJFR) (Li et al. 2019) constructs a unified F(edgeC: , edgeN ) = λ(1 − edgeC: )(1 − edgeN )
two-stream network model for image fusion, CU-Net (Deng + edgeC: · edgeN (1)
and Dragotti 2020) combines sparse encoding with Convo-
lutional Neural Networks (CNNs), DKN (Kim, Ponce, and where C: ∈ RH×W and N ∈ RH×W denote R/G/B channel
Ham 2021) explicitly learns sparse and spatially-variant ker- of the clean RGB image and NIR image, edgeC: and edgeN
nels for image filtering. (Lv et al. 2020) innovatively con- respectively represent the binarized edge maps of C: and N ,
structs a network that directly decouples RGB and NIR sig- which is obtained by binarizing its mean value as a threshold
nals for 24-hour imaging. In general, the above-mentioned after Sobel filtering.
RGB-NIR fusion algorithm has two main problems. One is As shown in Figure 3, F(·, ·) equals to 0 in the regions
insufficient ability to deal with RGB-NIR texture inconsis- where edgeC: and edgeN shows severe inconsistency. On
tency, leading to heavy artifacts on the final fusion images. the contrary, F(·, ·) equals to 1 in the regions where the
The other is the inadequate noise suppression capability, es- structures of RGB and NIR are consistent. And in other re-
pecially when dealing with high-intensity noise in extremely gions, F(·, ·) is set to a hyperparameter λ(0 < λ < 1), indi-
low-light environments. To handle the above problems, this cating that there is no significant inconsistency. Utilising the
paper proposes the DarkVisionNet with a novel DIP mech- output inconsistency map of F, the inconsistent NIR struc-
anism to effectively deal with the inconsistency between tures can be easily suppressed by a direct multiplication.
RGB-NIR images.
Datasets. There is only a small amount of data that can Extraction of Deep Structures
be used for RGB-NIR fusion studies because of the diffi- Even though function F subtly describes the inconsistency
culty to obtain aligned RGB-NIR image pairs. Some stud- between RGB and NIR images, it cannot be applied di-
ies (Foster et al. 2006) focus on obtaining hyperspectral rectly in extremely low light cases. As shown in Figure 4,
datasets, and strictly aligned RGB-NIR image pairs can be the calculated inconsistency map contains nothing but non-
obtained by integrating hyperspectral images on the corre- informative noise when facing extremely noisy RGB image.
sponding band. (Krishnan and Fergus 2009) present a pro- To avoid the influence of noise in the structure inconsistency
totype camera to collect RGB-NIR image pairs. However, extraction, we propose the Deep Structure Extraction Mod-
these datasets are too small to be used to comprehensively ule (DSEM) and Deep Inconsistency Prior (DIP), where we
measure the performance of fusion algorithms. More impor- compute the structure inconsistency in feature space. Con-
tantly, due to the lack of data on actual scenarios, they can- sidering the processing flow of RGB and NIR are basically
not encourage follow-up researchers to focus on the valuable the same, we give a unified description here to keep the sym-
problems that RGB-NIR will encounter in applications. To bols concise.
where structgt gt
i,c,j is the jth pixel of structi,c , ∇ represents
the Sobel operator, ∇deci,c,j is the jth pixel in ∇deci,c and
m∇deci,c is the global average pooling result of ∇deci,c . The
supervision signal obtained by this design effectively trains
the DSEM and clear deep structure maps are predicted as
shown in Figure 4.

Calculation of DIP and Image Fusion


The extracted deep structures contain rich structure informa-
tion and are robust to noise. With structi of the noisy RGB
and NIR images, we can introduce inconsistency function F
to obtain high-quality knowledge of structure inconsistency:
Figure 4: Applying F on the edge maps of noisy RGB DIP
Mi,c = F(structC N
i,c , structi,c ) (4)
and NIR images can only get a meaningless inconsistency
map due to the heavy noise in the input RGB (as the where C ∈ RH×W ×3 and N ∈ RH×W denote the noisy
first row shows). However, the deep structure map pre- RGB image and NIR image, structi,c is the cth channel of
dicted by DSEM is very clear and the calculated DIP effec- DIP
the features from the ith scale and Mi,c is the correspond-
tively describes the inconsistent structures (as the second- DIP
row shows). See more examples in supplementary material. ing inconsistency measurement. Since Mi,c represents the
structure inconsistency instead of intensity inconsistency,
DIP
The detailed architecture of DSEM is shown in Figure we apply Mi,c directly to structN N
i,c instead of f eati,c in
5(a). DSEM takes multi-scale features f eati (i represents the form of:
scale) from restoration network R and outputs multi-scale ˆ N = M DIP · structN .
struct (5)
i,c i,c i,c
deep structure maps structi . In order for DSEM to predict
high-quality deep structure maps, we introduce a clear su- ˆ N discards the struc-
Under the guidance of the DIP, struct
pervision signal structgt i (addressed later) for DSEM and
i,c
tures that are inconsistent with RGB, thus empowering the
the training loss is calculated as:
deep features with prior knowledge to tackle structure incon-
Chi
3 X
X sistency. As we will show in the experiments later, inconsis-
Lstru = Dist(structi,c , structgt
i,c ), (2) tent structures in the NIR structure maps can be significantly
DIP
i=1 c=1 suppressed after multiplying with Mi,c .
ˆ N into
To further fuse the rich details contained in struct
where Chi is the channel number of the deep structures i,c
in the ith scale, Dist is Dice Loss (Deng et al. 2018), RGB features, we designed a multi-scale fusion module as
structi,c is the cth channel of the predicted deep structures shown in Figure 5(c). As pointed out in (Jung, Zhou, and
in the ith scale and structgt Feng 2020), denoising first and fusion later can improve the
i,c is the corresponding ground-
PN fusion quality. So we follow (Jung, Zhou, and Feng 2020) to
truth. The Dice loss is given by Dist(P, G) = ( j p2j + reuse the denoised output of the restoration network R set up
PN 2 PN
j gj )/(2 j pj gj ), where pj , gj is the value of the jth for noisy RGB as the input of the multi-scale fusion module.
pixel on the predicted structure map P and ground-truth G.
Supervision of DSEM Considering that it is almost im- Loss Function
possible for DSEM to naturally output feature maps that The total loss function we used is formulated as:
only contain structural information, we have to introduce a
clear supervision for the output of DSEM to predict high- L = LC Ĉ N C N
rec + Lrec + Lrec + λ1 · Lstru + λ2 · Lstru (6)
quality deep structure maps. The supervision signal is set up
following the idea of Deep Image Prior (Ulyanov, Vedaldi, where LC N
stru and Lstru are the loss function for RGB/NIR
deep structures prediction, which is described above. λ1 and
and Lempitsky 2018) and structgt i,c is acquired from a pre- λ2 are the corresponding coefficients and set to 1/1000 and
trained AutoEncoder (Hinton and Salakhutdinov 2006) net-
work AE 1 . The base architecture of AE is exactly the same 1/3000. LC Ĉ N
rec , Lrec and Lrec represent the reconstruction
as R with skip connections removed, as Figure 5(b) shows. loss of fused-RGB/coarse-RGB/NIR image respectively. All
Multi-scale decoder features deci,c are extracted from the of them are Charbonnier loss (Charbonnier et al. 1994) in the
pretrained AutoEncoder network AE and the supervision form of:
signal is calculated by: q
2
 Lrec = kX − Xgt k + ε2 (7)
0 if (∇deci,c,j − m∇deci,c ) <= 0,
structgt
i,c,j = (3) where X and Xgt represent the network output and the cor-
1 if (∇deci,c,j − m∇deci,c ) > 0.
responding supervision. The constant ε is set to 10−3 empir-
1
See the training details in supplementary material. ically.
Figure 5: (a) Illustration of the Deep Structure Extraction Module (DSEM) details.(b) The architecture of the AutoEncoder
network which is employed to provide supervision signals for DSEM. (c) The detailed fusion process of RGB and DIP-weighted
NIR features, i.e. the third stage of DVN. Residual channel attention blocks (RCABs) (Zhang et al. 2018) are used to extract
features.

Figure 6: Fusion examples from DVD. The proposed DVN shows great superiority than other algorithms. Images are brightened
for visual convenience. See supplementary material for more examples.

Experiment (standard Red Green Blue) images as input, we convert the


Datasets collected raw data into sRGB through a simple isp-pipeline
(Gray World for white balance, Gamma correction, Demo-
Data Collection. In order to obtain the aligned RGB-NIR saicing) (Karaimer and Brown 2016), to make a fair com-
image pairs in the easiest and direct way, we collect all RGB- parison. In the following experiments, we use 5k reference
NIR image pairs by switching an optical filter placed di- image pairs (256*256) as the training set. Another 1k ref-
rectly in front of the camera without an IR-Cut, and we di- erence image pairs (256*256) along with 10 additional real
vide them into two types of image pairs for different usages. noisy image pairs (1920*1080) are used for testing.
We collect reference image pairs in normal-light environ-
ments. In order to obtain high-quality references, multiple Implementation Details
still captures are averaged to remove noise and a match-
ing algorithm (DeTone, Malisiewicz, and Rabinovich 2017) All experiments are conducted on a device equipped with
with manual double-check is applied to ensure the alignment two 2080-Ti GPUs. We train the proposed DVN from
of image pairs. In the following experiments, we add syn- scratch in an end-to-end fashion. Batchsize is set to 16.
thetic noise to these references to quantitatively evaluate the Training images are randomly cropped in the size of
performance of fusion algorithms. To facilitate training, the 128*128, and the value range is [0, 1]. We augment the
collected images are cropped into 256*256 image patches. training data following MPRNet (Zamir et al. 2021), includ-
We collect real noisy image pairs of 1920*1080 pixels in ing random flipping and rotating. Adam optimizer with mo-
low-light environments. The post-processing steps are the mentum terms (0.9, 0.999) is applied for optimization. The
same as those used in collecting references image pairs, ex- whole network is trained for 80 epochs, and the learning rate
cept that multi-frame average is not performed to noisy RGB is gradually reduced from 2e-4 to 1e-6. λ in function F is set
images. In the following experiments, we use these noisy to 0.5 for all configurations. The AutoEncoder network used
image pairs to qualitatively evaluate the performances of fu- to provide supervision signals for DSEM is pretrained in the
sion algorithms in handling real low-light images. same way, except that it only trained for 5 epoches and the
input is clean RGB and NIR images separately. See supple-
Dataset for Experiment. To make the synthetic data mentary material for more training details.
closer to the real images, we follow (Wang et al. 2020) to add The synthesis of low-light data for training includes two
synthetic noise to reference image pairs for training and test- steps. The first step is to reduce the average value of raw im-
ing. Considering that all the comparison methods use sRGB ages taken under normal light to 5 (10-bit raw data). The sec-
Table 1: The PSNR (dB) and SSIM results of different algorithms on DVD dataset. The best and second best results are
highlighted in bold and Italic respectively.

GIF DJFR DKN Scale Map CUNet NBNet MPRNet DVN (Ours)
PSNR 22.32 26.28 27.22 21.98 28.62 31.38 31.79 31.50
σ=2
SSIM 0.6410 0.8263 0.8902 0.6616 0.9138 0.9477 0.9504 0.9551
PSNR 19.15 23.91 24.34 21.02 26.81 29.14 29.37 29.62
σ=4
SSIM 0.5033 0.7464 0.8427 0.6225 0.8832 0.9259 0.9276 0.9400
PSNR 17.30 22.40 22.78 20.02 25.43 27.27 27.68 28.26
σ=6
SSIM 0.4240 0.6802 0.8067 0.5959 0.8510 0.9060 0.9083 0.9273
PSNR 15.98 20.72 22.50 19.07 23.75 24.81 26.20 26.98
σ=8
SSIM 0.3701 0.6177 0.7799 0.5742 0.8154 0.8822 0.8908 0.9155

ond step is to add noise to the pseudo-dark raw images, in- Table 2: Performance comparison (PSNR). The conclusions are
cluding Gaussian noise with variance equals to σ, and Pois- the same if SSIM is applied as the metric.
PSNR comparison on public IVRG PSNR comparison with other
son noise with a level proportional to σ. dataset (σ=50, input PSNR = 13.44) methods on DVD (σ = 4)
DJFR 23.35 SID (Chen et al. 2018) 25.26
CUNet 24.96 SGN (Gu et al. 2019) 28.40
Performance Comparison Scale Map 25.59 SSN (Dong et al. 2018) 13.72
DVN (Ours) 30.43 DVN (Ours) 29.62
Results on DVD Benchmark. We evaluate and compare
DVN with representative methods in related fields, includ- network is trained on a synthetic noisy dataset.
ing single-frame noise reduction algorithms NBNet (Cheng
et al. 2021) and MPRNet (Zamir et al. 2021), joint image Comparison on Public Dataset. So far, there is no high-
filtering algorithms GIF (He, Sun, and Tang 2012), DJFR quality public RGB-NIR dataset like DVD yet. For exam-
(Li et al. 2019), DKN (Kim, Ponce, and Ham 2021) and ple, RGB-NIR pairs in IVRG (Brown and Süsstrunk 2011)
CUNet (Deng and Dragotti 2020), as well as Scale Map (Yan are not well aligned. Even so, we retrained DVN and other
et al. 2013) which specially designed for RGB-NIR fusion. methods on IVRG and give quantitative comparison in Table
All methods are trained or finetuned on DVD from scratch. 2. It is clear that DVN still performs well.
We use PSNR and SSIM (Wang et al. 2004) for quantita-
tive measurement. Qualitative comparison is shown in Fig- Comparison with Low-Light Enhancement Methods.
ure 6, and quantitative comparison under different noise in- We also compare our method with the low-light enhance-
tensity settings (σ = 2, 4, 6, 8, the larger the σ, the heavier ment methods. We retrained SID (Chen et al. 2018) and SGN
the noise) on DVD benchmark is shown in Table 1. (Gu et al. 2019), the comparison can be seen in Table 2. It is
The qualitative comparison in Figure 6 clearly illustrates clear that our proposed DVN still shows great superiority.
the superiority of the proposed DVN on noise removal, de-
tail restoration and visual artifacts suppression. In contrast, Effectiveness of DIP
image denoising algorithms (i.e. NBNet and MPRNet) can-
not restore texture details when the noise intensity becomes In this section, we verify that the proposed DIP is effective in
high, and the output turns into pieces of smear even though handling the mentioned structure inconsistency. For compar-
the noise is effectively suppressed. GIF and DJFR output im- ison, we retrain a baseline, which is the same as the proposed
ages with heavy noise as the 3rd and 4th column in Figure DVN only without the DIP module. As Figure 8(a) shows,
6 shows, which greatly affects the fusion quality. The fusion the NIR shadow of the grass still remains in the fusion re-
effect of DKN and CUNet (5th and 6th column in Figure 6) sult without DIP, but not in the fusion result with DIP. This
under mild noise (e.g. σ = 2) is acceptable. But under heavy directly proves that DIP can handle the structure inconsis-
noise, obvious color deviation appears in the DKN output, tency. Figure 8(b) shows that DIP can also deal with serious
and neither of them can deal with structure inconsistency structure inconsistency caused by the misalignment between
(see the 4th row in Figure 6), resulting in severe artifacts in RGB-NIR images to a certain extent (this example pair can-
the fusion images. Scale Map outputs images with rich de- not be aligned even after image registration). This has prac-
tails. However, it cannot reduce the noise in the areas where tical value because the problem of misalignment frequently
texture is lacking in the NIR image. In addition, it is hard occurs in applications. Taking into account the nature of DIP,
to achieve a balance between noise suppression and texture the remaining artifacts are in line with expectations, since
migration when applying Scale Map. they are concentrated near the pixels with gradients in the
RGB image.
Generalization on Real Noisy RGB-NIR. To evaluate In addition, Figure 8 also visualizes the deep structure of
the performance of algorithms when facing real low-light RGB, NIR, consistent NIR (DIP-weighted) as well as DIP
images, we conduct a qualitative experiment on several pairs Maps. It is obvious that even facing noisy input, the RGB
of RGB-NIR images captured in real low-light environ- deep structure still contains clear structures. The visual com-
ments. As shown in Figure 7, outputs of DVN have obvi- parison between the NIR deep structure and the consistent
ously low noise, rich details, and are visually more natural NIR deep structure proves that the introduction of DIP can
when handling RGB-NIR pairs with real noise, even if the handle structure inconsistency in deep feature space.
Figure 7: Fusion results on RGB-NIR image pairs with real noise. DVN obviously obtains better results than other algorithms.

Figure 8: Illustration of the effectiveness of DIP. (a) shows a typical case of structure inconsistency caused by NIR shadows and
(b) shows a case of the misalignment RGB-NIR. Fusion results and visualizations of deep structures verify the effectiveness of
the DIP. Both examples are gathered from real noisy image pairs.

Table 3: Ablation experiment results are conducted on DVD improve performance as well as Table 3 (row 1 and 3) shows.
to study the effectiveness of each component. σ is set to 4. However, since the inconsistent structures are not removed,
the benefits are not obvious, even if we use intermediate su-
row. LĈ N
rec + Lrec DSEM DIP PSNR SSIM pervision and DSEM simultaneously as row 4 shows.
1. – – – 28.87 0.9356 As Table 3 (row 5) shows, after introducing DIP to deal
2. X – – 29.30 0.9375 with the structure inconsistency, the network performance
3. – X – 29.06 0.9376 can be further improved by a large margin. This demon-
strates the effectiveness of our proposed algorithm and the
4. X X – 29.36 0.9358
necessity to focus on the structure inconsistency problem on
5. X X X 29.62 0.9400
RGB-NIR fusion problem.
Ablation Study Conclusion
We evaluate the effectiveness of each component in the pro- In this paper, we propose a novel RGB-NIR fusion algo-
posed algorithm on the DVD benchmark quantitatively in rithm called Dark Vision Net (DVN). DVN introduces Deep
this section. PSNR and SSIM are reported in Table 3. The inconsistency prior (DIP) to integrate the structure inconsis-
baseline network directly fuse NIR features with RGB fea- tency into the deep convolution features, so that DVN can
tures (row 1 in Table 3). obtain a high-quality fusion result without visual artifacts.
Intermediate supervision LĈ N
rec and Lrec effectively im- In addition, we also proposed the first available benchmark,
prove the performance as Table 3 (row 1 and 2) shows. This which is called Dark Vision Dataset (DVD), for RGB-NIR
indicates the necessity of enhancing the noise suppression fusion algorithms training and evaluation. Quantitative and
capability of the network for clean structure extraction. qualitative results prove that the DVN is significantly better
Applying DSEM to learn deep structures without DIP can than other algorithms.
References Jung, C.; Zhou, K.; and Feng, J. 2020. FusionNet: Multi-
Brown, M.; and Süsstrunk, S. 2011. Multi-spectral SIFT for spectral fusion of RGB and NIR images using two stage con-
scene category recognition. In CVPR 2011, 177–184. IEEE. volutional neural networks. IEEE Access, 8: 23912–23919.
Cao, Y.; Wu, X.; Qi, S.; Liu, X.; Wu, Z.; and Zuo, W. 2021. Karaimer, H. C.; and Brown, M. S. 2016. A software plat-
Pseudo-ISP: Learning Pseudo In-camera Signal Processing form for manipulating the camera imaging pipeline. In Eu-
Pipeline from A Color Image Denoiser. arXiv preprint ropean Conference on Computer Vision, 429–444. Springer.
arXiv:2103.10234. Kim, B.; Ponce, J.; and Ham, B. 2021. Deformable kernel
Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; and Barlaud, networks for joint image filtering. International Journal of
M. 1994. Two deterministic half-quadratic regularization Computer Vision, 129(2): 579–600.
algorithms for computed imaging. In Proceedings of 1st Krishnan, D.; and Fergus, R. 2009. Dark flash photography.
International Conference on Image Processing, volume 2, ACM Trans. Graph., 28(3): 96.
168–172. IEEE. Krull, A.; Buchholz, T.-O.; and Jug, F. 2019. Noise2void-
Chen, C.; Chen, Q.; Xu, J.; and Koltun, V. 2018. Learning learning denoising from single noisy images. In Proceed-
to see in the dark. In Proceedings of the IEEE Conference ings of the IEEE/CVF Conference on Computer Vision and
on Computer Vision and Pattern Recognition, 3291–3300. Pattern Recognition, 2129–2137.
Cheng, S.; Wang, Y.; Huang, H.; Liu, D.; Fan, H.; and Liu, Lamba, M.; and Mitra, K. 2021. Restoring Extremely Dark
S. 2021. NBNet: Noise Basis Learning for Image Denoising Images in Real Time. In Proceedings of the IEEE/CVF Con-
with Subspace Projection. In Proceedings of the IEEE/CVF ference on Computer Vision and Pattern Recognition, 3487–
Conference on Computer Vision and Pattern Recognition, 3497.
4896–4906. Lehtinen, J.; Munkberg, J.; Hasselgren, J.; Laine, S.; Kar-
Connah, D.; Drew, M. S.; and Finlayson, G. D. 2015. Spec- ras, T.; Aittala, M.; and Aila, T. 2018. Noise2noise: Learn-
tral edge: gradient-preserving spectral mapping for image ing image restoration without clean data. arXiv preprint
fusion. JOSA A, 32(12): 2384–2396. arXiv:1803.04189.
Li, Y.; Huang, J.-B.; Ahuja, N.; and Yang, M.-H. 2019. Joint
Deng, R.; Shen, C.; Liu, S.; Wang, H.; and Liu, X. 2018.
image filtering with deep convolutional networks. IEEE
Learning to predict crisp boundaries. In Proceedings of the
transactions on pattern analysis and machine intelligence,
European Conference on Computer Vision (ECCV), 562–
41(8): 1909–1923.
578.
Lucas, A.; Iliadis, M.; Molina, R.; and Katsaggelos, A. K.
Deng, X.; and Dragotti, P. L. 2020. Deep convolutional neu- 2018. Using deep neural networks for inverse problems in
ral network for multi-modal image restoration and fusion. imaging: beyond analytical methods. IEEE Signal Process-
IEEE transactions on pattern analysis and machine intelli- ing Magazine, 35(1): 20–36.
gence.
Lv, F.; Zheng, Y.; Li, Y.; and Lu, F. 2020. An integrated
DeTone, D.; Malisiewicz, T.; and Rabinovich, A. 2017. Su- enhancement solution for 24-hour colorful imaging. In Pro-
perPoint: Self-Supervised Interest Point Detection and De- ceedings of the AAAI conference on artificial intelligence,
scription. CoRR, abs/1712.07629. volume 34, 11725–11732.
Foster, D. H.; Amano, K.; Nascimento, S. M.; and Foster, Mao, X.; Shen, C.; and Yang, Y.-B. 2016. Image restoration
M. J. 2006. Frequency of metamerism in natural scenes. using very deep convolutional encoder-decoder networks
Josa a, 23(10): 2359–2372. with symmetric skip connections. Advances in neural in-
Gu, S.; Li, Y.; Gool, L. V.; and Timofte, R. 2019. Self- formation processing systems, 29: 2802–2810.
guided network for fast image denoising. In Proceedings Son, C.-H.; and Zhang, X.-P. 2016. Layer-based approach
of the IEEE/CVF International Conference on Computer Vi- for image pair fusion. IEEE Transactions on Image Process-
sion, 2511–2520. ing, 25(6): 2866–2881.
Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; and Zhang, L. 2019. Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2018. Deep
Toward convolutional blind denoising of real photographs. image prior. In Proceedings of the IEEE conference on com-
In Proceedings of the IEEE/CVF Conference on Computer puter vision and pattern recognition, 9446–9454.
Vision and Pattern Recognition, 1712–1722. Wang, Y.; Huang, H.; Xu, Q.; Liu, J.; Liu, Y.; and Wang,
He, K.; Sun, J.; and Tang, X. 2012. Guided image filtering. J. 2020. Practical deep raw image denoising on mobile de-
IEEE transactions on pattern analysis and machine intelli- vices. In European Conference on Computer Vision, 1–16.
gence, 35(6): 1397–1409. Springer.
Hinton, G. E.; and Salakhutdinov, R. R. 2006. Reducing Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P.
the dimensionality of data with neural networks. science, 2004. Image quality assessment: from error visibility to
313(5786): 504–507. structural similarity. IEEE transactions on image process-
Huang, T.; Li, S.; Jia, X.; Lu, H.; and Liu, J. 2021. ing, 13(4): 600–612.
Neighbor2Neighbor: Self-Supervised Denoising from Sin- Wei, K.; Fu, Y.; Yang, J.; and Huang, H. 2020. A physics-
gle Noisy Images. In Proceedings of the IEEE/CVF Confer- based noise formation model for extreme low-light raw de-
ence on Computer Vision and Pattern Recognition, 14781– noising. In Proceedings of the IEEE/CVF Conference on
14790. Computer Vision and Pattern Recognition, 2758–2767.
Yan, Q.; Shen, X.; Xu, L.; Zhuo, S.; Zhang, X.; Shen, L.;
and Jia, J. 2013. Cross-field joint image restoration via scale
map. In Proceedings of the IEEE International Conference
on Computer Vision, 1537–1544.
Zamir, S. W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F. S.;
Yang, M.-H.; and Shao, L. 2021. Multi-stage progressive
image restoration. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, 14821–
14831.
Zhang, K.; Zuo, W.; and Zhang, L. 2018. FFDNet: Toward
a fast and flexible solution for CNN-based image denoising.
IEEE Transactions on Image Processing, 27(9): 4608–4622.
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; and Fu, Y.
2018. Image super-resolution using very deep residual chan-
nel attention networks. In Proceedings of the European con-
ference on computer vision (ECCV), 286–301.
Zhuo, S.; Zhang, X.; Miao, X.; and Sim, T. 2010. Enhancing
low light images using near infrared flash images. In 2010
IEEE International Conference on Image Processing, 2537–
2540. IEEE.

You might also like