1. Introduction
In modern society, image editing is becoming increasingly popular. With the simplification of image editing software, images can be edited even on one mobile phone. In various social networks, such as Facebook and Weibo, there are various kinds of edited images, some of which are maliciously distorted and tampered, making people unable to know the truth and even causing adverse effects. Therefore, accurate detection of tampered images is particularly important.
Roughly there are two types of image manipulation methods: image splicing and Copy–Move Forgery (CMF). Image splicing refers to pasting a part of an image into another image. CMF is to tamper image by pasting parts of an image into other areas of the image. Two examples of image manipulation are shown in
Figure 1.
After image manipulation, post-processing operations are sometimes applied, such as gaussian smoothing, adding gaussian white noise, and JPEG compression. These post-processing operations make the tampering effect more realistic, which makes the task of identifying tampered images more challenging.
Tampered images can be identified by image forensics algorithms, which are mainly divided into active and passive forensics methods. Active forensics methods include watermarking [
1,
2] and digital signatures [
3]. In active forensics methods, images have been pre-processed and imperceptibly marked. These changes can be easily detected. Passive forensics algorithms deal with completely unknown images with no prerequisites. Therefore, passive forensics is more challenging and has become an active research field for more than a decade.
Passive forensics algorithms, also known as image manipulation detection methods, are studied in this paper. Recent work on image manipulation detection algorithms focuses on certain features of tampered images, such as image compression features, image local noise features and Camera Filter Array (CFA) patterns, etc. Image compression features include block artifacts [
4] and Block Artifact Grids (BAG) [
5] generated by JPEG compression for checking image inconsistency. Then, the features of aligned double JPEG compression [
6] and Non-Aligned double JPEG (NA-JPEG) compression [
7] are introduced to detect image tampering. For image local noise features, the tamper traces are revealed based on the local noise variance models of wavelet filtering [
8] and the kurtosis characteristics of the frequency sub-band coefficients [
9] in the forged images. CFA demosaicing artifacts [
10] were studied as digital fingerprint, and the existence of demosaicing artifacts can be measured even at the level of 2 × 2 blocks. These algorithms in recent work have limited applicability and low detection accuracy.
Recently, several new algorithms using deep learning were proposed to improve manipulation detection performance. Convolutional Neural Network (CNN) model [
11] was developed to detect image splicing and CMF. A blind deep learning method based on CNN [
12] was used to learn invisible discrimination artifacts from manipulated images. A constrained convolution layer [
13] was proposed to adaptively learn manipulation detection features. CNN-LSTM architecture [
14] was developed to detect image forgery by learning the edges of tampered and non-tampered areas. Leveraging Faster Region-based Convolutional Neural Network (Faster R-CNN) [
15], a two-stream image manipulation detection model [
16] was developed to detect different types of tampered images. However, it has modest detection performance for CMF images. In addition, it mainly aims to detect target tampering, while ignoring background tampering.
In this study, leveraging CNN-LSTM architecture [
14] and Faster R-CNN two-stream image manipulation detection model [
16], we develop an image manipulation detection algorithm based on Faster R-CNN model. Laplacian of Gaussian (LoG) operator [
17] performs gaussian convolution filtering on the input image to overcome the influence of noise to some extent, but it may generate false edges [
18]. The Prewitt operator [
19] can be used to remove some false edges. Therefore, our algorithm applies the edge detection LoG operator and Prewitt operator to images, obtains edge features, carries out end-to-end training, fuses the features in the bilinear pooling layer, and detects tampered images. Image tampering detection is different from object detection, since image tampering can be either background tampering or target tampering, which means that tampering may occur in any area of an image. Compared with the Faster R-CNN model, our algorithm adds an edge detection layer, finds inconsistency between tampered and non-tampered areas, removes RoI max pooling, uses bilinear interpolation method to fix the size of the interest area, avoids only extracting high-frequency information, and improves image tampering detection performance. Our algorithm performs well in Copy–Move Forgery Detection (CMFD) and image splicing detection. Specifically, the main contributions of this paper are as follows:
The edge detection layer is added onto the Faster R-CNN model to extract edge features of original input images. Original input images and the images with edge features are put into the Faster R-CNN network in parallel for end-to-end training. We observe that the performance of image manipulation detection is improved on three benchmark datasets.
We propose to remove the max pooling in the Region of Interest (RoI) pooling layer and fix the size of the RoI by bilinear interpolation, thus retaining more feature information. Experimental results show that our method is more accurate than the original max pooling approach in image manipulation detection.
2. Related Work
In the past ten years, many methods have been developed to detect low-level tamper artifacts in tampered images. Ye et al. [
4] proposed a fast image manipulation detection method based on Discrete Cosine Transform (DCT) coefficient histogram, which detects a trace of forgery images by checking the inconsistency of block artifacts. Li et al. [
5] developed a passive detection algorithm to detect tampered JPEG images using anomalous block artifact grids. Zhu et al. [
6] proposed an improved double JPEG compression detection algorithm based on noise-free DCT coefficient mixture histogram model. Bianchi et al. [
7] obtained a single feature from DCT coefficient statistics to detect NA-JPEG compression. This algorithm can effectively detect tampered JPEG images. Mahdian et al. [
8] modeled the local noise variance by wavelet filtering, found local noise inconsistency, and identified forged images. Lyu et al. [
9] proposed a method to detect regional splicing by revealing inconsistencies of local noise levels. This utilizes a particular regular property of the kurtosis of nature images in band-pass domains and the relationship between noise characteristics and kurtosis. Considering the traces left by CFA interpolation, Ferrara et al. [
10] proposed a forensic tool to distinguish between the authentic area and the tampered area. These image manipulation detection algorithms only consider a limited set of features that tamper images, which hampers their detection performance.
Recently, CNN model [
11] has been used to detect tampered images, which is the first example of using deep learning in image manipulation detection. Rota et al. [
12] proposed a weak labeling method based on CNN, which can extract the tampered parts from a given forged image and generate segmentation region. Bayar et al. [
13] developed a constraint convolution layer, which can suppress the content of the image and learn manipulation features adaptively. Bappy et al. [
14] proposed a network model based on CNN-LSTM to detect different types of image manipulation, but the model only learns the boundary difference between manipulated and non-manipulated regions, ignoring other features of tampered images, which makes the detection performance not good enough. Salloum et al. [
20] used the Multi-task Fully Convolutional Network (MFCN) framework to learn a ground-truth mask and spliced region boundary to predict tampered region mask for a specified image, which makes the tampering location more accurate. However, the detection accuracy is modest. Zhou et al. [
16] proposed a two-stream Faster R-CNN model and trained it end-to-end to detect tampered regions in given images. One of the two streams is RGB stream, which aims to extract features from the RGB image to find tampering artifacts, such as strong contrast differences and unnatural tampering boundaries. The other is noise stream, which uses the noise features extracted by Steganalysis Rich Model (SRM) filter [
21] to find the noise inconsistency between the authentic areas and the tampered areas in manipulated images. However, this model is not ideal for detecting CMF images. Nevertheless, it provides an important basis for the construction of our algorithm. In this paper, the Faster R-CNN model is combined with the edge detection operators to learn the features of tampered images, and to perform image manipulation detection and forgery area localization.
3. The Faster R-CNN Model Combined with Edge Detection
The Faster R-CNN model [
15] mainly comprises three parts: feature extractor, Region Proposal Network (RPN) and RoI pooling [
22]. Leveraging the Faster R-CNN model, our model adds an edge detection layer to extract edge feature images, and input feature extractor in parallel with the original image to achieve feature mapping. We choose ResNet101, a residual network model [
23] which is deeper than Visual Geometry Group (VGG) network [
24], to learn the input image features. Different from the object detection [
15,
23], the forgery area of the manipulated image is likely to be the target region or the background region [
13]. In the RoI pooling layer, max pooling operation extracts the high-frequency information in the image, which is more conducive to detect target tampering [
22]. To overcome this problem, our algorithm removes the max pooling, uses bilinear interpolation to adjust the size of the region of interest and extracts more tampering information.
In this paper, we develop a Faster R-CNN-based image manipulation detection model (
Figure 2). The detailed steps of our algorithm are described as follows: First, in the edge detection layer, original input images are convolved with the Prewitt operator and LoG operator to obtain the images with edge features. After the images with edge features are sent into the Faster R-CNN model in parallel with the original tampered images. ResNet101 network [
23] is used to extract the features of tampered images, generating feature maps and edge feature maps of the original input images. They are sent to the RoI pooling layer to obtain both RoI features of the original input images and RoI features of the edge feature images. After two kinds of features are fused by bilinear pooling layer [
25], the tampering classification is carried out in the fully connection layer. Finally, the features of the original input images are mapped to the RPN to locate the tampered areas [
16]. The loss function of RPN network is described as follows [
15]:
where
represents the index value of anchors in a mini batch,
is the prediction probability of anchor
at the tampered area, and
is the ground-truth label where anchor
i is positive.
are the vectors describing four parameterized coordinates of the bounding boxes and the ground-truth mask boxes, respectively. The classification loss
denotes the cross-entropy loss of RPN network, and the regression loss
denotes the smoothing
loss of the proposed bounding boxes.
,
are normalized by
and
respectively, and weighted by the balance parameter
(
is set to 10 [
15]).
denotes the mini batch size of RPN network, and
denotes the number of anchors.
We use the cross-entropy loss in tampering classification, and the bounding box regression for smoothing
loss [
16]. The total loss function is described as follows:
where
is the total loss function.
represents the RPN loss function in RPN network.
represents the final cross-entropy classification loss, which is a bilinear pooling feature based on the original image input and edge feature image input.
represents the final bounding box regression loss.
is the RoI features of the original input image.
is the RoI features of the edge feature image.
3.1. Adding Edge Detection Layer
Whether image manipulation type is image splicing or CMF, it has visual inconsistency and obvious difference at the boundary of the tampered and non-tampered areas. Considering this information, edge features of the tampered area and tampered image are introduced to detect forgery image. The classical LoG operator and Prewitt operator in the field of edge detection are used to extract edge features from the original tampered images. They are then input into the Faster R-CNN model in parallel with the original tampered images for end-to-end training. Examples of edge feature images are shown in
Figure 3.
LoG edge detection operator is expressed as:
where
is the standard deviation. The
LoG edge detection operator performs gaussian convolution filtering for noise reduction of the input image, and then adopts Laplace operator for edge detection [
17]. This procedure overcomes the influence of noise to some extent, but it may produce false edges [
18]. If only one LoG operator is used to extract the edge features of the tampered images, the tampering detection performance is not ideal.
Prewitt edge detection operator is a first-order differential operator, which can be expressed as:
where
and
represent the Prewitt operator in horizontal and vertical directions, respectively. They use gray difference of pixel points at two sides to detect the edge, which can also be used to remove a part of the false edge [
19]. There is a great difference in pixel grayscale at the boundary of the authentic and tampered regions in the forgery images. LoG operator and Prewitt operator in both horizontal and vertical directions can effectively extract the edge information of the tampered images. The mean average precision (mAP) values detected on the Common Objects in Context (COCO) synthetic tampering dataset are shown in
Table 1.
Three kernels are selected in the edge detection layer (increasing the number of kernels does not improve performance). Their weights are shown in
Figure 4. We feed these three kernels directly into a pre-training network with input channel of 3. The kernel size in the edge detection layer is defined as
. The output channel of the edge detection layer is 3.
3.2. Improving RoI Pooling
RoI pooling layer can use max pooling to convert all regions of interest to rectangular windows with size of
(
is set to be
and 16 is feat_stride in this paper), where
and
are hyper-parameters. Each RoI window is represented as a four-tuple
, specifying its upper-left coordinate
, its height
and width
. The goal of RoI max pooling is mainly to divide the RoI window of
into sub-windows of
, and then collect the maximum value in each sub-window into the corresponding output grid cell through max pooling. The number of sub-windows is about
[
22].
The max pooling operation focuses on the features of the target object only and ignores the features from background. However, in image manipulation detection, either the background tampering or the target tampering can occur. Therefore, in order to improve the detection accuracy and simplify the model, we propose to remove the max pooling, and convert the RoI area
into windows with the size of
by bilinear interpolation, which is then sent into the fully connection layer for manipulation detection. The workflow is shown in
Figure 5.
5. Discussion and Conclusions
In this study, combining traditional edge detection algorithms with the Faster R-CNN model, a new image manipulation detection algorithm is developed to learn rich tampered image features for manipulation detection and localization. By using edge detection operators to extract features in the added edge detection layer, our algorithm can capture the edge inconsistency between the tampered area and the non-tampered area. In RoI pooling layer, our algorithm uses bilinear interpolation to adjust the size of the region of interest size instead of max pooling to extract more tampering information.
In this paper, we evaluate the performance of our proposed method through a series of experiments. Experiments on three benchmark datasets show that our proposed method is better than current image manipulation detection algorithms in detection accuracy. After adding gaussian white noise, gaussian blurring and JPEG compression attack, our method still has satisfactory detection performance on the CASIA dataset. Our method is also more accurate than RGB-N algorithm for locating tampered regions, especially on CMF.
However, because the CASIA dataset has undergone different degrees of post-processing, such as smoothing the boundary traces between the unforged region and the forged region of the tampered image, the results of our algorithm on the CASIA dataset are still not ideal. Therefore, manipulation detection of post-processed images becomes the key to future research.
Our study demonstrates that traditional image manipulation detection algorithms possess limited applicability and low detection accuracy. Our proposed image manipulation detection algorithm leveraged by Faster R-CNN model combined with edge detection exhibits higher detection performance than traditional image manipulation detection algorithms. Multiple features of tampered images could be fused to find more tampering clues and improve image manipulation detection performance.