Adaptive Feature Analysis in Target Detection and
Adaptive Feature Analysis in Target Detection and
Research Article
Adaptive Feature Analysis in Target Detection and Image Forensics
Based on the Dual-Flow Layer CNN Model
Nannan Liang ,1,2 Haifeng Xu,1 WanLi Zhang,1 and Lin Cui1
1
School of Informatics and Engineering, Suzhou University, Suzhou 234000, China
2
Key Laboratory of Mine Water Resource Utilization of Anhui Higher Education Institutes, Suzhou 234000, China
Received 31 May 2022; Revised 26 July 2022; Accepted 9 August 2022; Published 28 August 2022
Copyright © 2022 Nannan Liang et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With the rapid development of artificial intelligence technology, image editing technology has evolved from relying on software
such as Photoshop and GIMP for manual modification to using artificial intelligence technology to achieve intelligent and
automated tampering of images. Editing, falsifying, and disseminating digital images have become simple and easy, leading to a
crisis of confidence in digital images and reducing their reliability as judicial evidence. Therefore, how to identify falsified images,
improve their trustworthiness, and avoid judicial injustice has become a problem that must be overcome in the information age. In
this paper, we propose a target detection and adaptive feature analysis in image forensics based on a dual-flow layer CNN model,
which can effectively perform image forensics. The results show that our algorithm has a clear theoretical basis, a small operational
complexity, and a high detection accuracy.
minds, causing an uproar in society and subverting the basic photo images may lead to political turmoil, diplomatic
common sense that seeing is believing. discord and even military conflict, and other extremely
Digital photo image tampering forensic technology is a serious consequences.
kind of digital image forensic technology, which relies on
computer technology to judge whether the original content 2. Related Work
is still maintained in the process of transmission and
dissemination of photo images taken by digital cameras. If The traditional target detection methods are Viola-Jones,
we want to forensically examine the authenticity of HOG + SVM, and DPM. Among them, Viola-Jones uses
photo images, we should first understand the means of integral graph features and AdaBoost method. HOG + SVM
photo image tampering, and the more common method detects pedestrians as targets. It first extracts HOG
tampering methods include copy-paste of the same image, features from the candidate regions of the target and then
splicing of different images, image retouching, and image uses SVM classifier for classification decision. DPM is a
enhancement. variant of HOG feature detection, and DPM adds addi-
tional strategies. DPM method is the most effective and best
performing method among all traditional target
1.1. Copy-and-Paste Tampering with the Same Image. detection methods. Its advantages are the intuitive and
Paste a part of the image to other parts of the image as in simple method, block computing speed, and adaptation to
Figure 1. The original photo image is on the left, and the animal deformation. It has been verified by a large number
tampered image on the right is created by copying the lawn of scholars that its detection accuracy, generalization
in the figure and covering the person. ability, and detection speed are better than traditional
Heterogeneous image stitching tampering: a part of one methods.
image is stitched into another image by two or more images. The two-stage-based target detection model refers to the
The stitching tampering between different images has the extraction of features using convolutional neural network
following characteristics: (1) the tampering trace is not vi- (CNN) first, then the recommendation of candidate regions
sually noticeable; (2) some statistical characteristics of the using region candidates, and finally marking the target box
image are changed by the tampering behavior. As shown in location and classifying the marked target boxes. The most
Figure 2, the yellow flowers in the original image 2(a) are typical representative is the RCNN series of networks. The
stitched into the original image 2(b) to obtain the tampered one-stage target detection model is a regression model that
image 2(c). directly regresses the position of the target frame without
generating a candidate frame in the middle of the network
1.2. Image Retouching. It is an image restoration operation and directly converts the target frame location problem into
commonly used in artistic photos to make the people in the a regression problem. The most representative one is the
photos more beautiful; it is also commonly used after image yolo series network. The large number of prior frames in-
copy-paste tampering or splicing tampering to eliminate the creases the computation and memory usage. For targets with
edge traces of tampering. In the original image of extremely large aspect ratios in the scene, the method of
Figure 3(a), the human face has more obvious spots, while preset a priori frames is not only time-consuming but also
the face of the tampered Figure 3(b) becomes smooth and prone to false detection problems. Different data sets require
more beautiful after retouching. different target detection models, so different a priori frames
need to be set, resulting in reduced model generalization
capability.
1.3. Image Enhancement. An operation that blurs or high- Photo image tampering forensics is emerged in the last
lights information somewhere in an image. This type of decade. Despite the short development time, photo image
tampering technique is usually achieved by changing the hue tampering forensic technology has gained great progress
or contrast of a certain part of the image. Figure 4(a) blurs a with the continuous development and improvement of
large amount of detailed information by adjusting the image processing, pattern recognition and artificial intelli-
contrast and hue, so that the original image is tampered with, gence and other related theories, and the continuous dis-
creating a tampered Figure 4(b). covery of relevant experts and scholars’ research. According
Incidents of photo tampering such as the one mentioned to the current research theoretical results, the photo image
above have emerged, seriously affecting the public’s correct tampering forensic process is briefly summarized, as shown
judgment of things. In the present case, the negative impact in Figure 5.
of photo tampering and the crisis of confidence caused by it Since the tampered part of the tampered photo image
are worrying. If doctored photos are used in news reports, differs from the untampered real part in certain types of
they may distort the facts and mislead the public, which may features, such features can be extracted from each part of the
intensify social conflicts; if doctored photos are used as photo image to be tested during forensics, and then the
evidence in court, they may lead to false cases, obstruct extracted features can be classified to arrive at the verdict of
justice, and allow criminals who should be punished to authenticity of the photo image to be tested. According to
escape from the net of justice; if doctored photos are used in the different features extracted, this paper divides the photo
insurance claims, they may cause unnecessary economic image tampering forensics into two categories: tampering
disputes; in international relations, the use of doctored forensics based on image content features and tampering
Mobile Information Systems 3
(a) (b)
Figure 1: Copy-paste forgery within same image. (a) Original image. (b) Tampered image.
Figure 2: Splice forgery between different images. (a) Original Figure 1. (b) Tampered figure. (c) Original figure.
(a) (b)
Figure 3: Image blur forgery. (a) Original image. (b) Tampered image.
forensics based on imaging features, which are briefly in- will be destroyed, and the researcher can make a decision on
troduced below. the authenticity of the image by detecting the changes of
these content features.
(a) (b)
Figure 4: Image enhancement tamper. (a) Original image. (b) Tampered image.
Real
Tampering
the null and transform domains are some basic features of detect the stitched tampered images, respectively, to achieve
images and an important means to study the essential the localization of the tampered regions. To further improve
properties of images. the forensic effect, [6] suggested extracting LBP features in
The literature [1] proposes a copy-paste forensic algo- the DCT domain for detection. Considering that the tam-
rithm based on DCT coefficients, which adopts a sliding pered images may be contaminated by Gaussian blur fil-
window chunking strategy for the image to be tested, then tering and Gaussian white noise, the DCT algorithm is
calculates the DCT coefficients of each image block, quan- improved in the literature [7].
tifies the obtained DCT coefficients to construct feature In [8], wavelet transform-based image forensic algo-
vectors, and then performs dictionary sorting on all feature rithms are proposed to extract features for matching de-
vectors. If there are similar or identical image blocks in the tection of subband information of the wavelet transform of
image to be tested, the positions of their corresponding the image to be detected. For example, the wavelet de-
feature vectors will be closer, and the similar blocks in the composition has two subbands of low and high frequencies,
image can be identified by calculating the displacement and the copy-paste block is detected by comparing the
vector to achieve the purpose of tampering detection. On correlation of the Zernike moments at the corresponding
this basis, the DCT quantization coefficients are dimen- positions of the two subbands in blocks. The detection of the
sionalized in the literature [2]. [3] made an improvement on tampered region by comparing the similarity on the high-
the chunking strategy by using a circular chunking method, frequency subband after wavelet transform as proposed in
followed by the construction of DCT coefficient feature literature [9]. [10] proposed to extract LBP features in the
vectors. The literature [4] proposed to construct feature low-frequency subband to identify tampering.
vectors by calculating the difference matrix of the DCT
coefficient matrix of an image and then detect them using
SVM, and subsequently the improved method achieved an 2.1.2. Key Point-Based Forensic Techniques. For same-image
average recognition rate of 97.92% and 91.2% on the stitched tampering, there are two or more identical or similar regions
image libraries of CASIAv1.0 and CASIAv2.0; however, such in an image, and key points are extracted for the whole
methods did not achieve tampering localization. The liter- image. Since the key point characteristics of the identical or
ature [5] argues that the tampered images are compressed similar regions are closer, the tampered region can be lo-
and saved again causing changes in the DCT coefficients, cated by correlation matching of all key points. Based on this
whereby the histogram difference of the DCT coefficients principle, a detection method based on Harris point de-
and the double quantized mapping relationship are used to tection is proposed in the literature [11], which has better
Mobile Information Systems 5
robustness to posttampering compression. The literature proposes an LBP-based texture feature description method
[12] uses Harris points combined with the mean value of the with some robustness.
circular neighborhood as feature points, which can solve the
copy-and-paste operation of the visual structure plane re-
gion. The literature [13] extracts image feature points with 2.2. Tampering Forensic Techniques Based on Imaging
Harris operator and uses a new forensic feature matching Features. The general imaging model of a digital camera is
method to improve detection accuracy and efficiency. shown in Figure 6. First, an optical filter to filter the color
The literature [14] proposed the SIFT feature points of light other than red, blue, and green, after which the color
the image to be tested are extracted, the feature points are information of each position is recorded by the color filter
matched using the G2NN matching criterion, and then the array, and then the light signal is converted into an electrical
key points on the match are clustered to determine the signal through the sensor, and then the CFA (Color Filter
copied and pasted regions, but the detection effect is more Array) interpolation algorithm is applied to each. At this
dependent on the clustering results. The literature [15] time, the signal is then processed by a series of digital image
suggests J-linkage clustering, but the algorithm is not ac- processing techniques such as white balance and gamma
curate in locating pasted blocks after rotating and scaling correction, and then compressed according to certain rules
operations, and the detection efficiency is not ideal. To to obtain the final digital photo image.
further improve the robustness of the algorithm, it is pro- The analysis of the camera imaging model shows that in
posed to extract SIFT features using e measured after wavelet the process of digital photo image generation, after a series of
transform to reduce noise interference. hardware processing and software operations, some imaging
The SURF key points are proposed, and it is suggested features such as CFA interpolation noise, pattern noise, and
that SURF is extracted, so the localization of the pasted compression noise are inevitably introduced. Due to the use
blocks is not ideal; to overcome this drawback, a combi- of different hardware and software processing methods, it
nation of SURF algorithm and SIFT algorithm is used to makes the photo images taken by different brands and
extract key points to achieve precise localization while models of cameras, only have the imaging features of that
improving the efficiency of the algorithm. The literature [16] camera, so such features can be used for forensics to detect
combines both SURF and HOG features, and the experi- tampered and forged photo images.
mental results are significant.
2.2.1. Tampering Forensic Techniques Based on CFA Inter-
2.1.3. Forensic Techniques Based on Light Consistency polation Noise. Current photo images usually use a single
Features. In photographs, there is generally a relatively fixed sensor in the generation process, and each pixel point can
illumination environment (e.g., sun, interior lighting), only record one color information, and the other two-color
which makes the illumination intensity and direction con- information are obtained by interpolating the surrounding
sistent in the photograph. For stitched blocks from other pixels, which leads to the correlation between neighboring
photographs, the illumination will not be consistent with the pixels will exist. Since different cameras may not use the
illumination of the real region in the tampered image. Farid’s interpolation method, the correlation between pixels will be
team proposes an image recognition model under 3D light different and can be used to detect tampering. The literature
sources, which uses a spherical harmonic model to estimate [19] locates splicing tampering by re-CFA interpolation of
the direction of the light sources of objects in the photo- images to reconstruct their pixel neighborhood consistency,
graphs and then detects them based on the consistency of the detects whether there is splicing tampering operation by
direction. Using the detection of the consistency of the analyzing the distribution of color difference images at high
direction of the shadow region caused by the light with the frequencies, extracts CFA features using Gaussian filtering
direction of the light, it is robust to multiple tampered based on posterior probability estimation of CFA interpo-
targets. Since the above method is subject to relatively strict lation noise, and achieves forensic purposes by classifying
assumptions that limit its practical applicability, it has been the features. The literature [20] considers the spectral cor-
experimentally shown to improve the light direction esti- relation introduced by CFA interpolation and identifies the
mation error and is more applicable [17]. image authenticity based on this property.
2.1.4. Forensic Techniques Based on Texture Features. 2.2.2. Tampering Forensics Based on Camera Response
Texture is an important feature to describe and distinguish Function. The process of generating photos of natural
different objects. Given that it is difficult to keep the texture scenes through a series of hardware and software operations
features of the tampered block consistent with the original inside the camera can be called Camera Response Function
image, which inevitably destroys the periodicity, direc- (CRF). Each camera is an independent individual and its
tionality, and randomness of the original image texture, it corresponding function is not the same, so the authenticity
provides another possible method for stitching tampering of the image can be identified by comparing the consistency
forensics. By dictionary sorting, the Tamura texture features of CRFs in each region of the image. The literature [21]
of each image block and then calculating the feature simi- estimated the CRF of each region by the geometric in-
larity based on the Euclidean distance, the forged image variance of the pixels in each region of the image and then
regions can be detected and located. The literature [18] used crossover for statistical classification, achieving a
6 Mobile Information Systems
Data Compression
CFA Interpolation
gamma correction
and other post-
White balance,
Optical Filters
Digital photo
processing
Footage
Sensor
images
Scene
Color Filter Array, CFA
Figure 6: Digital camera imaging model.
detection rate of 87% for stitched images. Based on this, the In addition to the above features, Markov features,
differential invariants of the images were calculated to es- Fourier-Mellin transform features, image quality features,
timate the CRF. The literature [22] used a maximum pos- and color features are often used for photo image tampering
terior probability model to estimate the normality of the detection。
CFR to discriminate the authenticity of the photographs.
3. Methods
2.2.3. Tampering Forensic Techniques Based on Compression Figure 7 presents a generalized framework for digital image
Characteristics. Photo images are usually saved in a certain source forensics under the CNN model theory. In the image
compression format during the generation process, and preprocessing, the image to be detected is first cut into image
images are usually compressed one or more times again after blocks (Pk in Figure 8(a) indicates the kth image block), and
tampering, so the differences of individual image blocks after then the image fingerprint characterizing the source of the
compression are detected to identify tampering. In the lit- shot is extracted using CNN, and the detection result Yk of
erature [23], an iterative method is proposed to estimate the each image block is output (Yk in Figure 8(c) indicates the
original quantization table of an image to determine the feature extractor predicts the label for the k th image block),
approximate tampered region, and then the estimated and the majority voting algorithm is used to fuse the de-
original quantization table is used to perform another JPEG tection results of the kth image block and output the image-
compression on the tampered region to precisely locate the level prediction results, i.e., device model multiclassification
tampering according to the difference in pixel values before identification.
and after compression. In the literature [24], the similarity of It was found that the FPN feature fusion algorithm
the synthetic images before and after compression is ob- improved the detection of small targets but did not improve
tained by estimating the quantization factor to identify the the detection of large targets, and there was information
location of tampering. Compressing the image again pro- redundancy after feature fusion. Since then, researchers have
duces a double quantization effect on the DCT coefficients of proposed some variants of FPN, such as PANet, Libra-
the real region, whereby tampering is suggested to be RCNN, which are built on the assumption that the weights
identified based on the change in the DCT coefficients of are the same when the features of two layers are fused,
different regions before and after compression. The litera- ignoring the feature that the contribution values of features
ture [25] argues that images produce uniform quantization in different layers are different. Therefore, this section
noise after JPEG compression, and tampered blocks corrupt proposes a new feature pyramid named dual-stream feature
this property and propose a quantization noise estimation CNN (DS-CNN) using autonomous learning weights and
model to detect the differences between image blocks. jump connection method.
Considering that JPEG compression produces a grid effect, As shown in Figure 8, FPN is a simple top-down one-way
the presence of unaligned grids in the image can be detected information, while PANet adds bottom-up information flow
to identify the tampered locations. to FPN to enhance higher-level semantic information for
semantic segmentation, which is better than FPN but more
computationally intensive. LibraRCNN collects feature in-
2.2.4. Pattern Noise-Based Tampering Forensic Technique. formation from each layer and then refines the output to the
Pattern noise is caused by the imperfection of the camera feature layer, with the idea of fusion before segmentation.
sensor and the inconsistency of the materials used, resulting NASFPN adopts the idea of AutoML and uses search for
in the imperfect conversion of light signals into electrical feature fusion; ASFF learns the weight contribution value of
signals, and is stable in every picture taken by the camera. each layer, but it is a fully connected method with high
Since the sensor of each camera is unique, its mode noise is computation and requires higher performance computing
also unique; in addition, each pixel point on the sensor is equipment, which is not convenient for practical applications.
different, resulting in inconsistent performance of mode As shown in Figure 9, BiFPN is a feature fusion algo-
noise in each pixel point. Based on these two characteristics, rithm proposed in the Efficient Det network, in which a
the pattern noise can be regarded as a camera fingerprint and jump connection approach and a weighted fusion approach
applied to photo image tampering forensics, which can be are used for feature fusion, taking into account both effi-
generalized to a variety of tampering operations such as copy- ciency and accuracy. Its calculation equations are shown in
paste of the same image and stitching of different images. (1) and (2).
Mobile Information Systems 7
P1 X1
Y1
P2 X2 Y2
Majority vote
Y3 Tags L
P3 X3
...
Yk
...
X4
Pk
Figure 7: Digital image source framework based on CNN; (a) image preprocessing; (b) image feature extraction; (c) classification result
voting.
P7 P7 P7 P7
P7
P6 P6 P6 P6
P6
P5 P5 P5 P5
P5
P4 P4 P4 P4
P4
P3 P3 P3 P3
P3
Figure 8: Feature network design diagram; (a) FPN; (b) PANet; (c) LibraRCNN; (d) NAS-FPN; (e) ASFF.
We propose a new dual-stream feature CNN (DS-CNN), C3 , C4 , C5 layers is first converted to 256 dimensions and
for the FCOS structure of fused area-ness. It is mainly then input to A-DFN for feature fusion. The adaptive feature
improved based on the BiFPN algorithm, following its jump fusion is performed to P5 for layer C3 , C4 , C5 , where the
connection and weighted fusion, while improving the FCOS weights are normalized by Soft max, in order to provide
model structure based on the fused area-ness in Chapter 3. more information for the subsequent feature fusion without
The general structure is shown in Figure 10. sacrificing accuracy. And the subsequent weight normali-
As shown in Figure 10, the network is obtained from zation used for weighted feature fusion from layer
ResNeXt101 to obtain C3 , C4 , C5 layers, and then it is P3 , P4 , P5 , P6 , P7 is fast regular, which aims to make the
convolved by 1 ∗ 1 for dimensionality reduction to obtain a network run better by slightly sacrificing accuracy while
256-dimensional P3 , P4 , P5 feature map, which is convenient ensuring performance. Head layer is divided into a shared
for feature fusion. P6 , P7 is the feature map obtained after part and a branch part, and since area-ness is more closely
down sampling P5 , P6 separately. related to location, it is divided into the same branch as
The dual-stream feature CNN (DS-CNN) can enhance regression to help the model perform better target box
the semantic information of each prediction layer though. position regression.
However, the analysis of information inflow from each node
shows that the information inflow and outflow at P5 is 4. Experiments
unbalanced. From Figure 10, it can be seen that layer P5 has
only one information input in layer C5 , but three infor- The COCO dataset, known as Microsoft Common Objects in
mation outflows (the arrows can indicate the information Context, is a dataset acquired by the Microsoft team to
inflow and outflow of P5 ). Whether sufficient information perform target recognition, target segmentation, and target
can be obtained has an important impact on the subsequent detection competitions. The schematic diagram of the
feature fusion of node P6 , P7 . Therefore, in this thesis, the COCO dataset is shown in Figure 13. COCO dataset is
information of node P5 is enhanced based on the dual- divided into 2014 version and 2017 version. The current
stream pyramid, and the information of layer C3 , C4 is also version used in this thesis is the 2017 version, which contains
fused directly to layer P5 . However, layer C3 , C4 , C5 has 80 target categories, 118287 training sets, totaling 19.3 G,
different contribution values to P5 feature layers, so it is 5000 validation sets, totaling 1814.7 M, so the 2017 version
necessary to learn different weights for layer C3 , C4 , C5 first COCO dataset has 123287 sheets.
before feature fusion. In this paper, we name this feature From Figure 14, it can be seen that the COCO dataset has
fusion method as Adaptive Dual Streaming Feature CNN more categories than the PASCALVOC dataset, and the
(A-DS-CNN), and the specific structure is shown in number of instances corresponding to each category is also
Figure 11. higher. Therefore, the COCO dataset is more difficult to
As in Figure 11, layer C3 , C4 will increase the infor- detect the target and can better represent the performance of
mation inflow in layer P7 by means of adaptive feature the target detection model.
fusion. The adaptive weights are calculated as in -) (6): As shown in Figure 15, the area of most targets in the
COCO dataset is only about 6% of the image size; 41.43% of
yij � convα3ij ∗ x3ij + β4ij ∗ x4ij + c5ij ∗ x5ij , (6) all targets appearing in the COCO training dataset are small
targets, 34.4% are medium targets, and 24.2% are large
λ3αij
e targets. By analyzing the COCO data, it is found that small
α3ij � λ4
, and medium targets account for a larger proportion, so this
βij λ5c
λ3αij +e +e ij
e dataset is more concerned with the detection of small and
λ4β medium targets.
e ij
β4ij � λ4
, (7) We mainly detect the coco dataset and divide it. During
λ5c
λ3αij +e
βij
+e ij training, the image data are first preprocessed, resize the
e image to match the target detection model size, and then
λ5cij
e input to the target detection model, train the appropriate
c5ij � λ4
, number of iterations, and get the final detection results.
βij λ5c
λ3αij +e +e ij
e Finally, the test-test-dev data results are submitted to the
coco Detection Challenge competition to obtain AP values
where α3ij , β4ij , c5ij denotes the weight of C3 , C4 , C5 , respec- and AR values.
tively, i, j denotes the location coordinates of the feature In order to better compare the anchor-base and anchor-
map, x3ij , x4ij , x5ij denotes the value at location (i, j) in layer free methods, this thesis is based on a unified benchmark,
C3 , C4 , C5 , and yij denotes the final output to the value at i.e., the COCO dataset and uses a unified evaluation crite-
location (i, j) in P5 . rion: MAP values (MAP values are equivalent to AP values
The network structure of the final target detection al- in coco data), and compares their MAP values for large,
gorithm in this thesis is shown in Figure 12. medium, small targets and different IOU thresholds. The
To improve the basic convolutional network, the MAP values are compared for large, medium, small targets
Backbone layer adopts the ResNeXt101 structure and uses 64 and different IOU thresholds.
paths with each path width of 4 to reduce the computational According to Table 1, CenterNet511 and CornerNet511
effort. For better feature extraction, the Channel of take longer to test one image under the same conditions,
Mobile Information Systems 9
P7
P6
P5
C5
P4
C4
P3
C3
P7
P6
P5
C5
P4
C4
P3
C3
P6 head Y6 regression
... ...
P5
C5 head Y5 area-ness
Shared Branches
Sections
P4
C4 head Y4
P3
C3 head Y3 head
backbone
Input image
indicating that CenterNet511 predicts slower because CenterNet511 and CornerNet511 use the hourglass network
CenterNet511 predicts one more centroid than Corner- as the backbone layer, which is computationally intensive
Net511, which brings more computation. Both and has a slow computing speed. In contrast, the ResNeXt-
10 Mobile Information Systems
100,000
10,000
1,000
100
motorcycle
stop sign
baseball glove
person
bicycle
car
airplane
bus
train
boat
traffic light
fire hydrant
street sign
parking meter
bench
bird
cat
dog
horse
sheep
elephant
bear
zebra
giraffe
hat
umbrella
shoe
eye glasses
handbag
tie
suitcase
frisbee
skis
snowboard
sports ball
kite
baseball bat
skateboard
surfboard
tennis racket
bottle
plate
cup
Knife
spoon
bowl
banana
apple
sandwich
donut
cake
couch
potted plant
bed
blender
vase
scissors
wine glass
orange
broccoli
carrot
hot dog
pizza
chair
mirror
dining table
toilet
door
laptop
mouse
remote
keyboard
cell phone
microwave
oven
toaster
refrigerator
teddy bear
hair drier
toothbrush
hair brush
cow
window
fork
tv
truck
backpack
sink
desk
book
clock
COCO
PASCAL VOC
Figure 14: Comparison of the number of instances between the COCO dataset and the PASCAL VOC dataset.
25
FCOS w/
ResNet-101 41.6 60.6 45.1 24.3 44.9 51.5
20 FPN
FCOS w/ ResNeXt-
15 42.6 62.3 46.2 26.1 45.5 52.5
FPN 32x8d-101
10 FCOS w/ ResNeXt-
44.8 64.1 48.5 27.5 47.4 55.7
FPN 64x4d-101
5
(Note: 32 ∗ 8d means 32 paths, the width of each path is 8).
0
20 40 60 80 100
Percent of image size
101 prediction used by FCOS is fast and has a better per-
COCO SUN
PASCAL VOC
formance in terms of average accuracy mean, so this thesis
ImageNet
uses the FCOS algorithm as the base algorithm for
Figure 15: Example image size and percentage of the COCO, improvement.
PASCAL VOC, ImageNet, and SUN datasets. According to Table 2, the MAP value can reach 44.8
when FCOS adopts ResNeXt-64x4d-101-FPN (i.e., 64 paths,
each with a width of 4) as the backbone, which is 3 and 2
Table 1: Comparison of FCOS and CenterNet511, CornerNet511 points higher than that of ResNet-101 and ResNeXt-32x8d-
prediction speed.
101, respectively. Therefore, ResNeXt-64x4d-101 is used as
Method Backbone Testing time/image the backbone of the target detection model in this thesis.
CenterNet511 Hourglass52 270 ms According to Table 3, the mean accuracy of target de-
CenterNet511 Hourglass104 300 ms tection using center-ness is 0.2 points lower than the area-
CenterNet511 Hourglass104 340 ms ness designed in this paper when the feature fusion method
FCOS ResNeXt-101 112 ms is FPN. In the DS-CNN fusion method, the area-ness is 0.4
(Note: 511 means the input image size is 511 ∗ 511). points higher than the MAP value of center ness. It means
Mobile Information Systems 11
Figure 16: Visualization results of partial object detection by the algorithm proposed in this thesis.
that the area-ness designed in this paper is better than the The methods based on fixed thresholds will have dif-
center-ness of the original FCOS. ferent detection results at different thresholds, and four
As shown in Figure 16, the algorithm of this thesis is able more desirable thresholds of 0.006, 0.01, 0.014, and 0.03 were
to accurately detect even a compact orange placed in the fruit selected for comparison through experiments. In order to be
tray or an empty water bottle placed by the bed, indicating able to evaluate the detection results objectively, the pattern
that the prediction layer is able to acquire a sufficient noise in both types of algorithms is obtained by wavelet noise
number of features. When the target frames of people and reduction and then processed with ZM + WF. In the cal-
Frisbees overlap, the algorithm of this thesis is still able to culation of TPR and FAR, if the number of pixels of a certain
predict them each. image tampering localization result is less than 20, it is
The authenticity detection results for each test image can judged to be a real image and vice versa; it is considered as
be divided into two categories: tampered and true. To tampering. The detection results of the two algorithms are
evaluate the performance, shown in Table 4.
TN The proposed adaptive thresholding algorithm has a
TPR � , TPR of 98.9% and FAR of 1.896% for 1000 images to be
FP + TN
(8) tested, while the fixed thresholding algorithm has different
FN detection results at different threshold values. It is 0.01,
FAR � . 0.014, and 0.03, and although the TPR is similar or equal to
FN + TN
it, the FAR is much higher than this paper. At 0.006, the
Authenticity detection results: tampering detection ex- FAR is similar to the algorithm in this paper, but the TPR
periments are performed on 500 real images and 500 is much lower than that in this paper. Meanwhile, the
tampered images from the image library given in Table 4 average detection time of the two algorithms on 1000
using the traditional fixed-threshold sliding window method images is given in Table 4, and the comparison results
based on correlation coefficients and the proposed SPCE- show that the proposed algorithm effectively reduces false
based adaptive threshold nonoverlapping chunk match- alarms while maintaining a high detection rate and de-
ing + ZNCC algorithm, respectively. tection efficiency.
12 Mobile Information Systems
Table 6: The comparison between the proposed method and G.R. Sheng’s method.
Tampering
P-value ratio selection
Untampered image (%) 0.9 (%) 0.8 (%) 0.7 (%) 0.6 (%)
G.R. sheng 67.86 64.27 85.72 83.92 87.49
Algorithm of this paper 69.25 77.98 90.488 86.89 89.31
(Note: The detection result of this algorithm is the average correct rate using three thresholds of 0.7, 0.8 and 0.9).
Compared with the traditional fixed threshold judgment second local block of wall image is k ∈ [0.3288, 0.4372]; the
method, the adaptive threshold judgment method, which is texture complexity of the third local block of floor image is
based on the texture complexity of the image block to be k ∈ [0.3511, 0.5296]; the texture complexity of the fourth
tested, selects a suitable threshold value, thus realizing local block of green grass image is k ∈ [0.6601, 0.8442]; and
“specific problem specific analysis.” the texture complexity of the fifth local block of dead grass
In Figure 17, the first to fourth columns show the image is k ∈ [0.6927, 0.9463].
original image, the tampered image, and the tampered lo- Observing the localization results of the proposed
cation, respectively. The second column gives five tampered adaptive thresholding algorithm for five tampered inspec-
images with simple to complex texture complexity, where tion images shows that whether the texture of the tampered
the texture complexity of the first local block of blue sky image is simple or complex, which effectively eliminates the
image is k ∈ [0.1857, 0.2886]; the texture complexity of the influence of texture on forensics (see Tables 5, 6).
Mobile Information Systems 13