Deep Residual Network For Steganalysis of Digital Images
Deep Residual Network For Steganalysis of Digital Images
Abstract— Steganography detectors built as deep convolutional steganography in singular cover sources that permit powerful
neural networks have firmly established themselves as superior to compatibility attacks [6], [22], [37], [42], the most accurate
the previous detection paradigm – classifiers based on rich media detectors have been built using the tools of machine learning.
models. Existing network architectures, however, still contain
elements designed by hand, such as fixed or constrained convolu- This trend has been started by Avcibas et al. [1], [2] and Farid
tional kernels, heuristic initialization of kernels, the thresholded and Siwei [18] in early 2000’s and was later greatly improved
linear unit that mimics truncation in rich models, quantization by representing images with higher-order statistics of noise
of feature maps, and awareness of JPEG phase. In this work, residuals or DCT coefficients [43], [49], [68]. It culminated
we describe a deep residual architecture designed to minimize the in what is recognized today as steganalysis with rich mod-
use of heuristics and externally enforced elements that is universal
in the sense that it provides state-of-the-art detection accuracy els [9], [15], [17], [19], [30], [38], [50], [53] and scalable
for both spatial-domain and JPEG steganography. The key part machine learning [13], [40], [44].
of the proposed architecture is a significantly expanded front part Recently, deep learning [23] has been proposed for ste-
of the detector that “computes noise residuals” in which pooling ganalysis in an attempt to improve detection accuracy by
has been disabled to prevent suppression of the stego signal. jointly optimizing the image representation (features) as
Extensive experiments show the superior performance of this
network with a significant improvement, especially in the JPEG well as the classifier. Beginning with detectors that used
domain. Further performance boost is observed by supplying the stacked auto-encoders [52], in an early influential work by
selection channel as a second channel. Qian et al. [45] the authors described a neural network
Index Terms— Steganography, steganalysis, convolutional steganalyzer with a Gaussian activation function equipped with
neural network, deep residual network, selection channel, SRNet. a fixed preprocessing high-pass KV filter [39, eq. (9)] whose
role was to suppress the image content and thus improve the
signal-to-noise ratio between the stego signal and the host
I. I NTRODUCTION
image. The authors observed that without the fixed high-pass
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
1182 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 5, MAY 2019
mentioned on their web site that they were working on a first layer did improve the performance, which indicates a
modified architecture.2 space for future improvement in the quest for a completely
Detectors constructed using deep learning have also data-driven steganography detector.
advanced the state of the art in the JPEG domain [10], [60], At this point, the authors would like to point out a ter-
[64], [67]. Chen et al. [10] modified the XuNet for steganalysis minology clash between steganalysis and deep learning as the
of JPEG images by splitting the feature maps into 64 parallel term “residual” has been firmly established in both fields but is
channels to make the architecture aware of JPEG phase – the used for two completely different entities. To prevent potential
underlying grid of 8 × 8 pixels. The design mimicked the confusion, the phrase “noise residual” will be strictly used for
construction of the so-called JPEG-phase-aware noise residu- a pixel prediction error in steganalysis while “residual layer/
als discovered by Holub et al. [29], [30] and later improved module/connection” will always relate to the popular residual
by using Gabor filters for noise residual extraction [50], [59] network architecture in deep learning [26], [27].
and making them aware of the selection channel [15]. Section II contains the description of the proposed network
A 20-layer deep network with shortcut connections [26], [27] architecture and a discussion of our design choices. The
for steganalysis of J-UNIWARD [31] has been proposed by training, which is unified in both spatial and JPEG domain,
Xu et al. [60]. This architecture, too, relied on fixed pre- is detailed in Section III, where we also describe the setup
processing DCT kernels in the first convolutional layer and of all our experiments, the performance evaluation metric
thresholding its feature maps. as well as the list of prior art with which the proposed
When designing the architecture proposed in this paper, our detector is compared. The results of experiments in spatial
goal was a clean end-to-end design that could be used for a and JPEG domain appear in Section IV. The performance is
wider range of applications and work well for steganalysis evaluated in terms of the minimal detection error under equal
in both spatial and JPEG domains. We let ourselves be priors. We also report the detection performance on selected
guided by the latest advancements in deep learning and rather cases using the receiver operating characteristic curves with
general principles and insights to minimize the use of exter- the false-alarm rate for true positive rates of 0.5 and 0.3.
nally enforced constraints or heuristics. Fixed or constrained In Section V, we show that further boost of detection accuracy
preprocessing kernels or kernels initialized to SRM filters can be achieved in both domains by introducing the selection
or DCT bases can in fact be detrimental for the overall channel into the network. The paper is closed in Section VI
network performance depending on the characteristics of the with a discussion of potential further improvements and our
stego signal. High-pass filters, such as the popular KV filter, anticipated future effort.
suppress a major portion of the stego signal introduced by
JPEG steganography because the embedding modifications are II. SRN ET FOR I MAGE S TEGANALYSIS
applied to quantized DCT coefficients. This has already been The proposed network architecture is called SRNet –
observed and analyzed in [10] where the authors introduced Steganalysis Residual Network. The word “residual” refers
additional fixed filters into the first convolutional layer to to both the central term used in steganalysis and residual
improve detection of JPEG steganography. Ideally, however, layers with shortcut connections from deep learning [26].
the best filters should also be learned rather than enforced as The shortcut connections help propagate gradients to upper
it is unlikely that hand-designed filters or non-random kernel layers, which are the hardest to train because of the vanishing
initializations will be optimal for the chosen architecture. gradient phenomenon [21] that often negatively affects the
The overall design consists of four different types of layers, convergence and performance of deep architectures [26], [27].
two of which involve the so-called residual shortcuts that have They also encourage feature reuse in the training process.
been shown in the literature [26], [27] to improve convergence We first describe the architecture of SRNet and then explain
and help learn the parameters in upper layers of deep networks, and justify each component separately, motivating thus the
which are typically the hardest to learn. Functionally, the net- design.
work consists of three serially connected segments – the front A. Architecture
segment whose role is to learn effective “noise residuals,” the
middle segment that compactifies the feature maps, and the last Although it is not generally possible to claim that a certain
segment is a simple linear classifier. The front segment consists part of a network detector executes a specific task, we found
of seven layers in which pooling [23, Ch. 9.3, pp. 330–334] it useful to view the proposed detector schematically depicted
has been disabled to prevent suppression of the stego signal in Figure 1 as a concatenation of three segments: the front
due to averaging neighboring samples in feature maps during segment responsible for extracting the noise residuals, outlined
average pooling. in the figure by the first two shaded segments (Layers 1–7),
We would like to emphasize that, in its original form, the middle segment whose goal is to reduce the dimensionality
we do not supply the network with the knowledge of the of the feature maps, the third shaded segment and Layer 12,
selection channel as we firmly believe that, for the best results, and the last segment, which is a standard fully connected layer
the network should become aware of the selection channel via followed by a softmax node [23], the linear classifier.
end-to-end training. Having said this, we acknowledge that The input is assumed to be a grayscale 256 × 256 image.3
introducing the selection channel via a parallel branch in the All convolutional layers employ 3 × 3 kernels and all non-
3 Reference [20] explains how to steganalyze images of arbitrary size with
2 https://github.com/Steganalysis-CNN/residual-steganalysis network detectors.
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
BOROUMAND et al.: DEEP RESIDUAL NETWORK FOR STEGANALYSIS OF DIGITAL IMAGES 1183
Fig. 1. Architecture of the proposed SRNet for steganalysis. The first two shaded boxes correspond to the segment extracting noise residuals, the dark shaded
segment and Layer 12 compactify the feature maps, while the last fully connected layer is a linear classifier. The number in the brackets is the number of
3 × 3 kernels in convolutional layers in each layer. BN stands for batch normalization.
linear activation functions are ReLU. Note that Layers 1–7 use and suppresses noise-like stego signals by averaging adjacent
unpooled feature maps on their input. Pooling in the form of embedding changes. While this is desirable in typical computer
3 × 3 averaging with stride 2 is applied on the output of vision applications for classifying content, it is detrimental
Layers 8–11. In Layer 12, 512 feature maps of dimension for steganalysis where the signal of interest is the stego noise
16 × 16 are reduced to a 512-dimensional feature vector by while the “noise” is the image content. Guided by this insight,
computing statistical moments (averages) of each 16 × 16 SRNet does not use pooling until Layer 8 to avoid decreasing
feature map. This 512-dimensional output enters the classifier the energy of the stego signal and allow it to optimize the
part of the network. The first two layers do not contain noise residual extraction process for various types of selection
any residual shortcuts or pooling. Layers 3–7 have residual channels and steganographic embedding changes.
shortcuts and no pooling. Layers 8–11 contain both pooling All filters in SRNet are randomly initialized and learned
and residual shortcuts. via an end-to-end training process. This allows the network
SRNet contains two types of layers with shortcuts because to adapt to a greater variety of stego signals because the
unpooled layers (Type 2) require different shortcut connections polarity of and dependencies among embedding changes
than pooled layers (Type 3). The first two layers of Type 1 with vary significantly across different steganographic methods and
3×3 filters worked better for us than one layer with 5×5 filters. especially domains. Embedding modifications introduced by
Their purpose is to begin with a larger number of kernels (64) spatial-domain embedding methods that minimize an additive
and then decrease the number of feature maps to 16 before distortion, such as WOW [28], HILL [41], S-UNIWARD [31],
the unpooled layers to save on memory. The Type 4 layer is and MiPOD [47] are largely uncorrelated, while changes to
different from the last layer of Type 3 because of the global quantized DCT coefficients in JPEG image steganography lead
pooling applied before the fully connected classifier part. to a stego signal with significant energy in low and medium
spatial frequencies.
B. Motivating the Architecture The proposed architecture was formed based on results of
The key part of the SRNet is the noise residual extrac- many experiments in which we tested different allocations of
tion segment consisting of the first seven layers. Because resources to the three above mentioned segments so that the
average pooling is a low-pass filter, it reinforces content network can be trained with a reasonable minibatch size on
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
1184 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 5, MAY 2019
a single GPU with 12 GB of memory, examples of which GPU memory, a proper study of inception modules within the
are the popular Titan X and Xp, Tesla K40 and K80, and SRNet would require a comprehensive study that is beyond
GTX 1080 Ti (11 GB). Most of the exploration focused on the scope of this paper.
determining the number of layers in each segment, the number 4) Unpooled Layers: We now comment on the number of
of filters in each layer, and the optimizer. unpooled layers and their effect on detection. Decreasing their
The remainder of this section is divided into subsections, number from seven to six or five while keeping the rest of
each devoted to a specific design element of SRNet. The exper- the architecture unchanged lead to a small and gradual loss
imental results quoted here were all obtained with the setup of accuracy. For example, for J-UNIWARD at 0.4 bpnzac
explained in Section III on the standardized dataset BOSSbase (bits per non-zero AC DCT coefficient) and JPEG quality 75,
+ BOWS2 (Section III-A) with the detector accuracy reported the detection error PE increased from 0.0670 to 0.0701 and
using the minimal total detection error PE under equal priors 0.0748 when the number of unpooled layers was changed from
based on the training, validation, and testing (Section III-D). 7 to 5 and 4, respectively. This loss increases with decreasing
1) Activations: Besides the ReLU, we have also experi- payload. Also, we observed that this loss is typically smaller
mented with TanH activation, the leaky ReLU, ELU [11], in the spatial domain and larger in the JPEG domain. Across
and SELU [36], but they did not bring any performance the tested algorithms in both domains, the detection accuracy
gain. To avoid additional complexity and guided by simplicity, tends to level out at 5–6 unpooled layers. We opted for seven
we selected ReLU for all activation functions in our network. in our proposed design to avoid potential loss of detection for
Note that layers of Type 2 and 3 do not use ReLU more diverse cover and stego sources.
after the shortcut connections. While the original residual To assess the significance of disabling pooling in
networks [26], [27] do include ReLU after the addition of Layers 1–7, we carried out additional experiments in which
the shortcut connections, with these activations removed, pooling has been progressively enabled in Layers 7, 6, 5, and
we observed a small gain of up to 1% in detection accuracy. 4. Note that enabling pooling in more than four layers would
2) Residual Shortcuts: To assess the importance of shortcut require removing layers from group 3 because the size of the
connections in SRNet, we removed them from layers of Type feature maps before the output layer decreases from 16 × 16
2 and 3 and observed the change in detection accuracy. For to 8 × 8, and eventually 1 × 1 when pooling is enabled in
example, for HILL at 0.1 and 0.2 bpp the loss of classification four layers.
accuracy was about 0.5% and for J-UNIWARD at 0.4 bpnzac, The experiments were executed for HILL at 0.4 bpp and
quality factor 95, the loss was 1.5%. While the performance in J-UNIWARD at 0.4 bpnzac to cover both embedding domains.
these cases was still competitive, the loss of detection power With enabling average pooling in Layers 7–4, starting with
increased with decreased class separability, e.g., for small Layer 7, the detection error for HILL rapidly increased from
payloads and larger JPEG quality. 0.1414 (with the original SRNet) to 0.1528, 0.1823, 0.2202,
3) Dense Connections and Inception: Dense connections in and 0.3697. For J-UNIWARD, the detection error grew from
deep learning were introduced with a similar goal as residual 0.0670 to 0.0755, 0.0886, 0.1263, and 0.1710.
layers – to help with gradient propagation and convergence, 5) Number of Filters: The effect of the number of filters in
feature reuse, and to reduce the number of parameters to the first layer has a larger impact in the JPEG domain than in
learn [32]. We investigated the effect of dense connections the spatial domain. While the detection error, PE , for HILL at
introduced in the second segment of the SRNet – unpooled 0.4 bpp increased negligibly when using only 32 and 16 filters
Layers 3–7. On experiments with the embedding algorithms instead of 64 in the first layer (0.1414, 0.1432, and 0.1438 for
HILL and S-UNIWARD at 0.4 bpp, the SRNet with dense 64, 32, and 16 filters), for J-UNIWARD at 0.1 bpnzac at JPEG
connections did not provide statistically significant better quality 75, decreasing the number of filters from 64 to 32 lead
results as SRNet with residual connections (the statistical to an increase of PE of about 1%. Increasing the number of
significance was assessed based on the statistical spread of filters beyond 64 did not seem to lead to any improvement in
detection accuracy w.r.t. the snapshot selected for the final detection.
detector). Dense connections, however, may have more impact 6) Optimizer: Finally, we experimented with several opti-
on deeper architectures than the SRNet. mizers, including the AdaDelta [66], Adam [35], Adamax [35],
The main idea behind “inception” is that each layer con- and a simple stochastic gradient descend [23, Ch. 8.3.1,
catenates the outputs of filters of different sizes, which is pp. 286–288]. In the end, we settled on Adamax since it
reminiscent of fusing multiple-resolution representations in provided the most reliable and fastest convergence.
image processing [51]. Type 3 layers in SRNet (see Figure 1)
sum the outputs of what is an effective 5 × 5 filter in the main III. S ETUP OF E XPERIMENTS
branch (in terms of the receptive field) and a 1 × 1 filter (the This section describes the common core of all experiments
shortcut branch). We added an additional branch to this layer that appear in Section IV and V, including the datasets and
type with 3 × 3 filters followed by batch normalization. This SRNet training, the list of prior art to which SRNet is to be
required other changes in the architecture to fit the modified compared, and the evaluation metric.
SRNet in GPU memory – we decreased the number of feature
maps in Type 3 layers to one half. SRNet modified in this A. Datasets
manner gave a slightly worse (0.5–1%) detection accuracy on SRNet was primarily evaluated and contrasted with prior
both HILL and S-UNIWARD tested at 0.4 bpp. Due to limited art on the union of BOSSbase 1.01 [3] and BOWS2 [4],
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
BOROUMAND et al.: DEEP RESIDUAL NETWORK FOR STEGANALYSIS OF DIGITAL IMAGES 1185
each containing 10,000 grayscale images resized from their for an additional 100k iterations (114 epochs). The snap-
original size 512 × 512 to 256 × 256 using imresize with shot achieving the best validation accuracy in the last 100k
default setting in Matlab. For JPEG experiments, this source iterations was taken as the result of training. This training
was additionally compressed with quality factors 75 and 95. strategy was applied for all embedding algorithms for pay-
Randomly chosen 4,000 images from BOSSbase and load 0.4 bpp/bpnzac (bits per pixel / bits per non-zero AC
the entire BOWS2 dataset were used for training with DCT coefficient) with the exception of J-UNIWARD at JPEG
1,000 BOSSbase images set aside for validation. The remain- quality 95 (see the next paragraph). The detectors for all
ing 5,000 BOSSbase images were used for testing. This remaining payloads were built via curriculum training [5]
setup permitted a direct comparison with the current state-of- with 50–100k iterations (57–114 epochs) with learning rate r1
the-art spatial-domain detector, the YeNet [65]. In summary, and an additional 50k iterations (57 epochs) with the smaller
2 × 14, 000 cover and stego images were used for training, learning rate r2 . Again, the best validation snapshot in the
2 × 1, 000 for validation, and 2 × 5, 000 for testing. This last 50k iterations was taken as the detector. While this was
applies to both the spatial and JPEG domain and all network applied in both spatial and JPEG domain, we observed that
detectors. JPEG images were decompressed without rounding in the spatial domain the same results could be obtained by
to integers. curriculum training only with the smaller learning rate.
To test the network on a significantly larger and more realis- For J-UNIWARD and JPEG quality factor 95 at 0.4 bpnzac,
tic dataset, we performed additional experiments on ImageNet, we experienced convergence problems when training from a
namely its CLS-LOC version [46] containing 1,281,167 JPEG randomly initialized network. This was resolved by seeding
images meant to be used for training sorted into 1,000 cate- the network with the detector trained for J-UNIWARD for
gories (the dataset used in [60]). We selected 250 images from quality factor 75 at 0.4 bpnzac, after which we trained for 400k
each category at random, subjecting each image that was larger iterations with learning rate r1 followed by 100k iterations
than 256×256 pixels and whose JPEG quality was above 75 to with r2 .
the following chain of processing in Matlab: decompression We tested two types of curriculum training – by seeding
to the spatial domain (imread), cropping the upper left tile with the network trained for payload 0.4 bpp/bpnzac and by
of size 256 × 256, conversion to grayscale using rgb2gray, training in a progressive manner that is perhaps best described
and recompression with JPEG quality factor 75. This mimics symbolically as 0.1←0.2←0.3←0.4→0.5. In other words,
the preprocessing that was executed in [60] and [67]. In par- first the detectors for payload 0.3 and 0.5 were trained by
ticular, the requirement to work only with JPEG images with seeding with the network trained for 0.4. Then, the detector for
quality larger than 75 was imposed to avoid working with payload 0.2 was seeded with the network trained for 0.3, etc.
images exhibiting traces of double compression (lower quality While both types of curriculum training gave similar results
followed by larger quality) as this would introduce peaks and in the spatial domain, the progressive training gave slightly
valleys in histograms of quantized DCT coefficients, which better results in the JPEG domain.
could be exploited for targeted attacks. The total size of this
dataset was thus 2×250,000 cover-stego images out of which
2×10,000 pairs were selected for validation and 2×40,000 for C. Tested Prior Art
testing. For comparison with the current state of the art on the
union of BOSSbase and BOWS2, in the spatial domain SRNet
B. SRNet Training was compared with YeNet [65] and on JPEG algorithms
with the PNet/VNet [10] and the network recently proposed
The SRNet has been trained in both domains with the by Xu et al. [60], which we call in this paper J-XuNet to
same hyperparameters and in the same fashion. The stochastic distinguish it from the network introduced in [62]. We note
gradient descend optimizer Adamax4 [35] was used with that when we attempted to train the YeNet on decompressed
minibatches of 16 cover-stego pairs. The training database JPEGs with quality factor 75 embedded with J-UNIWARD
was shuffled after each epoch. Images in each batch were at 0.4 bpnzac the network did not appear to converge.
subjected to data augmentation with random mirroring and To show the gain in detection accuracy w.r.t. the old detec-
rotation of images by 90 degrees. The batch normalization tion paradigm based on the ensemble classifier and rich mod-
parameters were learned via an exponential moving average els, we steganalyzed all spatial-domain embedding algorithms
with decay rate 0.9. The filter weights were initialized with with the maxSRMd2 [17] features non-linearly normalized
the He initializer5 and 2 × 10−4 L2 regularization. The filter using random conditioning (RC) [8]. JPEG steganography was
biases were set to 0.2 and no regularization. For the fully steganalyzed with the Selection-Channel-Aware Gabor Filter
connected classifier layer, we initialized the weights with a Residuals [15] (SCA-GFR). The SCA-GFR features were not
zero mean Gaussian with standard deviation 0.01 and no bias. normalized or transformed [7], [8] because this type of features
On our dataset, the training was run for 400k iterations does not benefit from such preprocessing.
(457 epochs) with an initial learning rate of r1 = 0.001 All prior art CNN detectors were trained as described in
after which the learning rate was decreased to r2 = 0.0001 the corresponding papers. We observed that for the J-XuNet
4 Code available from https://github.com/openai/iaf/blob/master/tf_utils/ on 256 × 256 images, it was beneficial to decrease the
adamax.py learning rate by 10% every 8 epochs instead of 16 to avoid
5 https://arxiv.org/pdf/1502.01852v1.pdf a loss of performance for small payloads. For J-UNIWARD
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
1186 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 5, MAY 2019
quality factor 95, we had to train the J-XuNet for payloads TABLE I
0.1 and 0.2 bpnzac via curriculum training from 0.3 bpnzac. D ETECTION E RROR PE FOR MAX SRM D 2 W ITH R ANDOM C ONDITIONING
Due to the size of ImageNet, we limited our experiments AND E NSEMBLE , SRN ET, AND S ELECTION -C HANNEL -AWARE Y E N ET
on this dataset to J-UNIWARD at quality factor 75 and only FOR F IVE PAYLOADS IN bpp AND T HREE S PATIAL
compared to J-XuNet and the recently proposed hybrid deep D OMAIN E MBEDDING A LGORITHMS
network incorporating J-XuNet as a “subnet” as described in
Sec. III.E of [67] (Fig. 13a), which we abbreviate in this paper
as H-Net.
All detectors were trained on exactly the same data sets
as the SRNet, implemented in TensorFlow, and run on a
single GPU. It takes approximately two and half days to
train the SRNet on a Titan Xp GPU. Note that we did not
form ensembles of CNN detectors in this paper. Quite likely,
further small improvement in detection accuracy could be
obtained across all investigated network detectors by forming
an ensemble either over different snapshots obtained from a SCA-YeNet. This loss of performance is due to the fact that
single training or over independently trained networks. SRNet does not make explicit use of the selection channel
while YeNet benefits quite significantly by employing the
selection channel for WOW (c.f. columns 3 and 5 in Table VIII
D. Evaluation Metric
in [65]). In Section V, we show that this loss can be com-
The detection performance was measured with the total pensated by introducing the selection channel to SRNet in a
classification error probability on the testing set under equal similar manner as in YeNet. Finally, both network detectors
priors PE = min PFA 21 (PFA + PMD ), where PFA and PMD are clearly outperform the old steganalysis paradigm.
the false-alarm and missed-detection probabilities. For selected ROC curves for rich-model based detectors are well known
cases, we show the ROC curves and an alternative measure of to be mean-shifted Gauss-Gauss (see, e.g., [12]) and as such do
performance, the false-alarm rates for stego-image detection not perform well for low false alarms. In contrast, the detection
probability PD = 1 − PMD = 0.5 and 0.3. statistic outputted by network detectors exhibits non-Gaussian
The results reported in the next section are for one random characteristics and, as we found out, achieves significantly bet-
50/50 split of BOSSbase because it would not be computa- ter performance for low rates of false alarm, a goal identified
tionally feasible to train all networks on multiple different as one of the most relevant problems for practitioners in [34].
splits to obtain a more statistically robust result. To assess Figure 2 shows four ROC curves of SRNet for S-UNIWARD
the statistical spread across different BOSSbase splits and and HILL for two payloads and the false alarm rates PFA
thus interpret the statistical significance of the improvement for two test powers: PD ∈ {0.3, 0.5}. For the larger payload
of SRNet w.r.t. the state of the art, we trained the SRNet on 0.4 bpp, PD = 0.5 can be achieved with PFA = 6 × 10−4 for
five different 50/50 BOSSbase splits (BOWS2 was always a S-UNIWARD and 4.2 × 10−3 for HILL. In contrast, the low-
part of the training set) for HILL at 0.3 bpp and J-UNIWARD complexity linear classifier [12] with maxSRMd2 features [17]
at 0.4 bpnzac and JPEG quality 75. The standard devia- (the last two columns in the table underneath Figure 2)
tion of PE across these five splits was 0.0035 and 0.0016, exhibits 4–30 times larger (!) false alarms for the two test
respectively. The statistical spread appears coincidentally com- powers.6
parable to what has typically been reported for detectors
implemented with rich models and the ensemble classifier B. Transfer Learning
(see, e.g., [15], [17]).
To assess the ability of the SRNet to detect mismatched
stego sources, which is a situation likely to be encountered
IV. E XPERIMENTS in practice, we include the result of an investigation in which
This section contains the results of all experiments and their the SRNet was trained on one embedding algorithm and tested
interpretation divided into two subsections based on the type on a different one at the same payload. Table II shows that
of the embedding domain. the SRNet trained on the least detectable algorithm (MiPOD)
transfers the best while, when trained on the most detectable
algorithm (WOW), it transfers the least. This is consistent with
A. Spatial Domain
the results reported in [10] for the JPEG-phase-aware network
For spatial domain steganalysis, we report the results for in JPEG domain.
five payloads: 0.1–0.5 bpp (bits per pixel) for WOW [28],
HILL [41], and S-UNIWARD [31]. The detection error PE is C. JPEG Domain
shown in Table I. Depending on the algorithm and payload
SRNet improves upon SCA-YeNet by up to 3% in PE . The For the JPEG domain, J-UNIWARD [31] and UED-JC [25]
biggest improvement is typically observed for larger payloads. for payloads 0.1–0.5 bpnzac (bits per non-zero AC DCT coef-
The only exception is for WOW for the smallest tested 6 The low-complexity linear classifier was used instead of the ensemble to
payload 0.1 bpp when SRNet performs by 1.5% worse than the be able to obtain the performance measures reported in Figure 2.
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
BOROUMAND et al.: DEEP RESIDUAL NETWORK FOR STEGANALYSIS OF DIGITAL IMAGES 1187
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
1188 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 5, MAY 2019
TABLE III
D ETECTION E RROR PE FOR SRN ET AND P RIOR A RT FOR F IVE PAYLOADS IN BPNZAC FOR
J-UNIWARD AND UED-JC FOR Q UALITY FACTORS 75 (L EFT ) AND 95 (R IGHT )
Fig. 4. Detection error PE for VNet, PNet, J-XuNet, and SRNet for J-UNIWARD QF 75 (left) and 95 (right).
Fig. 5. Detection error PE for VNet, PNet, J-XuNet, and SRNet for UED-JC QF 75 (left) and 95 (right).
that, at least for classifiers trained with rich media models, residuals were computed from a subset corresponding to
it is still beneficial to use an imprecise selection channel pixels with the largest embedding change probabilities.
(e.g., because the payload size is not known) than not use In maxSRM [17] (and in [54]), the co-occurrences contain
it at all. the accumulated maximum change rate (the sum of change
While the incorporation of the selection channel helps rates) over all adjacent four-tuples of noise residuals. This
detection, it has always been achieved in some heuristic idea was further refined in [15] and [16] where the authors
manner. In the so-called t-SRM proposed by Tang et al. [53], showed that further improvement can be obtained by replacing
the four-dimensional SRM co-occurrence matrices of noise the change rate as the quantity being accumulated with an
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
BOROUMAND et al.: DEEP RESIDUAL NETWORK FOR STEGANALYSIS OF DIGITAL IMAGES 1189
Fig. 8. The first layer in SCA-SRNet. The left branch is the main branch
while the branch on the right brings in the information about the selection
channel. Top: spatial domain, Bottom: JPEG domain.
Fig. 6. ROC curves of SRNet for J-UNIWARD at 0.2 and 0.4 bpp for quality
factors 75 and 95 together with two detection performance measures: PFA for TABLE IV
PD = 0.5 and 0.3 for the low-complexity linear classifier with the SCA-GFR
feature set. E FFECT OF I NTRODUCING THE S ELECTION C HANNEL
I NTO SRN ET (S PATIAL D OMAIN )
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
1190 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 5, MAY 2019
TABLE V
E FFECT OF I NTRODUCING THE S ELECTION C HANNEL I NTO SRN ET (JPEG D OMAIN )
computed as |W(l) | β, where β is the matrix of change rates, The bound t enters the selection-channel branch in the first
the selection channel.7 This bound is added to the feature layer as shown in Figure 8. Finally, the lth feature map,
maps outputted by the first layer to reinforce the output of l = 1, . . . , 64, outputted by the first layer is thus
neurons that are most affected by embedding. The rest of the
(l)
SCA-SRNet architecture is identical to SRNet (see Figure 1) ReLU(W x) + |W(l) | t, (4)
with the first layer shown in Figure 8. Note that the batch
where x is the decompressed JPEG image without rounding
normalization was removed from the first layer to make sure
to integers. The square root non-linearity is there to obtain
both signals that are added are of similar scale. The kernels 1/2
the same quantity as δu S A from [15] [15, eq. (20)] and the
applied to the image in the first layer and those applied to
discussion following this equation).
the change rates are forced to be the same, e.g., the absolute
The SCA-SRNet was trained in the exact same fashion as
values of the kernels are merely copied from the main
the original network. The results for spatial-domain steganog-
network.
raphy are shown in Table IV. The gain is the largest for
Formally, for the spatial domain, with the M × N matrices
WOW as has always been observed in all prior art on SCA
of pixel values x = (x i j ) and embedding change probabilities
steganalysis because WOW is “overly content adaptive”. The
β = (βi j ), the lth feature map, l = 1, . . . , 64, that enters
gain w.r.t. the original SRNet is around 1% for payloads
the second convolutional layer in the forward pass is
0.4 and 0.5 bpp and then steadily increases to 4% for the
ReLU(W(l) x) + |W(l) | β, (1) smallest tested payload 0.1 bpp. For HILL and S-UNIWARD,
the gain ranges between 1–2%.
where W(l) ∈ R3×3 is the lth convolutional kernel from the
The JPEG results appear in Table V. The absolute gain
first layer of SRNet and denotes the convolution. During
is small for UED-JC for quality factor 75 also because the
learning, the weight vectors in the main branch of the network
detection error is already rather small even with the original
are copied to the selection-channel branch where the absolute
SRNet. For more difficult cases, such as higher quality factors
value operation is applied and the network is trained as
or smaller payloads, however, the SCA SRNet gains up to 5%,
before.
which is rather significant.
For JPEG domain algorithms, the selection channel is
incorporated in a similar fashion. The embedding change
probabilities, however, relate to the quantized DCT coefficients VI. C ONCLUSIONS
rather than pixels. Thus, as the first step, we compute the A novel convolutional neural network architecture called
impact of embedding on pixels as an upper bound t on the SRNet is proposed for steganalysis of digital images. SRNet
L 1 embedding distortion as in Eqs. (18–19) in [15]. This is the first steganalysis network that is free of many externally
bound in the (a, b)-th JPEG 8 × 8 block, 0 ≤ a ≤ M/8 − 1, introduced design elements previously proposed specifically
0 ≤ a ≤ N/8 − 1 is computed as: for steganalysis and forensics, such as constrained kernels,
initialization with heuristic kernels, thresholding, quantization,
7
ti(a,b)
j = | f i(k,l)
j
(a,b)
|qkl βkl , 0 ≤ i, j ≤ 7, (2) and awareness of JPEG phase. Consequently, SRNet can be
k,l=0
trained in an end-to-end fashion from randomly initialized
convolutional kernels and in the same fashion independently
(a,b)
where βkl , 0 ≤ k, l ≤ 7, is the change rate corresponding to of the embedding domain. The front part of SRNet contains
DCT mode k, l in (a, b)-th DCT 8 × 8 block, qkl is the JPEG seven residual layers in which pooling has been disabled
luminance quantization step, and to allow the network to learn relevant “noise residuals” for
wk wl πk(2i + 1) πl(2 j + 1) different types of embedding changes in both spatial and JPEG
(k,l)
fi j = cos cos , (3) domain. The design of SRNet is validated experimentally on
4 16 16
√ standard datasets and six steganographic algorithms. State-
w0 = 1/ 2 and wk = 1 for k > 0, are the coefficients of the of-the-art detection is observed in both domains with rather
DCT. The computation of the matrix t is a mere preprocessing significant improvements in the JPEG domain. Receiver oper-
of the change rates and can be done outside of the network. ating characteristics for selected combinations of embedding
7 β is the probability of modifying cover element x . Thus, for embedding algorithms and payloads reveal especially favorable detection
ij ij
schemes that modify cover values by 1 or −1, βi j is the sum of the two performance for very low false-alarm rates, which is expected
probabilities of changing by 1 and −1. to be significant for practitioners.
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
BOROUMAND et al.: DEEP RESIDUAL NETWORK FOR STEGANALYSIS OF DIGITAL IMAGES 1191
While SRNet was intentionally designed to minimize the [7] M. Boroumand and J. Fridrich, “Boosting steganalysis with explicit
use of heuristic design elements specific to steganalysis, it still feature maps,” in Proc. 4th ACM Workshop Inf. Hiding Multimedia
Secur., F. Perez-Gonzales, F. Cayre, and P. Bas, Eds. New York, NY,
benefits from being informed about the probabilistic impact USA: ACM, Jun. 2016, pp. 149–157.
of embedding in the form of the selection channel, which [8] M. Boroumand and J. Fridrich, “Non-linear feature normalization in
points out a space for future improvements. SRNet is the first steganalysis,” in Proc. 5th ACM Workshop Inf. Hiding Multimedia Secur.,
M. Stamm and M. Kirchner, Eds. New York, NY, USA: ACM, Jun. 2017.
steganalysis network that makes use of the selection channel
[9] L. Chen, Y. Q. Shi, P. Sutthiwan, and X. Niu, “A novel mapping
for JPEG domain steganalysis, a task that was achieved by scheme for steganalysis,” in Proc. Int. Workshop Digit. Forensics Water-
adding a bound on L 1 embedding distortion to the feature making in Lecture Notes in Computer Science, vol. 7809, Y. Q. Shi,
maps outputted by the first layer to reinforce the output of H.-J. Kim, and F. Perez-Gonzalez, Eds. Berlin, Germany: Springer, 2013,
pp. 19–33.
neurons that are most affected by embedding. [10] M. Chen, V. Sedighi, M. Boroumand, and J. Fridrich, “JPEG-phase-
This paper opens up a direction in steganalysis that we plan aware convolutional neural network for steganalysis of JPEG images,” in
to further pursue in the future. Since steganalysis detectors Proc. 5th ACM Workshop Inf. Hiding Multimedia Secur. (IH&MMSec),
M. Stamm M. Kirchner, Eds. New York, NY, USA: ACM, Jun. 2017.
by definition detect inconsistencies in the noise patterns of
[11] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate
images, they often find applications in forensics, such as for deep network learning by exponential linear units (ELUs),” CoRR, 2015.
establishing the processing history of images or detecting [Online]. Available: http://arxiv.org/abs/1511.07289
inconsistencies within a single image to identify locally manip- [12] R. Cogranne and J. Fridrich, “Modeling and extending the ensem-
ble classifier for steganalysis of digital images using hypothesis test-
ulated regions. ing theory,” IEEE Trans. Inf. Forensics Security, vol. 10, no. 12,
Large advancements in steganalysis need to be followed pp. 2627–2642, Dec. 2015.
by revisiting the inner workings of steganographic methods [13] R. Cogranne, V. Sedighi, T. Pevný, and J. Fridrich, “Is ensemble
classifier needed for steganalysis in high-dimensional feature spaces?” in
because they are often designed from feedback provided by Proc. IEEE Int. Workshop Inf. Forensics Secur., Rome, Italy, Nov. 2015,
detectors. A lucrative possibility that has already received pp. 1–6.
attention from researchers [55] is to let two competing net- [14] R. Cogranne, C. Zitzmann, L. Fillatre, F. Retraint, I. Nikiforov, and
works design the embedding algorithm within the generative- P. Cornu, “A cover image model for reliable steganalysis,” in Proc.
13th Int. Conf. Inf. Hiding in Lecture Notes in Computer Science,
adversarial network (GAN) [24] setup that essentially mimics T. Filler, T. Pevný, A. Ker, and S. Craver, Eds. Prague, Czech Republic,
the game played by the steganographer and the stegana- May 2011, pp. 178–192.
lyst. Novel steganalysis architectures, such as the SRNet, [15] T. D. Denemark, M. Boroumand, and J. Fridrich, “Steganalysis features
for content-adaptive JPEG steganography,” IEEE Trans. Inf. Forensics
will undoubtedly find their place to further advance this Security, vol. 11, no. 8, pp. 1736–1746, Aug. 2016.
direction. [16] T. Denemark, J. Fridrich, and P. Comesaña-Alfaro, “Improving selection-
All code used to produce the results in this paper, including channel-aware steganalysis features,” in Proc. IS&T Electron. Imag.,
the network configuration files and other supporting code is Media Watermarking, Secur., Forensics, A. Alattar and N. D. Memon,
Eds. San Francisco, CA, USA: Soc. Imag. Sci. Technol., Feb. 2016,
available from http://dde.binghamton.edu/download/. pp. 1–8.
[17] T. Denemark, V. Sedighi, V. Holub, R. Cogranne, and J. Fridrich,
ACKNOWLEDGMENT “Selection-channel-aware rich model for steganalysis of digital images,”
in Proc. IEEE Int. Workshop Inf. Forensics Secur., Atlanta, GA, USA,
The U.S. Government is authorized to reproduce and dis- Dec. 2014, pp. 48–53.
tribute reprints for Governmental purposes notwithstanding [18] S. Lyu and H. Farid, “Detecting hidden messages using higher-order
statistics and support vector machines,” in Proc. 5th Int. Workshop Inf.
any copyright notation there on. The views and conclusions Hiding, vol. 2578. 2002, pp. 340–354.
contained herein are those of the authors and should not be [19] J. Fridrich and J. Kodovský, “Rich models for steganalysis of dig-
interpreted as necessarily representing the official policies, ital images,” IEEE Trans. Inf. Forensics Security, vol. 7, no. 3,
either expressed or implied of AFOSR or the U.S. Govern- pp. 868–882, Jun. 2011.
[20] C. Fuji-Tsang and J. Fridrich, “Steganalyzing images of arbitrary size
ment. The authors would like to thank Prof. Jiangqun Ni for with CNNs,” in Proc. IS&T Electron. Imag., Media Watermarking,
sharing code and for discussions. Special thanks go to Clement Secur., Forensics, A. Alattar and N. D. Memon, Eds. Burlingame, CA,
Fuji-Tsang for valuable insight and guidance during his stay USA: Soc. Imag. Sci. Technol., Jan./Feb. 2018, pp. 121-1–121-8.
at Binghamton University. [21] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist.,
2010, pp. 249–256.
R EFERENCES [22] M. Goljan and J. Fridrich, “CFA-aware features for steganalysis of color
images,” Proc. SPIE, vol. 9409, p. 94090V, Feb. 2015.
[1] I. Avcibas, N. D. Memon, and B. Sankur, “Steganalysis of watermark- [23] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, vol. 1.
ing techniques using image quality metrics,” Proc. SPIE, vol. 4314, Cambridge, MA, USA: MIT Press, 2016.
pp. 523–531, Jan. 2001.
[24] I. Goodfellow et al., “Generative adversarial nets,” in Advances in
[2] I. Avcibas, N. D. Memon, and B. Sankur, “Image steganalysis with
Neural Information Processing Systems, Z. Ghahramani, M. Welling,
binary similarity measures,” in Proc. IEEE Int. Conf. Image Process.
C. Cortes, N. D. Lawrence, K. Q. Weinberger, Eds. Red Hook, NY, USA:
(ICIP), Rochester, NY, USA, vol. 3, Sep. 2002, pp. 645–648.
Curran Associates, Inc., 2014, pp. 2672–2680.
[3] P. Bas, T. Filler, and T. Pevný, “‘Break our steganographic system’: The
ins and outs of organizing BOSS,” Proc. 13th Int. Conf. Inf. Hiding [25] L. Guo, J. Ni, and Y. Q. Shi, “Uniform embedding for efficient JPEG
in Lecture Notes in Computer Science, vol. 6958, T. Filler, T. Pevný, steganography,” IEEE Trans. Inf. Forensics Security, vol. 9, no. 5,
A. Ker, and S. Craver, Eds. Berlin, Germany: Springer, 2011, pp. 59–70. pp. 814–825, May 2014.
[4] P. Bas and T. Furon. (Jul. 2007). BOWS-2. [Online]. Available: [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
http://bows2.ec-lille.fr image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum nit. (CVPR), Jun. 2016, pp. 770–778.
learning,” in Proc. ICML, 2009, pp. 41–48. [27] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
[6] R. Böhme, Advanced Statistical Steganalysis. Berlin, Germany: networks,” in Computer Vision—ECCV, vol 9908, B. Leibe, J. Matas,
Springer-Verlag, 2010. N. Sebe, and M. Welling, Eds. Cham, Switzerland: Springer, 2016.
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
1192 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 5, MAY 2019
[28] V. Holub and J. Fridrich, “Designing steganographic distortion using [53] W. Tang, H. Li, W. Luo, and J. Huang, “Adaptive steganalysis against
directional filters,” in Proc. 4th IEEE Int. Workshop Inf. Forensics Secur., WOW embedding algorithm,” in Proc. 2nd ACM Workshop Inf. Hiding
Tenerife, Spain, Dec. 2012, pp. 234–239. Multimedia Secur., S. Katzenbeisser, R. Kwitt, and A. Piva, Eds.
[29] V. Holub and J. Fridrich, “Low-complexity features for JPEG New York, NY, USA: ACM, Jun. 2014, pp. 91–96.
steganalysis using undecimated DCT,” IEEE Trans. Inf. Forensics [54] W. Tang, H. Li, W. Luo, and J. Huang, “Adaptive steganalysis based on
Security, vol. 10, no. 2, pp. 219–228, Feb. 2015. embedding probabilities of pixels,” IEEE Trans. Inf. Forensics Security,
[30] V. Holub and J. Fridrich, “Phase-aware projection model for steganalysis vol. 11, no. 4, pp. 734–745, Apr. 2016.
of JPEG images,” Proc. SPIE, vol. 9409, p. 94090T, Feb. 2015. [55] W. Tang, S. Tan, B. Li, and J. Huang, “Automatic steganographic
[31] V. Holub, J. Fridrich, and T. Denemark, “Universal distortion function distortion learning using a generative adversarial network,” IEEE Signal
for steganography in an arbitrary domain,” EURASIP J. Inf. Secur., Process. Lett., vol. 24, no. 10, pp. 1547–1551, Oct. 2017.
vol. 2014, p. 1, Dec. 2014. [56] T. Thai, R. Cogranne, and F. Retraint, “Statistical model of quantized
[32] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely DCT coefficients: Application in the steganalysis of Jsteg algorithm,”
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. IEEE Trans. Image Process., vol. 23, no. 5, pp. 1–14, May 2014.
Pattern Recognit., Honolulu, HI, USA, Jul. 2017, pp. 2261–2269. [57] T. H. Thai, R. Cogranne, and F. Retraint, “Optimal detection of
[33] S. Ioffe and C. Szegedy. (2015). “Batch normalization: Accelerating OutGuess using an accurate model of DCT coefficients,” in Proc. 6th
deep network training by reducing internal covariate shift.” [Online]. IEEE Int. Workshop Inf. Forensics Secur., Atlanta, GA, USA, Dec. 2014,
Available: https://arxiv.org/abs/1502.03167 pp. 179–184.
[58] S. Wu, S.-H. Zhong, and Y. Liu, “Steganalysis via deep residual
[34] A. D. Ker et al., “Moving steganography and steganalysis from the
network,” in Proc. IEEE 22nd Int. Conf. Parallel Distrib. Syst. (ICPADS),
laboratory into the real world,” in Proc. 1st ACM IH&MMSec Workshop,
Wuhan, China, Dec. 2016, pp. 1233–1236.
W. Puech, M. Chaumont, J. Dittmann, and P. Campisi, Eds. New York,
[59] C. Xia, Q. Guan, X. Zhao, Z. Xu, and Y. Ma, “Improving GFR ste-
NY, USA: ACM, Jun. 2013.
ganalysis features by using Gabor symmetry and weighted histograms,”
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. 5th ACM Workshop Inf. Hiding Multimedia Secur., Philadelphia,
CoRR, Dec. 2014. [Online]. Available: http://arxiv.org/abs/1412.6980 PA, USA, M. Stamm and M. Kirchner, Eds. Jun. 2017, pp. 55–66.
[36] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self- [60] G. Xu, “Deep convolutional neural network to detect J-UNIWARD,” in
normalizing neural networks,” CoRR, Jun. 2017. [Online]. Available: Proc. 5th ACM Workshop Inf. Hiding Multimedia Secur., Philadelphia,
http://arxiv.org/abs/1706.02515 PA, USA, M. Stamm and M. Kirchner, Eds. Jun. 2017, pp. 67–73.
[37] J. Kodovský and J. Fridrich, “JPEG-compatibility steganalysis using [61] G. Xu, H.-Z. Wu, and Y. Q. Shi, “Ensemble of CNNs for steganalysis:
block-histogram of recompression artifacts,” in Information Hiding An empirical study,” in Proc. 4th ACM Workshop Inf. Hiding Multimedia
(Lecture Notes in Computer Science), vol 7692, M. Kirchner and Secur. (IH&MMSec), F. Perez-Gonzales, F. Cayre, and P. Bas, Eds.
D. Ghosal, Eds. Berlin, Germany: Springer, 2013. New York, NY, USA: ACM, Jun. 2016.
[38] J. Kodovský and J. Fridrich, “Steganalysis of JPEG images using rich [62] G. Xu, H.-Z. Wu, and Y.-Q. Shi, “Structural design of convolutional
models,” Proc. SPIE, vol. 8303, p. 83030A, Jan. 2012. neural networks for steganalysis,” IEEE Signal Process. Lett., vol. 23,
[39] J. Kodovský, J. Fridrich, and V. Holub, “On dangers of overtraining no. 5, pp. 708–712, May 2016.
steganography to incomplete cover model,” in Proc. 13th ACM Mul- [63] J. Yang, K. Liu, X. Kang, E. Wong, and Y. Shi, “Steganaly-
timedia Secur. Workshop, J. Dittmann, S. Craver, and C. Heitzenrater, sis based on awareness of selection-channel and deep learning,” in
Eds. New York, NY, USA: ACM, Sep. 2011, pp. 69–76. Proc. Int. Workshop Digit. Forensics Watermarking, vol. 10431, 2017,
[40] J. Kodovský, J. Fridrich, and V. Holub, “Ensemble classifiers for pp. 263–272.
steganalysis of digital media,” IEEE Trans. Inf. Forensics Security, vol. 7, [64] J. Yang, Y.-Q. Shi, E. K. Wong, and X. Kang, “JPEG steganalysis based
no. 2, pp. 432–444, Apr. 2012. on densenet,” CoRR, abs/1711.09335, Apr. 2017.
[41] B. Li, M. Wang, and J. Huang, “A new cost function for spatial image [65] J. Ye, J. Ni, and Y. Yi, “Deep learning hierarchical representations for
steganography,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Paris, image steganalysis,” IEEE Trans. Inf. Forensics Security, vol. 12, no. 11,
France, Oct. 2014, pp. 4206–4210. pp. 2545–2557, Nov. 2017.
[42] W. Luo, Y. Wang, and J. Huang, “Security analysis on spatial ±1 [66] M. D. Zeiler. (Dec. 2012). “ADADELTA: An adaptive learning rate
steganography for JPEG decompressed images,” IEEE Signal Process. method.” [Online]. Available: https://arxiv.org/abs/1212.5701
Lett., vol. 18, no. 1, pp. 39–42, Jan. 2011. [67] J. Zeng, S. Tan, B. Li, and J. Huang, “Large-scale JPEG image
[43] T. Pevný, P. Bas, and J. Fridrich, “Steganalysis by subtractive pixel steganalysis using hybrid deep-learning framework,” IEEE Trans. Inf.
adjacency matrix,” in Proc. 11th ACM Multimedia Secur. Workshop, Forensics Security, vol. 13, no. 5, pp. 1200–1214, May 2018.
J. Dittmann, S. Craver, and J. Fridrich, Eds. New York, NY, USA: ACM, [68] D. Zou, Y. Q. Shi, W. Su, and G. Xuan, “Steganalysis based on Markov
Sep. 2009, pp. 75–84. model of thresholded prediction-error image,” in Proc. IEEE Int. Conf.
[44] T. Pevný and A. D. Ker, “Towards dependable steganalysis,” Proc. SPIE, Multimedia Expo, Toronto, ON, Canada, Jul. 2006, pp. 1365–1368.
vol. 9409, pp. 1501–1514, Feb. 2015.
[45] Y. Qian, J. Dong, W. Wang, and T. Tan, “Deep learning for steganalysis
via convolutional neural networks,” Proc. SPIE, vol. 9409, p. 94090J,
Feb. 2015.
[46] O. Russakovsky et al., “ImageNet large scale visual recognition
challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252,
2015.
[47] V. Sedighi, R. Cogranne, and J. Fridrich, “Content-adaptive steganogra-
phy by minimizing statistical detectability,” IEEE Trans. Inf. Forensics
Security, vol. 11, no. 2, pp. 221–234, Feb. 2016.
[48] V. Sedighi and J. Fridrich, “Effect of imprecise knowledge of the
selection channel on steganalysis,” in Proc. 3rd ACM IH&MMSec
Workshop, Portland, OR, USA, Jun. 2015, pp. 33–42.
[49] Y. Q. Shi, C. Chen, and W. Chen, “A Markov process based approach
to effective attacking JPEG steganography,” in Proc. 8th Int. Workshop Mehdi Boroumand received the B.S. degree in elec-
Inf. Hiding, vol. 4437. 2006, pp. 249–264. trical engineering from the K. N. Toosi University
[50] X. Song, F. Liu, C. Yang, X. Luo, and Y. Zhang, “Steganalysis of of Technology, Iran, in 2004, and the M.S. degree
adaptive JPEG steganography using 2D Gabor filters,” in Proc. 3rd ACM in electrical engineering from the Sahand Univer-
Workshop Inf. Hiding Multimedia Secur. (IH&MMSec), Portland, OR, sity of Technology, Iran, in 2007. He is currently
USA, Jun. 2015, pp. 15–23. pursuing the Ph.D. degree in electrical engineering
[51] C. Szegedy et al., “Going deeper with convolutions,” CoRR, with Binghamton University, The State University
abs/1409.4842, Sep. 2014. of New York. His areas of research interest include
[52] S. Tan and B. Li, “Stacked convolutional auto-encoders for steganalysis digital image steganalysis and steganography, digital
of digital images,” in Proc. Asia–Pacific Signal Inf. Process. Assoc. image forensics, image processing and computer
Annu. Summit Conf. (APSIPA), Dec. 2014, pp. 1–4. vison, and machine learning.
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.
BOROUMAND et al.: DEEP RESIDUAL NETWORK FOR STEGANALYSIS OF DIGITAL IMAGES 1193
Mo Chen received the B.S. and M.S. degrees in Jessica Fridrich (F’16) received the Ph.D. degree
electrical engineering from Shandong University, in systems science from Binghamton University
China, in 1998 and 2001, respectively, and the Ph.D. in 1995, and the M.S. degree in applied mathematics
degree in electrical engineering from Binghamton from Czech Technical University in Prague in 1987.
University, The State University of New York, She is currently a Distinguished Professor of elec-
in 2006. From 2006 to 2007, he was a Post-Doctoral trical and computer engineering at Binghamton Uni-
Research Associate at the Department of Electrical versity. Her main interests are in steganography,
and Computer Engineering, Binghamton University. steganalysis, digital watermarking, and digital image
Since 2009, he has been an Adjunct Research Sci- forensics. Her research work has been generously
entist at SUNY Research Foundation. Since 2007, supported by the U.S. Air Force, NSF, and AFOSR.
he has been a Principle Processing Engineer at Since 1995, she has been a recipient of 23 research
JADAK Technologies, Inc., a Novanta Company, responsible for designing grants, totaling over $11 mil for projects on data embedding, digital forensics,
the inspecting and tracking machine vision OEM systems for the healthcare and steganalysis that lead to 200 papers and seven U.S. patents. She is a
automation and vitro diagnostic applications. His research interests include member of ACM.
machine vision and machine learning, digital image and video processing,
and digital forensics.
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 05,2023 at 03:20:09 UTC from IEEE Xplore. Restrictions apply.