0% found this document useful (0 votes)
7 views

TIFS2012-SRM

The document presents a novel strategy for building steganography detectors for digital images using a rich model of noise components formed by diverse submodels. It emphasizes the use of ensemble classifiers to efficiently handle high-dimensional feature spaces and improve detection accuracy across various steganographic algorithms. The approach aims to automate steganalysis and enhance the development of accurate detectors for emerging steganographic techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

TIFS2012-SRM

The document presents a novel strategy for building steganography detectors for digital images using a rich model of noise components formed by diverse submodels. It emphasizes the use of ensemble classifiers to efficiently handle high-dimensional feature spaces and improve detection accuracy across various steganographic algorithms. The approach aims to automate steganalysis and enhance the development of accurate detectors for emerging steganographic techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

Rich Models for Steganalysis of Digital Images


Jessica Fridrich, Member, IEEE and Jan Kodovský

Abstract—We describe a novel general strategy for local predictor were “tuned” by observing the detection
building steganography detectors for digital images. performance on ±1 embedding. Had the authors used
The process starts with assembling a rich model of the HUGO [34] instead of ±1 embedding, the SPAM model
noise component as a union of many diverse submodels
formed by joint distributions of neighboring samples might have looked quite different. In particular, in light of
from quantized image noise residuals obtained using the recent work [17], [16], [19], the predictor would have
linear and non-linear high-pass filters. In contrast to probably employed higher-order pixel differences.
previous approaches, we make the model assembly a In this paper, we propose a general methodology for
part of the training process driven by samples drawn steganalysis of digital images based on the concept of a rich
from the corresponding cover- and stego-sources. En-
semble classifiers are used to assemble the model as model consisting of a large number of diverse submodels.
well as the final steganalyzer due to their low com- The submodels consider various types of relationships
putational complexity and ability to efficiently work among neighboring samples of noise residuals obtained by
with high-dimensional feature spaces and large training linear and non-linear filters with compact supports. The
sets. We demonstrate the proposed framework on three rich model is assembled as part of the training process
steganographic algorithms designed to hide messages in
images represented in the spatial domain: HUGO, edge- and is driven by the available examples of cover and stego
adaptive algorithm by Luo et al. [32], and optimally- images. Our design was inspired by the recent methods
coded ternary ±1 embedding. For each algorithm, we developed for attacking HUGO [17], [16], [19]. The key
apply a simple submodel-selection technique to in- element of these attacks is a complex model consisting
crease the detection accuracy per model dimensionality of multiple submodels, each capturing slightly different
and show how the detection saturates with increasing
complexity of the rich model. By observing the differ- embedding artifacts. Here, we bring this philosophy to the
ences between how different submodels engage in detec- next level by designing the submodels in a more systematic
tion, an interesting interplay between the embedding and exhaustive manner and we let the training data select
and detection is revealed. Steganalysis built around rich such a combination of submodels that achieves a good
image models combined with ensemble classifiers is a trade-off between model dimensionality and detection ac-
promising direction towards automatizing steganalysis
for a wide spectrum of steganographic schemes. curacy.
Since our approach requires fast machine learning, we
use the ensemble classifier as described in [30], [28] due to
I. Introduction its low computational complexity and ability to efficiently
Modern feature-based steganalysis starts with adopting work with high dimensional features and large training
an image model (a low-dimensional representation) within data sets. The rich model is assembled by selecting each
which steganalyzers are built using machine learning tools. submodel based on its detection error estimate in the form
The model is usually determined not only by the char- of the out-of-bag estimate calculated from the training
acteristics of the cover source but also by the effects of set. The final steganalyzer for each stego method is con-
embedding [4], [6], [14], [18], [21], [33], [35], [38]. For ex- structed again as an ensemble classifier.
ample, the SPAM feature vector [33], which was seemingly Besides the obvious goal to improve upon the state-
proposed from a pure cover model, is in reality, too, driven of-the-art in steganalysis, the proposed approach can be
by a specific case of steganography. The choice of the order viewed as a step towards automatizing steganalysis to
of the Markov process as well as the threshold T and the facilitate fast development of accurate detectors for new
steganographic schemes. We demonstrate the proposed
The work on this paper was supported by the Air Force Office of framework on three steganographic algorithms operating
Scientific Research under the research grant FA9550-09-1-0147. The on a fixed cover source. The edge-adaptive algorithm by
U.S. Government is authorized to reproduce and distribute reprints
for Governmental purposes notwithstanding any copyright notation Luo et al. [32] was included intentionally as an example
there on. The views and conclusions contained herein are those of the of a stegosystem that, according to the best knowledge of
authors and should not be interpreted as necessarily representing the the authors, has not yet been successfully attacked.
official policies, either expressed or implied of AFOSR or the U.S.
Government. Another promising aspect of rich models is their poten-
The authors would like to thank Vojtěch Holub for useful discus- tial to provide a good general-purpose model for various
sions. applications in forensics and in universal blind steganal-
The authors are with the Department of Electrical and Com-
puter Engineering, Binghamton University, NY, 13902, USA. Email: ysis. While the latter may rightfully seem rather out-of-
[email protected], [email protected]. reach due to the fact that steganography can be designed
Copyright (c) 2012 IEEE. Personal use of this material is permit- to minimize the disturbance in a fixed model space using
ted. However, permission to use this material for any other purposes
must be obtained from the IEEE by sending a request to pubs- the Feature-Correction Method (FCM) [8], [26] or the
[email protected]. framework described in [12], [10], model preservation will
2

likely become increasingly more difficult for rich (high- and during in-camera processing. Model building is further
dimensional and diverse) models. Indeed, as shown in complicated by the enormous diversity among cameras as
Section V, when sufficiently many diverse submodels built manufacturers implement more sophisticated processing
from differences between neighboring pixels are combined algorithms as well as specialized hardware components.
in the rich model, HUGO becomes quite detectable despite Since the focus of this paper is on spatial-domain
the fact that it was designed to minimize distortion to steganography, our rich model will be constructed in
high-dimensional multivariate statistics computed from the spatial domain because the best detection is usually
the same pixel differences. achieved by building the model directly in the domain
The paper is structured as follows. In Section II, we where the embedding changes are localized and thus most
describe the individual submodels built as symmetrized pronounced. Since steganography by cover modification
joint probability distributions of adjacent residual sam- makes only small changes to the pixels, we model only
ples. The steganalyzer and the experimental setup used the noise component (noise residual) of images rather
in this paper are detailed in Section III. The method- than their content. This philosophy has been adopted
ology for assembling the rich model from a sample of by steganalysts early on (Chapter 2.4 in [22]) and then
cover and stego images while considering the performance– perfected through a long series of papers examples of which
dimensionality trade-off appears in Section IV. The three are [2], [1], [9], [18], [33], [38].
tested stego methods are described in Section V together
with two investigative experiments aimed at analyzing the
performance of individual submodels and how it is affected A. Submodels
by quantization and steganographic payload. Section VI
Our overall goal is to capture a large number of different
contains the results of the main experiment in which the
types of dependencies among neighboring pixels to give the
full proposed framework is applied to three steganographic
model the ability to detect a wide spectrum of embedding
methods. Finally, the paper is concluded in Section VII
algorithms. However, enlarging a single model is unlikely
where we elaborate on how the proposed strategy affects
to produce good results as the enlarged model will have
future development of steganography and discuss potential
too many underpopulated bins (e.g., think of the second-
applications of rich models outside the field of steganalysis.
order SPAM model with a large truncation threshold T
Everywhere in this article, lower-case boldface symbols
employed by HUGO [34]). Instead, we form the model
are used for vectors and capital-case boldface symbols
by merging many smaller submodels avoiding thus the
for matrices and higher-dimensional arrays. The symbols
problem with underpopulated bins.
X = (Xij ) ∈ {0, . . . , 255}n1 ×n2 and X̄ = (X̄ij ) always
represent pixel values of an 8-bit grayscale cover image 1) Computing residuals: The submodels are formed
with n = n1 ×n2 pixels and its corresponding stego image. from noise residuals, R = (Rij ) ∈ Rn1 ×n2 , computed using
By slightly abusing the language, for compactness we will high-pass filters of the following form:
sometimes say “pixel Xij ” meaning pixel located at (i, j)
Rij = X̂ij (Nij ) − cXij , (1)
whose grayscale is Xij . A model representation of an image
using a feature will always be denoted with the same lower-
where c ∈ N is the residual order, Nij is a local neighbor-
case letter. For example, image X is represented with x
hood of pixel Xij , Xij ∈ / Nij , and X̂ij (.) is a predictor
and X̄ with x̄. For any vector x with index set I, xJ ,
of cXij defined on Nij . The set {Xij + Nij } is called
J ⊂ I, stands for the vector x from which all xi , i ∈ / J,
the support of the residual. The advantage of modeling
were removed. We use the symbols R and N to represent
the residual instead of the pixel values is that the image
the set of all real numbers and integers. For any x ∈ R, the
content is largely suppressed in R, which has a much
largest integer smaller than or equal to x is bxc, while the
narrower dynamic range allowing thus a more compact and
operation of rounding to an integer is denoted round(x).
robust statistical description. Many steganalysis features
The truncation function with threshold T > 0 is defined
were formed in this manner in the past, e.g., [9], [18], [33],
for any x ∈ R as truncT (x) = x for x ∈ [−T, T ] and
[38].
truncT (x) = T sign(x) otherwise. For a finite set X , |X |
denotes the number of its elements. 2) Truncation and quantization: Each submodel is
formed from a quantized and truncated version of the
II. Rich model of noise residual residual:
A good source model is crucial not only for steganog-
  
Rij
raphy but also, e.g., for source coding and forensic anal- Rij ← truncT round , (2)
q
ysis. Indeed, image representations originally designed for
steganalysis have found applications in digital forensic to where q > 0 is a quantization step. The purpose of
solve difficult problems of a very different nature [24], [7]. truncation is to curb the residual’s dynamic range to allow
However, digital images acquired using a sensor constitute their description using co-occurrence matrices with a small
a quite complex source. This is not only due to the richness T . The quantization makes the residual more sensitive to
of natural scenes but also due to the intricate network of embedding changes at spatial discontinuities in the image
dependencies among pixels introduced at the acquisition (at edges and textures).
3

residual samples processed using (2) with T = 2. Formally,


0.98 each co-occurrence matrix C is a four-dimensional array
hor/ver neighboring
indexed with d = (d1 , d2 , d3 , d4 ) ∈ T4 , {−T, . . . , T }4 ,
0.96 diag/m-diag neighboring
which gives the array (2T + 1)4 = 625 elements. The dth
Correlation

0.94 element of the horizontal co-occurrence for residual R =


(Rij ) is formally defined as the (normalized) number of
0.92 groups of four neighboring residual samples with values
equal to d1 , d2 , d3 , d4 :
0.9
(h) 1
Cd = {(Rij, Ri,j+1 , Ri,j+2 , Ri,j+3 )|
0.88 Z
Ri,j+k−1 = dk , k = 1, . . . , 4} , (3)
0.86
1 2 3 4 5 6
where Z is the normalization factor ensuring that
Pixel distance P (h)
= 1. The vertical co-occurrence, C(v) , is
d∈T4 Cd
defined analogically.
Figure 1. Correlation between pixels based on their distance. The Having fixed T and the co-occurrence order, deter-
distance of diagonally neighboring pixels is in the multiples of the mining the rest of the rich model involves selecting the
diagonal of two neighboring pixels. The results were averaged over local predictors X̂ij for the residuals and the quantization
100 randomly selected images from BOSSbase ver. 0.92.
step(s) q, all explained in the following sections.

3) Co-occurrences: The construction of each submodel B. Description of all residuals


continues with computing one or more co-occurrence ma- All residuals used in this paper are graphically shown
trices of neighboring samples from the truncated and in Fig. 2. They are built as locally-supported linear filters
quantized residual (2). Forming models in this manner is whose outputs are possibly combined using minimum and
well-established in the steganalysis literature. The trunca- maximum operators to increase their diversity. For better
tion of noise residuals and formation of joint or Markov insight, think of each filter in terms of its predictor. For
transition probability matrices as features has appeared example, in the first-order residual Rij = Xi,j+1 − Xij the
for the first time in [38] and in [36]. The key question is central pixel Xij is predicted as its immediate neighbor,
how to choose the model parameters – the threshold T , the X̂ij = Xi,j+1 , while the predictor in the second-order
co-occurrence order, and the spatial positions of the neigh- residual Rij = Xi,j−1 + Xi,j+1 − 2Xij assumes that
boring residual samples. To this end, we analyzed our cover the image is locally linear in the horizontal direction,
source, which is the BOSSbase ver. 0.92 database [13], 2X̂ij = (Xi,j+1 + Xi,j−1 ). Higher-order differences as well
and computed the average correlation between neighbor- as differences involving a larger neighborhood correspond
ing pixels in the horizontal/vertical and diagonal/minor to more complicated assumptions made by the predictor,
diagonal directions (see Fig. 1). The correlations fall off such as locally-quadratic behavior or linearity in both
gradually with increasing distance between pixels and dimensions. Additional motivation for the choice of our
they do so faster for diagonally-neighboring pixels. Thus, filters appears in Section II-B1.
we form co-occurrences of pixels only along the horizon- The central pixel Xij at which the residual (1) is evalu-
tal and vertical directions and avoid using groups with ated is always marked with a black dot and accompanied
diagonally-neighboring pixels.1 We chose four-dimensional with an integer – the value c from (1). If the chart
co-occurrences because co-occurrences of larger dimen- contains only one type of symbol (besides the black dot),
sions had numerous underpopulated bins, which compro- we say that the residual is of type ’spam’ (1a, 2a, 3a, S3a,
mised their statistical significance. For a fixed dimension, E3a, S5a, E5a) by their similarity to the SPAM feature
better results are generally obtained by using a lower value vector [33].
of T and including other types of residuals to increase If there are two or more different symbols other than
the model diversity. To compensate for loss of information the black dot, we call it type ’minmax’. In type ’spam’,
due to truncating all residual values larger than T , for the residual is computed as a linear high-pass filter of
each residual type we consider several submodels with neighboring pixels with the corresponding coefficients. For
different values of q, allowing thus our model to “see” example, 2a stands for the second-order Rij = Xi,j−1 +
dependencies among residual samples whose values lie Xi,j+1 − 2Xij and 1a for the first-order Rij = Xi,j+1 − Xij
beyond the threshold. residuals. In contrast, ’minmax’ residuals use two or more
In summary, our submodels will be constructed from linear filters, each filter corresponding to one symbol type,
horizontal and vertical co-occurrences of four consecutive and the final residual is obtained by taking the minimum
(or maximum) of the filters’ outputs. Thus, there will be
1 A few sample tests confirmed that co-occurrences formed from
two minmax residuals – one for the operation of ’min’
groups in which pixels do not lie on a straight line have a substantially
weaker detection performance across various stego methods and and one for ’max’. For example, 2b is obtained as Rij =
payloads. min{Xi,j−1 + Xi,j+1 − 2Xij , Xi−1,j + Xi+1,j − 2Xij } while
4

1g is Rij = min{Xi−1,j−1 − Xij , Xi−1,j − Xij , Xi−1,j+1 − a single co-occurrence matrix (sum of both horizontal
Xij , Xi,j+1 − Xij }, etc. The ’min’ and ’max’ operators and vertical matrices), while hv-nonsymmetrical ones will
introduce non-linearity into the residuals and desirably produce two matrices – one for the horizontal and one for
increase the model diversity. Both operations also make the vertical direction. We include this fact into the residual
the distribution of the residual samples non-symmetrical, name by appending either ’h’ or ’v’ to the end. No symbol
thickening one tail of the distribution of Rij and thinning is appended to hv-symmetrical residuals.
out the other. We also define a symmetry index σ for each residual
The number of filters, f , is the first digit attached to as the number of different residuals that can be obtained
the end of the residual name. The third-order residuals are by rotating and possibly mirrorring the image prior to
computed just like the first-order residuals by replacing, computing it. To give an example, 2c, 1b, 1c, and 1g have
e.g., Xi,j+1 − Xij with −Xi,j+2 + 3Xi,j+1 − 3Xij + Xi,j−1 . symmetry indices equal to 1, 2, 4, and 8, respectively. The
The differences along other directions are obtained ana- symmetry index is part of the residual name and it always
logically. follows the number of filters, f .
1) Residual classes: As the figure shows, the residuals To make the co-occurrence bins more populated, and
are divided into six classes depending on the central thus increase their statistical robustness, and to lower their
pixel predictor they are built from. The classes are given dimensionality, for hv-nonsymmetrical residuals we add all
the following descriptive names: 1st, 2nd, 3rd, SQUARE, σ co-occurrences. For hv-symmetrical residuals, since we
EDGE3x3, and EDGE5x5. The predictors in class ’1st’ add both the horizontal and vertical co-occurrences, we
estimate the pixel as the value of its neighbor, while end up adding 2σ matrices. For example, 1f has symmetry
those from class ’2nd’ (’3rd’) incorporate a locally linear index 4 and because it is hv-symmetrical we can form one
(quadratic) model. Such predictors are more accurate in horizontal and one vertical co-occurrence for each of the
regions with a strong gradient/curvature (e.g., around four rotations of the filter, adding together 8 matrices.
edges and in complex textures). The class ’SQUARE’ As another example, 1g has symmetry index 8 and is
makes use of more pixels for the prediction. The 3 × 3 hv-nonsymmetrical, which means we end up adding 8
square kernel S3a has been used in steganalysis before [23] matrices.
and it also coincides with the best (in the least-square 3) Syntax: The syntax of names used in Fig. 2 follows
sense) shift-invariant linear pixel predictor on the 3 × 3 this convention:
neighborhood for cover images from BOSSbase. The class
name = {type}{f }{σ}{scan}, (4)
’EDGE3x3’ predictors, derived from this kernel, were in-
cluded to provide better estimates at spatial discontinu- where type ∈ {spam, minmax}, f is the number of filters,
ities (edges). The larger 5×5 predictor in S5a was obtained σ is the symmetry index, and the last symbol scan ∈
as a result of optimizing the coefficients of a circularly- {∅, h, v} may be missing (for hv-symmetrical residuals) or
symmetrical 5 × 5 kernel using the Nelder–Mead algo- it is either h or v, depending on the co-occurrence scan
rithm to minimize the detection error for the embedding that should be used with the residual.
algorithm HUGO [29]. While this (only) predictor was In summary, the class ’1st’ contains 22 different co-
inspired by a specific embedding algorithm, it works very occurrence matrices – two for 1a, 1c, 1e, 1f, 1h, and four
well against other algorithms we tested in this paper. The for 1b, 1d, 1g. The same number is obtained for class ’3rd’,
’EDGE5x5’ residuals E5a–E5d (not shown in Fig. 2) are while ’2nd’ contains 12 matrices – two for 2a, 2b, 2c, 2e,
built from S5a in an analogical manner as E3a–E3d are and four for 2d. There are two matrices in ’SQUARE’,
built from S3a. S3a, S5a, and ten in ’EDGE3x3’ and in ’EDGE5x5’ (two
2) Residual symmetries: Each residual exhibits sym- for E3a, E3b, and E3d, and four for E3c), giving the total
metries that will later allow us to reduce the number of 22 + 12 + 22 + 2 + 10 + 10 = 78 matrices, each with
of submodels and make them better populated. If the 625 elements. These matrices are used to form the final
residual does not change after computing it from the image submodels by symmetrization explained next.
rotated by 90 degrees, we say that it is non-directional,
otherwise it is directional. For instance, 1a, 1b, 2a, 2e, C. Co-occurrence symmetrization
E3c are directional while 1e, 2b, 2c, S3a, E3d are non-
directional. Two co-occurrence matrices (3) are computed The individual submodels of the rich image model will
for each residual – one for the horizontal and one for the be obtained from the 78 co-occurrence matrices computed
vertical scan. We call a residual hv-symmetrical if its hori- above by leveraging symmetries of natural images. The
zontal and vertical co-occurrences can be added to form a symmetries are in fact quite important as they allow us to
single matrix (submodel) based on the argument that the increase the statistical robustness of the model while de-
statistics of natural images do not change after rotating creasing its dimensionality, making it thus more compact
the image by 90 degrees. Obviously, all non-directional and improving the performance-to-dimensionality ratio.
residuals are hv-symmetrical, but many directional residu- We use the sign-symmetry2 as well as the directional
als are hv-symmetrical as well (e.g, 1c, 1h, 2e, E3b, E3d). symmetry of images. The symmetrization depends on
In contrast, 1a, 1g, 2a, 2d, E3c are not hv-symmetrical. 2 Sign-symmetry means that taking a negative of an image does
In general, an hv-symmetrical residual will thus produce not change its statistical properties.
5

+1 +1

1st and 3rd −1 +1 +1 −1 +1 −1 +1 +1 −1 +1


ORDER:
SQUARE:
1a) spam14h,v 1b) minmax22h,v 1c) minmax24 1d) minmax34h,v

+1 +1 +1 +1 +1 +1 +1 +1 +1 −1 +2 −1

+1 −1 +1 −1 +1 −1 +1 −1 +1 +2 −4 +2

+1 −1 −1 +2 −1

1e) minmax41 1f) minmax34 1g) minmax48h,v 1g) minmax54 S3a) spam11

2nd ORDER:
+1 +1 +1 +1 +1 +1 +1 −1 +2 −2 +2 −1

+1 −2 +1 +1 −2 +1 +1 −2 +1 +1 −2 +1 +1 −2 +1 +2 −6 +8 −6 +2

+1 +1 +1 +1 +1 +1 +1 −2 +8 −12 +8 −2

+2 −6 +8 −6 +2
2a) spam12h,v 2b) minmax21 2c) minmax41 2d) minmax24h,v 2e) minmax32
−1 +2 −2 +2 −1

S5a) spam11
−1 +2 −1 −1 +2 −1 −1 +2 −1 −1 +2 −1

+2 −4 +2 +2 −4 +2 +2 −4 +2 +2 −4 +2
EDGE3x3:
−1 +2 −1 +2 −1 −1 +2 −1

E3a) spam14h,v E3b) minmax24 E3c) minmax22h,v E3d) minmax41

Figure 2. Definitions of all residuals. The residuals 3a – 3h are defined similar to the first-order residuals, while E5a – E5d are similar to
E3a – E3d defined using the corresponding part of the 5 × 5 kernel displayed in S5a. See the text for more details.

the residual type. All ’spam’ residuals are symmetrized After symmetrization, the total number of submodels
sequentially by applying the following two rules for all decreases from 78 to only 45 as the symmetrization reduces
d = (d1 , d2 , d3 , d4 ) ∈ T4 : two co-occurrences, one for ’min’ and one for ’max’, into
a single matrix. The number of co-occurrences for the
C̄d ← Cd + C−d , (5) ’spam’ type stays the same (only their dimensionality
=
Cd ← C̄d + C̄←
−, (6) changes). For example, for the class ’1st’, we will have 12
d
submodels – one symmetrized spam14h and one spam14v,
←−
where d = (d4 , d3 , d2 , d1 ) and −d = one minmax22h, one minmax22v, one minmax24, min-
(−d1 , −d2 , −d3 , −d4 ). After eliminating duplicates max34h, minmax34v, minmax41, minmax34, minmax48h,
=
from C (which had originally 625 elements), only 169 minmax48v, and one minmax54. There will be 12 submod-
unique elements remain. els from ’3rd’, seven from ’2nd’, two from ’SQUARE’, and
The ’minmax’ residuals of natural images also possess six from each edge class. In total, there are 12 submodels
the directional symmetry but not the sign symmetry. On of dimension 169 from 12 ’spam’ type residuals and 33
the other hand, since min(X ) = − max(−X ) for any of dimension 325 from type minmax. Thus, when all
finite set X ⊂ R, we use the following two rules for their submodels are put together, their combined dimensionality
symmetrization: is 12 × 169 + 33 × 325 = 12, 753.
(min) (max)
We remark that it is possible that the symmetrization
C̄d ← Cd + C−d (7) might prevent us from detecting steganographic methods
=
Cd ← C̄d + C̄←
−, (8) that disturb the above symmetries (think of symmetrizing
d
the histogram for Jsteg [37]). Such embedding methods
where C(min) and C(max) are the ’min’ and ’max’ co- are, however, fundamentally flawed (and easy to detect) as
occurrence matrices computed from the same residual. The one can likely build accurate quantitative targeted attacks
dimensionality is thus reduced from 2 × 625 to 325. leveraging the symmetry violations.
6

D. Quantization steganalyzer but also form the rich image model from
Finally, we specify how to select the quantization step available cover and stego images for a given steganographic
q. As mentioned already at the end of Section II-A, it is algorithm. Thus, most experiments will be built from
beneficial to include several versions of submodels with the following experimental unit that starts by randomly
different values of q because residuals obtained with a splitting the available image database into a training and
larger q can better detect embedding changes in textured testing set X trn and X tst , each with N trn and N tst cover
areas and around edges. Based on sample experiments images and the same number of their corresponding stego
with different algorithms and submodels, we determined images. Using only X trn , we assemble the rich model,
that the best performance of each submodel is always construct the steganalyzer, and then test it on X tst . All
achieved when q ∈ [c, 2c], where c is the residual order. images will be represented using d-dimensional features,
Thus, we included in the rich cover model all submodels x ∈ Rd , where d could stand for the dimension of a given
with residuals quantized with q: submodel or their arbitrary union.
( Everywhere in this paper, we use the ensemble classifier
{c, 1.5c, 2c} for c > 1 as described in [27], [28]. Since these two references obtain
q∈ (9)
{1, 2} for c = 1. a detailed description of the tool and since this paper is
not about ensemble classification, we repeat only the most
The case with c = 1 in (9) is different from the rest because essential details here, referring the reader to the original
quantizing a residual with c = 1 and q = 1.5 with T = 2 publications for more details.
leads to exactly the same result as when quantizing with The ensemble classifier is essentially a random forrest
q = 2. Thus, each submodel will be built in two versions consisting of L binary classifiers called base learners, Bl ,
for residuals in class ’1st’ and in three versions for the l = 1, . . . , L, each trained on a different dsub -dimensional
remaining residuals. subspace of the feature space selected uniformly at ran-
The authors acknowledge that the individual perfor- dom. Each random subspace will be described using an
mance of each submodel can likely be improved by re- index set Dl ⊂ {1, . . . , d}, |Dl | = dsub . The ensemble
placing the simple scalar quantizer with an optimized de- reaches its decision by fusing all L decisions of individual
sign. The possibilities worth investigating are non-uniform base learners using majority voting.
scalar quantizers and vector quantizers built directly in the Following the investigation reported in the original pub-
four-dimensional space. Due to lack of space for such an lications, we use Fisher Linear Discriminants (FLDs) as
investigation in this paper, the authors postpone research- base learners due to their simple and fast training and due
ing these possibilities to their future work. to the fact that in steganalysis we are unlikely to encounter
a small feature set responsible for majority of detection
E. Discussion accuracy. Denoting the cover and stego features from the
The residuals shown in Fig. 2 were selected using the training set as x(m) and x̄(m) , m = 1, . . . , N trn , respec-
principle of simplicity and are by no means to be meant as tively, each base learner Bl is trained as the FLD on the set
(m) (m)
the ultimate result as there certainly exist numerous other Xl = {xDl , x̄Dl }m∈N b , where Nlb is a bootstrap sample
l
trn
possibilities. We view the model building as an open-ended of {1, . . . , N } with roughly 63% of unique training
process because, quite likely, there exist other predictors examples.3 The remaining 37% are used for estimating the
that will further improve the detection after adding them classifier’s testing error as each Bl provides a single vote
to the proposed model. Having said this, we observed Bl (xDl ) ∈ {0, 1} (cover = 0, stego = 1) for each x ∈ / Xl .
a “saturation” of performance in the sense that further After all L base learners are trained, each training sample
enrichment of the model with other types of predictors x ∈ X trn will thus collect on average 0.37 × L predictions
lead to an insignificant improvement in detection accuracy that are fused using the majority voting strategy into the
for all tested algorithms (see Section V). final prediction, which we denote B (L) (x) ∈ {0, 1}.
In the future, the authors contemplate learning the best The optimal number of base learners, L, and the dimen-
predictors from the database of cover and stego images, re- sionality of each feature subspace, dsub , are determined au-
placing thus the hand design described above. We also note tomatically during ensemble training. The training makes
that submodels obtained from residuals computed using use of the so-called “out-of-bag” (OOB) error estimate:
denoising filters almost always lead to poor steganalysis N trn
results because denoising filters typically put substantial (L) 1 X  (L) (m) (L) (m)

EOOB = B (x ) + 1 − B (x̄ ) ,
weight to the central pixel being denoised, which leads 2N trn m=1
to a biased predictor X̂ij , and, when one computes the (10)
residual using (1), the stego signal becomes undesirably which is an unbiased estimate of the testing error [3]. The
suppressed. parameter dsub is determined by a simple one-dimensional
direct-search derivative-free technique inspired by the
III. Ensemble classifier
3 The threshold of all base learners is set to minimize the total
Our strategy for constructing steganography detectors detection error with equal priors determined again from examples
involves a training phase in which we not only build the from the training set only.
7

compass search [31] that minimizes (10). During the symmetry index, σ, and the co-occurrence scan direction.
search, for each tested dsub the number of base learners, Since for submodels of type ’spam’ both scan directions
L, is gradually increased one-by-one while monitoring the were merged, we use the scan string ’hv’ in their acronym.
gradual decrease of the OOB error estimate (10). This These acronyms are used in Fig. 3.
process is stopped, and L is fixed, when (10) starts showing To formally and unambiguously explain the feature
signs of saturation. selection strategies, we introduce additional notation. Let
The OOB error estimate is a very convenient by-product I1 , . . . , I6 ⊂ {1, . . . , 39} be the index sets correspond-
of training. Since it is an unbiased estimate of the testing ing to submodels from ’1st’, ’2nd’, ’3rd’, ’EDGE3x3’,
error, it is ideally suited for evaluating the detection ’EDGE5x5’, and ’SQUARE’ classes, respectively. The car-
performance of the submodel on unseen data without dinalities of these six index sets are 11, 6, 11, 5, 5, and
having to reserve a portion of the training set to assess 1, respectively. The reader is advised to follow Fig. 3 for
(q)
the testing error as is commonly done in cross-validation.4 better clarity. We denote by Mi the ith submodel, i ∈
In the next section, we will assemble the rich model using {1, . . . , 39}, quantized with q (q ∈ {1c, 2c} for i ∈ I1 and
the OOB error estimates of all submodels. q ∈ {1c, 1.5c, 2c} otherwise). The OOB error estimate (10)
(q)
of the ith submodel quantized with q will be denoted Ei .
IV. Assembling the rich model The following six submodel selection strategies will be
explored in the next section:
Fundamentally, the model assembly is a feature selection
problem as the steganalyst strives to achieve the best • ALL. This strategy is a simple forward feature se-
performance for a given feature dimensionality. As already lection applied to submodels. The idea is to merge
pointed out above, our approach will combine a feature the M best individually performing submodels (based
subset selection heuristic with a classification feedback. on their OOB errors) into one model and omit the
The union of all submodels (co-occurrence matrices), in- rest. Formally, we form the model by merging M
cluding their differently quantized versions, has a total submodels with the M smallest OOB error estimate
(q)
dimension of 2 × (2 × 169 + 10 × 325)+3 × (10 × 169 + 23 × Ei out of all 106 submodels.
325)= 34, 671.5 We apply our feature selection strategies • BEST-q. Since two differently quantized versions of
to the entire submodels based on the estimate of their one submodel provide less diversity than two submod-
individual performance (in terms of the OOB error (10)) els built from different residuals, in this strategy we
as this allows us to interpret the results, relate the selected force diversity among the selected submodels while
submodels to the steganographic algorithms, and provide hoping to improve the performance-to-dimensionality
interesting feedback to steganographers (Sections V and ratio. First, we determine the best quantization step
(q)
VI), which would not be possible if we were selecting indi- q for each submodel i: qi = arg minq Ei . We remind
vidual features. The learning process is applied for a given that the argument minimum is taken over two values
stegochannel as defined in [15], which entails a specific for q in class ’1st’ and over three values for all other
(q )
steganographic algorithm, message source (payload size), classes. As a result, we obtain 39 submodels, Mi i ,
(q )
and a sample of cover and stego images. with the corresponding OOB error estimates Ei i .
Since the dimensionality of submodels that originated Now, we apply the forward feature selection to these
from the ’spam’ type residual is 169 (’minmax’ resid- 39 submodels as in ALL. In particular, we merge M
(q ) (q )
uals have dimensionality 325), to make the ranking by submodels Mi i with M lowest OOB errors Ei i .
OOB fair, we merge the vertical and horizontal ’spam’ • BEST-q-CLASS. In this strategy, we force diversity
submodels into a single 2 × 169 = 338-dimensional sub- even more than in BEST-q by first selecting one
model (merging spam14h with spam14v from classes ’1st’, (the best) submodel from each class before select-
’3rd’, ’EDGE3x3’, and ’EDGE5x5’, and also spam12h with ing a different submodel from the same class again.
spam12v from class ’2nd’). Additionally, we merge the two As in BEST-q, we begin with only 39 submodels
spam11 submodels from class ’SQUARE’ into one 338- and proceed in rounds. In the first round, we find
(q ) (q )
dimensional submodel. Now, all submodels have approxi- i1 = arg mini∈I1 Ei i , . . ., i6 = arg mini∈I6 Ei i
(qi )
mately the same dimension (338 or 325) and can thus be and merge M6 = ∪6k=1 Mik k . Then, we remove the
fairly ranked by their individual detection performance. submodels from their classes, Ik ← Ik − {ik } and
Note that after merging the horizontal and vertical ’spam’ proceed to the second round. The second round is
submodels, we end up with 11 submodels in class ’1st’ and the same as the first one but with each class being
’3rd’, six in ’2nd’, one in ’SQUARE’, and five in each edge by one submodel smaller, etc. If, at some point, the
class (total of 39 submodels or 106 if counting quantized class becomes empty, we remove it from consideration.
versions as different). A short acronym will be used for The idea of this strategy is to force diversity even
each submodel consisting of the number of filters, f , the more than in BEST-q as the first six submodels are
guaranteed to be selected from six different classes,
4 Realize that we do not (and cannot!) use any feedback from the
so are the next six, etc.
actual testing set X tst for this purpose. (1c)
5 We provide the extractor of all 34,671 features at • Q1. Merge Mi , i = 1, . . . , 39. We want to compare
http://dde.binghamton.edu/download/feature_extractors. a fixed quantization q = 1c with the optimized quan-
8

tization of the BEST-q (or BEST-q-CLASS) strategy mechanisms:


after merging all 39 submodels. Formally, this strat-
(1c) 1) Non-adaptive ±1 embedding (also called LSB
egy is a simple union ∪39 j=1 Mi .
Matching) implemented with ternary matrix em-
• CLASS-q. The goal here is to see how successful each
bedding that is optimally coded to minimize the
residual type is in detecting a given stego algorithm.
number of embedding modifications. In particular,
To this end, we form the model by merging all sub-
the relative payload α bpp (bits per pixel) is em-
models with the best quantization step q from a fixed
bedded with change rate H3−1 (α), where H3−1 (x) is
class. There will be total of six models here, one for
(q ) the inverse of the ternary entropy function H3 (x) =
each class (k = 1, . . . , 6): ∪i∈Ik Mi i .
−x log2 x − (1 − x) log2 (1 − x) + x. (For more details,
• ITERATIVE-BEST-q. This strategy is the only one
see, e.g., Chapter 8 in [15].)
that considers mutual dependencies among submod-
2) HUGO [34], which was designed to minimize embed-
els. The submodels are selected sequentially one by
ding distortion in a high-dimensional feature space
one based on how much they improve the detection
computed from differences of four neighboring pixels.
w.r.t. the union of those already selected. We start
We used the embedding simulator available from
with 39 submodels just like in strategies BEST-q and
the BOSS website [13] with σ = 1 and γ = 1,
BEST-q-CLASS. The first submodel selected is the
(q ) for the parameters of the distortion function, and
one with the lowest OOB error. We denote it Mi1 i1 , the switch –T 255, which means that the distortion
(qi )
i1 = arg mini Ei . Having selected k ≥ 1 submodels, function was computed with threshold T = 255
(qi )
k+1
the k + 1st submodel Mik+1 is selected as instead of the default value T = 90 used in the
 (qi )
 BOSS challenge [13]. We did it to remove a weakness
(q )
ik+1 = arg min EOOB ∪kj=1 Mij j ∪ Mi i , of HUGO with T = 90 that makes the algorithm
i∈{i
/ 1 ,...,ik }
(11) vulnerable to first-order attacks due to an artifact
where EOOB (M) is the OOB error estimate when present in the histogram of pixel differences [29].
training model M. In other words, we add the one 3) Edge-Adaptive (EA) algorithm, due to Luo et al. [32]
submodel among the 39−k remaining submodels that confines the embedding changes to pixel pairs whose
leads to the biggest drop in the OOB estimate when difference in absolute value is as large as possible
the union of all k + 1 submodels is used as a model. (e.g., around edges). Both HUGO and the EA algo-
Determining the first k submodels requires 39 + 38 rithm place the embedding changes to those parts
+· · · + 39 − k + 1 trainings, which makes this method of the image that are hard to model and are thus
rather expensive for increasing k. In this paper, we expected to be more secure than the non-adaptive
applied this strategy for k ≤ 10. ±1 embedding.
From the machine-learning point of view, the first three
strategies, ALL, BEST-q, and BEST-q-CLASS, could be
classified as filters [20]. They are based solely on the A. Image source
initial OOB ranking of every submodel and thus ignore
dependencies among the submodels. They differ mainly The image source used for all experiments is the BOSS-
in how they enforce diversity. The ITERATIVE-BEST-q base ver. 0.92 consisting of 9, 074 cover images taken with
strategy, on the other hand, continuously utilizes classifi- seven digital cameras in their RAW format, converted to
cation feedback of the ensemble as it greedily minimizes grayscale, and resized/cropped to 512 × 512 using the
the OOB error in every iteration, taking thus the mutual script provided by the BOSS organizers. The reason for
dependencies among individual submodels into account. constraining our investigation to a single cover source was
This is an example of a wrapper [20], which is a feature our desire to apply the proposed framework to several
selection method using a machine-learning tool as a black- different algorithms for a wide range of relative payloads,
box and is thus classifier-dependent. Filters and wrappers which by itself is extremely computationally demanding
are both examples of forward feature selection methods, and required use of a high-performance computer cluster
here applied to the whole submodels rather than to indi- for an extensive period of time. Moreover, the focus of this
vidual features. paper is on the methodology rather than benchmarking
The CLASS-q strategy corresponds to merging all sub- steganography in different cover sources.
models with the best q from one chosen class, while the We structure our investigation into three experiments,
Q1 strategy corresponds to merging all 39 submodels with two of which are in this section, while the third one
a fixed quantization q = 1c. The purpose of these two appears in the next section. The goal of Experiment 1
simple heuristic merging strategies is rather investigative, is to obtain insight about the detection performance for
see Section V-C. each submodel, stego algorithm, and quantization step. We
also assess the statistical spread of the results w.r.t. the
V. Investigative experiments randomness in the ensemble. Experiment 2 was designed
All experiments in this paper are carried out on three to evaluate the efficiency of each model-assembly strategy
steganographic algorithms with contrasting embedding listed in Section IV.
9

Table I
B. Experiment 1 Mean Absolute Deviation (MAD) of OOB estimates (×10−3 )
over five database splits. The table reports the average
We start by computing the OOB estimates (10) for each (q)
and maximal values over all 106 submodels, Mi , for all
submodel, including its differently quantized versions, for three tested stego algorithms and two payloads.
each stego method and for one small and one large payload
Algorithm ±1 embedding HUGO EA
(0.1 and 0.4 bpp).6 The intention is to investigate how Payload 0.10 0.40 0.10 0.40 0.10 0.40
the submodel ranking is affected by the stego algorithm, avg. MAD 0.649 0.679 0.656 0.526 0.601 0.484
quantization factor, and payload. Note that the ensemble max. MAD 1.640 1.500 2.840 1.220 1.200 0.940
classifier is built using random structures (randomness
enters the selection of subspaces for base learners and the
bootstrap sample formation), which is why we repeated and EA achieve OOB error estimates around 0.1, 0.21, and
each run five times and report the average values of OOB 0.12, indicating that HUGO is by far the best algorithm
estimates. Table I shows that the variations are in general among the three. While there exist clear differences among
rather negligible. They can also be made arbitrarily small the performance of each submodel across algorithms, it is
by the user by increasing the number of base learners L. worth noting that certain submodels rank the same w.r.t.
All results are summarized in Fig. 3 showing the average each other for all three algorithms, both payloads, and all
(q)
OOB error estimates Ei for all i = 1, . . . , 39 and for quantization steps. For example, ’minmax22h’ is always
all values of q. The dashed lines separate the ’spam’ sub- worse than ’minmax22v’ for class ’1st’ as well as ’3rd’. In
models from submodels of type ’minmax’. The dots were other words, it is better to form co-occurrences in the di-
connected by a line to enable a faster visual interpretation rection that is perpendicular to the direction in which the
of the results. pixel differences are computed. This is most likely because
1) Evaluating individual algorithms: By comparing the the perpendicular scan prevents overlaps of filter supports
patterns for a fixed algorithm, we see that there is a great and thus utilizes more information among neighboring
deal of similarity between the performance of submodels pixels. The universality of submodels is further supported
across payloads even though the actual rankings may be by the fact that pair-wise relationships between submodels
different. Remarkably, ±1 embedding with small payload are largely invariant to stego method and payload – for
shows by far the largest sensitivity to the quantization 40% of all pairs (i, j), i > j, i, j ∈ {1, . . . , 39} the
(q )
factor than any other combination of algorithms and pay- numerical relationship between errors of models Mi i and
(q )
loads. This effect is caused by the non-adaptive character Mj j does not depend on the algorithm or payload.
of ±1 embedding. For small payloads, the amount of To further investigate the universality of submodels, in
changes in edges and textures is so small that detection Fig. 4 we plot for each submodel its OOB error estimate
essentially relies on smooth parts where the finest quanti- averaged over all three stego algorithms and five payloads,
zation discerns the embedding far better in comparison 0.05, 0.1, 0.2, 0.3, and 0.4 bpp. The fact that submodels
to other quantizations. On the contrary, both adaptive from ’3rd’ consistently provide lower OOBs than the
algorithms are much less sensitive to the quantization step corresponding submodels from ’1st’ allowed us to overlap
because small payloads are more concentrated in textures the results and thus compare both classes visually. The
and edges. best overall submodel is minmax24 in class ’EDGE3x3’.
Notice that, for HUGO, submodels built from first- Note that ’h’ versions of submodels built from residuals
order differences have worse performance than submodels that are not hv-symmetrical are almost always worse than
obtained from third-order differences, which is due to the ’v’ versions as most residuals are defined in Fig. 2 in their
fact that HUGO approximately preserves statistics among horizontal orientation. This supports the rule that forming
first-order differences; the higher-order differences thus co-occurrence matrices in the direction perpendicular to
reach “beyond” the model. Also, features of type ’spam’ the orientation of the kernel support generally leads to
seem to be consistently better than ’minmax’ for this better detection as the co-occurrence bin utilizes more
algorithm. pixels. Fig. 3 also nicely demonstrates that submodels
The OOB estimates for the EA algorithm exhibit a built from first-order differences are in general worse than
remarkable similarity for both payloads. This property their equivalents constructed from third-order differences.
can probably be attributed to the much more selective Finally, observe that the best submodels are in general
character of embedding. While HUGO makes embedding from hv-symmetrical non-directional residuals.
changes even in less textured areas albeit with smaller
probability, the EA algorithm limits the embedding only C. Experiment 2
to those pairs of adjacent pixels whose difference is above a
The purpose of this experiment is to investigate the
certain threshold, eliminating a large portion of the image
efficiency of the submodel-selection strategies explained in
from the embedding process.
Section IV. It is done for a fixed payload of 0.4 bpp for all
2) Universality of submodels: The best individual sub-
three algorithms on the training set for one fixed split of
models for the larger payload and ±1 embedding, HUGO,
BOSSbase into 8074 training and 1000 testing images.
6 Experiment 1 was carried out on the training set for one fixed 1) Evaluation by submodel selection strategies: Fig. 5
split of BOSSbase into 8,074 training and 1000 testing images. shows the OOB error estimate as a function of model
10

Submodel and Class


1st 2nd 3rd EDGE3x3 EDGE5x5 SQUARE
22h 22v 24 34 34h 34v 41 48h 48v 54 14hv 21 24h 24v 32 41 12hv 22h 22v 24 34 34h 34v 41 48h 48v 54 14hv 22h 22v 24 41 14hv 22h 22v 24 41 14hv 11

0.44
±1 embedding
0.40
0.36
0.32 0.10
0.28
OOB

0.24
0.20
0.16
0.12 0.40
0.08
0.52
0.48 HUGO
0.44
0.10
0.40
0.36
OOB

0.32
0.28
0.24
0.40
0.20
0.16

0.44 Edge Adaptive


0.40
0.36 0.10
0.32
OOB

0.28
0.24
0.20
0.16 0.40
0.12

q = 1.0c q = 1.5c q = 2.0c

(q)
Figure 3. OOB error estimates (10) for all 106 submodels, Mi , for three stego algorithms and two payloads. The values were averaged
over five runs of the ensemble for a fixed split of the BOSSbase.
11

0.34
1st

3rd EDGE5x5

OOB error
0.32
2nd EDGE3x3
0.30 SQUARE

0.28

0.26
22h 22v 24 34 34h 34v 41 48h 48v 54 14hv 21 24h 24v 32 41 12hv22h 22v 24 41 14hv22h 22v 24 41 14hv 11

1st 2nd 3rd Individual


EDGE3x3submodels
EDGE5x5 SQUARE

Figure 4. OOB error estimates averaged over all three stego methods and five payloads (0.05, 0.1, . . ., 0.4 bpp). Individual classes are shown
in different shades of gray.

dimensionality for all assembly strategies. Diversity- merged submodels. The loss is however rather small (and
boosting strategies (BEST-q and BEST-q-CLASS) clearly there is almost no loss for ±1 embedding). Additionally,
achieve better results than the simple ALL. Q1 allows the steganalyst to reduce the feature extraction
As expected, ITERATIVE-BEST-q outperforms all time roughly to 1/3 as only 39 submodels with q = 1c (out
other strategies but its complexity limited us to merging of 106) need to be calculated.
only ten submodels. A little over 3000 features are in gen- Finally, it is rather interesting that at this payload (0.4
eral sufficient to obtain detection accuracy within 0.5−1% bpp) the EA algorithm is less secure than the simple non-
of the result when the entire 34,761-dimensional rich model adaptive ±1 embedding.
is used. When all ten submodels are selected using this
strategy, the best “dependency-unaware” strategy, BEST- VI. Testing the full framework
q-CLASS, needs roughly double the dimension for compa- The purpose of this last experiment is to test the pro-
rable performance. This seems to suggest that further and posed framework in a way it is customary in research works
probably substantial improvement of the performance- on steganalysis. In particular, for each split of BOSSbase
dimensionality trade-off is likely possible using more so- into 8,074 training images and 1000 testing images, and for
phisticated feature-selection methods. each payload (0.05, 0.1, 0.2, 0.3, and 0.4 bpp) and stego
Overall, the lowest OOB error estimate is indeed ob- method, we use the BEST-q-CLASS strategy and assemble
tained when all 106 (dimension 34,671) submodels are the rich model as well as the final steganalyzer using only
used. The gain between using M = 39 submodels of BEST- the training set.7 Here, the training set serves the role of
q-CLASS (dimensionality 12,753) and all 106 quantized an available sample from the cover source in which secret
submodels is however rather negligible, indicating a satu- messages are to be detected for a known stego method.
ration of performance. We evaluate the performance using the detection error
Models assembled from a specific class (CLASS-q) also on the testing set:
provide interesting insight. We obtain another confir- 1
mation that third-order residuals have better detection PE , min (PFA + PMD (PFA )), (12)
PFA 2
accuracy than first-order residuals across all stego al-
as a function of the payload expressed in bpp, where PFA
gorithms. Remarkably, despite its lower dimension, the
and PMD are the probabilities of false alarm and missed
model assembled from class ’2nd’ for HUGO is better
detection. Fig. 6 shows the median values, P̄E , together
than class ’1st’. This is not true for the other two al-
with detection errors for the CDF set [30] implemented
gorithms and is due to the fact that HUGO preserves
with a Gaussian SVM – the state-of-the-art approach be-
complex statistics computed from first-order differences
fore introducing rich models and ensemble classifiers.8 The
among neighboring pixels. Curiously, while ’EDGE5x5’ is
figure contains the results for several models depending
better than ’EDGE3x3’ for ±1 embedding and HUGO, the
on how many submodels in the BEST-q-CLASS strategy
opposite is true for EA. The ’EDGE5x5’ class appears to
are used; TOPM means that the first M submodels as
be particularly effective against ±1 embedding.
reported in Fig. 5 were used.
Strategy Q1 (the single black cross at dimensionality
7 It is entirely possible that the submodel ranking is slightly differ-
12,753 in Fig. 5) does not optimize w.r.t. the quantization
factor q, and thus it is not surprising that its performance ent on each split.
8 To save on processing time, we report the results for the CDF set
is generally inferior to the performance of the equally- with a G-SVM for only a single split as the variations over different
dimensional BEST-q-CLASS strategy with M = 39 splits are rather small and similar to those of the ensemble.
12

0.11 SQUARE 2nd


EDGE3x3
±1 embedding
EDGE5x5
0.10
1st
OOB

0.09 3rd

0.08

0.07

SQUARE EDGE3x3 ·104


0.22
0.21 EDGE5x5 HUGO
0.20 2nd
0.19
1st
0.18
OOB

0.17 3rd
0.16
0.15
0.14
0.13

0.16 ·104
0.15 SQUARE Edge Adaptive
0.14 EDGE5x5
0.13 EDGE3x3
0.12
OOB

2nd 1st
0.11
0.10 3rd
0.09
0.08
0.07

0 1 2 3 4 5 6 7 8 9 10 11 12 13 16 25 35
4
Dimensionality ×·10
103

ALL BEST-q BEST-q-CLASS Q1 CLASS-q ITERATIVE-BEST-q

Figure 5. Performance-to-model dimensionality trade-off for five different submodel selection strategies for three algorithms and a fixed
relative payload of 0.4 bpp. The performance is reported in terms of OOB error estimates. The last three tics on the x axis for strategy ALL
are not drawn to scale. The last point corresponds to a model in which all quantized versions of all 106 submodels are merged.
13

0.50
±1 embedding
0.45
0.40 ±1 embedding
payload CDF (G-SVM) TOP 39
0.35
(bpp) (single split) MED MAD
0.30
0.05 0.3615 0.2740 0.0065

PE
0.25
0.10 0.2705 0.1985 0.0057 0.20
0.20 0.1890 0.1345 0.0035 0.15
0.30 0.1490 0.0968 0.0038 0.10
0.40 0.1215 0.0785 0.0035 0.05
0
0 0.10 0.20 0.30 0.40
Payload (bpp)

0.50
HUGO 0.45
0.40 HUGO
payload CDF (G-SVM) TOP 39
0.35
(bpp) (single split) MED MAD 0.30
PE

0.05 0.4775 0.4240 0.0045 0.25


0.10 0.4540 0.3640 0.0023 0.20
0.20 0.3975 0.2658 0.0053 0.15
0.30 0.3435 0.1915 0.0033 0.10
0.40 0.2750 0.1355 0.0035 0.05
0
0 0.10 0.20 0.30 0.40
Payload (bpp)

0.50
Edge Adaptive 0.45
0.40 Edge Adaptive
payload CDF (G-SVM) TOP 39 0.35
0.30
(bpp) (single split) MED MAD
PE

0.25
0.05 0.4240 0.3255 0.0028 0.20
0.10 0.3555 0.2335 0.0067 0.15
0.20 0.2650 0.1445 0.0037 0.10
0.30 0.1875 0.0958 0.0038 0.05
0.40 0.1390 0.0695 0.0020 0
0 0.10 0.20 0.30 0.40
Payload (bpp)

TOP 1 (d ≈ 330) TOP 10 (d ≈ 3286)


CDF ... dimension d = 1234
TOP 3 (d ≈ 985) TOP 39 (d = 12753)
TOP 39 ... dimension d = 12753 CDF (d = 1234)

Figure 6. Detection error for three stego algorithms as a function of payload for several rich models. P̄E is the median detection error
PE over ten database splits into 8074/1000 training/testing images. The models as well as the classifiers were constructed for each split.
The model assembly strategy was BEST-q-CLASS. The tables on the left contain the numerical values and a comparison with a classifier
implemented using Gaussian SVM with the CDF set.
14

A. Evaluation by steganographic methods good performance that can be achieved with very low
The results confirm that HUGO is by far the best complexity. Symmetries of natural images are heavily uti-
algorithm of all three capable of hiding the payload 0.05 lized to compactify the model and increase the statistical
bpp with PE ≈ 0.42. Surprisingly, the security of the EA significance of individual co-occurrence bins forming the
algorithm is comparable with that of ±1 embedding for model. Several simple submodel-selection strategies are
payloads larger than 0.3 bpp. We observed that at higher tested to improve the trade-off between detection accuracy
payloads the EA algorithm loses much of its adaptivity and and model dimensionality.
embeds with higher change rate than ±1 embedding due The framework is demonstrated on three stego algo-
to its less sophisticated syndrome coding. For smaller pay- rithms operating in the spatial domain: ±1 embedding
loads, the EA algorithm is only slightly more secure than and two content-adaptive methods – HUGO and an edge-
±1 embedding. Overall, the detection of both adaptive adaptive method by Luo et al. [32]. Ensemble classifiers
stego methods benefits more from the rich model than ±1 with the rich model significantly outperform previously-
embedding, which is to be expected and was commented proposed detectors especially for the two adaptive meth-
upon already in Section V-B. ods as they place embedding changes in hard-to-model
regions of images where the rich model better discerns the
embedding changes. Remarkably, the rich model is capable
B. Evaluation w.r.t. previous models
of achieving the same level of statistical detectability with
The proposed detectors provide a substantial improve- dimensions 30 to 100 times smaller than for the early
ment in detection accuracy over the 1234-dimensional versions of rich models [17], [16], [27].
CDF set with a Gaussian SVM even when the smallest The rich models are built using the philosophy of max-
model (TOP1 with dimensionality slightly above 300) is imizing the diversity of submodels while keeping all their
used. This improvement is again much higher for the elements (co-occurrence bins) well populated and thus
two adaptive stego algorithms. With regards to the more statistically significant. This is quite different from the
recent publications on detection of HUGO, we note that model used in HUGO, where the authors simply increased
the results of [19] were reported for HUGO implemented the truncation threshold to obtain a high-dimensional
with T = 90, which introduces artifacts that make the model. Besides steganalysis, the rich model could be used
steganalysis significantly more accurate and thus incompa- for steganography as well by endowing the model space
rable with the HUGO algorithm run with T = 255 tested with an appropriate distortion function using, e.g., the
here (see Section V and [29] for more details). Even though method described in [11]. The authors, however, hypoth-
the attacks on HUGO reported in [17], [16], [27], [19] did esize that steganographic methods based on minimizing
not explicitly utilize the above-mentioned weakness, they, distortion in a rich model space, such as [10], may no longer
too are likely affected by the weakness and are thus not be able to embed large payloads undetectably as it will
directly comparable. Having said this, the best results become increasingly harder to preserve a large number of
of [17] on HUGO (with T = 90) achieved with model statistically significant quantities. This statement stems
dimensionality of 33,930 can now be matched with our from an observation made in this paper, namely that
rich model with dimensionality 30–100 times smaller. The submodels built from first-order differences among pixels
decrease in the detection error PE ranges from roughly are able to detect HUGO relatively reliably despite the
6% (for payload 0.1 bpp) to about 3% for payload 0.4 fact that its distortion function minimizes perturbations
bpp. For ±1 embedding, the improvement is smaller and to joint statistics built from such differences.
ranges from 1 − 2%. We expect that the proposed rich models of the noise
component might find applications beyond steganography
VII. Conclusion and steganalysis in related fields, such as digital forensics,
Recent developments in digital media steganalysis for problems dealing with imaging hardware identification,
clearly indicate the immense importance of accurate mod- media integrity, processing history recovery, and authen-
els that are relevant for steganalysis. The accuracy of tication. A similar framework based on rich models can
steganalyzers and their ability to detect a wide spectrum likely be adopted for other media types, including audio
of embedding methods in various cover sources strongly and video signals.
depends on the quality and generality of the cover model. The steganalyzers and models proposed in this paper
It appears that any substantial progress is only possible consist of several procedures and modules whose de-
when steganalysts incorporate more complex models that sign certainly deserves further investigation and optimiza-
capture a large number of dependencies among pixels. This tion that might bring further performance improvement.
paper introduces a novel methodology for constructing rich This concerns, for example, the quantizer of the multi-
models of the noise component of digital images, rich in the dimensional co-occurrences. Replacing the uniform scalar
sense that they consider numerous qualitatively different quantizers with non-uniform or vector quantizers may help
relationships among pixels. The model is assembled for us further improve the performance for a fixed model
a given sample of the cover source and stego method. dimensionality. Additional boost can likely be obtained by
Both the model-building and the construction of the fi- applying more sophisticated feature selection algorithms
nal steganalyzer use ensemble classifiers because of their for choosing the submodels and/or their individual bins.
15

Table II
Finally, the local pixel predictors could be parametrized Detection error PE for three algorithms for payload 0.4
and optimized w.r.t. a specific stego method and cover bpp when the ensemble is used with the rich
source. 12,753-dimensional TOP39 model and when a G-SVM is
combined with the ∼ 3300-dimensional best
As our final remark, we note that one could view the ITERATIVE-BEST-q model. The reported numbers are
model-building process independently of the final classi- achieved over ten splits of BOSSbase.
fier design. It is certainly possible to use the speed and
Ensemble G-SVM
convenience of the ensemble to assemble the model and Algorithm
MED MAD MED MAD
then use it to build a classifier using a different machine-
±1 Emb. 0.0785 0.0035 0.0683 0.0042
learning tool that may provide a better separation between
HUGO 0.1355 0.0035 0.1310 0.0065
classes when a highly non-linear boundary exists that
EA 0.0695 0.0020 0.0643 0.0030
may not be well captured by the ensemble equipped with
linear base learners. In fact, we observed that features Table III
built as co-occurrences of neighboring noise residuals often The average running time (for the training and testing
lead to non-linear boundaries that are better captured by together) of the experiments in Table II if executed on a
single computer with the AMD Opteron 275 processor
Gaussian SVMs than the ensemble as implemented in this running at 2.2 GHz.
paper. This has already been observed in our previous
work [17] and is confirmed for our rich model as well.9
Algorithm Ensemble G-SVM
To demonstrate the potential of this approach, we in-
±1 Emb. 1 hr 20 min 4 days 22 hr 37 min
cluded one more final experiment. We used our best rich
HUGO 4 hr 35 min 8 days 15 hr 31 min
model (best in terms of the OOB error estimate vs. di-
EA 3 hr 09 min 3 days 23 hr 50 min
mensionality) assembled using the strategy ITERATIVE-
BEST-q with ten submodels merged (dimension approx-
imately 3300) and trained a G-SVM for all three algo-
rithms using the same experimental setup as described in package LIBSVM [5] (with manually implemented cross-
Section VI. This was the largest model we could afford validation [25]) to conduct the G-SVM experiments.
to use with a G-SVM given our computing resources.
Calculating the median detection error over ten splits, References
in Table II we compare the results with the detection
[1] I. Avcibas, M. Kharrazi, N.D. Memon, and B. Sankur. Image
error of classifiers implemented as ensembles using the steganalysis with binary similarity measures. EURASIP Jour-
12,753-dimensional TOP39 rich model. We only show the nal on Applied Signal Processing, 17:2749–2757, 2005.
results for the 0.4 bpp payload as carrying out these types [2] I. Avcibas, N.D. Memon, and B. Sankur. Steganalysis using
image quality metrics. In E.J. Delp and P.W. Wong, editors,
of experiments under our experimental setting (feature Proceedings SPIE, Electronic Imaging, Security and Water-
dimensionality and training set size) is rather expensive marking of Multimedia Contents III, volume 4314, pages 523–
with a G-SVM. Interestingly, the smaller model with a 531, San Jose, CA, January 22–25, 2001.
[3] L. Breiman. Bagging predictors. Machine Learning, 24:123–140,
G-SVM as the final classifier provided our best detection August 1996.
results. The improvement is roughly by 0.5–1% over all [4] G. Cancelli, G. Doërr, I.J. Cox, and M. Barni. Detection of
three steganographic methods, in terms of the median ±1 LSB steganography based on the amplitude of histogram
local extrema. In Proceedings IEEE, International Conference
testing error, with a similar level of statistical variability on Image Processing, ICIP 2008, pages 1288–1291, San Diego,
over the splits. The running time of a G-SVM classifier CA, October 12–15, 2008.
with 3,300-dimensional features, however, was on average [5] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for
support vector machines, 2001. Software available at http://
30–90 times higher than the running time of the ensemble www.csie.ntu.edu.tw/~cjlin/libsvm.
classifier with 12,753-dimensional features, as reported in [6] C. Chen and Y.Q. Shi. JPEG image steganalysis utilizing both
Table III. The measured running times correspond to the intrablock and interblock correlations. In Circuits and Systems,
ISCAS 2008. IEEE International Symposium on, pages 3029–
full training and testing, including the parameter-search 3032, May 2008.
procedures of both types of classifiers. In case of the en- [7] C. Chen, Y.Q. Shi, and Wei Su. A machine learning based
semble, this is the search for the optimal value of dsub , and scheme for double JPEG compression detection. In 19th In-
ternational Conference on Pattern Recognition (ICPR 2008),
in case of a G-SVM it is a five-fold cross-validation search pages 1–4, Tampa, FL, 2009.
for the optimal hyper-parameters – the cost parameter [8] V. Chonev and A.D. Ker. Feature restoration and distor-
C and the kernel width γ. It was carried out on the tion metrics. In N.D. Memon, E.J. Delp, P.W. Wong, and
J. Dittmann, editors, Proceedings SPIE, Electronic Imaging,
multiplicative grid GC × Gγ , GC = {10a }, a ∈ {0, . . . , 4}, Security and Forensics of Multimedia XIII, volume 7880, pages
Gγ = d1 · 2b , b ∈ {−4, . . . , 3}, where d is the feature

0G01–0G14, San Francisco, CA, January 23–26, 2011.
space dimensionality. We used our Matlab implementation [9] H. Farid and L. Siwei. Detecting hidden messages using higher-
order statistics and support vector machines. In F.A.P. Petit-
of the ensemble classifier10 and the publicly available colas, editor, Information Hiding, 5th International Workshop,
volume 2578 of Lecture Notes in Computer Science, pages 340–
9 In contrast, in the JPEG domain co-occurrences between quan-
354, Noordwijkerhout, The Netherlands, October 7–9, 2002.
tized DCT coefficients appear to react in a more linear fashion to Springer-Verlag, New York.
embedding, causing the ensemble to perform equally well as G- [10] T. Filler and J. Fridrich. Gibbs construction in steganography.
SVMs [27], [28]. IEEE Transactions on Information Forensics and Security,
10 available at http://dde.binghamton.edu/download/ensemble 5(4):705–720, 2010.
16

Table IV
List of symbols.

(h) (v) (q)


X, X̄ cover/stego image Cd ,Cd hor./ver. cooc. N trn ,N tst no. of train/test pts Ei OOB of M
(q)
i
(L)
I, J index sets R = (Rij ) noise residual Bl lth base learner EOOB OOB with L learners

T threshold X̂ij pixel predictor B (l) decision of Bl EOOB (M) OOB with model M
(q)
q quant. step c residual order Nlb Bl bootstrap set Mi ith submodel quant. q
(q)
truncT truncation f no. of filters Dl lth rand. subspace qi gives smallest OOB for M
i
P̄E , PE (median) tst. error σ symmetry index dsub rand.subspace dim. OOB out of bag estimate
=
H3 (x) ternary entropy Nij pixel neighb. L no. of base learners Cd symmetrized Cd

[11] T. Filler and J. Fridrich. Design of adaptive steganographic [25] J. Kodovský. On dangers of cross-validation in steganalysis.
schemes for digital images. In N.D. Memon, E.J. Delp, P.W. Technical report, Binghamton University, August 2011.
Wong, and J. Dittmann, editors, Proceedings SPIE, Electronic [26] J. Kodovský and J. Fridrich. On completeness of feature spaces
Imaging, Security and Forensics of Multimedia XIII, volume in blind steganalysis. In A.D. Ker, J. Dittmann, and J. Fridrich,
7880, pages OF 1–14, San Francisco, CA, January 23–26, 2011. editors, Proceedings of the 10th ACM Multimedia & Security
[12] T. Filler, J. Judas, and J. Fridrich. Minimizing additive dis- Workshop, pages 123–132, Oxford, UK, September 22–23, 2008.
tortion in steganography using syndrome-trellis codes. IEEE [27] J. Kodovský and J. Fridrich. Steganalysis in high dimensions:
Transactions on Information Forensics and Security, 6(3):920– Fusing classifiers built on random subspaces. In N.D. Memon,
935, 2010. E.J. Delp, P.W. Wong, and J. Dittmann, editors, Proceedings
[13] T. Filler, T. Pevný, and P. Bas. BOSS (Break Our Steganogra- SPIE, Electronic Imaging, Security and Forensics of Multimedia
phy System). http://boss.gipsa-lab.grenoble-inp.fr, July 2010. XIII, volume 7880, pages OL 1–13, San Francisco, CA, January
[14] J. Fridrich. Feature-based steganalysis for JPEG images and 23–26, 2011.
its implications for future design of steganographic schemes. [28] J. Kodovský, J. Fridrich, and V. Holub. Ensemble classifiers for
In J. Fridrich, editor, Information Hiding, 6th International steganalysis of digital media. IEEE Transactions on Informa-
Workshop, volume 3200 of Lecture Notes in Computer Science, tion Forensics and Security, 2011. Under review.
pages 67–81, Toronto, Canada, May 23–25, 2004. Springer- [29] J. Kodovský, J. Fridrich, and V. Holub. On dangers of overtrain-
Verlag, New York. ing steganography to incomplete cover model. In J. Dittmann,
[15] J. Fridrich. Steganography in Digital Media: Principles, Algo- S. Craver, and C. Heitzenrater, editors, Proceedings of the 13th
rithms, and Applications. Cambridge University Press, 2009. ACM Multimedia & Security Workshop, Niagara Falls, NY,
September 29–30, 2011.
[16] J. Fridrich, J. Kodovský, M. Goljan, and V. Holub. Breaking
[30] J. Kodovský, T. Pevný, and J. Fridrich. Modern steganalysis
HUGO – the process discovery. In T. Filler, T. Pevný, A. Ker,
can detect YASS. In N.D. Memon, E.J. Delp, P.W. Wong, and
and S. Craver, editors, Information Hiding, 13th International
J. Dittmann, editors, Proceedings SPIE, Electronic Imaging,
Workshop, volume 6958 of Lecture Notes in Computer Science,
Security and Forensics of Multimedia XII, volume 7541, pages
pages 85–101, Prague, Czech Republic, May 18–20, 2011.
02–01–02–11, San Jose, CA, January 17–21, 2010.
[17] J. Fridrich, J. Kodovský, M. Goljan, and V. Holub. Steganal- [31] T.G. Kolda, R.M. Lewis, and V. Torczon. Optimization by
ysis of content-adaptive steganography in spatial domain. In direct search: New perspectives on some classical and modern
T. Filler, T. Pevný, A. Ker, and S. Craver, editors, Information methods. SIAM Review, 45:385–482, 2003.
Hiding, 13th International Workshop, volume 6958 of Lecture [32] W. Luo, F. Huang, and J. Huang. Edge adaptive image
Notes in Computer Science, pages 102–117, Prague, Czech Re- steganography based on LSB matching revisited. IEEE Trans-
public, May 18–20, 2011. actions on Information Forensics and Security, 5(2):201–214,
[18] M. Goljan, J. Fridrich, and T. Holotyak. New blind steganalysis June 2010.
and its implications. In E.J. Delp and P.W. Wong, editors, [33] T. Pevný, P. Bas, and J. Fridrich. Steganalysis by subtractive
Proceedings SPIE, Electronic Imaging, Security, Steganography, pixel adjacency matrix. IEEE Transactions on Information
and Watermarking of Multimedia Contents VIII, volume 6072, Forensics and Security, 5(2):215–224, June 2010.
pages 1–13, San Jose, CA, January 16–19, 2006. [34] T. Pevný, T. Filler, and P. Bas. Using high-dimensional im-
[19] G. Gül and F. Kurugollu. A new methodology in steganaly- age models to perform highly undetectable steganography. In
sis: Breaking highly undetactable steganograpy (HUGO). In R. Böhme and R. Safavi-Naini, editors, Information Hiding,
T. Filler, T. Pevný, A. Ker, and S. Craver, editors, Information 12th International Workshop, volume 6387 of Lecture Notes in
Hiding, 13th International Workshop, volume 6958 of Lecture Computer Science, pages 161–177, Calgary, Canada, June 28–
Notes in Computer Science, pages 71–84, Prague, Czech Re- 30, 2010. Springer-Verlag, New York.
public, May 18–20, 2011. [35] Y. Q. Shi, C. Chen, and W. Chen. A Markov process based
[20] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors. Fea- approach to effective attacking JPEG steganography. In J.L.
ture Extraction, Foundations and Applications. Physica-Verlag, Camenisch, C.S. Collberg, N.F. Johnson, and P. Sallee, editors,
Springer, 2006. Information Hiding, 8th International Workshop, volume 4437
[21] J. J. Harmsen and W. A. Pearlman. Steganalysis of additive of Lecture Notes in Computer Science, pages 249–264, Alexan-
noise modelable information hiding. In E.J. Delp and P.W. dria, VA, July 10–12, 2006. Springer-Verlag, New York.
Wong, editors, Proceedings SPIE, Electronic Imaging, Security [36] K. Sullivan, U. Madhow, S. Chandrasekaran, and B.S. Man-
and Watermarking of Multimedia Contents V, volume 5020, junath. Steganalysis of spread spectrum data hiding exploiting
pages 131–142, Santa Clara, CA, January 21–24, 2003. cover memory. In E.J. Delp and P.W. Wong, editors, Proceedings
[22] S. Katzenbeisser and F. A. P. Petitcolas, editors. Information SPIE, Electronic Imaging, Security, Steganography, and Water-
Hiding Techniques for Steganography and Digital Watermark- marking of Multimedia Contents VII, volume 5681, pages 38–46,
ing. New York: Artech House, 2000. January 16–20,.
[23] A.D. Ker and R. Böhme. Revisiting weighted stego-image [37] D. Upham. Steganographic algorithm JSteg. Software available
steganalysis. In E.J. Delp and P.W. Wong, editors, Proceedings at http://zooid.org/~paul/crypto/jsteg.
SPIE, Electronic Imaging, Security, Forensics, Steganography, [38] D. Zou, Y. Q. Shi, W. Su, and G. Xuan. Steganalysis based
and Watermarking of Multimedia Contents X, volume 6819, on Markov model of thresholded prediction-error image. In
pages 5 1–5 17, San Jose, CA, January 27–31, 2008. Proceedings IEEE, International Conference on Multimedia and
[24] M. Kirchner and J. Fridrich. On detection of median filtering Expo, pages 1365–1368, Toronto, Canada, July 9–12, 2006.
in images. In Proc. SPIE, Electronic Imaging, Media Forensics
and Security XII, volume 7542, pages 10 1–12, San Jose, CA,
January 17–21 2010.

You might also like