0% found this document useful (0 votes)
133 views476 pages

Dokumen - Pub - Steganography in Digital Media Principles Algorithms and Applications 1nbsped 0521190193 978 0 521 19019 0

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views476 pages

Dokumen - Pub - Steganography in Digital Media Principles Algorithms and Applications 1nbsped 0521190193 978 0 521 19019 0

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 476

Steganography in Digital Media

Steganography, the art of hiding of information in apparently innocuous objects or


images, is a field with a rich heritage, and an area of rapid current development. This
clear, self-contained guide shows you how to understand the building blocks of covert
communication in digital media files and how to apply the techniques in practice,
including those of steganalysis, the detection of steganography. Assuming only a basic
knowledge in calculus and statistics, the book blends the various strands of steganogra-
phy, including information theory, coding, signal estimation and detection, and statistical
signal processing. Experiments on real media files demonstrate the performance of the
techniques in real life, and most techniques are supplied with pseudo-code, making
it easy to implement the algorithms. The book is ideal for students taking courses on
steganography and information hiding, and is also a useful reference for engineers and
practitioners working in media security and information assurance.

Jessica Fridrich is Professor of Electrical and Computer Engineering at Binghamton


University, State University of New York (SUNY), where she has worked since receiving
her Ph.D. from that institution in 1995. Since then, her research on data embedding and
steganalysis has led to more than 85 papers and 7 US patents. She also received the SUNY
Chancellor’s Award for Excellence in Research in 2007 and the Award for Outstanding
Inventor in 2002. Her main research interests are in steganography and steganalysis of
digital media, digital watermarking, and digital image forensics.
Steganography in Digital Media
Principles, Algorithms, and Applications

JESSICA FRIDRICH
Binghamton University, State University of New York (SUNY)
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York

www.cambridge.org
Information on this title: www.cambridge.org/9780521190190


C Cambridge University Press 2010

This publication is in copyright. Subject to statutory exception


and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.

First published 2010

Printed in the United Kingdom at the University Press, Cambridge

A catalogue record for this publication is available from the British Library

ISBN 978 0 521 19019 0 Hardback

Additional resources for this publication at www.cambridge.org/9780521190190

Cambridge University Press has no responsibility for the persistence or


accuracy of URLs for external or third-party internet websites referred to
in this publication, and does not guarantee that any content on such
websites is, or will remain, accurate or appropriate.
v

To Nicole and Kathy

Time will bring to light whatever is hidden; it will cover up and conceal what is
now shining in splendor.

Quintus Horatius Flaccus (65–8 BC)


Contents

Preface page xv
Acknowledgments xxiii

1 Introduction 1
1.1 Steganography throughout history 3
1.2 Modern steganography 7
1.2.1 The prisoners’ problem 9
1.2.2 Steganalysis is the warden’s job 10
1.2.3 Steganographic security 11
1.2.4 Steganography and watermarking 12
Summary 13

2 Digital image formats 15


2.1 Color representation 15
2.1.1 Color sampling 17
2.2 Spatial-domain formats 18
2.2.1 Raster formats 18
2.2.2 Palette formats 19
2.3 Transform-domain formats (JPEG) 22
2.3.1 Color subsampling and padding 23
2.3.2 Discrete cosine transform 24
2.3.3 Quantization 25
2.3.4 Decompression 27
2.3.5 Typical DCT block 28
2.3.6 Modeling DCT coefficients 29
2.3.7 Working with JPEG images in Matlab 30
Summary 30
Exercises 31

3 Digital image acquisition 33


3.1 CCD and CMOS sensors 34
3.2 Charge transfer and readout 35
3.3 Color filter array 36
viii Contents

3.4 In-camera processing 38


3.5 Noise 39
Summary 44
Exercises 45

4 Steganographic channel 47
4.1 Steganography by cover selection 50
4.2 Steganography by cover synthesis 51
4.3 Steganography by cover modification 53
Summary 56
Exercises 57

5 Naive steganography 59
5.1 LSB embedding 60
5.1.1 Histogram attack 64
5.1.2 Quantitative attack on Jsteg 66
5.2 Steganography in palette images 68
5.2.1 Embedding in palette 68
5.2.2 Embedding by preprocessing palette 69
5.2.3 Parity embedding in sorted palette 70
5.2.4 Optimal-parity embedding 72
5.2.5 Adaptive methods 73
5.2.6 Embedding while dithering 75
Summary 76
Exercises 76

6 Steganographic security 81
6.1 Information-theoretic definition 82
6.1.1 KL divergence as a measure of security 83
6.1.2 KL divergence for benchmarking 85
6.2 Perfectly secure steganography 88
6.2.1 Perfect security and compression 89
6.2.2 Perfect security with respect to model 91
6.3 Secure stegosystems with limited embedding distortion 92
6.3.1 Spread-spectrum steganography 93
6.3.2 Stochastic quantization index modulation 95
6.3.3 Further reading 97
6.4 Complexity-theoretic approach 98
6.4.1 Steganographic security by Hopper et al. 100
6.4.2 Steganographic security by Katzenbeisser and Petitcolas 101
6.4.3 Further reading 102
Summary 103
Exercises 103
Contents ix

7 Practical steganographic methods 107


7.1 Model-preserving steganography 108
7.1.1 Statistical restoration 108
7.1.2 Model-based steganography 110
7.2 Steganography by mimicking natural processing 114
7.2.1 Stochastic modulation 114
7.2.2 The question of optimal stego noise 117
7.3 Steganalysis-aware steganography 119
7.3.1 ±1 embedding 119
7.3.2 F5 embedding algorithm 119
7.4 Minimal-impact steganography 122
7.4.1 Performance bound on minimal-impact embedding 124
7.4.2 Optimality of F5 embedding operation 128
Summary 130
Exercises 131

8 Matrix embedding 135


8.1 Matrix embedding using binary Hamming codes 137
8.2 Binary linear codes 139
8.3 Matrix embedding theorem 142
8.3.1 Revisiting binary Hamming codes 144
8.4 Theoretical bounds 144
8.4.1 Bound on embedding efficiency for codes of fixed length 144
8.4.2 Bound on embedding efficiency for codes of increasing length 145
8.5 Matrix embedding for large relative payloads 149
8.6 Steganography using q-ary symbols 151
8.6.1 q-ary Hamming codes 152
8.6.2 Performance bounds for q-ary codes 154
8.6.3 The question of optimal q 156
8.7 Minimizing embedding impact using sum and difference covering set 158
Summary 162
Exercises 163

9 Non-shared selection channel 167


9.1 Wet paper codes with syndrome coding 169
9.2 Matrix LT process 171
9.2.1 Implementation 173
9.3 Wet paper codes with improved embedding efficiency 174
9.3.1 Implementation 177
9.3.2 Embedding efficiency 179
9.4 Sample applications 179
9.4.1 Minimal-embedding-impact steganography 179
9.4.2 Perturbed quantization 180
x Contents

9.4.3 MMx embedding algorithm 183


9.4.4 Public-key steganography 184
9.4.5 e + 1 matrix embedding 185
9.4.6 Extending matrix embedding using Hamming codes 186
9.4.7 Removing shrinkage from F5 algorithm (nsF5) 188
Summary 189
Exercises 190

10 Steganalysis 193
10.1 Typical scenarios 194
10.2 Statistical steganalysis 195
10.2.1 Steganalysis as detection problem 196
10.2.2 Modeling images using features 196
10.2.3 Optimal detectors 197
10.2.4 Receiver operating characteristic (ROC) 198
10.3 Targeted steganalysis 201
10.3.1 Features 201
10.3.2 Quantitative steganalysis 205
10.4 Blind steganalysis 207
10.4.1 Features 208
10.4.2 Classification 209
10.5 Alternative use of blind steganalyzers 211
10.5.1 Targeted steganalysis 211
10.5.2 Multi-classification 211
10.5.3 Steganography design 212
10.5.4 Benchmarking 212
10.6 Influence of cover source on steganalysis 212
10.7 System attacks 215
10.8 Forensic steganalysis 217
Summary 218
Exercises 219

11 Selected targeted attacks 221


11.1 Sample Pairs Analysis 221
11.1.1 Experimental verification of SPA 226
11.1.2 Constructing a detector of LSB embedding using SPA 227
11.1.3 SPA from the point of view of structural steganalysis 230
11.2 Pairs Analysis 234
11.2.1 Experimental verification of Pairs Analysis 237
11.3 Targeted attack on F5 using calibration 237
11.4 Targeted attacks on ±1 embedding 240
Summary 247
Exercises 247
Contents xi

12 Blind steganalysis 251


12.1 Features for steganalysis of JPEG images 253
12.1.1 First-order statistics 254
12.1.2 Inter-block features 255
12.1.3 Intra-block features 256
12.2 Blind steganalysis of JPEG images (cover-versus-all-stego) 258
12.2.1 Image database 258
12.2.2 Algorithms 259
12.2.3 Training database of stego images 259
12.2.4 Training 260
12.2.5 Testing on known algorithms 261
12.2.6 Testing on unknown algorithms 262
12.3 Blind steganalysis of JPEG images (one-class neighbor machine) 263
12.3.1 Training and testing 264
12.4 Blind steganalysis for targeted attacks 265
12.4.1 Quantitative blind attacks 267
12.5 Blind steganalysis in the spatial domain 270
12.5.1 Noise features 271
12.5.2 Experimental evaluation 273
Summary 274

13 Steganographic capacity 277


13.1 Steganographic capacity of perfectly secure stegosystems 278
13.1.1 Capacity for some simple models of covers 280
13.2 Secure payload of imperfect stegosystems 281
13.2.1 The SRL of imperfect steganography 282
13.2.2 Experimental verification of the SRL 287
Summary 290
Exercises 291

A Statistics 293
A.1 Descriptive statistics 293
A.1.1 Measures of central tendency and spread 294
A.1.2 Construction of PRNGs using compounding 296
A.2 Moment-generating function 297
A.3 Jointly distributed random variables 299
A.4 Gaussian random variable 302
A.5 Multivariate Gaussian distribution 303
A.6 Asymptotic laws 305
A.7 Bernoulli and binomial distributions 306
A.8 Generalized Gaussian, generalized Cauchy, Student’s t-distributions 307
A.9 Chi-square distribution 310
A.10 Log–log empirical cdf plot 310
xii Contents

B Information theory 313


B.1 Entropy, conditional entropy, mutual information 313
B.2 Kullback–Leibler divergence 316
B.3 Lossless compression 321
B.3.1 Prefix-free compression scheme 322

C Linear codes 325


C.1 Finite fields 325
C.2 Linear codes 326
C.2.1 Isomorphism of codes 328
C.2.2 Orthogonality and dual codes 329
C.2.3 Perfect codes 331
C.2.4 Cosets of linear codes 332

D Signal detection and estimation 335


D.1 Simple hypothesis testing 335
D.1.1 Receiver operating characteristic 339
D.1.2 Detection of signals corrupted by white Gaussian noise 339
D.2 Hypothesis testing and Fisher information 341
D.3 Composite hypothesis testing 343
D.4 Chi-square test 345
D.5 Estimation theory 347
D.6 Cramer–Rao lower bound 349
D.7 Maximum-likelihood and maximum a posteriori estimation 354
D.8 Least-square estimation 355
D.9 Wiener filter 357
D.9.1 Practical implementation for images 358
D.10 Vector spaces with inner product 359
D.10.1 Cauchy–Schwartz inequality 361

E Support vector machines 363


E.1 Binary classification 363
E.2 Linear support vector machines 364
E.2.1 Linearly separable training set 364
E.2.2 Non-separable training set 366
E.3 Kernelized support vector machines 369
E.4 Weighted support vector machines 371
E.5 Implementation of support vector machines 373
E.5.1 Scaling 373
E.5.2 Kernel selection 373
E.5.3 Determining parameters 373
E.5.4 Final training 374
E.5.5 Evaluating classification performance 375
Contents xiii

Notation and symbols 377


Glossary 387
References 409
Index 427
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
Preface pp. xv-xxii

Chapter DOI:
Cambridge University Press
Preface

Steganography is another term for covert communication. It works by hiding


messages in inconspicuous objects that are then sent to the intended recipient.
The most important requirement of any steganographic system is that it should
be impossible for an eavesdropper to distinguish between ordinary objects and
objects that contain secret data.
Steganography in its modern form is relatively young. Until the early 1990s,
this unusual mode of secret communication was used only by spies. At that time,
it was hardly a research discipline because the methods were a mere collection
of clever tricks with little or no theoretical basis that would allow steganography
to evolve in the manner we see today. With the subsequent spontaneous transi-
tion of communication from analog to digital, this ancient field experienced an
explosive rejuvenation. Hiding messages in electronic documents for the purpose
of covert communication seemed easy enough to those with some background
in computer programming. Soon, steganographic applications appeared on the
Internet, giving the masses the ability to hide files in digital images, audio, or
text. At the same time, steganography caught the attention of researchers and
quickly developed into a rigorous discipline. With it, steganography came to the
forefront of discussions at professional meetings, such as the Electronic Imaging
meetings annually organized by the SPIE in San Jose, the IEEE International
Conference on Image Processing (ICIP), and the ACM Multimedia and Secu-
rity Workshop. In 1996, the first Information Hiding Workshop took place in
Cambridge and this series of workshops has since become the premium annual
meeting place to present the latest advancements in theory and applications of
data hiding.
Steganography shares many common features with the related but fundamen-
tally quite different field of digital watermarking. In late 1990s, digital water-
marking dominated the research in data hiding due to its numerous lucrative
applications, such as digital rights management, secure media distribution, and
authentication. As watermarking matured, the interest in steganography and
steganalysis gradually intensified, especially after concerns had been raised that
steganography might be used by criminals.
Even though this is not the first book dealing with the subject of steganog-
raphy [22, 47, 51, 123, 142, 211, 239, 250], as far as the author is aware this is
the first self-contained text with in-depth exposition of both steganography and
xvi Preface

steganalysis for digital media files. Even though this field is still developing at a
fast pace and many fundamental questions remain unresolved, the foundations
have been laid and basic principles established. This book was written to provide
the reader with the basic philosophy and building blocks from which many prac-
tical steganographic and steganalytic schemes are constructed. The selection of
the material presented in this book represents the author’s view of the field and
is by no means an exhaustive survey of steganography in general. The selected
examples from the literature were included to illustrate the basic concepts and
provide the reader with specific technical solutions. Thus, any omissions in the
references should not be interpreted as indications regarding the quality of the
omitted work.
This book was written as a primary text for a graduate or senior undergradu-
ate course on steganography. It can also serve as a supporting text for virtually
any course dealing with aspects of media security, privacy, and secure commu-
nication. The research problems presented here may be used as motivational
examples or projects to illustrate concepts taught in signal detection and esti-
mation, image processing, and communication. The author hopes that the book
will also be useful to researchers and engineers actively working in multimedia
security and assist those who wish to enter this beautiful and rapidly evolving
multidisciplinary field in their search for open and relevant research topics.
The text naturally evolved from lecture notes for a graduate course on
steganography that the author has taught at Binghamton University, New York
for several years. This pedigree influenced the presentation style of this book as
well as its layout and content. The author tried to make the material as self-
contained as possible within reasonable limits. Steganography is built upon the
pillars of information theory, estimation and detection theory, coding theory,
and machine learning. The book contains five appendices that cover all topics
in these areas that the reader needs to become familiar with to obtain a firm
grasp of the material. The prerequisites for this book are truly minimalistic and
consist of college-level calculus and probability and statistics.
Each chapter starts with simple reasoning aimed to provoke the reader to think
on his/her own and thus better see the need for the content that follows. The
introduction of every chapter and section is written in a narrative style aimed
to provide the big picture before presenting detailed technical arguments. The
overall structure of the book and numerous cross-references help those who wish
to read just selected chapters. To aid the reader in implementing the techniques,
most algorithms described in this book are accompanied with a pseudo-code.
Furthermore, practitioners will likely appreciate experiments on real media files
that demonstrate the performance of the techniques in real life. The lessons
learned serve as motivation for subsequent sections and chapters. In order to
make the book accessible to a wide spectrum of readers, most technical arguments
are presented in their simplest core form rather than the most general fashion,
while referring the interested reader to literature for more details. Each chapter is
closed with a brief summary that highlights the most important facts. Readers
Preface xvii

Disk space
Cover type Count
13.8%
Audio 445 Audio
Disk space 416 14.8%
Images 1689
Network 39 56.1% 8.5%
Text
Other files 81 Images
Text 255 Video (2.8%)
Video 86 Other (2.7%)
Network (1.3%)

Number of steganographic software applications that can hide data in electronic media
as of June 2008. Adapted from [122] and reprinted with permission of John Wiley &
Sons, Inc.

can test their newly acquired knowledge on carefully chosen exercises placed
at the end of the chapters. More involved exercises are supplied with hints or
even a brief sketch of the solution. Instructors are encouraged to choose selected
exercises as homework assignments.
All concepts and methods presented in this book are illustrated on the ex-
ample of digital images. There are several valid reasons for this choice. First
and foremost, digital images are by far the most common type of media for
which steganographic applications are currently available. Furthermore, many
basic principles and methodologies can be readily extended from images to other
digital media, such as video and audio. It is also considerably easier to explain
the perceptual impact of modifying an image rather than an audio clip sim-
ply because images can be printed on paper. Lastly, when compared with other
digital objects, the field of image steganography and steganalysis is by far the
most advanced today, with numerous techniques available for most typical image
formats.
The first chapter contains a brief historical narrative that starts with the rather
amusing ancient methods, continues with more advanced ideas for data hiding in
written documents as well as techniques used by spies during times of war, and
ends with modern steganography in digital files. By introducing three fictitious
characters, prisoners Alice and Bob and warden Eve, we informally describe
secure steganographic communication as the famous prisoners’ problem in which
Alice and Bob try to secretly communicate without arousing the suspicion of Eve,
who is eagerly eavesdropping. These three characters will be used in the book
to make the language more accessible and a little less formal when explaining
technical aspects of data-hiding methods. The chapter is closed with a section
that highlights the differences between digital watermarking and steganography.
Knowing how visual data is represented in a computer is a necessary prereq-
uisite to understand the technical material in this book. Chapter 2 first explains
basic color models used for representing color in a computer. Then, we describe
the structure of the most common raster, palette, and transform image formats,
xviii Preface

including the JPEG. The description of each format is supplied with instruc-
tions on how to work with such images in Matlab to give the reader the ability
to conveniently implement most of the methods described in this book.
Since the majority of digital images are obtained using a digital camera, cam-
corder, or scanner, Chapter 3 deals with the process of digital image acquisition
through an imaging sensor. Throughout the chapter, emphasis is given to those
aspects of this process that are relevant to steganography. This includes the
processing pipeline inside typical digital cameras and sources of noise and im-
perfections. Noise is especially relevant to steganography because the seemingly
useless stochastic components of digital images could conceivably convey secret
messages.
In Chapter 4, we delve deeper into the subject of steganography. Three basic
principles for constructing steganographic methods are introduced: steganogra-
phy by cover selection, cover synthesis, and cover modification. Even though the
focus of this book is on data-hiding methods that embed secret messages by
slightly modifying the original (cover) image, all three principles can be used
to build steganographic methods in practice. This chapter also introduces ba-
sic terminology and key building blocks that form the steganographic channel
– the source of cover objects, source of secret messages and secret keys, the
data-hiding and data-extraction algorithms, and the physical channel itself. The
physical properties of the channel are determined by the actions of the warden
Eve, who can position herself to be a passive observant or someone who is actively
involved with the flow of data through the channel. Discussions throughout the
chapter pave the way towards the information-theoretic definition of stegano-
graphic security given in Chapter 6.
The content of Chapter 5 was chosen to motivate the reader to ask basic ques-
tions about what it means to undetectably embed secret data in an image and
to illustrate various (and sometimes unexpected) difficulties one might run into
when attempting to realize some intuitive hiding methods. The chapter contains
examples of some early naive steganographic methods for the raster, palette,
and JPEG formats, most of which use some version of the least-significant-bit
(LSB) embedding method. The presentation of each method continues with crit-
ical analysis of how the steganographic method can be broken and why. The
author hopes that this early exposure of specific embedding methods will make
the reader better understand the need for a rather precise technical approach in
the remaining chapters.
Chapter 6 introduces the central concept, which is a formal information-
theoretic definition of security in steganography based on the Kullback–Leibler
divergence between the distributions of cover and stego objects. This definition
puts steganography on a firm mathematical ground that allows methodological
development by studying security with respect to a cover model. The concept of
security is further explained by showing connections between security and detec-
tion theory and by providing examples of undetectable steganographic schemes
built using the principles outlined in Chapter 4. We also introduce the concept
Preface xix

of a distortion-limited embedder (when Alice is limited in how much she can


modify the cover image) and show that some well-known watermarking meth-
ods, such as spread-spectrum watermarking and quantization index modulation,
can be used to construct secure steganographic schemes. Finally, the reader is
presented with an alternative complexity-theoretic definition of steganographic
security even though this direction is not further pursued in this book.
Using the definition of security as a guiding philosophy, Chapter 7 introduces
several design principles and intuitive strategies for building practical stegano-
graphic schemes for digital media files: (1) model-preserving steganography us-
ing statistical restoration and model-based steganography, (2) steganography by
mimicking natural phenomena or processing, (3) steganalysis-aware steganog-
raphy, and (4) minimal-impact steganography. The first three approaches are
illustrated by describing in detail specific examples of steganographic algo-
rithms from the literature (OutGuess, Model-Based Steganography for JPEG
images, stochastic modulation, and the F5 algorithm). Minimal embedding im-
pact steganography is discussed in Chapters 8 and 9.
Chapter 8 is devoted to matrix embedding, which is a general method for
increasing security of steganographic schemes by minimizing the number of em-
bedding changes needed to embed the secret message. The reader is first moti-
vated by what appears a simple clever trick, which is later generalized and then
reinterpreted within the language of coding theory. The introductory sections
naturally lead to the highlight of this chapter – the matrix embedding theorem,
which is essentially a recipe for how to turn a linear code into a steganographic
embedding method using the principle of syndrome coding. Ample space is de-
voted to various bounds that impose fundamental limits on the performance one
can achieve using matrix embedding.
The second chapter that relates to minimal-impact steganography is Chap-
ter 9. It introduces the important topic of communication with a non-shared
selection channel as well as several practical methods for communication using
such channels (wet paper codes). A non-shared selection channel refers to the
situation when Alice embeds her message into a selected subset of the image but
does not (or cannot) share her selection with Bob. This chapter also discusses
several diverse problems in steganography that lead to non-shared selection chan-
nels and can be elegantly solved using wet paper codes: adaptive steganography,
perturbed quantization steganography, a new class of improved matrix embed-
ding methods, public-key steganography, the no-shrinkage F5 algorithm, and the
MMx algorithm.
While the first part of this book deals solely with design and development
of steganographic methods, the next three chapters are devoted to steganalysis,
which is understood as an inherent part of steganography. After all, steganogra-
phy is advanced through analysis.
In Chapter 10, steganalysis is introduced as the task of discovering the pres-
ence of secret data. The discussion in this chapter is directed towards explaining
xx Preface

general principles common to many steganalysis techniques. The focus is on sta-


tistical attacks in which the warden reaches her decision by inspecting statistical
properties of pixels. This approach to steganalysis provides connections with the
abstract problem of signal detection and hypothesis testing, which in turn allows
importing standard signal-detection tools and terminology, such as the receiver
operating characteristic. The chapter continues with separate sections on tar-
geted and blind steganalysis. The author lists several general strategies that one
can follow to construct targeted attacks and highlights the important class of
quantitative attacks, which can estimate the number of embedding changes. The
section on blind steganalysis contains a list of general principles for constructing
steganalysis features as well as description of several diverse applications of blind
steganalyzers, including construction of targeted attacks, steganography design,
multi-class steganalysis, and benchmarking. The chapter is closed with discus-
sion of forensic steganalysis and system attacks on steganography in which the
attacker relies on protocol weaknesses of a specific implementation rather than
on statistical artifacts computed from the pixel values.
Chapter 11 contains examples of targeted steganalysis attacks and their ex-
perimental verifications. Experiments on real images are used to explain various
issues when constructing a practical steganography detector and to give the
reader a sense of how sensitive the attacks are. The chapter starts with the Sam-
ple Pairs Analysis, which is a targeted quantitative attack on LSB embedding
in the spatial domain. The derivation of the method is presented in a way that
makes the algorithm appear as a rather natural approach that logically follows
from the strategies outlined in Chapter 10. Next, the approach is generalized by
formulating it within the structural steganalysis framework. This enables several
important generalizations that further improve the method’s accuracy. The third
attack, the Pairs Analysis, is a quantitative attack on steganographic methods
that embed messages into LSBs of palette images, such as EzStego. The concept
of calibration is used to construct a quantitative attack on the F5 embedding
algorithm. The chapter is closed with description of targeted attacks on ±1 em-
bedding in the spatial domain based on the histogram characteristic function.
Chapter 12 is devoted to the topic of blind attacks, which is an approach to
steganalysis based on modeling images using features and classifying cover and
stego features using machine-learning tools. Starting with the JPEG domain, the
features are introduced in a natural manner as statistical descriptors of DCT co-
efficients by modeling them using several different statistical models. The JPEG
domain is also used as an example to demonstrate two options for constructing
blind steganalyzers: (1) the cover-versus-all-stego approach in which a binary
classifier is trained to recognize cover images and a mixture of stego images
produced by a multitude of steganographic algorithms, and (2) a one-class ste-
ganalyzer trained only on cover images that classifies all images incompatible
with covers as stego. The advantages and disadvantages of both approaches are
discussed with reference to practical experiments. Blind steganalysis in the spa-
tial domain is illustrated on the example of a steganalyzer whose features are
Preface xxi

computed from image noise residuals. This steganalyzer is also used to demon-
strate how much statistical detectability in practice depends on the source of
cover images.
Chapter 13 discusses the most fundamental problem of steganography, which
is the issue of computing the largest payload that can be securely embedded in
an image. Two very different concepts are introduced – the steganographic ca-
pacity and secure payload. Steganographic capacity is the largest rate at which
perfectly secure communication is possible. It is not a property of one specific
steganographic scheme but rather a maximum taken over all perfectly secure
schemes. In contrast, secure payload is defined as the number of bits that can be
communicated at a given security level using a specific imperfect steganographic
scheme. The secure payload grows only with the square root of the number of pix-
els in the image. This so-called square-root law is experimentally demonstrated
on a specific steganographic scheme that embeds bits in the JPEG domain. The
secure payload is more relevant to practitioners because all practical stegano-
graphic schemes that hide messages in real digital media are not likely to be
perfectly secure and thus fall under the squre-root law.
To make this text self-contained, five appendices accompany the book. Their
style and content are fully compatible with the rest of the book in the sense
that the student does not need any more prerequisites than a basic knowledge
of calculus and statistics. The author anticipates that students not familiar with
certain topics will find it convenient to browse through the appendices and either
refresh their knowledge or learn about certain topics in an elementary fashion
accessible to a wide audience.
Appendix A contains the basics of descriptive statistics, including statistical
moments, the moment-generating function, robust measures of central tendency
and spread, asymptotic laws, and description of some key statistical distributions,
such as the Bernoulli, binomial, Gaussian, multivariate Gaussian, generalized
Gaussian, and generalized Cauchy distributions, Student’s t-distribution, and
the chi-square distribution.
As some of the chapters rely on basic knowledge of information theory, Ap-
pendix B covers selected key concepts of entropy, conditional entropy, joint en-
tropy, mutual information, lossless compression, and KL divergence and some
of its key properties, such as its relationship to hypothesis testing and Fisher
information.
The theory of linear codes over finite fields is the subject of Appendix C. The
reader is introduced to the basic concepts of a generator and parity-check matrix,
covering radius, average distance to code, sphere-covering bound, orthogonality,
dual code, systematic form of a code, cosets, and coset leaders.
Appendix D contains elements of signal detection and estimation. The author
explains the Neyman–Pearson and Bayesian approach to hypothesis testing, the
concepts of a receiver-operating-characteristic (ROC) curve, the deflection coef-
ficient, and the connection between hypothesis testing and Fisher information.
The appendix continues with composite hypothesis testing, the chi-square test,
xxii Preface

and the locally most powerful detector. The topics of estimation theory covered
in the appendix include the Cramer–Rao lower bound, least-square estimation,
maximum-likelihood and maximum a posteriori estimation, and the Wiener filter.
The appendix is closed with the Cauchy–Schwartz inequality in Hilbert spaces
with inner product, which is needed for proofs of some of the propositions in this
book.
Readers not familiar with support vector machines (SVMs) will find Ap-
pendix E especially useful. It starts with the formulation of a binary classification
problem and introduces linear support vector machines as a classification tool.
Linear SVMs are then progressively generalized to non-separable problems and
then put into kernelized form as typically used in practice. The weighted form
of SVMs is described as well because it is useful to achieve a trade-off between
false alarms and missed detections and for drawing an ROC curve. The appendix
also explains practical issues with data preprocessing and training SVMs that
one needs to be aware of when using SVMs in applications, such as in blind
steganalysis.
Because the focus of this book is strictly on steganography in digital sig-
nals, methods for covert communication in other objects are not covered. In-
stead, the author refers the reader to other publications. In particular, lin-
guistic steganography and data-hiding aspects of some cryptographic applica-
tions are covered in [238, 239]. The topic of covert channels in natural lan-
guage is also covered in [18, 25, 41, 161, 182, 227]. A comprehensive bibli-
ography of all articles published on covert communication in linguistic struc-
tures, including watermarking applications, is maintained by Bergmair at http:
//semantilog.ucam.org/biblingsteg/. Topics dealing with steganography in
Internet protocols are studied in [106, 162, 163, 165, 177, 216]. Covert timing
channels and their security are covered in [26, 34, 100, 101]. The intriguing
topic of steganography in Voice over IP applications, such as Skype, appears
in [6, 7, 58, 147, 150, 169, 251]. Steganographic file systems [4, 170] are useful
tools to thwart “rubber-hose attacks” on cryptosystems when a person is coerced
to reveal encryption keys after encrypted files have been found on a computer
system. A steganographic file system allows the user to plausibly deny that en-
crypted files reside on the disk. In-depth analysis of current steganographic soft-
ware and the topics of data hiding in elements of operating systems are provided
in [142]. Finally, the topics of audio steganography and steganalysis appeared
in [9, 24, 118, 149, 187, 202].
Acknowledgments

I would like to acknowledge the role of several individuals who helped me com-
mit to writing this book. First of all and foremost, I am indebted to Richard
Simard for encouraging me to enter the field of steganography and for support-
ing research on steganography. This book would not have materialized without
the constant encouragement of George Klir and Monika Fridrich. Finally, the
privilege of co-authoring a book with Ingemar Cox [51] provided me with energy
and motivation I would not have been able to find otherwise.
Furthermore, I am happy to acknowledge the help of my PhD students for
their kind assistance that made the process of preparing the manuscript in TEX
a rather pleasant experience instead of the nightmare that would for sure have
followed if I had been left alone with a TEX compiler. In particular, I am im-
mensely thankful to TEX guru Tomáš Filler for his truly significant help with
formatting the manuscript, preparing the figures, and proof-reading the text,
to Tomáš Pevný for contributing material for the appendix on support vector
machines, and to Jan Kodovský for help with combing the citations and proof-
reading. I would also like to thank Ellen Tilden and my students from the ECE
562 course on Fundamentals of Steganography, Tony Nocito, Dae Kim, Zhao Liu,
Zhengqing Chen, and Ran Ren, for help with sanitizing this text to make it as
free of typos as possible.
Discussions with my colleagues, Andrew D. Ker, Miroslav Goljan, Andreas
Westfeld, Rainer Böhme, Pierre Moulin, Neil F. Johnson, Scott Craver, Patrick
Bas, Teddy Furon, and Xiaolong Li were very useful and helped me clarify some
key technical issues. The encouragement I received from Mauro Barni, Deepa
Kundur, Slava Voloshynovskiy, Jana Dittmann, Gaurav Sharma, and Chet Hos-
mer also helped with shaping the final content of the manuscript. Special thanks
are due to George Normandin and Jim Moronski for their feedback and many
useful discussions about imaging sensors and to Josef Sofka for providing a pic-
ture of a CCD sensor. A special acknowledgement goes to Binghamton University
Art Director David Skyrca for the beautiful cover design.
Finally, I would like to thank Nicole and Kathy Fridrich for their patience and
for helping me to get into the mood of sharing.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
1 - Introduction pp. 1-14

Chapter DOI:
Cambridge University Press
1 Introduction

A woman named Alice sends the following e-mail to her friend Bob, with whom
she shares an interest in astronomy:
My friend Bob,
until yesterday I was using binoculars for stargazing. Today, I decided to try my new
telescope. The galaxies in Leo and Ursa Major were unbelievable! Next, I plan to check
out some nebulas and then prepare to take a few snapshots of the new comet. Although
I am satisfied with the telescope, I think I need to purchase light pollution filters to
block the xenon lights from a nearby highway to improve the quality of my pictures.
Cheers,
Alice.

At first glance, this letter appears to be a conversation between two avid ama-
teur astronomers. Alice seems to be excited about her new telescope and eagerly
shares her experience with Bob. In reality, however, Alice is a spy and Bob is her
superior awaiting critical news from his secret agent. To avoid drawing unwanted
attention, they decided not to use cryptography to communicate in secrecy. In-
stead, they agreed on another form of secret communication – steganography.
Upon receiving the letter from Alice, Bob suspects that Alice might be using
steganography and decides to follow a prearranged protocol. Bob starts by listing
the first letters of all words from Alice’s letter and obtains the following sequence:

mf buyiwubf stidttmnttgilaumwuniptcosnatpttaf sotncaiaswttitintplpf tbt


xlf anhtitqompca.

Then, he writes down the decimal expansion of π

π = 3.141592653589793 . . .

and reads the message from the extracted sequence of letters by putting down
the third letter in the sequence, then the next first letter, the next fourth letter,
etc. The resulting message is

buubdlupnpsspx.

Finally, Bob replaces each letter with the letter that precedes it in the alphabet
and deciphers the secret message

attack tomorrow.
2 Chapter 1. Introduction

Let us take a look at the tasks that Alice needs to carry out to communicate
secretly with Bob. She first encrypts her message by substituting each letter with
the one that follows it in the English alphabet (e.g., a is substituted with b, b
with c, . . . , and z with a). Note that this simple substitution cipher could be
replaced by a more secure encryption algorithm if desired. Then, Alice needs
to write an almost arbitrary but meaningful(!) letter while making sure that
the words whose location is determined by the digits of π start with the letters
of the encrypted message. Of course, instead of the decimal expansion of π,
Alice and Bob could have agreed on a different integer sequence, such as one
generated from a pseudo-random number generator seeded with a shared key.
The shared information that determines the location of the message letters is
called the steganographic key or stego key. Without knowing this key, it is not
only difficult to read the message but also difficult for an eavesdropper to prove
that the text contains a secret message.
Note that the hidden message is unrelated to the content of the letter, which
only serves as a decoy or “cover” to hide the very fact that a secret message is
being sent. In fact, this is the defining property of steganography:

Steganography can be informally defined as the practice of undetectably communicating


a message in a cover object.

We now elaborate on the above motivational example a little more. If Alice


planned to send a very long message, the above steganographic method would
not be very practical. Instead of hiding the message in the body of the e-mail us-
ing her creative writing skills, Alice could hide her message by slightly modifying
pixels in a digital picture, such as an image of a galaxy taken through her tele-
scope, and attach the modified image to her e-mail. Of course, that would require
a different hiding procedure shared with Bob. A simple method to hide a binary
message would be to encode the message bits into the colors of individual pixels
in the image so that even values represent a binary 0 and odd values a binary
1. Alice could achieve this by modifying each color by at most one. Here, Alice
relies on the fact that such small modifications will likely be imperceptible. This
method of covert communication allows Alice to send as many bits as there are
pixels in the image without the need to painstakingly form a plausible-looking
cover letter. She could even program a computer to insert the message, which
could be an arbitrary electronic file, into the image for her.
Digital images acquired using a digital camera or scanner provide a friendly
environment to the steganographer because they contain a slight amount of noise
that helps mask the modifications that need to be carried out to embed a secret
message. Moreover, attaching an image to an e-mail message is commonly done
and thus should not be suspicious.
This book deals with steganography of signals represented in digital form, such
as digital images, audio, or video. Although the book focuses solely on images,
many principles and methods can be adopted to the other multimedia objects.
Introduction 3

1.1 Steganography throughout history

The word steganography is a composite of the Greek words steganos, which means
“covered,” and graphia, which means “writing.” In other words, steganography is
the art of concealed communication where the very existence of a message is se-
cret. The term steganography was used for the first time by Johannes Trithemius
(1462–1516) in his trilogy Polygraphia and in Steganographia (see Figure 1.1).
While the first two volumes described ancient methods for encoding messages
(cryptography), the third volume (1499) appeared to deal with occult powers,
black magic, and methods for communication with spirits. The volume was pub-
lished in Frankfurt in 1606 and in 1609 the Catholic Church put it on the list of
“libri prohibiti” (forbidden books). Soon, scholars began suspecting that the book
was a code and attempted to decipher the mystery. Efforts to decode the book’s
secret message came to a successful end in 1996 and 1998 when two researchers in-
dependently [65, 201] revealed the hidden messages encoded in numbers through
several look-up tables included in the book [145]. The messages turned out to
be quite mundane. The first one was the Latin equivalent of “The quick brown
fox jumps over the lazy dog,” which is a sentence that contains every letter of
the alphabet. The second message was: “The bearer of this letter is a rogue and
a thief. Guard yourself against him. He wants to do something to you.” Finally,
the third was the start of the 21st Psalm.
The first written evidence about steganography being used to send messages
is due to Herodotus [109], who tells of a slave sent by his master, Histiæus, to
the Ionian city of Miletus with a secret message tattooed on his scalp. After
the tattooing of the message, the slave grew his hair back in order to conceal
the message. He then traveled to Miletus and, upon arriving, shaved his head
to reveal the message to the city’s regent, Aristagoras. The message encouraged
Aristagoras to start a revolt against the Persian king.
Herodotus also documented the story of Demeratus, who used steganography
to alert Sparta about the planned invasion of Greece by the Persian Great King
Xerxes. To conceal his message, Demeratus scraped the wax off the surface of
a wooden writing tablet, scratched the message into the wood, and then coated
the tablet with a fresh layer of wax to make it appear to be a regular blank
writing tablet that could be safely carried to Sparta without arousing suspicion.
Aeneas the Tactician [226] is credited with inventing many ingenious stegano-
graphic techniques, such as hiding messages in women’s earrings or using pigeons
to deliver secret messages. Additionally, he described some simple methods for
hiding messages in text by modifying the height of letter strokes or by marking
letters in a text using small holes.
Hiding messages in text is called linguistic steganography or acrostics. Acros-
tics was a very popular ancient steganographic method. To embed a unique
“signature” in their work, some poets encoded secret messages as initial letters
of sentences or successive tercets in a poem. One of the best-known examples
4 Chapter 1. Introduction

Figure 1.1 The title page of Steganographia by Johannes Trithemius, the inventor of the
word “steganography.” Reproduced by kind permission of the Syndics of Cambridge
University Library.

is Amorosa visione by Giovanni Boccaccio [247]. Boccaccio encoded three son-


nets (more than 1500 letters) into the initial letters of the first verse of each
tercet from other poems. The linguistic steganographic scheme described at the
beginning of this chapter is an example of Cardan’s Grille, which was originally
conceived in China and reinvented by Cardan (1501–1576). The letters of the
secret message form a random pattern that can be accessed simply by placing a
mask over the text. The mask plays the role of a secret stego key that has to be
shared between the communicating parties.
Francis Bacon [15] described a precursor of modern steganographic schemes.
Bacon realized that by using italic or normal fonts, one could encode binary
representation of letters in his works. Five letters of the cover object could hold
five bits and thus one letter of the alphabet. The inconsistency of sixteenth-
century typography made this method relatively inconspicuous.
A modern version of this steganographic principle was described by Brassil [29].
He described a method for data hiding in text documents by slightly shifting the
lines of text up or down by 1/300 of an inch. It turns out that such subtle changes
are not visually perceptible, yet they are robust enough to survive photocopying.
Introduction 5

This way, the message could be extracted even from printed or photocopied
documents.
In 1857, Brewster [31] proposed a very ingenious technique that was actually
used in several wars in the nineteenth and twentieth centuries. The idea is to
shrink the message so much that it starts resembling specks of dirt but can still
be read under high magnification. The technological obstacles to use of this idea
in practice were overcome by the French photographer Dragon, who developed
technology for shrinking text to microscopic dimensions. Such small objects could
be easily hidden in nostrils, ears, or under fingernails [224]. In World War I, the
Germans used such “microdots” hidden in corners of postcards slit open with a
knife and resealed with starch. The modern twentieth-century microdots could
hold up to one page of text and even contain photographs. The Allies discovered
the usage of microdots in 1941. A modern version of the concept of the microdot
was recently proposed for hiding information in DNA for the purpose of tagging
important genetic material [45, 212]. Microdots in the form of dust were also
recently proposed to identify car parts [1].
Perhaps the best-known form of steganography is writing with invisible ink.
The first invisible inks were organic liquids, such as milk, urine, vinegar, diluted
honey, or sugar solution. Messages written with such ink were invisible once
the paper had dried. To make them perceptible, the letter was simply heated
up above a candle. Later, more sophisticated versions were invented by replacing
the message-extraction algorithm with safer alternatives, such as using ultraviolet
light.
In 1966, an inventive and impromptu steganographic method enabled a pris-
oner of war, Commander Jeremiah Denton, to secretly communicate one word
when he was forced by his Vietnamese captors to give an interview on TV. Know-
ing that he could not say anything critical of his captors, as he spoke, he blinked
his eyes in Morse code, spelling out T-O-R-T-U-R-E.
Steganography became the subject of a dispute during the match between
Viktor Korchnoi and Anatoly Karpov for the World Championship in chess in
1978 [117]. During one of the games, Karpov’s assistants handed him a tray with
yogurt. This was technically against the rules, which prohibited contact between
the player and his team during play. The head of Korchnoi’s delegation, Petra
Leeuwerik, immediately protested, arguing that Karpov’s team could be passing
him secret messages. For example, a violet yogurt could mean that Karpov should
offer a draw, while a sliced mango could inform the player that he should decline a
draw. The time of serving the food could also be used to send additional messages
(steganography in timing channels). This protest, which was a consequence of
the extreme paranoia that dominated chess matches during the Cold War, was
taken quite seriously. The officials limited Karpov to consumption of only one
type of yogurt (violet) at a fixed time during the game. Using the terminology of
this book, we can interpret this protective measure as an act of an active warden
to prevent usage of steganography.
6 Chapter 1. Introduction

Figure 1.2 Symbols on an American patchwork quilt on display at the National


Cryptologic Museum near Washington, D.C.

In the 1990s, the story of a “quilt code” allegedly used in the Underground
Railroad surfaced in the media. The Underground Railroad appeared sponta-
neously as a clandestine network of secret pathways and safe houses that helped
black slaves in the USA escape from slavery during the first part of the nine-
teenth century. According to the story told by a South Carolina woman named
Ozella Williams [230], people sympathetic to the cause displayed quilts on their
fences to non-verbally inform the escapees about the direction of their journey
or which action they should take next. The messages were supposedly hidden
in the geometrical patterns commonly found in American patchwork quilts (see
Figure 1.2). Since it was common to air quilts on fences, the master or mistress
would not be suspicious about the quilts being on display.
The recent explosion of interest in steganography is due to a rather sudden
and widespread use of digital media as well as the rapid expansion of the Internet
(Figure 1.3 shows the annual count of research articles on the subject of steganog-
raphy published by the IEEE). It is now a common practice to share pictures,
video, and sound with our friends and family. Such objects provide a very favor-
able environment for concealing secret messages for one good reason: typical dig-
ital media files consist of a large number of individual samples (e.g., pixels) that
can be imperceptibly modified to encode a secret message. And there is no need to
develop technical expertise for those who wish to use steganography because the
hiding process itself can be carried out by a computer program that anyone can
download from the Internet for free. As of writing this book in late 2008, one can
Introduction 7

200

Number of IEEE publications


150

100

50

0
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
Figure 1.3 The growth of the field is witnessed by the number of articles annually
published by IEEE that contain the keywords “steganography” or “steganalysis.”

select from several hundreds of steganographic products available on the Internet.


Figure 1.4 shows the number of newly released applications or new versions of ex-
isting programs capable of hiding data in digital media and text. Some software
applications that focus on security, privacy, and anonymity offer the possibility
to hide encrypted messages in pictures and music as an additional layer of pro-
tection. Examples of such programs are Steganos (http://www.steganos.com/)
and Stealthencrypt (http://www.stealthencrypt.com/). An updated list of se-
lected currently available steganographic programs for various platforms can be
obtained from http://www.stegoarchive.com/.
In the next section, the reader is informally introduced to some key concepts
and principles on which modern steganography is built. The author also feels that
it is important at this point to explain the differences between steganography
and other related privacy and security applications, such as cryptography and
digital watermarking. No attempt is made at this point to be rigorous. The goal
is to entice the reader and gently introduce some of the challenges elaborated
upon in this book.

1.2 Modern steganography

Because electronic communication is very susceptible to eavesdropping and ma-


licious interventions, the issues of security and privacy are more relevant today
than ever. Traditional solutions are based on cryptography [207], which is a
mature, well-developed field with rigorous mathematical foundations. The cryp-
tographic approach to privacy is to make the exchanged information unreadable
to those who do not have the right decryption key. When an encrypted message
8 Chapter 1. Introduction

Number of new software stego tools


400

300

200

100

0
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Figure 1.4 The number of newly released steganographic software applications or new
versions per year. Adapted from [122] and reprinted with permission of John Wiley &
Sons, Inc.

is intercepted, even though the content of the message is protected, the fact that
the subjects are communicating secretly is obvious. In some situations, it may be
important to avoid drawing attention and instead embed sensitive data in other
objects so that the fact that secret information is being sent is not obvious in
the first place. This is the approach taken by steganography.
Every steganographic system discussed in this book consists of two basic com-
ponents – the embedding and extraction algorithms. The embedding algorithm
accepts three inputs – the secret message to be communicated, the secret shared
key that controls the embedding and extraction algorithms, and the cover ob-
ject , which will be modified to convey the message. The output of the embedding
algorithm is called the stego object. When the stego object is presented as an
input to the message-extraction algorithm, it produces the secret message.
Steganography offers a feasible alternative to encryption in oppressive regimes
where using cryptography might attract unwanted attention or in countries where
the use of cryptography is legally prohibited. An interesting documented use of
steganography was presented at the 4th International Workshop on Information
Hiding [209]. Two subjects developed a steganographic scheme of their own to
hide messages in uncompressed digital images and then used it successfully for
several years when one of them was residing in a hostile country that explicitly
prohibited use of encryption. The reason for their paranoia was a story told by
their friend who already resided in the area, who had tried to send an encrypted
e-mail only to have it returned to him by the local Internet service provider with
the message appended, “Please, don’t send encrypted emails – we can’t read
them.”
Introduction 9

In the early 1980s, Simmons [214] described intriguing political implications


of the possibility to send data through a covert communication channel. Ac-
cording to the disarmament treaty SALT, the USA and Soviet Union mutually
agreed to equip their nuclear facilities with sensors that would inform the other
country about the number of missiles but not some other information, such as
their location. All communications were required to be protected using standard
digital signatures to prevent unauthorized modification of the sensors’ readings.
However, both sides quickly became concerned about the possibility to hide ad-
ditional information through so-called subliminal channels that existed in most
digital signature schemes at that time. This triggered research into developing
digital signatures free of subliminal channels [57].

1.2.1 The prisoners’ problem


The most important property of a steganographic system is undetectability,
which means that it should be impossible for an eavesdropper to tell whether Al-
ice and Bob are engaging in regular communication or are using steganography.
Simmons provided a popular formulation of the steganography problem through
his famous prisoners’ problem [214]. Alice and Bob are imprisoned in separate
cells and want to hatch an escape plan. They are allowed to communicate but
their communication is monitored by warden Eve. If Eve finds out that the pris-
oners are secretly exchanging messages, she will cut the communication channel
and throw them into solitary confinement. The prisoners resort to steganography
as a means to exchange the details of their escape. Note that in the prisoners’
problem, all that Eve needs to achieve is to detect the presence of secret messages
rather than know their content. In other words, when Eve discovers that Alice
and Bob communicate secretly, the steganographic system is considered broken.
This is in contrast to encryption, where a successful attack means that the at-
tacker gains access to the decrypted content or partially recovers the encryption
key.
In the prisoners’ problem, it is usually assumed that Eve has a complete knowl-
edge of the steganographic algorithm that Alice and Bob might use, with the
exception of the secret stego key, which Alice and Bob agreed upon before impris-
onment. The requirement that the steganographic algorithm be known to Eve
is Kerckhoffs’ principle imported from cryptography. This seemingly strong and
paranoid principle states that the security of the communication should not lie
in the secrecy of the system but only in the secret key. The principle stems from
many years of experience that taught us that through espionage the encryption
(steganographic) algorithm or device may fall into the hands of the enemy and,
if this happens, the security of the secret channel should not be compromised.
10 Chapter 1. Introduction

1.2.2 Steganalysis is the warden’s job


Steganography is a privacy tool and as such it naturally provokes the human
mind to attack it. The effort concerned with developing methods for detecting the
presence of secret messages and eventually extracting them is called steganalysis.
Positioning herself into the role of a passive warden, Eve passively monitors the
communication between Alice and Bob. She is not only allowed to visually inspect
the exchanged text or images, but also can apply some statistical tests to find
out whether the distribution of colors in the image follows the expected statistics
of natural images.
This field started developing more rapidly after the terrorist attacks of Septem-
ber 11, 2001, when speculations spread through the Internet that terrorists might
use steganography for planning attacks [146, 48]. The only publicly documented
use of a rather primitive form of steganography for planning terrorist activi-
ties was described by The New York Times in an article from November 11,
2006. Dhiren Barot, an Al Qaeda operative, filmed reconnaissance video between
Broadway and South Street and concealed it before distribution by splicing it
into a copy of the Bruce Willis movie Die Hard: With a Vengeance. In a differ-
ent criminal case in 2000, the commercial steganographic tool S-Tools had been
used for distribution of child porn.1 The suspect was successfully prosecuted
using steganalysis methods published in [119].
As we already know, steganography is considered broken even when the mere
presence of the secret message is detected. This is because the primary goal of
steganography is to conceal the communication itself. Often, identifying the sub-
jects who are communicating using steganography can be of vital importance
despite the fact that the content of the secret message may still be unknown.
When the warden discovers the use of steganography, she may choose to cut
the communication channel, if such an act is in her power, as in the prisoners’
problem, or she may exercise other options. Eve may assume the role of an active
warden and slightly modify the communicated objects to prevent the prisoners
from using steganography. For example, if the prisoners are embedding messages
in images, the warden may process the images by slightly resizing them, crop-
ping, and recompressing in hope of preventing the recipient from reading any
secret messages. She can also rephrase the content of a letter using synonyms
or by changing the order of words in a sentence. During World War I, US Post
Office censors used to rephrase telegrams to prevent people from sending hidden
messages. In one case, the censor replaced the text “father is dead” with “father
is deceased,” which prompted the recipient to reply with “is father dead or de-
ceased?” A more recent example of an active warden was given by Gina Fisk
and her coworkers from Los Alamos National Laboratory [72], who described
an active warden system integrated with a firewall designed to eliminate covert
channels in network protocols.

1 Personal communication by Neil F. Johnson, 2007.


Introduction 11

The actions of an active warden will likely inform Alice and Bob that they
are under surveillance. Instead of actively blocking the covert communication
channel, Eve may decide not to intervene at all and instead try to extract the
messages to learn about the prisoners’ escape plans. Effort directed towards
extracting the secret message belongs to the field of forensic steganalysis. If Eve
is successful and gains access to the stego (encryption) key, a host of other options
opens up for her. She can now be devious and impersonate the prisoners to trick
them to reveal more secrets. Such a warden is called malicious.

1.2.3 Steganographic security


The defining property of steganography is the requirement that Eve should not
be able to decide whether or not a given object conveys a secret message. For-
malizing this requirement mathematically, however, is far from easy.
According to Kerckhoffs’ principle, the warden has a complete knowledge of
the steganographic algorithm, which means that she also has all details about the
source of cover objects used by both prisoners. At an abstract level, the properties
of the cover source could be described by a statistical distribution on the space
of all cover objects that the prisoners can possibly exchange. For example, our
amateur astronomer Alice, who by the way really dislikes winter, is much more
likely to send a picture of the Moon or Saturn than a snowy landscape. This
could be formulated by stating that the probability distribution of cover images
used by Alice will have higher values on images with astronomical themes and
much lower values on images with winter scenery. Given the fact that Eve knows
this distribution, she can run statistical tests to see whether the images sent by
Alice are compliant with the expected distribution. Interpreting the results of
her test, she can decide that, at a certain confidence level, Alice does or does not
communicate using steganography.
Because digital images are quite complex high-dimensional objects consisting
of millions of pixels, it is not feasible to obtain even a rough approximation to
this hypothetical distribution. Modern steganalysis works with simplified models
consisting of a set of statistical quantities derived from images, such as sample
histograms of pixel values or various types of higher-order statistics computed
from adjacent pairs of pixels. Eve calculates these quantities and compares them
with their expected values estimated from cover images that would be sent dur-
ing legitimate use of the channel. Statistically significant deviations from the
expected values are then interpreted as evidence that the image has been modi-
fied by a steganographic algorithm.
While this quantitative view of steganographic security permits precise math-
ematical formulation and rigorous study, it is only an approximation of the con-
cept of undetectability. For example, statistical quantities do not well describe
the semantic meaning of the communicated images. Imagine the situation when
Alice writes her message on paper, takes a picture of it, and attaches the image
to her e-mail. Because this image was never modified, its statistical properties
12 Chapter 1. Introduction

should be compliant with those of other images produced by her camera. Thus,
any automatic steganalysis system only inspecting statistical properties of im-
ages would label the image as compliant with the legitimate use of the channel.
A human warden will, of course, have full access to the communicated message.
It is intuitively clear that Alice and Bob can increase their sense of security
and decrease the chance of being caught by Eve if they communicate only very
short messages. This would, however, make their communication less efficient
and quite likely impractical. Therefore, Alice and Bob need to know how large a
message they can hide in a given object without introducing artifacts that would
trigger Eve’s detector. The size of the critical message is called the steganographic
capacity. The research in steganography focuses on design of algorithms that
permit sending messages that are as long as possible without making the stego
objects statistically distinguishable from cover objects.

1.2.4 Steganography and watermarking


At this point, we wish to stress one important difference between steganography
and a related data-hiding field called watermarking [16, 51, 186]. Even though
watermarking and steganography share some fundamental similarities in that
they both secretly hide information, they address very different applications. In
steganography, the cover image is a mere decoy and has no relationship to the
secret message. In contrast, a watermark usually carries supplemental informa-
tion about the cover image or some other data related to the cover, such as
labels identifying the sender or the receiver. For example, when purchasing an
MP3 song over the Internet, information about the seller and/or buyer can be
inserted into the song in the form of an inaudible but robust watermark. The
watermark may be used later to trace illegally distributed copies of the song.
Watermarks can also convey information about the song itself (e.g., its robust
hash or digest) so that the song’s integrity can later be verified by comparing
the song with the watermark payload.
The second and perhaps even more important difference between steganog-
raphy and watermarking is the issue of the existence of a secret message in an
image. While in steganography it is of utmost importance to make sure the image
does not exhibit any traces of hidden data, the presence of a watermark is often
advertised to deter illegal activity, such as unauthorized copying or redistribu-
tion. Additionally, steganography is a mode of communication and as such needs
to allow sending large amounts of data. On the contrary, even a very short digi-
tal watermark can be quite useful. For example, the presence of a watermark (a
one-bit payload) may testify about the image’s ownership. These very different
requirements imposed on these two applications make their design and analysis
quite different.
Introduction 13

Summary
r Steganography is the practice of communicating a secret message by hiding it
in a cover object.
r Steganography is usually described as the prisoners’ problem in which two
prisoners, Alice and Bob, want to hatch an escape plan but their communica-
tion is monitored by the warden (Eve), who will cut the communication once
she suspects covert exchange of data.
r The most important property of steganography is statistical undetectability,
which means that it should be impossible for Eve to prove the existence of a
secret message in a cover. Statistically undetectable steganographic schemes
are called secure.
r A warden who merely observes the traffic between Alice and Bob is called
passive. An active or malicious warden tampers with the communication in
order to prevent the prisoners from using steganography or to trick them into
revealing their communication.
r Digital watermarking is a data-hiding application that is related to steganog-
raphy but is fundamentally quite different. While in steganography the secret
message has usually no relationship to the cover object, which plays the role
of a mere decoy, watermarks typically supply additional information about
the cover. Moreover, and most importantly, watermarks do not have to be
embedded undetectably.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
2 - Digital image formats pp. 15-32

Chapter DOI:
Cambridge University Press
2 Digital image formats

Digital images are commonly represented in four basic formats – raster, palette,
transform, and vector. Each representation has its advantages and is suitable
for certain types of visual information. Likewise, when Alice and Bob design
their steganographic method, they need to consider the unique properties of
each individual format. This chapter explains how visual data is represented and
stored in several common image formats, including raster and palette formats,
and the most popular format in use today, the JPEG. The material included in
this chapter was chosen for its relevance to applications in steganography and
is thus necessarily somewhat limited. The topics covered here form the minimal
knowledge base the reader needs to become familiar with. Those with sufficient
background may skip this chapter entirely and return to it later on an as-needed
basis. An excellent and detailed exposition of the theory of color models and their
properties can be found in [74]. A comprehensive description of image formats
appears in [32].
In Section 2.1, the reader is first introduced to the basic concept of color
as perceived by humans and then learns how to represent color quantitatively
using several different color models. Section 2.2 provides details of the processing
needed to represent a natural image in the raster (BMP, TIFF) and palette
formats (GIF, PNG). Section 2.3 is devoted to the popular transform-domain
format JPEG, which is the most common representation of natural images today.
For all three formats, the reader is instructed how to work with such images in
Matlab.

2.1 Color representation

Visible light is a superposition of electromagnetic waves with wavelengths span-


ning the interval between approximately 380 nm and 750 nm. Each color can be
associated with the spectral density function P (λ), which describes the amount
of energy present at wavelength λ. Thus, even though one could say that there are
infinitely (or even uncountably) many different colors, each color corresponding
to a different density function P (λ), the human eyes are capable of distinguish-
ing only a relatively small subset of all possible colors. There are three different
receptors in the eye retina called cones, with peak sensitivity to red, green, and
16 Chapter 2. Digital image formats

blue light. The cones that register the blue light have the smallest sensitivity
to light intensity, while the cones that respond to green light have the highest
sensitivity. Electrical signals produced by the cones are fed to the brain, allowing
us to perceive color. This is the tristimulus theory of color perception.
This theory leads to the so-called additive color model. According to this
model, any color is obtained as a linear combination of three basic colors (or
color channels) – red, green, and blue. Denoting the amount of each color as
R, G, and B, where each number is from the interval [0, 1] (zero intensity to
full intensity), each color can be represented as a three-dimensional vector in the
RGB color cube (R, G, B) ∈ [0, 1]3 . Hardware systems that emit light are usually
modeled as additive. For example, old computer monitors with the Cathode-
Ray Tube (CRT) screens create colors by combining three RGB phosphores on
the screen. Liquid-Crystal Display (LCD) panels combine the light from three
adjacent pixels. Full intensity of all three colors is perceived as white, while low
intensity in all is perceived as dark or black.
The subtractive color model is used for hardware devices that create colors
by absorption of certain wavelengths rather than emission of light. A good ex-
ample of a subtractive color device is a printer. The standard basic colors for
subtractive systems are, by convention, cyan, magenta, and yellow, leading to
color representation using the vector CMY. These three colors are obtained by
removing from white the colors red, green, and blue, respectively. The CMY sys-
tem is augmented with a fourth color, black (abbreviated as K) to improve the
printing contrast and save on color toners.
The following relationship holds between the additive RGB and subtractive
CMY systems:

C = 1 − R, (2.1)
M = 1 − G, (2.2)
Y = 1 − B. (2.3)

Although the additive RGB color model describes the colors perceivable by
humans quite well, it is redundant because the three signals are highly correlated
among themselves and is thus not the most economical for transmission. A very
popular color system is the YUV model originally developed for transmission of
color TV signals. The requirement of backward compatibility with old black-and-
white TVs led the designers to form the color TV signal as luminance augmented
with chrominance signals. The reader is forewarned that from now on the letter Y
will always stand for luminance and not yellow as in the CMY(K) color system.
The luminance Y is defined as a weighted linear combination of the RGB
channels with weights determined by the sensitivity of the human eye to the
three RGB colors,

Y = 0.299R + 0.587G + 0.114B. (2.4)


Digital image formats 17

The chrominance components are the differences


U = R − Y, (2.5)
V =B−Y (2.6)
conveying the color information. The transformation between the RGB and YUV
systems is linear,
⎛ ⎞ ⎛ ⎞⎛ ⎞
Y 0.299 0.587 0.114 R
⎝ U ⎠ = ⎝ 0.701 −0.587 −0.114 ⎠ ⎝ G ⎠ , (2.7)
V −0.299 −0.587 0.886 B
⎛ ⎞ ⎛ ⎞⎛ ⎞
R 1 1 0 Y
⎝ G ⎠ = ⎝ 1 −0.509 −0.194 ⎠ ⎝ U ⎠ . (2.8)
B 1 0 1 V
Note that if the RGB colors are represented using 8-bit integers in the range
{0, . . . , 255}, the luminance Y shares the same range, while U and V fall into the
range {−179, . . . , 179}. To adjust all three components to the same range rep-
resentable using 8 bits, the chrominance components are further linearly trans-
formed to Cr and Cb , obtaining thus the Y Cr Cb color model
⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞
Y 0 0.299 0.587 0.114 R
⎝ Cr ⎠ = ⎝ 128 ⎠ + ⎝ 0.5 −0.419 −0.081 ⎠ ⎝ G ⎠ . (2.9)
Cb 128 −0.169 −0.331 0.5 B
Because human eyes are much less sensitive to changes in chrominance than in
luminance, the chrominance signals are often represented with fewer bits without
introducing visible distortion into the image. This fact is utilized in the JPEG
compression format and it is also used for TV signals, where a smaller band-
width is allocated to the chrominance signals and a wider bandwidth is used for
luminance. Digital image formats that use the Y Cr Cb model include IIF, TIFF,
JFIF, JPEG, and MPEG (motion JPEG).

2.1.1 Color sampling


Even though some image formats allow arbitrarily accurate representation of the
color intensity values (IIF, PostScript), most formats represent the intensities in
a quantized form using a fixed number of nc bits, which allows capturing 2nc
different shades. The most appropriate color sampling is heavily dependent on the
application. At one extreme, most fax machines work with only two colors (black
and white), while high-resolution satellite imagery or photo-realistic synthetic
images may use up to 16 bits per color channel (48 bits for the color). Examples
of typical color sampling values for selected applications are listed in Table 2.1.
In this book, we will mostly deal with raster images in BMP (Bitmap), TIFF
(Tagged Image File Format), and PNG (Portable Network Graphics) formats,
palette (indexed) images in GIF (Graphics Interchange Format) or PNG, and
18 Chapter 2. Digital image formats

Table 2.1. Typical color bit depth for various applications. For nc ≥ 8, the sampling of
color images is in bits per color channel.

nc Colors Application

1 2 Fax, black-and-white drawings


2 4
4 16 Line drawings, charts, cartoons
8 256 Grayscale images, true-color natural images
12 4096 Grayscale medical images, digital sensor output, scans
14 16384 Digital sensor output, scans, film-quality digital images
16 65536 Photo-realistic synthetic images, scans, satellite imagery

Table 2.2. Color sampling for selected popular image formats.

Raster (color) Raster (grayscale) Palette


BMP 24 8 1, 4, 8
TIFF 24, 30, 36, 42, 48 4–8 1–8
PNG 24, 48 1, 2, 4, 8, 16 1, 2, 4, 8
GIF – – 1–8
JPEG 24 8, 12 –

the JPEG format. Table 2.2 shows the color bit depth allowed by each format.
For palette formats, the bit depth represents the range of palette indices.

2.2 Spatial-domain formats

The majority of readers would probably agree that the most intuitive way to
represent natural images in a computer is to sample the colors on a sufficiently
dense rectangular grid. This approach also nicely plays into how digital images
are usually acquired through an imaging sensor (Chapter 3). Images stored in
such spatial-domain formats form very large files that allow the steganographer
to hide relatively large messages. In this section, we describe the details of the
raster and palette representations.

2.2.1 Raster formats


In a raster format, the image data is typically stored in a row-by-row manner
with one or more bytes (or bits) per pixel depending on the format and the
number of bits allocated per pixel. While three bytes are necessary for each pixel
of a 24-bit true-color image, grayscales typically require 8 bits (one byte) per
pixel. Monochrome (black-and-white) images need only one bit per pixel. The
most common formats that allow raster image representation are BMP, TIFF,
Digital image formats 19

and PNG. Their color sampling is shown in Table 2.2. These formats may use
lossless compression [206] to provide a smaller file size. For example, BMP may
use runlength compression (optionally), PNG uses DEFLATE, while TIFF allows
multiple different compression schemes.
In this book, a grayscale image in raster format will be represented us-
ing an M × N matrix of integers from the range {0, . . . , 2nc − 1}, where
typically nc = 8. A true-color BMP image will be represented with three
such matrices. The Matlab command1 for importing a BMP image is X
= imread(’my_image.bmp’). Alternatively, saving a uint8 matrix of in-
tegers X to a BMP image is obtained using the command imwrite(X,
’my_stego_image.bmp’, ’bmp’). The reader is urged to check the Matlab help
facility for the list of formats supported by his/her version of Matlab.

2.2.2 Palette formats


Palette formats are typically used for images with low color depth, such as
computer-generated graphics, line drawings, and cartoons. For this type of visual
information, the format is lossless (there is no loss of fidelity). The image consists
of the header, image palette, and image data. The palette can have up to 256
colors stored as 8-bit RGB triples. The image data is a rectangular M × N array
of 8-bit pointers to the palette.
When converting an image with more than 256 colors to a palette image, two
separate procedures are employed: creation of the color palette and converting
each pixel color to a palette color. The palette could be a fixed array of colors
independent of the image content or it could be derived from the image using
color-quantization algorithms. The latter option always gives much more visually
pleasing results. Note that color quantization is a lossy process.
There exist many different algorithms for color quantization [111]. The popu-
larity algorithm calculates the histogram and takes the most frequently occurring
256 colors as the palette. The median-cut algorithm recursively fits a box around
the colors (in the RGB color cube {0, . . . , 255}3), splitting it along its longest
dimension at the median in that dimension. The recursive process ends when
28 = 256 boxes are obtained. The color palette is then formed by the center of
gravity of colors from each box. This algorithm produces better results than the
popularity algorithm. It is possible to split the box according to other criteria,
such as minimizing the spread in each box, etc.
Once the image palette has been obtained, the color of every pixel needs to be
mapped onto the newly created palette. Again, a number of approaches exist,
ranging from simple truncation to the nearest neighbor to stochastic dithering or
dithering along a space-filling curve. In this book, we describe only the simplest
dithering methods to explain the main concepts.

1 Matlab Image Processing Toolbox is required.


20 Chapter 2. Digital image formats

e[i, j] = x[i, j] − y[i, j]


+

x[i, j + 1]
··· x[i, j] x[i, j + 2] · · · Original 24-bit image
+e[i, j]

Quantized and dithered


··· y[i, j] y[i, j + 1] y[i, j + 2] ···
image

y[i, j] is the closest palette color to x[i, j]


Figure 2.1 Simple dithering mechanism.

Let us denote the original pixel colors as x[i], i = 1, . . . , M × N , assuming


here that the sequence x[i] has been obtained by scanning the image in some
continuous manner (for example, by rows). Let us further denote the palette col-
ors in the RGB model as c[k] = (r[k], g[k], b[k]), k = 0, . . . , 255, r[k], g[k], b[k] ∈
{0, . . . , 255}. During dithering, depicted in Figure 2.1, the pixel values are pro-
cessed one-by-one using the following formulas (y[i] denotes the dithered image
with colors from the palette):

y[i] = c[k], where c[k] is the closest color to x[i], (2.10)


e[i] = x[i] − y[i] is the color approximation error at pixel i, (2.11)
x[i + 1] = x[i] + e[i] is the next pixel value corrected by
the approximation error made at x[i]. (2.12)

In this algorithm, the pixel colors are truncated to the closest palette color,
and, at the same time, the next pixel to be visited is modified by the truncation
error at the current pixel. If a pixel is truncated, say, to a color that is less red,
a small amount of red is added to the next pixel to locally preserve the overall
color balance in the image.
This process can be further refined by spreading the truncation error among
more pixels. We just need to make sure that the pixels are spatially close to
the current pixel and that no pixels are modified that have already been visited.
Weights can be assigned as fixed numbers or random variables with sum equal
to 1 across all pixels receiving a portion of the truncation error.
One of the most popular dithering algorithms is Floyd–Steinberg dithering.
To explain this algorithm, we now represent the pixels in the image as a two-
dimensional array and assume that the dithering algorithm starts in the upper
Digital image formats 21

··· x[i, j] + α0,1 e[i, j]

+ + + α0,1 + α1,1 + α1,0 + α1,−1 = 1

α1,−1 e[i, j] α1,0 e[i, j] α1,1 e[i, j]

Figure 2.2 Floyd–Steinberg dithering.

left corner with pixel indices i = 0, j = 0 (follow Figure 2.2):

y[i, j] = c[k] where c[k] is the closest color to x[i, j], (2.13)
e[i, j] = x[i, j] − y[i, j], (2.14)
x[i, j + 1] = x[i, j + 1] + α0,1 e[i, j], (2.15)
x[i + 1, j + 1] = x[i + 1, j + 1] + α1,1 e[i, j], (2.16)
x[i + 1, j] = x[i + 1, j] + α1,0 e[i, j], (2.17)
x[i + 1, j − 1] = x[i + 1, j − 1] + α1,−1 e[i, j]. (2.18)

7 1 5
Typical values of the coefficients are α0,1 = 16 , α1,1 = 16 , α1,0 = 16 , and
3
α1,−1 = 16 . The dithering process basically arranges for a trade-off between lim-
ited color resolution and spatial resolution. Because human eyes have the ability
to integrate colors in a small patch when looking from a distance, dithering al-
lows us to perceive new shades of color not present in the palette. The dithering
process introduces characteristic structures or patterns (noisiness) that may be-
come visible in areas of small color gradient. The stochastic error spread helps
by breaking any regular dithering patterns that may otherwise arise, thereby
creating a more visually pleasing image. As an example, in Figure 2.3 we show a
magnified portion of a true-color image after storing it as a GIF image with 256
colors in the palette obtained using the median-cut algorithm (the color image is
displayed in Plate 1). The colors were dithered using Floyd–Steinberg dithering.
The Image Processing Toolbox of Matlab offers a number of routines that make
working with GIF images in Matlab very easy. A GIF image can be read using
the command [Ind, Map] = imread(’my_image.gif’). The variable Map is an
n × 3 double array of palette colors consisting of n ≤ 256 colors. Each row of Map
is one palette color in the RGB format scaled so that R, G, B ∈ [0, 1]. The variable
Ind is the array of indices to the palette of the same dimensions as the image.
Modified versions of both arrays can be used to write the modified image to disk
as a GIF file using the command imwrite(Ind, Map, ’my_stego_image.gif’,
’gif’).
22 Chapter 2. Digital image formats

Figure 2.3 Magnified portion of a true-color image (left) and the same portion after
storing the image as GIF (right). A color version of this figure appears in Plate 1.

2.3 Transform-domain formats (JPEG)

People perceive natural images as a collection of segments filled with texture


rather then as matrices of pixels. In particular, tests on human subjects showed
that our visual system is fairly insensitive to small changes in color or high-
spatial-frequency noise. Thus, it is highly inefficient to store natural images
as rectangular matrices of colors. Engineers working in data compression have
long realized this fact and proposed several much more efficient image formats
that work by transforming the image into a different domain where it can be
represented in an easily compressible “sparse” form. Such formats are typically
lossy, meaning that the format conversion introduces some perceptual loss that
is imperceptible under regular viewing conditions. Substantial savings in storage
space justify the slight loss of fidelity. The two most commonly used transforms
today are the Discrete Cosine Transform (DCT) and the Discrete Wavelet Trans-
form (DWT). The DCT is at the heart of the JPEG format, while the DWT is
used in JPEG2000 [228]. In this section, we introduce only the JPEG format as
JPEG2000 steganography is currently not well developed.
JPEG stands for the Joint Photographic Experts Group that finalized the
standard in 1992. In this section, we review basic properties of the format relevant
to steganography. A detailed description of the format can be found in [185].
JPEG compression consists of the following five basic steps.

1. Color transformation. The color is transformed from the RGB model to


the Y Cr Cb model (2.9). Although this step is not necessary (JPEG can work
directly with the RGB representation) it is typically used because it enables
higher compression ratios at the same fidelity.
2. Division into blocks and subsampling. The luminance signal Y is divided
into 8 × 8 blocks. The chrominance signals Cr and Cb may be subsampled
before dividing into blocks (more details are provided below).
Digital image formats 23

3. DCT transform. The Y Cr Cb signals from each block are transformed from
the spatial domain to the frequency domain with the DCT. The DCT can be
thought of as a change of basis representing 8 × 8 matrices.
4. Quantization. The resulting transform coefficients are quantized by dividing
them by an integer value (quantization step) and rounded to the nearest in-
teger. The luminance and chrominance signals may use different quantization
tables. Larger values of the quantization steps produce a higher compression
ratio but introduce more perceptual distortion.
5. Encoding and lossless compression. The quantized DCT coefficients are
arranged in a zig-zag order, encoded using bits, and then losslessly compressed
using Huffman or arithmetic coding. The resulting bit stream is prepended
with a header and stored with the extension ’.jpg’ or ’.jpeg.’ For applications
in steganography, it will not be necessary to understand the details of this
last step.

To view a JPEG image, we first need to obtain the spatial-domain representa-


tion from the JPEG file, which is achieved essentially by reversing the five steps
above. The JPEG bit stream is first parsed, then decompressed, and then the
two-dimensional array of quantized DCT coefficients is formed. The coefficients
in each block are then multiplied by the quantization steps and the inverse DCT
is applied to produce the raw pixel values. Finally, the values are rounded to in-
tegers from a certain dynamic range (usually the set {0, . . . , 255}). While lossless
compression and the DCT are reversible processes, the quantization in Step 4 is
irreversible and, in general, the decompressed image will not be identical to the
original image before compression.

2.3.1 Color subsampling and padding


Because human eyes are less sensitive to changes in color than in luminance, the
chrominance signals, Cr , Cb , are typically downsampled before applying the DCT
to achieve a higher compression ratio. This is executed by initially dividing the
image into macroblocks of 16 × 16 pixels. Each macroblock produces four 8 × 8
luminance blocks and 1, 2, or 4 blocks for each chrominance, depending on how
the chrominance in the macroblock is subsampled. If it is subsampled by a factor
of 2 in each direction, each macroblock will have only one 8 × 8 chrominance Cr
block and one chrominance Cb block (and, of course, four luminance blocks).
This is usually written in an abbreviated form 4 : 1 : 1. If neither chrominance
signal is subsampled, we have the 4 : 4 : 4 representation. Both Cr and Cb can
be subsampled only along one direction, which would lead to two chrominance
8 × 8 blocks in every macroblock, 4 : 2 : 2. Other possibilities are allowed by the
format, such as subsampling Cr only in one direction and Cb in both directions
(4 : 2 : 1).
If the image dimensions, M × N , are not multiples of 8, the image is padded
to the nearest larger multiples, 8 M/8 × 8 N/8. During decompression, the
24 Chapter 2. Digital image formats

padded parts are not displayed. Also, before applying the DCT, all pixel values
are shifted by subtracting 128 from them.

2.3.2 Discrete cosine transform


We now provide a mathematical description of selected steps in JPEG com-
pression. We start with the DCT and explain its properties. For an 8 × 8 block
of luminance (or chrominance) values B[i, j], i, j = 0, . . . , 7, the 8 × 8 block of
DCT coefficients d[k, l], k, l = 0, . . . , 7 is computed as a linear combination of
luminance values,


7
d[k, l] = f [i, j; k, l]B[i, j] (2.19)
i,j=0

 7
w[k]w[l] π π
= cos k(2i + 1) cos l(2j + 1)B[i, j], (2.20)
i,j=0
4 16 16

π π

where f [i, j; k, l] = (w[k]w[l]/4) cos 16 k(2i + 1) cos 16 l(2j + 1) and w[0] = 1/ 2,
w[k > 0] = 1. The coefficient d[0, 0] is called the DC coefficient (or the DC term),
while the remaining coefficients with k + l > 0 are called AC coefficients.
The DCT is invertible and the inverse transform (IDCT) is


7
w[k]w[l] π π
B[i, j] = cos k(2i + 1) cos l(2j + 1)d[k, l]. (2.21)
4 16 16
k,l=0

The fact that the transform involves real numbers rather than integers increases
its complexity and memory requirements, which could be an issue for mobile
electronic devices, such as digital cameras or cell phones. Fortunately, the JPEG
format allows various implementations of the transform that work only with
integers and are thus much faster and more easily implemented in hardware.
The fact that there exist various implementations of the transform means that
one image could be compressed to several slightly different JPEG files. In fact,
this difference may not be that small for some images and may influence the
statistical distribution of DCT coefficients [95].
The DCT can be interpreted as a change of basis in the vector space of all 8 × 8
matrices, where the sum of matrices and multiplication by a scalar are defined in
the usual elementwise manner and the dot product between matrices X and Y

is X · Y = 7i,j=0 X[i, j]Y[i, j]. For a fixed pair (k, l), we call the 8 × 8 matrix
f [i, j; k, l] the (k, l)th basis pattern. All 64 such patterns, depicted in Figure 2.4,
form an orthonormal system because


7
f [i, j; k, l]f [i, j; k  , l ] = δ(k − k  )δ(l − l ) for all k, k  , l, l ∈ {0, . . . , 7},
i,j=0
(2.22)
Digital image formats 25

= d[0, 0] · + d[0, 1] · + · · · + d[7, 7] ·

Figure 2.4 All 64 orthonormal basis patterns used in JPEG compression. Below, an
example of an expansion of a pattern into a linear combination of basis patterns. Image
provided courtesy of Andreas Westfeld.

where δ is the Kronecker delta,



1 when x = 0
δ(x) = (2.23)
0 when x = 0.

Equation (2.20) can then be understood as decomposition of the pixel block B


into the basis. The DCT coefficients d[k, l] are coefficients in a linear combination
of the patterns that produces the pixel block B. Figure 2.4 demonstrates this
fact graphically.
We also wish to remark that the two-dimensional DCT can be built from the
one-dimensional DCT by taking tensor products of one-dimensional basis vectors.
For a fixed pair (k, l), f [i, j; k, l] = u[i] ⊗ v[j], where u[i] = (w[k]/2) cos 16
π
k(2i +
1) and v[j] = (w[l]/2) cos 16 l(2j + 1) and a ⊗ b[i, j] = a[i]b[j]. In fact, most two-
π

and higher-dimensional transforms are constructed in this manner.

2.3.3 Quantization
The purpose of quantization is to enable representation of DCT coefficients using
fewer bits, which necessarily results in loss of information. During quantization,
the DCT coefficients d[k, l] are divided by quantization steps from the quantiza-
tion matrix Q[k, l] and rounded to integers

d[k, l]
D[k, l] = round , k, l ∈ {0, . . . , 7}, (2.24)
Q[k, l]
26 Chapter 2. Digital image formats

Figure 2.5 A magnified portion of a true-color image in BMP format (left) and the same
portion after compressing with JPEG quality factor qf = 20 (right). A color version of this
figure appears in Plate 2.

where we denoted the operation of rounding x to the closest integer as round(x).


The larger the quantization step, the fewer bits can be allocated to each DCT
coefficient and the larger the loss and with it the perceptual distortion. JPEG
introduces very characteristic compression artifacts that manifest themselves as
“blockiness” or spatial discontinuities at the boundary of the 8 × 8 blocks (see
Figure 2.5 or Plate 2 for the color version). The blockiness becomes more pro-
nounced with coarser quantization. At fine quantization (low compression ratio
or high quality factor), the blockiness artifacts are not perceptible even when
inspected under magnification.
The JPEG standard recommends a set of quantization matrices indexed by a
quality factor qf ∈ {1, 2, . . . , 100}. These matrices became known as the “stan-
dard” quantization matrices. Denoting the 8 × 8 matrix of ones with boldface 1,
the standard quantization matrices are obtained using the following formula:


max {1, round (2Q50 (1 − qf /100))} , qf > 50
Qqf = (2.25)
min {255 · 1, round (Q50 50/qf )} , qf ≤ 50,

where the 50% quality standard JPEG quantization matrix (for the luminance
component Y ) is

⎛ ⎞
16 11 10 16 24 40 51 61
⎜ 12 12 14 19 26 58 60 55 ⎟
⎜ ⎟
⎜ 14 13 16 24 40 57 69 56 ⎟
⎜ ⎟
⎜ ⎟
(lum) ⎜ 14 17 22 29 51 87 80 62 ⎟
Q50 =⎜ ⎟. (2.26)
⎜ 18 22 37 56 68 109 103 77 ⎟
⎜ ⎟
⎜ 24 35 55 64 81 104 113 92 ⎟
⎜ ⎟
⎝ 49 64 78 87 103 121 120 101 ⎠
72 92 95 98 112 100 103 99
Digital image formats 27

The chrominance quantization matrices are obtained using the same mechanism
with the 50% quality chrominance quantization matrix,
⎛ ⎞
17 18 24 47 99 99 99 99
⎜ 18 21 26 66 99 99 99 99 ⎟
⎜ ⎟
⎜ 24 26 56 99 99 99 99 99 ⎟
⎜ ⎟
⎜ ⎟
(chr) ⎜ 47 66 99 99 99 99 99 99 ⎟
Q50 = ⎜ ⎟. (2.27)
⎜ 99 99 99 99 99 99 99 99 ⎟
⎜ ⎟
⎜ 99 99 99 99 99 99 99 99 ⎟
⎜ ⎟
⎝ 99 99 99 99 99 99 99 99 ⎠
99 99 99 99 99 99 99 99

The JPEG format allows arbitrary (non-standard) quantization matrices that


can be stored in the header of the JPEG file. Many digital cameras use such
non-standard matrices. An example of a non-standard luminance quantization
matrix from a Kodak DC 290 camera is
⎛ ⎞
555 5 5 6 6 8
⎜5 5 5 5 5 6 7 8 ⎟
⎜ ⎟
⎜5 5 5 5 6 7 8 9 ⎟
⎜ ⎟
⎜ ⎟
⎜ 5 5 5 6 7 8 9 10 ⎟
⎜ ⎟. (2.28)
⎜ 5 5 6 7 8 9 11 12 ⎟
⎜ ⎟
⎜ 6 6 7 8 9 11 13 14 ⎟
⎜ ⎟
⎝ 6 7 8 9 11 13 15 16 ⎠
8 8 9 10 12 14 16 19

We will denote the (k, l)th DCT coefficient in the bth block as D[k, l, b], b ∈
{1, . . . , NB }, where NB is the number of all 8 × 8 blocks in the image. Note
that for a color image, there will be three such three-dimensional arrays, one for
luminance and two for the chrominance signals. The pair (k, l) ∈ {0, . . . , 7} ×
{0, . . . , 7} is called the spatial frequency (or mode) of the DCT coefficient.
Before the DCT coefficients are encoded using bits and losslessly compressed,
the blocks are ordered from the upper left corner to the bottom right corner and
the individual coefficients from each block are arranged by scanning the block
using a zig-zag scan that starts at the spatial frequency (0, 0) and proceeds
towards (7, 7).

2.3.4 Decompression
The decompression works in the opposite order. After reading the quantized DCT
blocks from the JPEG file, each block of quantized DCT coefficients D is multi-
plied by the quantization matrix Q, d̃[k, l] = Q[k, l]D[k, l], k, l ∈ {0, . . . , 7}, and
the inverse DCT is applied to the 8 × 8 matrix d̃. The values are finally rounded
to integers and truncated to a finite dynamic range (usually {0, . . . , 255}). The
block of decompressed pixel values B̃ is thus
28 Chapter 2. Digital image formats

-1 6 3 -1 0 0 0 0 -16 90 37 -17 -1 -2 -2 -1
4 1 -4 -1 0 0 0 0 63 10 -46 -14 12 0 0 2
0 -1 0 0 0 0 0 0 -2 -9 -5 12 4 -5 -2 1
0 0 0 0 0 0 0 0 1 -3 -2 0 -3 -1 1 1
0 0 0 0 0 0 0 0 0 -2 -1 -1 0 1 1 -1
0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0
0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 -1 0 1 0 0 0 0

Figure 2.6 An 8 × 8 luminance block of pixels and its quantized DCT coefficients for
JPEG quality factor qf = 20 (left) and qf = 90 (right).


B̃ = trunc round IDCT(d̃) , (2.29)

where IDCT(·) is the inverse DCT (2.21) and trunc(x) is the operation of truncat-
ing integers to a finite dynamic range (trunc(x) = x for x ∈ [0, 255], trunc(x) = 0
for x < 0, and trunc(x) = 255 for x > 255). Due to quantization, rounding, and
truncation, B̃ will in general differ from the original block B.

2.3.5 Typical DCT block


The quantized DCT coefficients in a JPEG file are represented using integers
in the range [−1023, 1024]. If we were to compress an image resembling white
noise, the 8 × 8 matrix of quantized DCT coefficients in each block, b, D[k, l, b],
0 ≤ k, l ≤ 7, would be filled with many non-zero integers because the spectrum
of white noise is flat (all frequencies contribute the same amount of energy).
Because natural images consist of objects rather than random textures, most of
their energy is concentrated in low spatial frequencies. Consequently, the non-
zero elements in each 8 × 8 block of DCT coefficients will be concentrated in
the upper left corner with (k, l) = (0, 0) (the low-spatial-frequency corner). To
illustrate this claim, in Figure 2.6 we show an 8 × 8 luminance block and the
block of quantized DCT coefficients for the same block.
Digital image formats 29

·104

0
−20 −15 −10 −5 0 5 10 15 20
Figure 2.7 Histogram of luminance DCT coefficients for the image shown in Figure 5.1
stored as 95% quality JPEG.

102

101

100

10−1

10−2

10−3

10−4

10−5
−100 −50 0 50 100
Figure 2.8 Histogram of luminance DCT coefficients from the image shown in Figure 5.1
compressed as 95% quality JPEG and the generalized Gaussian fit.

2.3.6 Modeling DCT coefficients


DCT coefficients of natural images follow a distribution with a spike around
zero (see Figure 2.7), which is often modeled using the generalized Gaussian or
Cauchy distribution (see Appendix A for description of these distributions).
The method of moments in Section A.8 is a simple approach that can be
used to determine the parameters of the generalized Gaussian fit from data. In
Figure 2.8, we show the histogram of DCT coefficients from the image shown
in Figure 5.1 and the generalized Gaussian fit obtained using the method of
moments.
30 Chapter 2. Digital image formats

2.3.7 Working with JPEG images in Matlab


Many steganographic methods for JPEG images work by directly manipulating
quantized DCT coefficients extracted from the JPEG file and saving the modified
array again as JPEG. Steganalysis methods for JPEG images also need access
to the quantized DCT coefficients rather than the JPEG image decompressed to
the spatial domain. However, most image-editing programs, including Matlab,
do not provide direct access to the coefficients. The imread command in Matlab
returns the spatial-domain representation of the JPEG file (the decompressed
file) rather than its DCT coefficients.
The Independent JPEG group (http://www.ijg.org) provides access to its
libjpeg library with C routines for parsing a JPEG file, extracting header data,
DCT coefficients, and quantization tables, and writing data back to a JPEG file.
Most readers of this book are probably familiar with programming in Matlab
and would thus prefer to carry out the above-mentioned tasks within Matlab
itself. This is, indeed, possible using the Matlab JPEG Toolbox written by Phil
Sallee (http://www.philsallee.com). It essentially gives the user the ability to
use the libjpeg library through Matlab functions. Among others, two routines
included in the toolbox are jpeg_read and jpeg_write. They are Matlab MEX
wrappers for the libjpeg library. The toolbox includes precompiled MEX binaries
for Windows. For other operating systems, the user can download the libjpeg
library and compile the MEX routines.
Below, we give an example of how to use the Matlab JPEG Toolbox to extract
the array of luminance and chrominance DCT coefficients and the quantization
matrices from a JPEG file, and how to write the data back to a JPEG file.

im=jpeg_read(’color_image.jpg’);
Lum=im.coef_arrays{im.comp_info(1).component_id};
ChromCr=im.coef_arrays{im.comp_info(2).component_id};
ChromCb=im.coef_arrays{im.comp_info(3).component_id};
Lum_quant_table=im.quant_tables{im.comp_info(1).quant_tbl_no};
Chrom_quant_table=im.quant_tables{im.comp_info(2).quant_tbl_no};
...
jpeg_write(im_stego,’my_stego_image.jpg’);

The array of luminance DCT coefficients Lum is obtained by replacing every


8 × 8 pixel block with the corresponding block of quantized DCT coefficients.

Summary
r Human eyes are sensitive to electromagnetic radiation in the range of approx-
imately 380 nm to 750 nm and each color is uniquely captured by the spectral
power within this range.
Digital image formats 31

r According to the tristimulus theory of human perception, each color that


humans can perceive can be obtained as a superposition of three basic colors
– red, green, and blue (RGB).
r There exist other color models, such as CMYK (cyan, magenta, yellow, black),
YUV, and Y Cr Cb (luminance and two chrominance signals), that are suitable
for various applications.
r There are four main types of image formats: raster, palette (indexed), trans-
form, and vector formats.
r Raster formats represent a digital image as a rectangular array of integers
sampled using a fixed number of bits. Typical raster image formats are BMP,
TIFF, and PNG. Images in raster formats are often large despite the fact
that the formats use lossless compression to decrease the amount of data that
needs to be stored.
r Palette images consist of two parts – a color palette and an array of indices
to the palette. Typical palette formats are GIF and PNG. Palette images are
convenient for representing charts, computer art, and other images with low
color depth.
r Images in transform formats are represented through transform coefficients
quantized using a fixed number of bits rather than using pixels directly. The
most popular transform format is JPEG, which uses the discrete cosine trans-
form (DCT). The coefficient quantization is controlled through a quantization
table, which in turn can be controlled using a quality factor. Transform for-
mats enable more efficient storage of visual data through lossy compression.
r Vector formats, such as WMF, EPS, and PS, can represent objects in the
image using parametric description.
r Many formats enable several different image representations. For example,
BMP and PNG can be either raster or palette, while EPS and TIFF are very
general formats that allow multiple image representations.
r The raster and transform formats are most suitable for steganography.
r DCT coefficients in a JPEG file follow a symmetrical distribution with a
spike around zero that is well modeled using generalized Gaussian or Cauchy
distributions.

Exercises

2.1 [Palette ordering] Write a Matlab routine that orders a palette (repre-
sented as an n × 3 array) by luminance. Use the conversion formula (2.4) for the
ordering.

2.2 [Draw palette] Write a routine that displays the palette colors of a
GIF image as 16 × 16 little squares arranged by rows (from left to right, top to
bottom) into a square pattern. Inspect Figure 5.5 or Plate 6 for an example of
the output.
32 Chapter 2. Digital image formats

2.3 [JPEG quantization table] Take a picture of a highly textured im-


age with your digital camera set to take images in the highest-quality JPEG
format. A closeup of grass or foliage would be good. Load the JPEG image in
Matlab using the routine jpeg_read from the Matlab JPEG Toolbox. Extract
the luminance quantization matrix. Is the matrix a standard matrix? Repeat
the experiment by taking another JPEG image at the same JPEG quality, but
this time choose a smoother content, such as a picture of blue sky, sea, or a wall
painted with one color. Compare the quantization tables. Since most cameras use
different quantization matrices for different images depending on their content,
the two matrices will probably be different.

2.4 [JPEG histogram] Using the JPEG Matlab Toolbox, write a routine
that displays a histogram of DCT coefficients from a chosen DCT mode (k, l). Use
your routine and analyze a picture of a regular scene. Plot the histograms of DCT
modes (1, 2), (1, 4), and (1, 6). Notice that the histograms become progressively
“spikier.” This is because modes with higher spatial frequencies are more often
quantized to zero.

2.5 [Generalized Gaussian Fit] Write a routine that fits the generalized
Gaussian distribution to a histogram of DCT coefficients using the method of
moments (Section A.8). For debugging purposes, apply it to artificially generated
Gaussian data to see whether you are getting the expected result. Then apply it
to the three histograms obtained in the previous project. The shape parameter
should decrease with increasing spatial frequency of the DCT mode, quantifying
thus your observation that the histogram becomes spikier.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
3 - Digital image acquisition pp. 33-46

Chapter DOI:
Cambridge University Press
3 Digital image acquisition

This book focuses on steganographic methods that embed messages in digital


images by slightly modifying them. In this chapter, we explain the process by
which digital images are created. This knowledge will help us design more se-
cure steganography methods as well as build more sensitive detection schemes
(steganalysis).
Fundamentally, there exist two mechanisms through which digital images can
be created. They can be synthesized on a computer or acquired through a sen-
sor. Computer-generated images, such as charts, line drawings, diagrams, and
other simple graphics generated using drawing tools, could, in principle, be
made to hold a small amount of secret data by the selection of colors, object
types (line type, fonts), their positions or dimensions, etc. Realistic-looking com-
puter graphics generated from three-dimensional models (or measurements) us-
ing specialized methods, such as ray-tracing or radiosity, are typically not very
friendly for steganography as they are generated by deterministic algorithms us-
ing well-defined rules. In this book, we will mostly deal with images acquired
with cameras or scanners because they are far more ubiquitous than computer-
generated images and provide a friendlier environment for steganography. As
with any categorization, the boundary between the two image types (real versus
computer-generated) is blurry. For example, it is not immediately clear how one
should classify a digital-camera image processed in Photoshop to make it look
like Claude Monet’s style of painting or a collage of computer-generated and real
images.
This chapter is devoted to digital imaging sensors, which form the heart of
most common digital image-acquisition devices today, such as scanners, digital
cameras, and digital videocameras. We emphasize that from the point of view
of the steganographers (Alice and Bob), we are interested in any imperfections,
noise sources, or variations that enter the image-acquisition process because these
uncertainties could be utilized for steganography. The steganalyst, Eve, on the
other hand, would like to know what patterns, periodicities, or dependences exist
among pixels of a digital image because the steganographer may disrupt these
structures by embedding, allowing Eve to construct a steganography detector.
The reader should read this chapter with both views in mind.
Section 3.1 explains the physical process of registering light by an imaging
sensor. After registering the light, the signal created at each pixel needs to be
34 Chapter 3. Digital image acquisition

transferred from the sensor for further processing. This topic is described in
Section 3.2. The processing of the acquired signal in the camera is the subject
of Sections 3.3 and 3.4 that explain how sensors register color through a color
filter array and how they process the signal for it to be viewable on a computer
monitor. Finally, in Section 3.5 we describe a topic of special interest to steganog-
raphers – imperfections and stochastic processes involved in image acquisition.

3.1 CCD and CMOS sensors

The imaging sensor is a silicon semiconductor device that forms an image by


capturing photons, converting them into electrons, transferring them, and even-
tually converting to voltage, which is turned into digital output through quanti-
zation in an A/D converter. There exist two competing imaging sensor technolo-
gies: the Charge-Coupled Device (CCD) and the Complementary Metal–Oxide–
Semiconductor (CMOS). Both were invented in the late 1960s and early 1970s
from research originally focused on solid-state memory devices. The CCD, which
was proposed and tested at Bell Labs by Boyle and Smith [2, 28], was originally
preferred due to its superior image quality when compared with CMOS images.
Today, these two technologies coexist because the CMOS technology offers lower
cost, faster readout, and greater flexibility that allows on-chip processing. The
two technologies use the same process for generating electrons from photons but
differ in how the electrons are transferred.
The reason why imaging sensors are made of silicon is its responsiveness to
light in the visible spectrum (380 nm to 750 nm). The physics behind the sensors
is the photoelectric effect. When photons collide with silicon, electron–hole pairs
are created. In theory, one photon of visible light would release exactly one
electron. In practice, sensors are not 100% efficient and thus one photon will
release less than one electron. And even when an electron is released, it may not
be captured and processed by the sensor. Thus, image sensors always have less
than 100% quantum efficiency. One important factor that influences quantum
efficiency is the quality and purity of the silicon wafer used for manufacturing
the sensor. Homogeneous crystal lattices aligned in the same direction will allow
the silicon to conduct electrons more efficiently.
Image sensors typically consist of a rectangular array of pixels also called
photosites. Each pixel has a photosensitive area (photodetector) that receives
photons and converts them to electrons. The electrons accumulate in a potential
or electric-charge well (pixel well). The charge depends on the light intensity
and the exposure time (also called integration time). The photodetectors are
usually equipped with a miniature lens (microlens) whose purpose is to increase
the photodetector’s sensitivity to light coming from different incident angles.
The number of pixels on the sensor determines the resolution of the image
the sensor produces. Smaller pixel size enables higher resolution but may lead
to higher noise levels in the resulting image. This is mainly due to decreased
Digital image acquisition 35

Figure 3.1 Kodak KAI 1100 CM-E CCD sensor with 11 megapixels.

well capacity and lower quantum efficiency and due to the fact that the various
electronic components are packed closer to each other and thus influence each
other more. Inhomogeneities of the silicon also become more influential with
decreased pixel size. Sensors with larger pixels (10 microns or larger) produce
images with a higher signal-to-noise ratio (SNR) but are more expensive.

3.2 Charge transfer and readout

To capture an image, both CCD and CMOS sensors perform a sequence of in-
dividual tasks. They first absorb photons, generate a charge using the photo-
electric phenomenon, then collect the charge, transfer it, and finally convert the
charge to a voltage. The CCD and CMOS sensors differ in how they transfer the
charge and convert it to voltage.
CCDs transfer charges between pixel wells by shifting them from one row of
pixels of the array to the next row, from top to bottom (see Figure 3.2). The
transfer happens in a parallel fashion through a vertical shift-register architec-
ture. The charge transfer is “coupled” (hence the term charge-coupled device)
in the sense that as one row of charge is moved down, the next row of charge
(which is coupled to it) shifts into the vacated pixels. The last row is a horizontal
shift register that serially transfers the charge out of the sensor for further pro-
cessing. Each charge is converted to voltage, amplified, and then sent to an A/D
converter, which converts it into a bit string. Even though the charge-transfer
process itself is not completely lossless, the Charge-Transfer Efficiency (CTE)
in today’s CCD sensors is very high (0.99999 or higher) and thus influences the
resulting image quality in a negligible manner at least as far as steganographic
applications are concerned. The charge conversion and amplification, however,
do introduce noise into the signal in the form of readout and amplifier noise.
Both noise signals can be well modeled as a sequence of iid Gaussian variables.
36 Chapter 3. Digital image acquisition

Charge-coupled
transfer

Amplifier
Last row is a horizontal
shift register A/D convertor

Figure 3.2 Charge transfer in a CCD.

The CCD array can be either a one-dimensional strip of photodetectors or a


two-dimensional array. To obtain an image using a linear CCD, it needs to be
mechanically moved across the image (as in a flat-bed scanner). Spy satellites
use the same mode of acquisition known as Time Delay and Integration (TDI)
imaging. The sensor movement introduces additional sources of imperfections
and variations that could potentially be used for steganography. Array CCDs
use pixels that are usually rectangular. One exception is Fuji’s Super CCD that
uses a “honeycomb” array of octagonal pixels that maximizes the use of silicon
in the sensor.
The charge transfer and readout in a CMOS sensor is very different from a
CCD. In a CMOS sensor, it is possible to read out the charge directly at each
individual pixel. This offers direct rather than sequential access to image data,
which gives CMOS sensors greater flexibility. The individual pixel readout (sim-
ilar in principle to random-access memory) was made possible using technology
called an Active Pixel Sensor (APS). An APS has a readout amplifier transis-
tor at each pixel. Besides signal amplification at each pixel, the sensor can also
perform other functions, such as adjustment of gain under low-light conditions,
white balance, noise reduction (as already mentioned above), or even A/D con-
version. The random access offers the possibility to read out only a targeted
area of the sensor (windowing readout) or subsample the image at acquisition.
CMOS sensors can easily accommodate processing right on the imaging chip
(which CCDs cannot due to their process limitations). The down side of having
all this extra on-chip circuitry is an increased level of noise produced by elec-
tronic components, such as transistor and diode leakage, cross-talk, and charge
injection.

3.3 Color filter array

Because a photodetector registers all incident photons in the visible spectrum, it


registers all colors and thus the sensor produces a grayscale image. To produce a
Digital image acquisition 37

G B

R G

Figure 3.3 The Bayer color filter array.

color image, each photodetector has a filter layer bonded to the silicon that allows
only light of a certain color to pass through, absorbing all other wavelengths. The
filters are assigned to pixels in a two-dimensional pattern called the Color Filter
Array (CFA). Most sensors use the Bayer pattern developed by Kodak in the
1970s. It is obtained by tiling the sensor periodically using 2 × 2 squares as
depicted in Figure 3.3. The Bayer pattern has twice as many green pixels as red
or blue, reflecting the fact that human eyes are more sensitive to green than red
or blue.
To form a complete digital image, the other two missing colors at each pixel
must be obtained by interpolation (also called demosaicking) from neighboring
pixels. A very simple (but not particularly good) color-interpolation algorithm
computes the missing colors as in Table 3.1. There exist numerous very sophisti-
cated content-adaptive color-interpolation algorithms that perform much better
than this simple algorithm.
Color-interpolation algorithms may introduce artifacts into the resulting im-
age, such as aliasing (moiré patterns) or misaligned colors in the neighborhood
of edges. These artifacts have a very characteristic structure and thus cannot be
used for steganography.
A very important consequence of color interpolation for steganography is that
no matter how sophisticated the demosaicking algorithm is, it is a type of fil-
tering, and it will inevitably introduce dependences among neighboring pixels.
After all, the red color at a green pixel is some function of the neighboring col-
ors, etc. Small modifications of the colors due to data embedding may disrupt
these dependences and thus become statistically detectable.
We note that not all cameras use CFAs. Some high-end digital videocameras
use a prism that splits the light into three beams and sends them to three separate
sensors, each sensor registering one color at every pixel. This approach is not
usually taken in compact cameras because it makes the camera more bulky and
expensive. There also exist special sensors that can register all three colors at
every pixel, capitalizing on the fact that red, green, and blue light penetrates to
different depths of the silicon layer. By reading out the charge from each layer
separately, rather than from the whole photodetector as in conventional sensors,
all three colors are obtained at every pixel. This design is incorporated in the
Foveon X3 sensor, for example, in the Sigma SD9 camera.
To capture color, scanners typically use trilinear CCDs consisting of three
adjacent linear CCDs, each equipped with a different color filter.
38 Chapter 3. Digital image acquisition

Table 3.1. A simple example of a color-interpolation algorithm for the Bayer pattern.

Color-interpolation algorithm

At red pixel G = (GN + GE + GS + GW )/4


B = (BNW + BNE + BSE + BSW )/4
At green pixel R = (RE + RW )/2
B = (BN + BS )/2
At blue pixel R = (RNW + RNE + RSE + RSW )/4
G = (GN + GE + GS + GW )/4

3.4 In-camera processing

The signal registered at each photodetector due to incoming photons goes


through a complicated chain of processing before the actual digital image in
some viewable image format is written to the camera memory device. We al-
ready know that the photons create a charge, which is subsequently converted to
voltage, which is amplified and converted to a digital form in an A/D converter.
Some cameras are able to export this raw signal onto the memory card to give
the user more control over the final processing stage, which is done off-line on
a computer, usually using manufacturer-supplied software. For example, Canon
and Nikon use the CRW and NEF raw formats, respectively. The raw output is
always a grayscale image, usually sampled at higher bit rates, such as 10–14 bits
per pixel, or higher for professional cameras, such as those used in astronomy.
One can view this raw sensor output as a digital equivalent of a negative. In fact,
Adobe has been trying to standardize the format for the raw sensor output by
including a Photoshop plug-in that can handle Adobe’s DNG (Digital NeGative)
format.
Most consumer-end digital cameras do not output the raw sensor signal and
instead perform a host of various processing operations to obtain a visually
pleasing image that is stored in some common format, such as TIFF or JPEG.
Even though the type of processing may greatly vary among cameras, most
cameras apply white balance, demosaicking, color correction, gamma correction,
denoising, and filtering (e.g., sharpening). The white balance is a multiplicative
adjustment designed to correct for differences in the spectrum of the ambient
light,
R ← gR R, (3.1)
G ← gG G, (3.2)
B ← gB B, (3.3)
where gR , gG , gB are the gains for each color channel.
After demosaicking the signal using color-interpolation algorithms designed
for the CFA, the signal is further processed using so-called color correction,
whose purpose is to adjust the amounts of red, green, and blue so that the
Digital image acquisition 39

image is correctly displayed on a computer monitor. Color correction is a linear


transformation
⎛ ⎞ ⎛ ⎞⎛ ⎞
R c11 c12 c13 R
⎝ G ⎠ ← ⎝ c21 c22 c23 ⎠ ⎝ G ⎠ . (3.4)
B c31 c32 c33 B
Because the CCD response is linear in the sense that the charge is proportional
to the light intensity1 (number of photons registered), it is incompatible with the
human visual system, which has a logarithmic response to light. Thus, the signal
must be corrected in a non-linear fashion, usually using gamma correction,
R ← Rγ , (3.5)
G ← Gγ , (3.6)
B←B , γ
(3.7)
where typically γ = 2.2.
These basic processing steps are usually supplied with other actions, such as
denoising, defects removal, sharpening, etc. Finally, the signal is quantized to
either a true-color 24-bit image in TIFF or JPEG format. For applications in
steganography, one should realize that denoising, filtering, and JPEG compres-
sion introduce additional local dependences whose presence is quite fundamental
for steganography and steganalysis. If the data-embedding algorithm does not
preserve the nature of these dependences, the embedding changes will become
statistically detectable.

3.5 Noise

There are numerous noise sources that influence the resulting image produced
by the sensor. Some are truly random processes, such as the shot noise caused by
quantum properties of light, while other sources are systematic in the sense that
they would be the same if we were to take the same image twice. It should be clear
that while systematic imperfections cannot be used for steganography, random
components can and are thus very fundamental for our considerations. If we
knew the random noise component exactly, in principle we could subtract it from
the image and replace it with an artificially created noise signal with the same
statistical properties that would carry a secret message (Chapter 7). We now
review each noise source individually, pointing out their role in steganography.
Dark current is the image one would obtain when taking a picture in com-
plete darkness. The main factors contributing to dark current are impurities in
the silicon wafer or imperfections in the silicon crystal lattice. Heat also leads to

1 Some sensors use Anti-Blooming Gates (ABGs) when the charge exceeds 50% of well capacity
to prevent blooming. This causes the photodetector response to become non-linear at high
intensity.
40 Chapter 3. Digital image acquisition

Figure 3.4 A magnified portion of a dark frame.

dark noise (with the noise energy doubling with an increase in temperature of
6–8◦ C). The thermal noise can be suppressed by cooling the sensor, which is typ-
ically done in astronomy. The number of thermal electrons is also proportional
to the exposure time. Some consumer cameras take a dark frame with a closed
shutter when the camera is powered up and subtract it from every image the
camera takes. One interesting example worth mentioning here is the KAI 2020
chip developed by Kodak. This chip calculates the dark current (sensor output
when not exposed to light) and subtracts it from the illuminated image. This
method is frequently used in CMOS sensors to suppress noise as well as other
artifacts.
Figure 3.4 shows an example of dark current on the raw sensor output obtained
using a 60-second exposure with the SBIG STL-1301E camera equipped with a
1280 × 1024 CCD sensor cooled to −15◦C.
Photo-Response Non-Uniformity (PRNU). Due to imperfections in the
manufacturing process and the silicon wafer, the dimensions as well as the quan-
tum efficiency of each pixel may slightly vary. Additional imperfections may be
introduced by anomalies in the CFA and microlenses. Therefore, even when tak-
ing a picture of an absolutely uniformly illuminated scene (e.g., the blue sky), the
sensor output will be slightly non-uniform even if we eliminated all other sources
of noise. This non-uniformity may have components of low spatial frequencies,
such as a gradient or darkening at image corners, circular or irregular blobs due
to dirty optics or dust on the protective glass of the sensor (Figure 3.5), and a
stochastic component resembling white noise due to anomalies in the CFA and
varying quantum efficiency among pixels. The PRNU is a systematic artifact in
the sense that two images of exactly the same scene would exhibit approximately
the same PRNU artifacts. (This is why it is sometimes called the fixed pattern
noise.) Thus, the PRNU does not increase the amount of information we can
embed in an image because it is systematic and not random.
Digital image acquisition 41

Figure 3.5 Dust particles on the sensor protective glass show up as fuzzy dark spots
(circled).

Figure 3.6 Magnified portion of the stochastic component of PRNU in the red channel for
a Canon G2 camera. For display purposes, the numerical values were scaled to the range
[0, 255] and rounded to integers to form a viewable grayscale image.

It is worth mentioning that the PRNU can be used as a sensor fingerprint for
matching an image to the camera that took it [43] in the same way as bullet
scratches can be used to match a bullet to the gun barrel that fired it. Figure 3.6
shows a magnified portion of the PRNU in the red channel from a Canon G2
camera. To isolate the PRNU and suppress random noise sources, the pattern
was obtained by averaging the noise residuals of 300 images acquired in the TIFF
format. The noise residual for each image was obtained by taking the difference
between the image and its denoised version after applying a denoising filter.
Shot noise is the result of the quantum nature of light and makes the number
of electrons released at each photodetector essentially a random variable due to
the random variations of photon arrivals. Shot noise is a fundamental limitation
that cannot be circumvented. The presence of random components during im-
42 Chapter 3. Digital image acquisition

age acquisition has some fundamental consequences for steganography that we


elaborate upon in Section 6.2.1.
The number of photons captured by the photodetector during the exposure of
Δt seconds is a random variable ξ that follows the Poisson distribution (law of
rare events)

e−λΔt (λΔt)k
Pr{ξ = k} = = p[k] (3.8)
k!
with mean value and variance (see Exercise 3.2)

E[ξ] = p[k]k = λΔt, (3.9)
k≥0

Var[ξ] = p[k]k 2 − (λΔt)2 = λΔt, (3.10)
k≥0

where λ > 0 is the expected number of photons captured in a unit time interval.
With increased number of photons, λΔt, the relative (percentual) variations of
ξ decrease because the ratio

E[ξ] 1
 = √ → 0. (3.11)
Var[ξ] λΔt

This means that the shot noise decreases with increased pixel size and with
longer exposure times. Also, we note that for large λΔt the Poisson distribution
is well approximated with a Gaussian distribution N (λΔt, λΔt).
Charge-transfer efficiency. The transfer of charge to the output amplifier
in a CCD sensor is not a completely lossless phenomenon. This results in an
additional source of variations in the final collected charge. The charge-transfer
efficiency in the latest CCD designs is very close to 1 (e.g., 0.99999), which
means that it is entirely reasonable to simply neglect this effect for applications
in steganography and assume that the charge-transfer efficiency is 1.
Amplifier noise. The charge collected at each pixel is amplified using an
on-chip amplifier. This can be done in the last row of photodetectors for a CCD
sensor or at each pixel in a CMOS sensor. The amplifier noise is well modeled
as a Gaussian random variable with zero mean. It dominates the shot noise at
low light conditions.
Quantization noise. The amplified signal on the sensor output is further
transformed through a complicated chain of processing, such as demosaicking,
gamma correction, low-pass filtering to prevent aliasing during subsequent re-
sampling, etc. Finally, the image data can be converted to one of the image
formats, such as TIFF or JPEG, which introduces quantization errors.
The most important fact we need to realize for steganography is that the in-
camera processing will introduce local dependences among neighboring pixels.
This is what makes steganography in digital images very challenging because the
exact nature of these dependences is quite complex and different for each camera.
Digital image acquisition 43

Consequently, it is not clear how to perform embedding so that the embedding


modifications stay compatible with the processing (and are thus undetectable).
Defective pixels. Some pixels or their associated circuitry may be faulty
and always generate the same signal. Pixels that constantly output the highest
signal values are called hot, while pixels that always output a zero signal are
called dead pixels. Defective pixels can be quite visible in images under a close
inspection (see Figure 3.7). Some cameras may attempt to identify such pixels
and eliminate their effect from the final image (the color at a defective pixel is
replaced with an interpolated value). It is also possible to upgrade the firmware
with the information about defective pixels and thus eliminate the defects from
images at acquisition.

Figure 3.7 Hot pixel in an image (top) and its closeup (bottom). Because the hot pixel
had a red filter in front of it, the hot pixel appears red. Note the spread of the red color
due to demosaicking and other in-camera processing. For a color version of this figure, see
Plate 3.

Blooming, cross-talk. There are other types of imperfections whose nature


is not stochastic (noise-like) and are thus not interesting for steganography. We
include them here for completeness. When a photodetector is illuminated with
sufficient intensity, the charge generated at that site may overflow from the po-
tential well and spread into the neighboring sites (see Figure 3.8). This digital
equivalent of overexposure is called blooming. Cross-talk pertains to the situa-
tion when a photon striking a photodetector at an angle passes through the CFA
44 Chapter 3. Digital image acquisition

and eventually hits a neighboring photodetector. This undesirable interference


can be suppressed by optically shielding the photodetectors.

Figure 3.8 Blooming artifacts due to overexposure. Also, notice the green artifact in the
lower right corner caused by multiple reflections of light in camera optics. A color version
of this figure is displayed in Plate 4.

Summary
r The photoelectric effect in silicon is the main physical principle on which
imaging sensors are based.
r There exist two competing sensor technologies – CCD and CMOS.
r Each imaging sensor consists of millions of individual photosensitive detectors
(photodiodes, photodetectors, or pixels).
r The light creates a charge at each pixel that is transferred, converted to volt-
age, amplified, and quantized. CCD and CMOS differ mainly in how the
charge is transferred. In a CCD, the charge is transferred out of the sensor
in a sequential manner, while CMOS sensors are capable of transferring the
charge in a parallel fashion.
r The quantized signal generated by the sensor is further processed through a
complex chain of processing that involves white-balance (gain) adjustment,
demosaicking, color correction, denoising, filtering, gamma correction, and
finally conversion to some common image format (JPEG, TIFF).
r The processing introduces local dependences into the image.
r The image-acquisition process is influenced by many sources of imprecision
and noise due to physical properties of light (shot noise), slight differences in
pixel dimensions and silicon inhomogeneities (pixel-to-pixel non-uniformity),
optics (vignetting, chromatic aberration), charge transfer and readout (read-
Digital image acquisition 45

out noise, amplifier noise, reset noise), quantization noise, pixel defects (hot
and dead pixels), and defects due to charge overflow (blooming).
r Some sources of imprecision and noise are random in nature (e.g., shot noise,
readout noise), while others are systematic components that repeat from im-
age to image.
r The presence of truly random noise components in images acquired using
imaging sensors is quite fundamental and has direct implications for steganog-
raphy (Section 6.2.1).

Exercises

3.1 [Law of rare events] Assume that photons arrive sequentially in time
in an independent fashion with a constant average rate of arrival. Let λ be
the probability that one photon arrives in a unit time interval. Show that the
probability that k events occur in a unit time interval is
e−λ λk
Pr{k events} = . (3.12)
k!
Hint: Divide the unit time interval into n subintervals of length 1/n. Assuming
that the probability that two events occur in one subinterval is negligible, the
probability that exactly k photons arrive can be obtained from the binomial
distribution Bi(n, λ/n)
k n−k
n λ λ
Pr{k events} = 1− . (3.13)
k n n
The result is then obtained by taking the limit n → ∞ for a fixed k.

3.2 [Poisson random variable] Show that the mean and variance of a Pois-
son random variable ξ
e−λΔt (λΔt)k
Pr{ξ = k} = = p[k] (3.14)
k!
are

E[ξ] = p[k]k = λΔt, (3.15)
k≥0

Var[ξ] = p[k]k 2 − (λΔt)2 = λΔt. (3.16)
k≥0

3.3 [Analysis of sensor imperfections] Take your digital camera and ad-
just its settings to the highest resolution and highest JPEG quality. Then take
N ≥ 10 images of blue sky, I[i, j; k], i = 1, . . . , m, j = 1, . . . , n, k = 1, . . . , N ,
where indices i and j determine the pixel at the position (i, j) in the kth image.
You might want to zoom in but not to the range of a digital zoom if your camera
has this capability. Make sure that the images do not contain stray objects, such
as airplanes, birds, or stars/planets. Compute the sample variance of all pixels
46 Chapter 3. Digital image acquisition

across all N images

1  2
N
σ̂[i, j] = I[i, j; k] − Ī[i, j] , (3.17)
N −1
k=1

where
1 
N
Ī[i, j] = I[i, j; k] (3.18)
N
k=1

is the sample mean at pixel (i, j). Plot the histogram of the sample variance.
The variations at each pixel are due to combined random noise sources, such as
the shot noise or readout noise.
Extract the noise residual W from all images using the Wiener filter
W[·, ·; k] = I[·, ·; k] − W (I[·, ·; k]). (3.19)
In Matlab, the Wiener filter W is accessed using the routine wiener2.m. Then,
average all N noise residuals

1 
N
K[i, j] = W[i, j; k]. (3.20)
N
k=1

The averaging suppresses random noise components in K, while the systematic


components become more pronounced (pattern noise, PRNU, defective pixels,
dark current). First, find the location of outlier values in K. They will likely
correspond to defective pixels. Use the routine mat2gray.m to convert K to the
range [0, 1] and view the result using imshow.m (after casting to uint8). Zoom
in so that you can discern individual pixels and pan over the pattern. Do you see
any regular patterns? Is the pattern K random or does it exhibit local stochastic
structures? Do you see any artifacts around the border of K? Focus on the
neighborhood of outlier pixels and describe their appearance.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
4 - Steganographic channel pp. 47-58

Chapter DOI:
Cambridge University Press
4 Steganographic channel

The main goal of steganography is to communicate secret messages without


making it apparent that a secret is being communicated. This can be achieved
by hiding messages in ordinary-looking objects, which are then sent in an overt
manner through some communication channel. In this chapter, we look at the
individual elements that define steganographic communication.
Before Alice and Bob can start communicating secretly, they must agree on
some basic communication protocol they will follow in the future. In particular,
they need to select the type of cover objects they will use for sending secrets. Sec-
ond, they need to design the message-hiding and message-extraction algorithms.
For increased security, the prisoners should make both algorithms dependent on
a secret key so that no one else besides them will be able to read their messages.
Besides the type of covers and the inner workings of the steganographic algo-
rithm, Eve’s ability to detect that the prisoners are communicating secretly will
also depend on the size of the messages that Alice and Bob will communicate. Fi-
nally, the prisoners will send their messages through a channel that is under the
control of the warden, who may or may not interfere with the communication.
We recognize the following five basic elements of every steganographic channel1
(see Figure 4.1):
r Source of covers,
r Data-embedding and -extraction algorithms,
r Source of stego keys driving the embedding/extraction algorithms,
r Source of messages,
r Channel used to exchange data between Alice and Bob.

Because in this book the covers are digital images, the cover-source attributes
include the image format, origin, resolution, typical content type, etc. The cover-
source properties are determined by objects that Alice and Bob would be ex-
changing if they were not secretly communicating. Modern steganography typi-
cally makes some fundamental assumption about the cover source that enables
formal analysis. For example, considering the cover source as a random variable
allows analysis of steganography using information theory (this topic is elabo-

1 In this book, we sometimes also use the shorter term stegosystem.


48 Chapter 4. Steganographic channel

Message
source Message

Embedding Channel Extraction


Cover source algorithm
algorithm

Stego key Stego key


source source

Figure 4.1 Elements of the steganographic channel.

rated upon in Section 6.1). Alternatively, the cover source can be interpreted as
an oracle that can be queried. This leads to the complexity-theoretic study of
steganography explained in Section 6.4.
The embedding algorithm is a procedure through which the sender determines
an image that communicates the required secret message. The procedure may
depend on a secret shared between Alice and Bob called the stego key. This
key is needed to correctly extract a secret message from the stego image. For
example, Alice can embed her secret bit stream as the least significant bits of
pixels chosen along a non-intersecting pseudo-randomly generated path through
the image (determined by the stego key).
The protocol that Alice and Bob use to select the stego keys is usually modeled
with a random variable on the space of all keys. For example, a reasonable
strategy is to select the stego key randomly (with uniform distribution) from the
set of all possible stego keys.
The message source has a major influence on the security of the steganographic
channel. Imagine two extreme situations. On the one hand, Alice and Bob al-
ways communicate only a short message, say 16 bits, in every stego image. On
the other hand, Alice and Bob have a need to communicate large messages and
frequently embed as many bits into the image as the embedding algorithm al-
lows. Intuitively, the prisoners are at much greater risk in the latter case. The
distribution of messages can be modeled using a random variable on the space
of all possible messages.
The actual communication channel used to send the images is assumed to be
monitored by a warden (Eve). Eve can assume three different roles. Position-
ing herself into the role of a passive observer, she merely inspects the traffic
and does not interfere with the communication itself. This is called the passive-
warden scenario. Alternatively, Eve may suspect that Alice and Bob might use
steganography and she can preventively attempt to disrupt the steganographic
channel by intentionally distorting the images exchanged by Alice and Bob. For
example, she may compress the image using JPEG, resize or crop the image,
apply a slight amount of filtering, etc. Unless Alice and Bob use steganography
that is resistant (robust) to such processing, the steganographic channel would
Steganographic channel 49

be broken by Eve’s actions. This is called the active-warden scenario. Finally, Eve
can be even more devious and she may try to guess the steganographic method
that Alice and Bob use and attempt to impersonate Alice or Bob or otherwise
intervene to confuse the communicating parties. There is a difference between
this so-called malicious warden and the active warden. The actions of an active
warden are aimed at making steganography impossible for Alice and Bob, while a
malicious warden does not necessarily intend to entirely disrupt the stego channel
but rather use it to her advantage to determine whether or not steganography
is taking place. In this book, we focus mainly on the passive-warden scenario
in which the communication channel is assumed error-free.2 This is the case
of communication via standard Internet protocols, where error-correction and
authentication tools guarantee error-free data transmission.
The discussion above underlines the need to view steganographic communica-
tion in a wider context. When designing a steganographic scheme, the prisoners
have a choice in selecting the basic elements to form their communication scheme.
As will be seen in the next chapter, it is not very difficult to write a computer
program that hides a large amount of data in an image. On the other hand,
writing a program that does so without introducing any detectable artifacts is
quite hard.

The problem of steganography can thus be formulated as finding embed-


ding and extraction algorithms for a given cover source that enable com-
munication of reasonably large messages without introducing any embedding
artifacts that could be detected by the warden. In other words, the goal is
to embed secret messages undetectably.

The concept of undetectability will be made rather precise in Chapter 6 that


deals with a formal definition of steganographic security. Until then, we will
assume some intuitive meaning of these concepts.
The embedding and extraction algorithms are obviously the most important
parts of any stegosystem. Steganographic algorithms can utilize three different
fundamental architectures that determine the internal mechanism of the embed-
ding and extraction algorithms. Alice can “perform embedding” by choosing a
cover image that already has the desired message hidden inside. This is called
steganography by cover selection and is the subject of Section 4.1. Alternatively,
Alice can construct an object that conveys her message. This strategy is called
steganography by cover synthesis and is elaborated upon in Section 4.2. The
third option, which is the most practical for communicating large amounts of
data, is steganography by cover modification, explained in Section 4.3. This is
the mainstream approach to steganography that is also the central focus of this
book.

2 The active- and malicious-warden scenarios are discussed in [14, 53, 66, 72, 116, 178] and
the references therein.
50 Chapter 4. Steganographic channel

4.1 Steganography by cover selection

In steganography by cover selection, Alice has available a fixed database of images


from which she selects one that communicates the desired message. For example,
one bit of information could be sent by choosing a picture in a landscape or
portrait orientation. Alternatively, the presence of an animal in the picture could
have hidden meaning, such as “attack tomorrow,” etc. The embedding algorithm
can work simply by randomly drawing images from the database till an image is
found that communicates the desired message. The stego key here is essentially
the set of rules that tell Alice and Bob how to interpret the images.
An important case of steganography by cover selection involves message-digest
(hash) functions. Alice selects an image from the database and applies to it a
message-digest function (this function may depend on a key and must be shared
with Bob). If the digest matches the desired message bit stream, the image is for-
warded to Bob, otherwise Alice selects a different image till she obtains a match.
The expected number of tries needed to obtain a match depends exponentially
on the length of the digest and thus quickly becomes impractically large. Upon
receiving an image, Bob simply extracts the digest to read the message. The
advantage of this approach is that the cover is always “100% natural” because it
is a real image that was not modified in any way. An obvious disadvantage is an
impractically low payload.
Under close inspection, one can realize one potential problem with steganogra-
phy by cover selection that prevents us from proclaiming it “truly undetectable.”
To make the issue more apparent, we use a very simple digest function formed
by the least significant bits of the first three pixels in the image,

h(x) = {x[1] mod 2, x[2] mod 2, x[3] mod 2}. (4.1)

Note that the digest consists of three bits. The problem may arise when Alice
decides to use this technique to communicate not just once but repetitively. If
Alice is equally likely to send each triple of bits out of all eight possible triples
of bits (which is a reasonable assumption if she is sending parts of an encrypted
document, for example), the stego images sent by her will equally likely produce
any of the 8-bit triples as the digest. How do we know, however, that in natural
images the distribution of LSBs of the first three pixels in the upper left corner
follows this distribution? Most likely, it does not because these pixels are likely
to belong to a piece of sky and thus their values are far from being independent.
Note that this problem arises only because we consider multiple uses of the
steganographic channel and allow Eve to inspect all transmissions from Alice
rather than considering one image at a time. This observation is, in fact, quite
fundamental and will lead us to a formal definition of steganographic security
using information theory in Section 6.1.
The reader might suspect that had Alice used a better digest, such as the
first three bits of a cryptographic hash MD5 or SHA applied to the whole image
Steganographic channel 51

rather than just three pixels, intuitively, the above issue would be largely elim-
inated. We revisit this simple thought experiment in Section 6.4.1 dealing with
the complexity-theoretic definition of steganography.

4.2 Steganography by cover synthesis

In steganography by cover synthesis, Alice creates the cover so that it conveys


the desired message. There were some speculations in the press that Bin Laden’s
videos may contain hidden messages communicated, for example, by his clothes,
the position of his rifle, or the choice of words in his speech [203]. This would be
an example of steganography by cover synthesis.
Another quite interesting example that involves text as the cover rather than
digital images is encoding messages so that they resemble spam. Here, the
steganographer uses the fact that spam often contains incoherent or unusual
wording, which can be used to encode a message. The program SpamMimic
(www.spammimic.com) uses mimic functions [238, 239] to hide messages in arti-
ficially created spam-looking stego text.
Steganography by cover synthesis could be combined with steganography by
cover selection to alleviate the exponential complexity of embedding by hash-
ing [188]. Let us assume that we can take a large number of images of exactly
the same scene with the same digital camera. For example, we could fix the cam-
era on a tripod and take multiple images under constant light conditions. Let us
assume that the images are 8-bit grayscale with xj [i] standing for the intensity
of the ith pixel in the jth image, i = 1, . . . , n, j = 1, . . . , K. We now make the
key observation that the light intensity values at a fixed pixel, i, when viewed
across all images, j, will slightly vary due to random noise sources present in
images, such as shot noise, readout noise, and noise due to on-chip electronics
(see Figure 4.2 and the discussions in Section 3.5).
Alice will use a cryptographic hash function modified to return 4 bits when
applied to 16 pixels. For example, she could use the last 4 bits from the MD5
hash [207]. Here, the values 4 and 16 are chosen rather arbitrarily and other
values can certainly be used. In order to embed her message, Alice divides ev-
ery image into disjoint blocks of 4 × 4 pixels and assembles a new image in
a block-by-block fashion so that each 4 × 4 block conveys 4 message bits. To
embed the first 4 bits in the first 4 × 4 block of pixels, Alice searches through
the hashes h(xj [1], . . . , xj [16]), j ∈ {1, 2, . . . , K} till she finds a match between
the hash of the first 16 pixels and the message, which will happen for im-
age number j1 . Then, she moves to the next block and finds j2 ∈ {1, . . . , K}
so that h(xj [17], . . . , xj [32]) matches the next 4 message bits, etc. The final
stego image y will be a mosaic assembled from blocks from different images
y = (xj1 [1], . . . , xj1 [16], xj2 [17], . . . , xj2 [32], xj3 [33], . . .). The probability of find-
ing a match in one particular block among all K images is 1 − (1 − 1/16)K .
The probability of being able to embed the whole message, which consists of
52 Chapter 4. Steganographic channel

Figure 4.2 The same 16 × 16 pixel block from four (uncompressed) TIFF images of blue
sky taken with the same camera within a short time interval. First note that the blocks
are not uniform even though the scene is perfectly uniform. Second, the blocks appear
different in every picture due to random noise sources.

n/16 groups of 4 bits is thus (1 − (1 − 1/16)K )n/16 . For increasing number of


images, K, this probability can be made arbitrarily close to 1. For example,
for a square n = 512 × 512 image and only K = 400 images, this probability is
0.9999998993 . . ..
This method is more a theoretical construct rather than a practical stegano-
graphic technique. This is because it is in fact difficult to obtain multiple sam-
plings of one signal (identical photographs of one scene accurate to within a
fraction of a pixel width). Small variations in the exposure time due to the me-
chanics of the shutter and other mechanical vibrations will necessarily limit the
alignment between different pictures.
The scheme makes an implicit assumption that pixels at the block bound-
aries are independent, or, more accurately, that the random components of the
images are independent. As we already know from Chapter 3, various types of
in-camera processing, such as color interpolation or filtering, create dependences
among neighboring pixels and thus among their noise components. Consequently,
Eve may start constructing an attack by inspecting the dependences among pix-
els spanning the block boundaries and compare them with dependences among
neighboring pixels from the interior of the blocks.
A qualitatively different realization of steganography by cover synthesis is
called data masking. The idea is that all steganalysis tools are in the end auto-
mated and work by extracting a set of numerical features from the image that
Steganographic channel 53

is later analyzed for compatibility with features typically obtained from cover
images (see Chapter 10). Thus, all that Alice needs to achieve to evade detec-
tion is to make the stego image look like a cover in the feature space. The stego
“image” does not have to look like a natural image because, as long as its fea-
tures are compatible with features of natural images, it should pass through the
detector. A practical data-masking method was constructed in [200] by applying
a time-varying inverse Wiener filter shaping every 1024 message bits to match a
reference audio frame.

4.3 Steganography by cover modification

In this book, we deal mainly with steganography by cover modification because


it is by far the most studied steganography paradigm today. Postponing specific
examples of steganographic schemes to Chapters 5–7, we now introduce basic
definitions and concepts.
Alice starts with a cover image and makes modifications to it in order to embed
secret data. Alice and Bob work with the set of all possible covers and the sets
of keys and messages that may, in the most general case, depend on each cover:

C . . . set of cover objects x ∈ C, (4.2)


K(x) . . . set of all stego keys for x, (4.3)
M(x) . . . set of all messages that can be communicated in x. (4.4)

Following the diagram displayed in Figure 4.3, a steganographic scheme is a pair


of embedding and extraction functions Emb and Ext,

Emb : C × K × M → C, (4.5)
Ext : C × K → M, (4.6)

such that for all x ∈ C, and all k ∈ K(x), m ∈ M(x),

Ext (Emb(x, k, m), k) = m. (4.7)

In other words, Alice can take any cover x ∈ C and embed in it any message
m ∈ M(x) using any key k ∈ K(x), obtaining the stego image y = Emb(x, k, m).
The number of messages that can be communicated in a specific cover x depends
on the steganographic scheme and it may also depend on the cover itself. For
example, if C is the set of all 512 × 512 grayscale images and Alice embeds
one message bit per pixel, then M = {0, 1}512×512 and |M(x)| = 2512×512 for
all x ∈ C. On the other hand, if C is the set of all 512 × 512 grayscale JPEG
images with quality factor qf = 75 and Alice embeds one bit per each non-zero
quantized DCT coefficient, the number of messages that can be embedded in a
specific cover depends on the cover itself because the number of non-zero DCT
coefficients in a JPEG file depends on the image content.
54 Chapter 4. Steganographic channel

Message m Message m
y
Cover x Emb(·) Ext(·)

Key k Key k
Figure 4.3 Steganography by cover modification (passive-warden case).

Thus, we define the embedding capacity (payload) of cover x in bits as

log2 |M(x)|, (4.8)

and the relative embedding capacity is


log2 |M(x)|
, (4.9)
n
where n is the number of elements in x, such as the number of pixels or non-zero
DCT coefficients. For raster image formats, relative capacity is often expressed
in bpp (bits per pixel). For JPEG images, we use the unit bpnc (bits per non-zero
DCT coefficient). By taking expectations in (4.8) and (4.9) with respect to some
distribution of covers x, we can speak of the average embedding and relative
embedding capacity.
Perhaps the most fundamental concept in steganography is the steganographic
capacity defined as the maximal number of bits that can be embedded without
introducing statistically detectable artifacts. For now, we stay with this rather
vague definition, leaving the precise definition of this advanced concept to Chap-
ter 13. Typically, the steganographic capacity is much smaller than the embed-
ding capacity.
Embedding algorithms of many steganographic schemes require a representa-
tion of cover and stego images using bits or, more generally, symbols from some
alphabet A using a symbol-assignment function π,

π : X → A, (4.10)

where X is the range of individual cover elements, such as pixels or DCT coef-
ficients. One frequently used bit-assignment (parity) function is the least signif-
icant bit

LSB(x) = x mod 2. (4.11)

Throughout this book, the reader will learn about other symbol-assignment func-
tions.
If the embedding algorithm is designed to avoid making embedding changes
to certain areas of the cover image, we speak of adaptive steganography. For
example, we may wish to skip flat areas in the image and concentrate embedding
changes to textured regions. The subset of the image where embedding changes
are allowed is called the selection channel. Another example of a selection channel
Steganographic channel 55

is when the message bits are embedded along a pseudo-random path through
the image generated from the stego key. In general, it is in the interest of both
Alice and Bob not to reveal any information or as little as possible about the
selection channel as this knowledge can help an attacker. If the selection channel
is available to Alice but not to Bob, it is called a non-shared selection channel.
Steganography by cover modification introduces distortion into the cover. The
distortion is typically measured with a mapping d(x, y), d : C × C → [0, ∞). One
commonly used family of distortion measures is parametrized by γ ≥ 1,

n
dγ (x, y) = |x[i] − y[i]|γ . (4.12)
i=1

For γ = 1,

n
d1 (x, y) = |x[i] − y[i]| (4.13)
i=1

is the L1 norm between the vectors x and y, while



n
d2 (x, y) = |x[i] − y[i]|2 (4.14)
i=1

is the energy of embedding changes.


The measure
n
ϑ(x, y) = 1 − δ(x[i] − y[i]) (4.15)
i=1

is the number of embedding changes, where δ is the Kronecker delta (2.23). Note
that if the amplitude of embedding changes is |x[i] − y[i]| = 1, dγ and ϑ coincide
for all γ.
The distortion measures above are absolute in the sense that they measure the
total distortion. Often, it is useful to express distortion per cover element (e.g.,
per pixel), in which case we speak of relative distortion
d(x, y)
. (4.16)
n
The quantity
ϑ(x, y)
β= (4.17)
n
is called the change rate and will typically be denoted β.
Two popular relative measures are the Mean-Square Error (MSE)
1
n
d2 (x, y)
MSE = = |x[i] − y[i]|2 , (4.18)
n n i=1

and the Peak Signal-to-Noise Ratio (PSNR)


x2max
PSNR = 10 log10 , (4.19)
MSE
56 Chapter 4. Steganographic channel

where xmax is the maximum value that x[i] can attain. For example, for 8-bit
grayscale images, xmax = 255.
The average embedding distortion is the expected value E [d(x, y)] taken over
all x ∈ C, k ∈ K, m ∈ M selected according to some fixed probability distribu-
tions from their corresponding sets.
A very important characteristic of a stegosystem that has a major influence
on its security is the embedding efficiency. We define it here rather informally as
the average number of bits embedded per average unit distortion

Ex [log2 |M(x)|]
e= . (4.20)
Ex,m [d(x, y)]

The complexity of computing the embedding efficiency depends on the stegano-


graphic scheme, on its inner mechanism, and on the cover source. For the simplest
embedding schemes, such as LSB embedding or ±1 embedding in the spatial
domain (see Sections 5.1 and 7.3.1), the embedding efficiency can be easily eval-
uated analytically. On the other hand, for some steganographic methods the
maximal payload as well as the distortion may be a rather complex function of
the cover content or even the stego key, which makes computing the embedding
efficiency analytically virtually impossible due to our imprecise knowledge of the
cover source. In such cases, the embedding efficiency is determined experimen-
tally. Finally, we note that the embedding efficiency depends on the distortion
measure d.
In the next chapter, we give examples of a few simple steganographic schemes
that embed messages by slightly modifying the pixels, pointers, or DCT coeffi-
cients in the image file. This will illustrate the concepts defined above as well as
provide motivation for the reader.

Summary
r A steganographic channel consists of the source of covers, the message source,
embedding and extraction algorithms, the source of stego keys, and the com-
munication channel.
r If the channel is error-free, we speak of a passive warden. An active warden
intentionally distorts the communication with the hope of preventing usage of
steganography. A malicious warden tries to trick the communicating parties,
e.g., by impersonation.
r There exist three types of embedding algorithms: steganography by cover
selection, cover synthesis, and cover modification.
r The embedding capacity for a given cover is the maximal number of bits that
can be embedded in it.
r The relative embedding capacity is the ratio between the embedding capacity
and the number of elements in the cover where a message can be embedded.
Steganographic channel 57

r The steganographic capacity is the maximal payload that can be embedded


in the cover without introducing detectable modifications.
r The symbol-assignment function is used to convert individual cover elements
to bits or alphabet symbols.
r The embedding distortion is a measure of the strength of embedding changes.
r PSNR and MSE are relative distortion measures (distortion per cover ele-
ment).
r The embedding efficiency is the average number of bits embedded per unit
distortion.

Exercises

4.1 [Distortion due to noise adding] Consider a steganographic scheme


for which the impact of embedding is equivalent to adding iid realizations of a
random variable η with pdf f (x) to individual cover elements and rounding the
result to the nearest integer,

y[i] = round(x[i] + η[i]), (4.21)

where round(x) is x rounded to its nearest integer. Neglecting the boundary


issue that y[i] may get out of the dynamic range of allowed values, show that
the expected value of the embedding distortion is


E [d1 (x, y)] = kp[k], (4.22)
k=−∞
∞
E [d2 (x, y)] = k 2 p[k], (4.23)
k=−∞
´ k+ 12
where p[k] = k− 12
f (x)dx.

4.2 [Exponentially decaying distortion] Show that when η in Exercise 4.1


is exponential with pdf f (x) = λe−λx for x > 0 and f (x) = 0 for x ≤ 0, λ > 0,
2eλ λ
E [d1 (x, y)] = λ sinh , (4.24)
(e − 1)2 2
2eλ (eλ + 1) λ
E [d2 (x, y)] = sinh , (4.25)
(e − 1)
λ 3 2
where sinh x = (ex − e−x )/2 is the hyperbolic sine function.

4.3 [Rule for adding PSNR] Let x[i] be an n-dimensional vector of real
numbers and let η 1 [i] ∼ N (0, σ12 ) and η 2 [i] ∼ N (0, σ22 ) be two iid Gaussian se-
quences with zero mean. Show that for n → ∞, the PSNR between x and
y = x + η 1 + η 2 satisfies
PSNR PSNR1 PSNR2
10− 10 = 10− 10 + 10− 10 , (4.26)
58 Chapter 4. Steganographic channel

where PSNR1 is between x and x + η 1 and PSNR2 is between x + η 1 and x +


η 1 + η2 . Use the fact that the variance of the sum of two independent Gaussian
variables is equal to the sum of their variances. Also, notice that with n → ∞,
MSE = σ̂ 2 → σ 2 , where σ̂ 2 is the sample variance.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
5 - Naive steganography pp. 59-80

Chapter DOI:
Cambridge University Press
5 Naive steganography

The first steganographic techniques for digital media were constructed in the
mid 1990s using intuition and heuristics rather than from specific fundamental
principles. The designers focused on making the embedding imperceptible rather
than undetectable. This objective was undoubtedly caused by the lack of stegana-
lytic methods that used statistical properties of images. Consequently, virtually
all early naive data-hiding schemes were successfully attacked later. With the
advancement of steganalytic techniques, steganographic methods became more
sophisticated, which in turn initiated another wave of research in steganalysis,
etc. This characteristic spiral development can be expressed through the follow-
ing quotation:

Steganography is advanced through analysis.

In this chapter, we describe some very simple data-hiding methods to illustrate


the concepts and definitions introduced in Chapter 4 and especially Section 4.3.
At the same time, we point out problems with these simple schemes to em-
phasize the need for a more exact fundamental approach to steganography and
steganalysis.
In Section 5.1, we start with the simplest and most common steganographic
algorithm – Least-Significant-Bit (LSB) embedding. The fact that LSB embed-
ding is not a very secure method is demonstrated in Section 5.1.1, where we
present the histogram attack. Section 5.1.2 describes a different attack on LSB
embedding in JPEG images that can not only detect the presence of a secret
message but also estimate its size.
Some of the first steganographic methods were designed for palette images,
which is the topic of Section 5.2. We discuss six different ideas for hiding in-
formation in palette images and point out their weaknesses as well as other
problematic issues pertaining to their design. The palette format here serves
as a useful educational platform to stimulate the reader to think about certain
fundamental problems that are common to steganography in general and not
necessarily specific to any image format.
60 Chapter 5. Naive steganography

Algorithm 5.1 Embedding message m ∈ {0, 1}m in cover image x ∈ X n .


// Initialize a PRNG using stego key
// Input: message m ∈ {0, 1}m, cover image x ∈ X n
Path = Perm(n);
// Perm(n) is a pseudo-random permutation of {1, 2, . . . , n}
y = x;
m = min(m, n);
// If message longer than available capacity, truncate it
for i = 1 to m {
y[Path[i]] = x[Path[i]] + m[i] − x[Path[i]] mod 2;
}
// y is the stego image with m embedded message bits

5.1 LSB embedding

Arguably, LSB embedding is the simplest steganographic algorithm. It can be


applied to any collection of numerical data represented in digital form. Let us
assume that x[i] ∈ X = {0, . . . , 2nc − 1} is a sequence of integers. For example,
x[i] could be the light intensity at the ith pixel in an 8-bit grayscale image
(nc = 8), an index to a palette in a GIF file (nc = 8), or a quantized DCT
coefficient in a JPEG file (nc = 11). Depending on the image format and the bit
depth chosen for representing the individual values, each x[i] can be represented
using nc bits b[i, 1], . . . , b[i, nc ],

nc
x[i] = b[i, k]2nc −k . (5.1)
k=1

Thus, one can think of the sequence (b[i, 1], . . . , b[i, nc ]) as the binary represen-
tation of x[i] in big-endian form (the most significant bit b[i, 1] is first). The LSB
is the last bit b[i, nc ].
LSB embedding, as its name suggests, works by replacing the LSBs of x[i] with
the message bits m[i], obtaining in the process the stego image y[i]. Algorithm 5.1
shows a pseudo-code for embedding a bit stream in an image along a pseudo-
random path generated from a secret key shared between Alice and Bob.
Note that in a color image the number of elements in the cover, n, is three
times larger than for a grayscale image. Thus, the pseudo-random path is chosen
across all pixels and color channels. The message embedded using Algorithm 5.1
can be extracted with the pseudo-code in Algorithm 5.2.
The amplitude of changes in LSB embedding is 1, maxi |x[i] − y[i]| = 1, which
is the smallest possible change for any embedding operation. Under typical view-
ing conditions, the embedding changes in an 8-bit grayscale or true-color image
are not visually perceptible. Moreover, because natural images contain a small
amount of noise due to various noise sources present during image acquisition
Naive steganography 61

Algorithm 5.2 Extracting message m from stego image y.


// Initialize a PRNG using stego key
// Input: stego image y ∈ X n
Path = Perm(n);
// Perm(n) is a pseudo-random permutation of {1, 2, . . . , n}
for i = 1 to m {
m[i] = y[Path[i]] mod 2;
}
// m is the extracted secret message

Table 5.1. Relative counts of bits, neighboring bit pairs, and triples from an LSB plane
of the image shown in Figure 5.1.

Frequency of occurrence
Bits 0.49942, 0.50058
Pairs 0.24958, 0.24984, 0.24984, 0.25073
Triples 0.1246, 0.1250, 0.1247, 0.1251, 0.1250, 0.1249, 0.1251, 0.1256

(see Chapter 3), the LSB plane of raw, never-compressed natural images already
looks random. Figure 5.1 (and color Plate 5) shows the original cover image and
its LSB plane b[i, nc ] for the red channel.
Table 5.1 contains the frequencies of occurrence of single bits in b[i, nc ], i =
1, . . . , n, pairs of neighboring bits (b[i, nc ], b[i + 1, nc]), and triples of neighboring
bits (b[i, nc ], b[i + 1, nc ], b[i + 2, nc ]). The data are consistent with the claim that
the LSB plane is random. Even though this is not a proof of randomness,1 the
argument is convincing enough to make us intuitively believe that any attempts
to detect the act of randomly flipping a subset of bits from the LSB plane are
doomed to fail. This seemingly intuitive claim is far from truth because LSB
embedding in images can be very reliably detected (see Chapter 11 on targeted
steganalysis). For now, we provide only a small hint.
Even if the LSB plane of covers was truly random, it may still be possible
to detect embedding changes due to flipping LSBs if, for example, the second
LSB plane b[i, nc − 1] and the LSB plane were somehow dependent! In the most
extreme case of dependence, if b[i, nc − 1] = b[i, nc ] for each i, detecting LSB
changes would be trivial. All we would have to do is to compare the LSB plane
with the second LSB plane.
LSB embedding belongs to the class of steganographic algorithms that embed
each message bit at one cover element. In other words, each bit is located at a

1 Pearson’s chi-square test from Section D.4 could replace intuition by verifying that the
distributions are uniform at a certain confidence level.
62 Chapter 5. Naive steganography

Figure 5.1 A true-color 800 × 548 image and the LSB plane of its red channel. Plate 5
shows the color version of the image.

certain element. The embedding proceeds by visiting individual cover elements


and applying the embedding operation (flipping the LSB) if necessary, to match
the LSB with the message bit. Not all steganographic schemes must follow this
simple embedding paradigm. Methods that use syndrome coding utilize multiple
cover elements to embed each bit in the sense that the extraction algorithm needs
to see more than one element to extract one bit (see Chapter 8).
The embedding operation of flipping the LSB can be written mathematically
in many different ways:


x+1 when x even
LSBflip(x) = (5.2)
x−1 when x odd
= x + 1 − 2(x mod 2) (5.3)
x
= x + (−1) . (5.4)
Naive steganography 63

LSBflip is an idempotent operation satisfying LSBflip(LSBflip(x)) = x. This


means that repetitive LSB embedding will partially cancel itself out and thus
there is a limit on the maximal expected distortion that can be introduced
by repetitive LSB embedding. Many other embedding operations do not have
this property. Partial cancellation of embedding changes can be used to attack
schemes that use LSB embedding [87]. Exercise 5.6 quantifies the effect of partial
cancellation.
LSB embedding also induces 2nc −1 disjoint LSB pairs on the set of all possible
element values {0, 1, . . . , 2nc − 1},
{0, 1}, {2, 3}, . . . , {2nc − 2, 2nc − 1}. (5.5)
Note that if x[i] is in LSB pair {2k, 2k + 1}, it must stay there after embedding
because the pair elements differ only in their LSBs (2k ↔ 2k + 1). This simple
observation is the starting point of many powerful attacks on LSB embedding
(Chapter 11).
For any steganographic method, it is often valuable to mathematically express
the impact of embedding on the image histogram. Many steganographic tech-
niques introduce characteristic artifacts into the histogram and these artifacts
can be used to detect the presence of secret messages (construct attacks). Let
h[j], j = 0, . . . , 2nc − 1, be the histogram of elements from the cover image

n
h[j] = δ(x[i] − j), (5.6)
i=1

where δ is the Kronecker delta (2.23). We will assume that Alice is embedding a
stream of m random bits. The assumption of randomness is reasonable because
Alice naturally wants to minimize the impact of embedding and thus compresses
the message and probably also encrypts to further improve the security of com-
munication. We denote by α = m/n the relative payload Alice communicates.
Assuming she embeds the bits along a pseudo-random path through the image,
the probability that a pixel is not changed is equal to the probability that it
is not selected for embedding, 1 − α, plus the probability that it is selected, α,
multiplied by the probability that no change will be necessary, which happens
with probability 12 because we are embedding a random bit stream. Thus, for
any j,
α α
Pr{y[i] = j|x[i] = j} = 1 − α + = 1 − , (5.7)
2 2
α
Pr{y[i] = j|x[i] = j} = . (5.8)
2
Because during LSB embedding the pixel values within one LSB pair {2k, 2k +
1}, k = 0, . . . , 2nc −1 − 1, are changed into each other but never to any other
value, the sum hα [2k] + hα [2k + 1] stays unchanged for any α and thus forms
an invariant under LSB embedding. Here, we denoted the histogram of the stego
image as hα . Thus, the expected value of hα [2k] is equal to the number of pixels
with values 2k that stay unchanged plus the number of pixels with values 2k + 1
64 Chapter 5. Naive steganography

8,000 8,000
6,000 6,000
4,000 4,000
2,000 2,000
0 0
28 30 32 34 36 38 28 30 32 34 36 38

·104 ·104
6 6

4 4

2 2

0 0
−20 0 20 −20 0 20
Figure 5.2 Effect of LSB embedding on histogram. Top: Magnified portion of the
histogram of the image shown in Figure 5.1 after converting it to an 8-bit grayscale before
and after LSB embedding. (Shown are histogram values for grayscales between 28 and
39.) Bottom: Histogram of quantized DCT coefficients of the same image before and
after embedding using Jsteg (see Section 5.1.2). Left figures correspond to cover images,
right figures to stego.

that were flipped to 2k:

α α
E [hα [2k]] = 1 − h[2k] + h[2k + 1], (5.9)
2 2
α α
E [hα [2k + 1]] = h[2k] + 1 − h[2k + 1]. (5.10)
2 2

Note that if Alice fully embeds her cover image with n bits (α = 1), we have

h[2k] + h[2k + 1]
E [h1 [2k]] = E [h1 [2k + 1]] = , k = 0, . . . , 2nc −1 − 1. (5.11)
2

We say that LSB embedding has a tendency to even out the histogram within
each bin. This leads to a characteristic staircase artifact in the histogram of
the stego image (Figure 5.2), which can be used as an identifying feature for
images fully embedded with LSB embedding. This observation is quantified in
the so-called histogram attack [246], which we now describe.

5.1.1 Histogram attack


In a fully embedded stego image (α = 1), we expect

hα [2k] + hα [2k + 1]
hα [2k] ≈ h[2k] = , k = 0, . . . , 2nc −1 − 1. (5.12)
2
Naive steganography 65

Formally, the histogram attack amounts to the following composite hypothesis-


testing problem:

H0 : hα ∼ h, (5.13)
H1 : hα ∼ h, (5.14)

which we approach using Pearson’s chi-square test [221] (also, see Appendix D).
This test determines whether the even grayscale values in the stego image fol-
low the known distribution h[2k], k = 0, . . . , 2nc −1 − 1. The chi-square test first
computes the test statistic S,

d−1
(hα [2k] − h[2k])2
S= , (5.15)
k=0
h[2k]

where d = 2nc −1 . Under the null hypothesis, the even grayscale values follow
the probability mass function, h[2k], and the test statistic (5.15) approximately
follows the chi-square distribution, S ∼ χ2d−1 , with d − 1 degrees of freedom
(see Section A.9 on the chi-square distribution). That is as long as all d bins,
h[2k], k = 0, . . . , 2nc −1 − 1, are sufficiently populated. Any unpopulated bins
must be merged so that h[2k] > 4 for all k, to make S approximately chi-square
distributed.
One can intuitively see that if the even grayscales follow the expected distri-
bution, the value of S will be small, indicating the fact that the stego image is
fully embedded with LSB embedding. Large values of S mean that the match is
poor and notify us that the image under inspection is not fully embedded. Thus,
we can construct a detector of images fully embedded using LSB embedding by
setting a threshold γ on S and decide “cover” when S > γ and “stego” otherwise.
The probability of failing to detect a fully embedded stego image (probability of
missed detection) is the conditional probability that S > γ for a stego image,

PMD (γ) = Pr{S > γ|x is stego}. (5.16)

We set the threshold γ so that the probability of a miss is at most PMD . Denoting
the probability density function of χ2d−1 as fχ2d−1 (x), the threshold is determined
from (5.16),
ˆ∞ ˆ∞
e− 2 x 2 −1
x d−1

PMD (γ) = fχ2d−1 (x)dx = d−1  d−1  dx. (5.17)


2 2 Γ 2
γ γ

The value PMD (γ) is called the p-value and it measures the statistical significance
of γ. It is the probability that a chi-square-distributed random variable with d − 1
degrees of freedom would attain a value larger than or equal to γ.
The histogram attack can identify images fully embedded with random mes-
sages (α = 1) and it can also be used to detect messages with α < 1 if the order of
embedding is known (e.g., sequential). In this case, the m = αn message bits are
66 Chapter 5. Naive steganography

p-value
0.5

0
0 0.2 0.4 0.6 0.8 1
Percentage of visited pixels

Figure 5.3 The p-value for the histogram attack on a sequentially embedded 8-bit
grayscale image with relative message length α = 0.4.

embedded along a known path in the image represented using the vector of in-
dices Path[i], i = 1, . . . , n. Evaluating the ith p-value pv [i] from the histogram of
{x[Path[1]], . . . , x[Path[i]]} from the stego image, after a short transient phase,
pv [i] will reach a value close to 1. It will suddenly fall to zero when we arrive at
the end of the message (at approximately i ≈ αn) and it will stay at zero until
we exhaust all pixels. This is because the test statistic S will cease to follow
the chi-square distribution. Figure 5.3 shows pv [i] for a sequentially embedded
message with α = 0.4 (the cover image is shown in Figure 5.1). Thus, for sequen-
tial embedding this test not only determines with very high probability that a
random message has been embedded but also estimates the message length.
If the embedding path is not known, the histogram attack is ineffective unless
the majority of pixels have been used for embedding. Attempts to generalize this
attack to randomly spread messages include [199, 242]. The most accurate ste-
ganalysis methods for LSB embedding are the detectors discussed in Chapter 11.

5.1.2 Quantitative attack on Jsteg


LSB embedding can be applied to any collection of numerical data. If the cover
elements follow a distribution about which we have some a priori knowledge, we
can use it to construct an attack in the following manner [255]. Let us describe
our a priori knowledge about the cover image using some function of the cover
histogram h,

F (h) = 0. (5.18)
Naive steganography 67

From equations (5.9) and (5.10), h can be expressed using E [hα ] by solving the
system of two linear equations for h[2k] and h[2k + 1],

h[2k] = aE[hα [2k]] − bE[hα [2k + 1]], (5.19)


h[2k + 1] = −bE[hα [2k]] + aE[hα [2k + 1]], (5.20)

with a = (1 − α/2)/(1 − α) and b = (α/2)/(1 − α). Using the approximation


hα ≈ E[hα ], we can substitute (5.19) and (5.20) into (5.18) and thus obtain an
equation for the unknown relative message length α. This equation will contain
only the histogram of the stego image, hα , which is known.
Note that this attack provides an estimate of the unknown message length
independently of whether or not the message placement is known. Steganalysis
methods that estimate the message length, or, more accurately, the number of
embedding changes, are called quantitative. We illustrate this general method
by attacking the steganographic algorithm Jsteg (http://zooid.org/~paul/
crypto/jsteg/), which embeds data in JPEG images.
Jsteg uses the LSB embedding principle applied to quantized DCT coefficients
with the exception that the LSB pair {0, 1} is skipped because allowing embed-
ding into 0s would lead to quite disturbing artifacts. From Chapter 2, we know
that the histogram of quantized DCT coefficients in a JPEG file is approximately
symmetrical. This a priori knowledge can be expressed as

h[j] − h[−j] = 0, j = 1, 2, . . . . (5.21)

Moreover, the histogram is monotonically increasing for j < 0 and decreasing for
j > 0. Because the LSB pairs are . . . , {−4, −3}, {−2, −1}, {0, 1}, {2, 3}, . . . and
because LSB embedding evens out the differences in counts in each LSB pair,
h[2k] decrease and h[2k + 1] increase with embedding for k > 0 and the effect is
the opposite for k < 0 (h[2k] increase and h[2k + 1] decrease). Thus, we use the
following function of the histogram to attack Jsteg:

   
F (h) = h[2k] + h[2k + 1] − h[2k + 1] − h[2k]. (5.22)
k>0 k<0 k≥0 k<0

Note that for the cover image F (h) = 0 as required. Also, with embedding, F
increases due to the impact of LSB embedding on positive and negative even
and odd DCT values. By substituting (5.19) and (5.20) into (5.22) and using the
approximation hα ≈ E [hα ], we obtain2
 
(ahα [2k] − bhα [2k + 1]) + (−bhα [2k] + ahα [2k + 1])
k>0 k<0
 
− (−bhα [2k] + ahα [2k + 1]) − (ahα [2k] − bhα [2k + 1]) = hα [1]. (5.23)
k>0 k<0

2 Note that because the DCT coefficients equal to 1 are skipped, h[1] = hα [1].
68 Chapter 5. Naive steganography

Rearranging the terms, an equation for the unknown relative message length α
is obtained,
 
(a + b) (hα [2k] − hα [2k + 1]) + (a + b) (hα [2k + 1] − hα [2k]) = hα [1],
k>0 k<0
(5.24)
where a + b = 1/(1 − α) (recall that hα is known as it is the histogram of the
stego image). Solving for α, we finally get for its estimate

k=0 Δhα [k]
α̂ = 1 − , (5.25)
hα [1]
where

Δhα [k] = hα [2k] − hα [2k + 1] for k > 0, (5.26)


Δhα [k] = hα [2k + 1] − hα [2k] for k < 0. (5.27)

This estimate can be used to formulate the following hypothesis-testing problem:

H0 : α̂ = 0, (5.28)
H1 : α̂ > 0, (5.29)

where H0 and H1 correspond to the hypotheses that hα is a histogram of a cover


and a stego image, respectively.
Figure 5.4 shows the histograms of the detector response α̂ for α = 0, 0.1,
0.2, 0.3, and 0.5 on a database of 954 images, 315, 320, and 319 of which came
from Canon G2, Canon S40, and Kodak DC 290 digital cameras with dimensions
2272 × 1704, for both Canon images, and 1792 × 1200 for images taken by Kodak
DC 290. All images were acquired in the raw format, then off-line converted to
grayscale and saved as 75% quality JPEGs before the experiment. The estimated
message length α̂ appears to be unbiased and is overall a good detector capable
of distinguishing cover images from images embedded with Jsteg. Note that
the distribution of the estimator error is far from Gaussian and exhibits quite
thick tails. The reader is referred to Section 10.3.2, which describes statistical
properties of error of quantitative steganalyzers.

5.2 Steganography in palette images

Palette images, such as GIF or the indexed form of PNG, represent the image
data using pointers to a palette of colors stored in the header. It is possible to
hide messages both in the palette and in the image data.

5.2.1 Embedding in palette


Reordering the image palette and reindexing the image data correspondingly is
a modification that does not change the visual appearance of the image. Thus, it
Naive steganography 69

α=0
α = 0.1
150 α = 0.2
α = 0.3
α = 0.5

100

50

0
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
α
Figure 5.4 Histogram of estimated message length α̂ for cover images and stego images
embedded with Jsteg and relative payload 0.1, 0.2, 0.3, 0.5. The data was computed from
954 grayscale JPEG images with quality factor qf = 75.

is possible to hide short messages as permutations of the palette. This method is


implemented in the steganographic program Gifshuffle (http://www.darkside.
com.au/gifshuffle/). Gifshuffle uses the palette order to hide up to log2 256! ≈
1684 bits ≈ 210 bytes in the palette by permuting its entries.
While this steganographic method does not change the appearance of the
image, its security is low because many image-editing programs order the palette
according to luminance, frequency of occurrence, or some other scalar factor. A
randomly ordered palette will thus immediately raise suspicion. Also, displaying
the image and resaving it may erase the information because the palette may be
reordered. Another disadvantage of Gifshuffle is that its capacity is quite small
and independent of the image size.

5.2.2 Embedding by preprocessing palette


For most palette images, it is not possible to apply LSB embedding directly to the
image colors because new colors that are not in the palette would be obtained
and, once the total number of unique colors exceeds 256, the image could no
longer be stored as a GIF. One simple solution to this problem is to preprocess the
palette before embedding by decreasing the color depth to 128, 64, or 32 colors.
This way, when the LSBs of one, two, or three color channels, respectively, are
perturbed, the total number of newly created colors will be at most 256. Thus,
it will be possible to embed one, two, or three bits per pixel without introducing
artifacts that are too disturbing. This method was implemented in the earlier
versions of S-Tools (ftp://ftp.ntua.gr/pub/crypt/mirrors/idea.sec.dsi.
unimi.it/code/s-tools4.zip).
70 Chapter 5. Naive steganography

Figure 5.5 Palette colors of image shown in Figure 5.1 saved as GIF. A color version of
this figure is shown in Plate 6.

Although this method provides high capacity, its steganographic security is low
because the palette of the stego image will have very unusual structure [121] that
is unlikely to occur naturally during color quantization (see Section 2.2.2). It will
contain suspicious groups of 2, 4, or 8 close colors depending on the technique. It
is thus relatively easy to identify stego images simply by analyzing the palette.
What is even worse is that the detection will be equally reliable even for very
short messages.

5.2.3 Parity embedding in sorted palette


Many problems associated with applying LSB-like embedding to palette images
can be somewhat alleviated by presorting the palette colors so that neighboring
colors are close and applying simple LSB embedding to the pointers. For example,
the EzStego program (http://www.fqa.com/stego_com/) orders the palette by
luminance. However, since luminance is a linear combination of three basic colors,
occasionally colors with similar luminance values may be relatively far from each
other (e.g., the RGB colors [6, 98, 233] and [233, 6, 98] have the same luminance
but represent two visually very different colors). When this happens, visible
suspicious artifacts result in the stego image.
Figures 5.5 and 5.6 (see Plates 6 and 7 for color versions of both figures) show
the original palette colors and palette colors sorted by luminance arranged in a
row-by-row fashion into a square 16 × 16 array. Notice that the luminance-sorted
palette contains many very different neighboring colors. Not surprisingly, LSB
embedding into indices to the sorted palette creates quite disturbing artifacts
(see Figure 5.7(b) or Plate 8 for its color version).
There were attempts to modify the EzStego algorithm to eliminate this prob-
lem by ordering the palette using more sophisticated algorithms that would min-
imize the differences between neighboring palette colors in the palette. The task
Naive steganography 71

Figure 5.6 Palette colors sorted by luminance (by rows). A color version of this figure is
shown in Plate 7.

(a) (b)

(c) (d)
Figure 5.7 A small magnified portion of the image shown in Figure 5.1 saved as 256-color
GIF (a), the same portion after embedding a maximum-length message using EzStego
(b), optimal-parity embedding (c), and embedding while dithering (d). A color version of
this figure is shown in Plate 8.

of finding the optimal ordering leads to the traveling-salesman problem, which


is known to be NP-complete.
After a little thought, it should be clear that we do not really need to order the
palette to obtain the minimal embedding distortion. In fact, all that is needed is
72 Chapter 5. Naive steganography

to assign parities (bits 0 and 1) to each palette color so that the closest color to
each palette color has the opposite parity. If this could be achieved, we can simply
embed messages as color parities of indices instead of their LSBs by swapping a
color for its closest neighbor from the palette. This is the idea behind optimal-
parity embedding [80], explained next.

5.2.4 Optimal-parity embedding


Let the image palette contain Np colors represented as RGB triples, c[j] =
(r[j], g[j], b[j]), with parities P[j] ∈ {0, 1}, j = 1, . . . , Np . We define the isola-
tion, s[j], for each color c[j] as the distance from c[j] to its closest neighbor from
the palette, s[j] = minj  =j d(c[j], c[j  ]). For example, one can use the Euclidean
distance in the RGB cube as the measure of distance between two colors,


dRGB (c[j], c[j ]) = c[j] − c[j  ]2 (5.30)

= (r[j] − r[j  ])2 + (g[j] − g[j  ])2 + (b[j] − b[j  ])2 . (5.31)

We define parity assignment in the following manner.


1. Calculate the matrix of distances between all pairs of colors dRGB [k, l] =
dRGB (c[k], c[l]). Set P = {∅}.
2. Order all distances dRGB [k, l], k > l, to a non-decreasing sequence D =
dRGB [k1 , l1 ] ≤ dRGB [k2 , l2 ] ≤ . . .. For a unique order, resolve ties, for example,
using alphabetical order.
3. Iteratively repeat Step 4 until P contains all Np palette colors.
4. Choose the next distance dRGB [k, l] in D, such that either c[k] ∈ / P or c[l] ∈
/ P.
If no such dRGB [k, l] can be found, this means that P already contains all Np
colors and we are done.
a. If both c[k] ∈ / P and c[l] ∈ / P, assign two opposite parities to both c[k] and
c[l] and update P = P ∪ {c[k]} ∪ {c[l]}.
b. If c[k] ∈
/ P and c[l] ∈ P, set P[k] = 1 − P[l] and update P = P ∪ {c[k]}.
c. If c[k] ∈ P and c[l] ∈/ P, set P[l] = 1 − P[k] and update P = P ∪ {c[l]}.
Note that once a parity of a color has been defined, it cannot be changed later
by the algorithm. It is also clear that at the end all colors will have assigned
parities. We now show that for any color c[j], s[j] = dRGB (c[j], c[j  ]) for some
palette color c[j  ] with the opposite parity P[j  ] = P[j]. Let ka , la be the index
pair of the first occurrence of index j in the sequence k1 , l1 , k2 , l2 , . . .. Without
loss of generality, let us assume that ka = j. Because it is the first occurrence
of j, P[j] has not been assigned and thus P[j] = 1 − P[la ]. Also, there is no
color in the palette that is strictly closer to c[j] than c[la ] because, if there were
such a color, we would have encountered color c[j] earlier in the process because
D is non-decreasing and thus ka would not be the first occurrence of index j.
This means that the parity assignment constructed using the above algorithm
guarantees that for any palette color there is another color in the palette that
Naive steganography 73

is closest and has the opposite parity. Thus, we can construct a steganographic
algorithm that embeds one message bit at each palette index (each pixel) as the
color parity and this algorithm induces the smallest possible distortion. This is
because if the message bit does not match the color parity, we can swap the color
for another palette color that is the closest to it.
Note that there may be more than one parity assignment with the above
optimal property. The algorithm above, however, is deterministic and will always
produce one specific assignment. Also note that the optimal parity assignment is
only a function of the palette and not of the frequency with which the colors
appear in the image. This means that the recipient can construct the same
assignment from the stego image as the sender and thus read the message.
Assuming that color c[j] occurs in the cover image with frequency3 p[j],
N p
j=1 p[j] = 1, if the message-carrying pixels are selected pseudo-randomly, the
expected embedding distortion per visited pixel is

1   2  1
n p N
E [d2 (x, y)]
= E dRGB (x[i], y[i]) = p[j]s2 [j], (5.32)
m m i=1 2 j=1

because for a message of length m, there will be on average mp[j] pixels of color
c[j] containing message bits and one half of them will have to be modified to
c[j  ], inducing distortion s2 [j].

5.2.5 Adaptive methods


Quite often, palette images contain large areas of uniform color, where no em-
bedding should take place to avoid introducing easily detectable artifacts. Thus,
it seems that steganography for palette images would benefit from methods that
use adaptive selection channels and limit the embedding changes to textured ar-
eas while avoiding simple structures, such as segments of uniform color. Adaptive
selection channels determined by the content of the cover image, however, create
a potential problem with message recovery because the recipient does not have
the cover image. We illustrate this issue on a simple adaptive method for palette
images.
The image is divided into disjoint blocks, B, formed, for example, by 3 × 3
blocks completely covering the image. Using the optimal parity assignment from
the previous section, we define block parity as the eXclusive OR (XOR) of parities

of all pixels in the block, P(B) = x[i]∈B P(x[i]). We also define a measure of
texture, which is a map, t(B) → [0, ∞), that assigns a scalar value to each block.
For example, we could define t(B) as the number of unique colors in B. If the
texture measure exceeds a certain threshold, t(B) > γ, we check whether the
block parity matches the message bit. If it does not, one of the colors is changed
to adjust its parity and thus the parity of the whole block. For example, we can

3 One can also say that p is the normalized color histogram or sample pmf of colors.
74 Chapter 5. Naive steganography

Algorithm 5.3 Optimal parity assignment, P, for a palette consisting of Np


colors c[j] = (r[j], g[j], b[j]), j = 1, . . . , Np .
// Input: image palette c[j] = (r[j], g[j], b[j]), j = 1, . . . , Np
for k = 1 to Np {
for l = 1 to Np{
dRGB [k, l] = (r[k] − r[l])2 + (g[k] − g[l])2 + (b[k] − b[l])2 .
}
}
for k = 1 to Np dRGB [k, k] = Inf;
inc = (0, . . . , 0); // vector of zeros of length Np
Np
while k=1 inc[k] < Np {
[kmin , lmin ] = arg mink,l dRGB [k, l];
c = inc[kmin ] + inc[lmin ];
if (c = 1) {
if (inc[kmin ] = 0) {
inc[kmin ] = 1;P[kmin ] = 1 − P[lmin ];
} else {
inc[lmin ] = 1;P[lmin] = 1 − P[kmin ];
}
}
if (c = 0) {
inc[kmin ] = 1;P[kmin ] = 0;
inc[lmin ] = 1;P[lmin] = 1;
}
dRGB [kmin , lmin ] =Inf;
}
// P is the optimal parity assignment

choose the color with the smallest isolation to minimize the overall embedding
distortion. The recipient simply reads the message by following the same steps,
extracting message bits from the parity of all blocks whose texture measure is
above the threshold.
The problem with this scheme is that the act of embedding may change the
texture measure to fall below the threshold and thus the recipient will not read
the bit embedded in that block. This problem is common to many steganographic
methods that use adaptive selection rules. For the scheme above, the solution is
simple. After embedding each bit, we need to check whether the block texture
is still above the threshold. If it is, the embedding can continue. If the texture
falls below the threshold, we need to embed the same bit in the next block
because the block where we just embedded will be skipped by the recipient. This
modification decreases the embedding efficiency because sometimes changes are
made that do not embed any bits. However, the decrease in embedding efficiency
Naive steganography 75

is usually very small because the chances that the block’s texture will fall below
the threshold after embedding are small.
The problem above is a specific example of a selection channel that is not
completely shared between the sender and the recipient. Chapter 9 is devoted
to the problem of communicating with non-shared selection channels, where a
general solution is presented using so-called wet paper codes.

5.2.6 Embedding while dithering


We use the palette format to demonstrate one more important type of stegano-
graphic schemes that embed messages while applying an information-reducing
operation to the cover image, such as quantization. The embedding minimizes
at the same time the combined distortion due to quantization and embedding.
Let us assume that the cover image is a true-color 24-bit image, for example in
the BMP format, and we wish to save it as a GIF. As explained in Chapter 2, this
conversion involves color quantization, to “round” the pixel colors to palette col-
ors, and dithering, which spreads the quantization error. The embedding scheme
starts with computing the optimal parity assignment for the palette. To embed
m bits, m pixels are pseudo-randomly selected to carry the message bits. Finally,
color quantization and dithering are performed as usual by scanning the image
by rows with one exception. At each message-carrying pixel, its color is quantized
to the closest palette color with the parity that matches the message bit. The
combined quantization and embedding are thus diffused to neighboring pixels.
This way, both the quantization error and the error due to message embedding
will be diffused through the whole image. An example of an image embedded
at 1 bpp with this method is shown in Figure 5.7(d) (see Plate 8 for its color
version).
The concept of embedding a message while the cover image is being processed
is a very powerful one and can be greatly generalized. When applied to other
image-processing operations, however, a more advanced embedding mechanism
is required. This topic is further elaborated upon in Chapter 9 on non-shared
selection channels.
Finally, we would like to state that palette images with a small number of
colors are not very suitable for steganography. It is difficult (if not impossible)
to design secure schemes with reasonable capacity. Moreover, palette images are
typically used for storing computer art and line drawings, where embedding
changes are usually easily detectable due to the semantic meaning of the objects
in the image. Thus, the emphasis in this book will be given especially to the
ubiquitous JPEG format and raster formats. We included steganography of GIF
images in this chapter because the GIF format allowed us to demonstrate many
important embedding principles and issues that will be revisited throughout this
book.
76 Chapter 5. Naive steganography

Summary
r The simplest steganographic method is Least-Significant-Bit (LSB) embed-
ding.
r The effect of LSB embedding on an image histogram can be quantified. The
embedding evens out the populations of both values from the same LSB pair
{2k, 2k + 1}.
r Histogram attack is an attack on LSB embedding when the message placement
is known.
r Jsteg is a steganographic algorithm for JPEG images that uses LSB embed-
ding. It can be attacked by utilizing a priori knowledge about the cover-image
histogram, such as its symmetry.
r Messages can be embedded in palette images either in the palette or in the
indices (image data). Hiding in palette provides limited capacity independent
of the image size.
r Embedding in indices to palette offers larger capacity but often creates easily
discernible artifacts.
r Optimal-parity embedding is an assignment of bits to palette colors that can
be used to minimize the embedding distortion.
r Embedding while dithering is an embedding method for palette images that
minimizes the total distortion due to color quantization and embedding.
r A possible way to improve security is to confine the embedding changes to
more textured or noisy areas of the cover (adaptive steganography).

Exercises

5.1 [Embedding in two LSBs (LSB2)] Consider the following stegano-


graphic scheme that embeds bits into the least two LSBs of pixels

C = {0, . . . , 255}n , (5.33)


M = {0, 1}2n, (5.34)
K = {∅}. (5.35)

Emb : Two bits are embedded sequentially at each pixel by replacing two least
significant bits with two message bits. For example, if x[i] = 14 = (00001110)2
and we want to embed bit pairs 00, 01, 10, 11, x[i] is changed to 12, 13, 14, 15,
respectively. If x[i] = 32 = (00100000)2, we embed the same bit pairs by chang-
ing x[i] to 32, 33, 34, and 35, etc. Ext : Two bits are extracted sequentially as
the least two LSBs from the pixels to form the message.

Calculate the embedding efficiency for both d1 and d2 distortion under the as-
sumption that the message is a random bit stream and the pixel values are
uniformly distributed in {0, . . . , 255}.

Hint: Write the expected distortion for x[i] = 4k, 4k + 1, 4k + 2, 4k + 3, and


Naive steganography 77

then use the assumption that the pixel values are uniformly distributed to obtain
the average distortion per pixel.

5.2 [Alternative embedding in 2 LSBs (LSB2–)] Consider the following


steganographic scheme with the same C, M, K as in Exercise 5.1 with embedding
function Emb : Two message bits are sequentially embedded at each pixel by
always modifying the pixel value to the closest value with the required least
two LSBs. For example, if x[i] = 14 = (00001110)2 and we want to embed bit
pairs 00, 01, 10, 11, x[i] is changed to 12, 13, 14, 15, respectively. If x[i] = 32 =
(00100000)2, we embed the same bit pairs by changing x[i] to 32, 33, 34, and
31. Note that the last change from 32 to 31 will lead to modification of all six
LSBs! The extraction function is the same as in the previous exercise. Calculate
the embedding efficiency for both d1 and d2 distortion under the assumption
that the message is a random bit stream. For simplicity, ignore the boundary
issues. Compare the embedding efficiency of LSB2 and LSB2– for each distortion
measure. Prove that the LSB2– method has higher embedding efficiency than the

LSB2 method for distortion measure dγ (x, y) = ni=1 |x[i] − y[i]|γ for any γ > 0.

5.3 [LSB embedding of biased bit stream] Assume that the cover image is
an 8-bit grayscale image. Suppose that the secret message is a random biased bit
stream, i.e., the bits are iid realizations of a Bernoulli random variable ν ∼ B (p0 )
with probability mass function

Pr{ν = 0} = p0 , (5.36)
Pr{ν = 1} = p1 , (5.37)

where p0 + p1 = 1. Let hα be the image histogram after LSB embedding a secret


message of relative length α in the cover image. Show that

E [hα [2k]] = (1 − αp1 )h[2k] + αp0 h[2k + 1], (5.38)


E [hα [2k + 1]] = αp1 h[2k] + (1 − αp0 )h[2k + 1]. (5.39)

5.4 [Histogram attack for biased message] Assume that an 8-bit


grayscale stego image is fully embedded (α = 1) with biased bit stream from
the previous exercise. First show
E [hα [2k]] p0
= for all k = 0, . . . , 127, (5.40)
E [hα [2k + 1]] p1

E [hα [2k]] = p0 (hα [2k] + hα [2k + 1]) , (5.41)


E [hα [2k + 1]] = (1 − p0 ) (hα [2k] + hα [2k + 1]) . (5.42)

The distribution now depends on the unknown message bias p0 . Estimate p0


d−1
 2  2
p̂0 = arg max hα [2k] − 2λh[2k] + hα [2k + 1] − 2(1 − λ)h[2k] , (5.43)
λ∈R
k=0
78 Chapter 5. Naive steganography

where h[2k] is defined in (5.12), and compute the chi-square test statistic
d−1 

2
hα [2k] − 2p̂0 h[2k]
S= , (5.44)
k=0
2p̂ 0 h[2k]
which now follows the chi-square distribution with d − 2 = 126 degrees of free-
dom because we had to estimate the unknown parameter p0 from the data.

5.5 [Power of parity] Assume that the cover image has a biased distribution
of LSBs: the fraction r of pixels has LSBs equal to 1 and the fraction 1 − r
has LSBs equal to 0, 0 < r < 1. Let x[i] be the LSBs of pixels ordered along
a pseudo-random path. Consider the following sequence of bits b[i] obtained
as the XOR of LSBs of disjoint groups of m consecutive pixels b[1] = x[1] ⊕
x[2] ⊕ · · · ⊕ x[m], b[2] = x[m + 1] ⊕ x[m + 2] ⊕ · · · ⊕ x[2m], . . .. Show that the
bit stream b[i] becomes unbiased exponentially fast with m by proving that
|Pr{b[i] = 0} − Pr{b[i] = 0}| = |1 − 2r|m . (5.45)
Hint: For m even, Pr{b[i] = 0} = Pr{x[1] + x[2] + · · · + x[m] is even} and ex-
press the probabilities using r and 1 − r.

5.6 [Repetitive LSB embedding] Let x be an 8-bit grayscale cover image


and y be the stego image after LSB embedding in x a random unbiased message
of relative length α1 . Continue by embedding in y using LSB embedding another
random unbiased message of relative length α2 , obtaining the doubly-embedded
stego image z. Assume that the paths for both embeddings are pseudo-randomly
chosen and independent of each other. The stego image z will appear to have
been embedded with a random unbiased message of relative length L(α1 , α2 ).
Show that
L(α1 , α2 ) = α1 + α2 − α1 α2 . (5.46)
Furthermore, consider the case when you embed messages of relative length α
repetitively into the same image k times. Show that the relative length Lk (α) of
the message that appears to have been embedded after k repetitive embeddings
is
Lk (α) = 1 − (1 − α)k . (5.47)
Note that Lk (α) → 1 as k → ∞ exponentially fast, for any α positive.

5.7 [Upper bound on message length] Let hα be the histogram of an


8-bit grayscale image embedded using LSB embedding with a random unbiased
message of relative length α. Prove the following upper bound:
min {hα [2k], hα [2k + 1]}
α≤2 for all k = 0, . . . , 127. (5.48)
hα [2k] + hα [2k + 1]
Hint: hα [2k] ≈ (1 − α/2) h[2k] + (α/2)h[2k + 1] ≥ (α/2)h[2k] + (α/2)h[2k + 1]
because 1 − α/2 ≥ α/2 for 0 ≤ α ≤ 1.
Naive steganography 79

5.8 [LSB embedding as noise adding] Explain why the impact of LSB
embedding in cover x cannot be written as adding to x an iid noise, y = x + ξ.
Hint: Are x and ξ independent?

5.9 [View bit planes] Write a routine in Matlab that displays a selected
bit plane (a two-color black-and-white image) of an image represented using a
uint8 array. For a color image, display the bit planes of red, green, and blue
channels separately as three two-color images. You may wish to use green–white
combination for the bit plane of the green channel, red–white combination for
the red channel, etc.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
6 - Steganographic security pp. 81-106

Chapter DOI:
Cambridge University Press
6 Steganographic security

In the previous chapter, we saw a few examples of simple steganographic schemes


and successful attacks on them. We learned that the steganographic scheme
called LSB embedding leaves a characteristic imprint on the image histogram
that does not occur in natural images. This observation lead to an algorithm (a
detector) that could decide whether or not an image contains a secret message.
The existence of such a detector means that LSB embedding is not secure. We
expect that for a truly secure steganography it should be impossible to construct
a detector that could distinguish between cover and stego images. Even though
this statement appears reasonable at first sight, it is vague and allows subjective
interpretations. For example, it is not clear what is meant by “could distinguish
between cover and stego images.” We cannot construct a detector that will always
be 100% correct because it is hardly possible to detect the effects of flipping one
LSB, at least not reliably in every cover. Just how reliable must a detector be to
pronounce a steganographic method insecure?
Even though there are no simple practical solutions to the questions raised in
the previous paragraph, they can in principle be studied within the framework
of information theory. Imagine that Alice and Bob are engaging in a legitimate
communication and do not use steganography. Let us suppose that they ex-
change grayscale 512 × 512 images in raster format that were never compressed.
If we observed their communication for a sufficiently long time, the images would
sample out a probability distribution Pc in the space of all covers C = X 512×512 ,
X = {0, . . . , 255}. This distribution captures legitimate communication between
Alice and Bob. On the other hand, if Alice and Bob embed secrets in images,
again over a long time period, the images will appear to follow a different dis-
tribution Ps over C. Intuitively, Alice wants to design the stego method while
making sure that Ps is as close to Pc as possible to prevent Eve from discovering
the fact that she communicates secretly with Bob. There exists a fundamental
relationship between the “distance” between Pc and Ps and Eve’s ability to detect
images with steganographic content. This distance can be taken as a measure of
steganographic security and it will impose constraints on the reliability of the
best detector Eve can ever build.
At this point, we ignore the fundamental question of whether it is feasible to
assume that the distributions Pc and Ps can be estimated in practice or even
whether they are appropriate descriptions of the cover-image source. We simply
82 Chapter 6. Steganographic security

assume that Eve knows the steganographic channel (Kerckhoffs’ principle) and
thus knows both Pc and Ps . This rather strong assumption is justified because
in real life the prisoners can never be sure how much Eve knows (she may be a
government agency with significant resources, for example) and thus it is pru-
dent to grant her omnipotence. Under this assumption, in Section 6.1 we define
steganographic security as the KL divergence between the distributions of cover
and stego images, Pc and Ps . The importance of this information-theoretic quan-
tity will become apparent later when we show how the KL divergence imposes
fundamental limits on Eve’s detector and how it can be used for comparing
steganographic schemes. In Section 6.2, we discuss several specific examples of
perfectly secure stegosystems and point out an interesting relationship between
perfect security and perfect compression. Section 6.3 investigates secure stegosys-
tems under the condition that the embedding distortion introduced by Alice is
limited (the case of a so-called distortion-limited embedder). In the same section,
we also show how certain algorithms originally proposed for robust watermarking
can be modified into secure stegosystems.
Even though the information-theoretic definition of security is well developed
and widely accepted in the steganographic community, there exist important
alternative approaches, which we also mention in this paragraph. Inspired by
the concept of security of public-key cryptosystems, in Section 6.4 we explain a
complexity-theoretic approach to steganographic security, which makes an im-
portant connection between security in steganography and properties of some
common cryptographic primitives, such as one-way functions. This research di-
rection arose due to critique of the information-theoretic definition of security,
which ignores the important issue of computational complexity and Eve’s ability
to actually implement an attack.

6.1 Information-theoretic definition

We now give a mathematically precise form to the thoughts presented above


[35]. Recall from Chapter 4 that every steganographic scheme uses a pair of
embedding and extraction mappings Emb : C × K × M → C, Ext : C × K → M
defined on the sets of all possible covers C, stego keys K, and the set of messages
M. The mappings are assumed to satisfy Ext(Emb(x, k, m), k) = m for any
message m ∈ M, cover x ∈ C, and key k ∈ K. In other words, we will work
under the simplifying assumption that the set of keys and messages is the same
for every cover x. The object obtained as a result of embedding is called the
stego object y = Emb(x, k, m).
Assuming the cover is drawn from C with probability distribution Pc and the
stego key, as well as the message, are drawn according to distributions Pk , Pm
over their corresponding spaces K, M, the distribution of stego objects will be
denoted as Ps . Note that, even though for steganography by cover selection or
cover synthesis the stego object y does not have to be obtained as a modification
Steganographic security 83

of some x ∈ C, the steganographic method will still generate some distribution


Ps over y ∈ C.

6.1.1 KL divergence as a measure of security


We intuitively expect that if Pc is “close” to Ps , then Eve must make erroneous
decisions increasingly more often. Given an object x, Eve must decide between
two hypotheses: H0 , which represents the hypothesis that x does not contain a
hidden message, and H1 , which stands for the hypothesis that x does contain
a hidden message. Under hypothesis H0 , the observation x is drawn from the
distribution Pc , x ∼ Pc . Conversely, under H1 , x ∼ Ps .
The two distributions can be compared using their Kullback–Leibler diver-
gence, also called KL distance or relative entropy (see Appendix B and [50]),

 Pc (x)
DKL (Pc ||Ps ) = Pc (x) log , (6.1)
Ps (x)
x∈C

which is a fundamental concept from information theory measuring how different


the two distributions are. Here, the log can be to the base 2, in which case the
KL divergence is measured in bits, or it could be the natural logarithm and the
unit is a “nat.”
When DKL (Pc ||Ps ) = 0, we call Alice’s stegosystem perfectly secure (unde-
tectable) because in this case the distribution of the stego objects Ps created by
Alice is identical to the distribution Pc assumed by Eve. Thus, it is impossible
for Eve to distinguish between covers and stego objects. If DKL (Pc  Ps ) ≤ ,
then we say the steganographic system is -secure.
To better understand what is meant by -security, consider that Eve has a stego
detector, which is a mapping F : C → {0, 1}. The response of Eve’s detector is
binary – it answers either 0 for cover or 1 for stego. The detector can make two
types of error. The first type of error is a false alarm (false positive), which occurs
when Eve decides that a hidden message is present when in fact it is absent. The
second type of error is a missed detection (false negative) which occurs when Eve
decides that a hidden message is absent, when in fact it is present. Let PFA and
PMD denote the probabilities of false alarm and missed detection, respectively.
Assuming the detector is fed only covers distributed according to Pc , Eve will
decide 0 or 1 with probabilities pc (0) = 1 − PFA and pc (1) = PFA , respectively.
On stego objects distributed according to Ps , Eve’s detector assigns 0 and 1 with
probabilities ps (0) = PMD and ps (1) = 1 − PMD , respectively. The KL divergence
between the two distributions of Eve’s detector

pc = (pc (0), pc (1)) = (1 − pc (1), pc (1)) , (6.2)


ps = (ps (0), ps (1)) = (ps (0), 1 − ps (0)) (6.3)
84 Chapter 6. Steganographic security

is given by (6.1) with C = {0, 1}, pc (1) = PFA , and ps (0) = PMD ,
dKL (PFA , PMD )  DKL (pc ||ps ) (6.4)
1 − PFA PFA
= (1 − PFA ) log + PFA log . (6.5)
PMD 1 − PMD
Because Eve’s detector is a type of processing and processing cannot increase
the KL divergence (Proposition B.6 in Appendix B), for an -secure stego system
we must have
dKL (PFA , PMD ) ≤ DKL (Pc  Ps ) ≤ . (6.6)
This inequality imposes a fundamental limit on the performance of any detec-
tor Eve can build. Requiring that the probability of false alarm for Eve’s detector
cannot be larger than some fixed value PFA , 0 < PFA < 1, the smallest possible
probability of missed detection she can achieve is
PMD (PFA ) = arg min {PMD |dKL (PFA , PMD ) ≤ }, (6.7)
PMD ∈[0,1]

where the minimum is taken over all detectors whose probability of false alarm
is PFA . Figure 6.1 shows the probability of detection of a stego object
PD (PFA ) = 1 − PMD (PFA ) (6.8)
as a function of PFA for various values of . The curves are called Receiver-
Operating-Characteristic (ROC) curves (see Appendix D for the definition). The
region under each curve is the range within which the performance of Eve’s
detector must fall for a given value of . Note that Eve cannot minimize both
errors at the same time as there appears to be a trade-off between the two types
of error. In particular, if Eve is not allowed to falsely accuse Alice or Bob of using
steganography, PFA = 0, Eve’s detector will fail to detect -secure steganographic
communication with probability at least
PMD ≥ e− . (6.9)
This can be easily seen by setting PFA = 0 in (6.6). Thus, the smaller  is, or
the closer the two distributions, Pc and Ps , are, the greater the likelihood that
a covert communication will not be detected. This motivates choosing the KL
divergence as a measure of steganographic security.
It is also instructive to inspect what kind of detector Eve obtains for secure
stegosystems with  = 0. Because the KL divergence is always non-negative,
from (6.6) we have that dKL (PFA , PMD ) = 0. As shown in Appendix B, this can
happen only if distributions pc and ps are the same, or PMD = 1 − PFA . A detec-
tor whose false alarms and missed detections satisfy this relationship amounts
to a detector that is just randomly guessing. To see this, imagine the following
family of detectors parametrized by p ∈ [0, 1]:

1 with probability p
Fp (x) = (6.10)
0 with probability 1 − p.
Steganographic security 85

=1
0.8

0.6
 = 0.1
PD

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
PFA
Figure 6.1 Probability of detection PD of a stego image as a function of false alarms,
PFA . The ROC curves correspond to  = 0.1, 0.2, . . . , 1 with  = 1 corresponding to the
top curve and  = 0.1 to the bottom curve.

When a cover image is sent to this detector, Fp flips a biased coin and decides
“stego” with probability p (the false-alarm probability is PFA = p). On the other
hand, presenting the detector with a stego object, it is detected as cover with
probability 1 − p (the missed detection rate is PMD = 1 − p). Thus, this randomly
guessing detector satisfies PMD = 1 − PFA .

6.1.2 KL divergence for benchmarking


It is of great interest to Alice and Bob to minimize the chance that their secret
communication will be detected by Eve. The prisoners would thus prefer to use
the most secure stegosystem currently available. For this, they need some means
of comparing security of stegosystems.
In this chapter, we defined the security of a stegosystem as the KL diver-
gence between the distributions of cover and stego objects. It is thus tempting
to use it for comparing (benchmarking) steganographic algorithms in the fol-
lowing manner. Given two stegosystems, S (1) and S (2) , that share the same
set of covers, we would say that S (1) is more secure than S (2) whenever
(1) (2) (1) (2)
DKL (Pc ||Ps ) ≤ DKL (Pc ||Ps ), where Ps and Ps are the distributions of
their stego objects. This approach to benchmarking would be justified if a larger
KL divergence implied the existence of a better steganography detector. This is
86 Chapter 6. Steganographic security

asymptotically correct due to the Chernoff–Stein lemma (Appendix B) in the


limit when n, the number of objects exchanged by Alice and Bob, approaches
infinity. The lemma says that if Eve imposes a bound on the probability of false
alarms, PFA , the best detector she can build will misclassify n observed stego
objects as covers with probability PMD satisfying

1
lim log PMD (PFA ) = −DKL (Pc ||Ps ). (6.11)
n→∞ n

Alternatively, one could also say that the probability of a miss decays exponen-
tially with n, PMD (PFA ) ≈ e−nDKL (Pc ||Ps ) .
We stress that this result holds only in the limit of a large number of obser-
vations, n. For a finite number of observations, larger KL divergence does not
necessarily imply the existence of a better detector. Exercise 6.3 shows an ex-
ample of two families of distributions, gN and hN on C = {1, . . . , N }, for which
DKL (gN ||hN ) = N and for which any detector based on one observation will al-
ways have PD − PFA ≤ δN , where δN → 0. In other words, despite the fact that
the KL divergence between distributions gN and hN grows to infinity with N ,
our ability to decide between them on the basis of a single observation diminishes
to random guessing.
Benchmarking steganographic systems using their KL divergence as hinted
above is, however, still not without problems. Ignoring for now the issue of nu-
merically evaluating the KL divergence for a real stegosystem, the KL divergence
is a function of the relative payload (or, more accurately, the distribution of the
change rate β). Thus, it is conceivable that two stegosystems compare differently
for different payloads.
On the other hand, if the prisoners wish to stay undetected, with time they
must start embedding smaller and smaller payloads, otherwise Eve would detect
the covert communication with certainty. To see this, imagine that the prisoners
communicate using change rate bounded from below, β ≥ β0 > 0. Then, the KL
divergence between cover and stego objects would be bounded from below by
DKL (Pc ||Pβ0 ).1 Invoking the Chernoff–Stein lemma again, Eve’s detector would
thus achieve an arbitrarily small probability of missed detection, PMD , at any
bound on false alarms PFA . In other words, the prisoners would be caught with
probability approaching 1.
Thus, it makes sense to define the steganography benchmark by properties of
Eve’s best detector in the limit of β → 0. Let us assume for simplicity that covers
are represented by n iid realizations x[i], i = 1, . . . , n, of a scalar random variable
x with pmf Pc ≡ P0 . If the embedding modifications are also independent of each
other, the stego object is a sequence of iid realizations that follow distribution
Pβ , where β is the change rate. In the simplest case, Eve knows the payload and

1 Pβ0 stands for the distribution of stego objects modified with change rate β0 .
Steganographic security 87

solves the following simple binary hypothesis-testing problem:


H0 : β = 0, (6.12)
H1 : β > 0 known. (6.13)
The optimal detector for this problem is the likelihood-ratio test (see Ap-
pendix D, equation (D.10))

n
Pβ (x[i])
Lβ (x) = log . (6.14)
i=1
P0 (x[i])

It is shown in Section D.2 that for small β and large n, Lβ (x)/n approaches the
Gaussian distribution
  
1 N − 21 β 2 I(0), n1 β 2 I(0) under H0
Lβ (x) ∼ 1 2 1
 (6.15)
n 2
N 2 β I(0), n β I(0) under H1 ,
where I(β) is the Fisher information for one observation (see Section D.2 for
more details about Fisher information),
 1 ∂Pβ (x)
2
I(β) = . (6.16)
x
Pβ (x) ∂β

The connection between Fisher information and KL divergence is seen through


the following equation, which holds up to second order in β (see Proposition B.7):
β2
DKL (P0 ||Pβ ) = DKL (Pβ ||P0 ) = I(0). (6.17)
2
Because of the Gaussian character of the detection statistic, the performance of
Eve’s optimal steganography detector is completely described using the deflection
coefficient (see Section D.1.2)
1 2 1 2
2
2 2 β I(0) + 2 β I(0)
d = = nβ 2 I(0). (6.18)
β 2 I(0)/n
We note that a larger deflection coefficient means a more accurate detector and
thus a less secure stegosystem. Therefore, we can conclude that for small β
stegosystems with larger Fisher information I(0) are more detectable than those
with smaller I(0), which makes this quantity useful for benchmarking. Moreover,
through the Cramer–Rao lower bound (Section D.6), I(0) imposes a lower bound
on the variance of any unbiased estimator of the change rate β ≈ 0 from an n-
element stego object,
1
Var[β̂] ≥ . (6.19)
nI(0)
In Exercises 6.4 and 6.5, the reader is guided to derive the Fisher information,
I(0), for LSB embedding and ±1 embedding in iid cover sources.
In this section, we worked under the assumption that the cover was an iid
sequence. Fisher information can also be used for benchmarking stegosystems
88 Chapter 6. Steganographic security

with covers exhibiting dependences in the form of a Markov chain [69, 70]. For
such stegosystems, similar limiting behavior for small change rates can be estab-
lished for the KL divergence between the Markov chain of covers and the hidden
Markov chain of stego objects.2
Fisher information could potentially be used for benchmarking stegosystems
with real-life cover sources, such as digital images. Here, there is little hope,
however, that we could compute it analytically due to the difficulty of modeling
images. A plausible practical option is to model images using numerical features,
such as features used in blind steganalysis (see Chapter 12), and compute the
Fisher information experimentally from a database of cover and stego images
embedded with varying change rate. This is currently an active area of research
and the interested reader is referred to [136, 191] and the references therein.

6.2 Perfectly secure steganography

In the previous section, we introduced the concept of a perfectly secure stegosys-


tem by demanding that such systems preserve the distribution of covers. At this
point, it is not clear whether non-trivial secure systems indeed exist. Thus, in
this section we describe several examples of secure and -secure stegosystems.
Other examples can be found in Section 6.3. We also point out the relationship
between perfect security and perfect compression.
Consider the following simple one-time-pad steganographic system with C =
X , X = {0, 1}, and Pc the uniform distribution on X n [35]. Given a secret
n

message m ∈ X n , Alice selects k ∈ K = X n at random and synthesizes the stego


object as XOR (eXclusive OR) y = k ⊕ m. The message is extracted by Bob
as m = k ⊕ y. Obviously, Alice and Bob need to preagree on the set of secret
keys and their selection prior to starting the communication. The stego objects
will again follow a uniform distribution on C. This system is, however, not very
useful because if random bit strings were commonly exchanged there would be no
need for steganography as secrecy could be simply achieved using cryptographic
means.
For general cover sources, an -secure stegosystem can be obtained using the
principle of cover selection in the following manner [35]. We first split the set of
all covers, C, into two subsets of approximately the same probability, C = C0 ∪ C1 ,
Pc (C0 ) ≈ Pc (C1 ). For example, two such sets can be found as

 
C0 = arg minPc (C  ) − Pc (C − C  ), C1 = C − C0 . (6.20)

C ⊂C

2 Steganographic embedding into a Markov chain produces stego objects that are no longer
Markov and instead form a hidden Markov chain (see, e.g., [213]).
Steganographic security 89

Let Pc,0 (x) be the distribution Pc constrained on C0 ,



Pc (x)/Pc (C0 ) x ∈ C0
Pc,0 (x) = (6.21)
0 otherwise.

Similarly, let Pc,1 be Pc constrained on C1 . If Alice wants to send bit m, she


simply generates a cover using Pc,m and sends it to Bob. Bob shares with Alice
the breakup of C into C0 ∪ C1 and extracts the bit by noting in which subset the
stego object lies. Unless Pc (C0 ) = Pc (C1 ), this stegosystem will not be perfectly
secure but only -secure. We now compute the KL divergence between Pc and
the stego distribution Ps .
Let δ = Pc (C0 ) − Pc (C1 ). Then, because Pc (C0 ) + Pc (C1 ) = 1, we have Pc (C0 ) =
(1 + δ)/2 and Pc (C1 ) = (1 − δ)/2. Thus, the probability that x ∈ C0 is sent as
a stego object is Ps (x) = 12 Pc,0 (x) = 12 Pc (x)/Pc (C0 ) = Pc (x)/(1 + δ). Similarly,
Ps (x) = Pc (x)/(1 − δ) for x ∈ C1 . Thus, we can write for the KL divergence
 
DKL (Pc ||Ps ) = Pc (x) log(1 + δ) + Pc (x) log(1 − δ) (6.22)
x∈C0 x∈C1

= Pc (C0 ) log(1 + δ) + Pc (C1 ) log(1 − δ) (6.23)


1+δ 1−δ
= log(1 + δ) + log(1 − δ) (6.24)
2 2
1+δ 1−δ
≤ δ− δ = δ2. (6.25)
2 2
Here, we used the log inequality log(1 + x) ≤ x for all real x. Thus, the stegosys-
tem is δ 2 -secure.
This construction can be generalized by splitting C into more than two subsets,
which would allow Alice to communicate log2 k bits if k subsets were used. A vari-
ant of this method in which C0 and C1 form two interleaved lattices can be used
to design -secure stegosystems with distortion-limited embedder as investigated
in Section 6.3.

6.2.1 Perfect security and compression


If Alice could somehow arrange that her stego objects follow the cover dis-
tribution Pc , her stego scheme would be perfectly secure. In this case, Al-
ice would be on average sending H(Pc ) bits in every cover, where H(Pc ) =

− x∈C Pc (x) log Pc (x) is the entropy of the cover source. Alice could theoreti-
cally construct such a steganographic method using the principle of embedding
by cover synthesis [3] (Chapter 4). The reader is encouraged to read Section B.3
to better appreciate the following arguments.
Let us assume that Alice has a perfect prefix-free compression scheme. Here,
by perfect we understand that the covers can be on average encoded using H(Pc )
bits. Alice feeds her encrypted (and thus “random”) message bits into the decom-
pressor one-by-one and, as soon as she obtains a codeword, she simply sends to
90 Chapter 6. Steganographic security

Bob the object from C encoded by that codeword. Then, she continues with
the remainder of the message bits till the complete message has been sent. Bob
reads the secret message bits by compressing Alice’s objects using the same com-
pression scheme and concatenating the bit strings. Because Alice uses a perfect
prefix-free compressor, she will be sending objects that exactly follow the re-
quired distribution Pc and thus Ps = Pc , which means that the steganographic
method is perfectly secure. Note that the average steganographic capacity over
all covers is the entropy H(Pc ).
This stegosystem is rather academic because the complexity and dimensional-
ity of covers formed by digital media objects, such as images, will prevent us from
determining even a rough approximation to the distribution Pc . It is, however,
possible to realize this idea within a sufficiently simple model of covers (see Sec-
tion 7.1.2 on model-based steganography). The scheme also tells us something
quite fundamental. In Chapter 3, we learned that the process of image acqui-
sition using imaging sensors is influenced by multiple sources of imperfections
and noise. Some of the noise sources, such as the shot noise, are truly random
phenomena caused by the quantum properties of light.3 Thus, in the absence of
other imperfections, even if we took multiple images of exactly the same scene
with identical camera settings, we would obtain slightly different images. The
images could be described as a superposition of the true scene S[i] and a two-
dimensional field of some (not necessarily independent or identically distributed)
random variables η[i],

x[i] = S[i] + η[i], i = 1, . . . , n, (6.26)

where n is the number of pixels in the image. The variables η[i] will be locally
dependent and also dependent on S due to demosaicking, in-camera processing,
and artifacts, such as blooming, and possibly due to JPEG compression. Because
the dependence is only local, η[i] and η[j] will be independent as long as the
distance between the pixels i and j is larger than some fixed threshold deter-
mined by the physical phenomena inside the sensor and the character of the
dependences. If we model the dependences as a Markov Random Field (MRF),
it is known that the entropy of a MRF increases linearly with the number of
random variables [50]. Thus, with increasing number of pixels, n, the entropy of
η will be proportional to the number of pixels,

H(η) = O(n). (6.27)

We can thus conclude that fundamentally the steganographic capacity of im-


ages obtained using an imaging sensor increases linearly with the number of
pixels. The reader is encouraged to compare this finding with the results from
Chapter 13, which is dedicated to the issue of steganographic capacity.

3 Experiments convincingly violating Bell’s inequalities [8] showed that quantum mechanics is
free of “hidden” parameters and is thus indeterministic.
Steganographic security 91

This thought experiment also tells us that with every digital image, x, acquired
by a sensor, there is a large cluster of natural images that slightly differ from
x in their noise components. And the size of this cluster increases exponentially
with the number of pixels in the image. Essentially all steganographic schemes
try, in one way or another, to reach into this image cluster by following certain
design principles and elements as discussed in Chapters 7–9.

6.2.2 Perfect security with respect to model


Constructing secure steganographic schemes for digital-media objects, such as
digital images, is not an easy task. The problem is that very little is typically
known about the distribution Pc , which describes the cover source. In the clas-
sical prisoners’ scenario, if Alice hides messages in images taken with her own
digital camera, the only legitimate images she can send are those taken from her
prison cell with her camera, potentially processed in the computer (e.g., cropped,
compressed, filtered, etc.). In this case, the cover source is rather narrow but still
too complex to model it analytically. Due to the very high dimensionality of im-
ages, it is equally impossible to even obtain an accurate sampled version of Pc .
To make this rather complex problem more manageable, one usually adopts
a model for the cover source and proves statistical undetectability within the
model. (This would force Eve to use a better model to detect the hidden data.)
A common simplification is to model the statistics of individual pixels (or DCT
coefficients) in the image rather than the statistics of images as a whole. For
example, images represented in the transform domain using wavelet, Fourier,
or DCT coefficients can be reasonably well modeled as a sequence of iid ran-
dom variables because the transform coefficients are largely decorrelated.4 Thus,
an often-made assumption is to model the cover image x as an iid sequence
x[i] ∼ f (x; θ), where f is the probability mass function that depends on a vector
parameter θ. For example, the distribution of wavelet coefficients is often mod-
eled using the generalized Gaussian variable or the generalized Cauchy variable
(Section A.8). If the value of θ is the same across all cover images, we speak
of a homogeneous cover source. A more realistic cover-source model that allows
dependences among neighboring pixels is obtained by modeling x[i] as a Markov
chain [70, 71, 225]. In this case, θ would be the transition-probability matrix.
An even more realistic cover model is the heterogeneous [22] model, in which θ
depends on the image. Imposing a probabilistic model on θ leads to a mixture
model for the entire cover source. One of the few theoretical steganalysis studies
that use this mixture model is [148].
The power of the models described in the previous paragraph is that one
can obtain their sample approximations from a single image (for homogeneous

4 For block transforms, such as the DCT in JPEG compression, one can use a more general
model and consider the image as 64 parallel channels, each modeled as an iid sequence with
a different distribution.
92 Chapter 6. Steganographic security

models) or a large image database to also estimate the distribution of θ for het-
erogeneous models. Adopting this approach, Alice can then guarantee that her
steganography will be undetectable within the model if her embedding preserves
the essential statistics. For an iid source, she needs to preserve the histogram,
while for the Markov model, she needs to preserve the transition-probability ma-
trix and the Markov property. This approach to steganography enables exact
mathematical proofs of security as well as progressive, methodological improve-
ment. There is some hope that with advances in statistical modeling of images,
this approach will lead to secure stegosystems.
So far, however, all steganographic schemes for digital images that follow this
paradigm (see the methods in Chapter 7) have been broken. This is because all
that Eve needs to do to attack a stegosystem that is provably secure within a
given model is to use a better model of covers that is not preserved by embed-
ding. Often, it is sufficient to merely identify a single statistical quantity that is
predictably perturbed by embedding, which is usually not very hard given the
complexity of typical digital media. We discuss these issues from the point of
view of steganalysis in Chapters 10–12.

6.3 Secure stegosystems with limited embedding distortion

Note that the information-theoretic definition of steganographic security de-


scribed in Section 6.1 is not concerned with fidelity. The embedding distortion
can be arbitrary as long as the source of stego objects is statistically indis-
tinguishable from the cover source. Indeed, the perfectly secure systems from
Section 6.2 are based on the paradigms of steganography by cover synthesis and
cover selection rather than by cover modification and thus lack the concept of
embedding distortion. However, in steganography based on cover modification,
which is the paradigm for most practical steganographic schemes, Alice starts
with a specific cover object and then modifies it to embed the secret message.
Here, care is usually being taken to keep the embedding distortion low to pre-
vent introducing atypical statistical characteristics into the stego object. It thus
makes sense to consider steganographic security when Alice has to comply with
a bound on embedding distortion. (In this case, she is called a distortion-limited
embedder.) The idea to study the security of such stegosystems appeared for the
first time in [179] by drawing an analogy to the methodology employed in robust
watermarking.
In this section, we formulate secure steganography with a distortion-limited
embedder and we do so for the more general case of an active warden (the
passive-warden scenario is a special case of this formulation) [180, 235]. We also
give examples of provably secure stegosystems for some simple models of the
cover source. Although the constructions in this chapter are more important from
the theoretical point of view and so far have not produced secure steganographic
Steganographic security 93

Message m Message m
D1
Channel A(ỹ|y)
Cover x Emb(·) Ext(·)
y D2 ỹ
Key k Key k
Figure 6.2 A diagram of steganographic communication with an embedder whose
distortion is limited by D1 and noisy channel A(·|·) with distortion limited by D2 .

schemes that would work for real digital media, the results provide quite valuable
fundamental insight.
Covers will be represented as sequences of n elements from some alphabet X ,
x ∈ C = X n . The steganographic channel, captured with the embedding mapping
Emb, extraction mapping Ext, source of covers, messages, and secret keys, is now
augmented with two more requirements that bound the expected embedding
distortion and the expected channel distortion that the stego object experiences
on its way from Alice to Bob,
  
E [d(x, y)] = Pc (x) Pk (k) Pm (m)d (x, Emb(x, k, m)) ≤ D1 , (6.28)
x k m

Ps (y)A(ỹ|y)d(ỹ, y) ≤ D2 , (6.29)
y,ỹ

where d(x, y) is some measure of distortion per cover element, such as the energy
1 2
i=1 (x[i] − y[i]) . The matrix A(ỹ|y) captures the probabilistic
1 2
n d2 (x, y) = n
active warden (noisy channel) and stands for the conditional probability that the
noisy stego object, ỹ, is received by Bob when stego object y was sent by Alice.
The scalar values, D1 and D2 , are the bounds on the per-element embedding
and channel distortion. Again, the stegosystem is considered secure if Pc = Ps .
A diagram showing the main elements of the communication channel is displayed
in Figure 6.2.
Secure stegosystems with distortion-limited embedder exist and two examples,
for the passive-warden case, are given below.5

6.3.1 Spread-spectrum steganography


Let us assume that covers are sequences of n iid Gaussian random variables,
x[i] ∼ N (0, σc2 ), i = 1, . . . , n. Using a secret key k, Alice seeds a cryptograph-
ically secure PRNG with her stego key and generates n iid realizations w[i],
i = 1, . . . , n, of a Gaussian random variable N (0, Dw ). Then, Alice embeds one
secret message bit m ∈ {0, 1} by either adding w or −w to the cover,

Emb(x, k, m) = γx + (2m − 1)w = y (6.30)

5 These examples appeared in [235].


94 Chapter 6. Steganographic security

for some scalar γ. We now need to determine γ and Dw so that the cover model
is preserved and the embedding distortion is below D1 .
Because w is independent of x, the stego object y is a sequence of iid Gaussian
random variables N (0, γ 2 σc2 + Dw ). To obtain a perfectly secure stegosystem, we
need to preserve the Gaussian model of the cover source. Because y is a zero-mean
Gaussian signal, we must preserve the variance γ 2 σc2 + Dw = σc2 . The expected
value of the embedding distortion is
   n 
1 1 
E d2 (x, y) = E (x[i] − y[i])2 (6.31)
n n i=1
 n 
1  2
= E ((1 − γ)x[i] − (2m − 1)w[i]) (6.32)
n i=1

= (1 − γ)2 σc2 + Dw , (6.33)


which will be bounded by D1 when
σc2 − 2γσc2 + γ 2 σc2 + Dw ≤ D1 , (6.34)
or
D1
γ ≤1− . (6.35)
2σc2
Thus, the parameters γ and Dw that provide perfect security are
D1
γ =1− , (6.36)
2σc2
D1
Dw = σc2 (1 − γ 2 ) = D1 1 − . (6.37)
4σc2
To finish the description of the stegosystem, we need to describe the extraction
mapping Ext. Bob first generates w using the stego key, computes the correlation

ρ = n1 ni=1 y[i]w[i], and extracts the message bit using the following rule:
ρ > 0 ⇒ m = 1, (6.38)
ρ ≤ 0 ⇒ m = 0. (6.39)
As shown in Section D.1.2, correlation is the likelihood-ratio test for Bob’s simple
binary hypothesis test
H0 : m = 0 (6.40)
H1 : m = 1 (6.41)
given the observed y[i], i = 1, . . . , n. The test statistic ρ follows the Gaussian

distribution with mean (2m − 1)D1 and variance proportional to 1/ n. Thus,
with increasing n, the message bit will be correctly extracted with probability
approaching 1.
Note that this low-capacity steganographic method is also robust with respect
to channel noise. Let us assume that the stego object is subject to noise and
Steganographic security 95

Bob receives ỹ = y + z, where each z[i] ∼ N (0, D2 ). The distribution of the test
statistic ρ has the same mean but a higher variance. Nevertheless, with increasing

n, its variance is again proportional to 1/ n.
Finally, we note that, even though the spread-spectrum embedding method
is perfectly secure with a distortion-limited embedder, it has a low embedding
capacity far below the theoretical steganographic capacity for such a channel [49,
235] (also, see Chapter 13).

6.3.2 Stochastic quantization index modulation


Quantization Index Modulation (QIM) is a data-embedding method originally
proposed for robust watermarking [42]. It can also be used for design of secure
steganographic schemes [49, 235]. The central concept is the notion of an N -
dimensional lattice Λ, which is a set of discrete points in a Euclidean space RN ,
usually spaced in some regular pattern. The lattice defines a quantizer, QΛ , that
maps any point in the space to the closest point from the lattice, QΛ (x) = y, y =
arg minx ∈Λ x − x . The set of points that quantize to the origin (0, 0) is called
the Voronoi cell and is defined as
V = {x ∈ RN | x ≤ x − x  for all x ∈ Λ}, (6.42)
where x is some norm in RN . In Figure 6.3, left, the Voronoi cell for lattice
Λ◦ is highlighted in gray (in L1 norm).
The QIM works with a family of interleaved lattices indexed by the message
that Alice wants to send to Bob. For each message m, we define a dither vector
dm ∈ RN and a shifted lattice Λm = Λ + dm called the mth coset of Λ. The
union of cosets ∪Λm = Λf is called the fine lattice and it forms the backbone of
the QIM data-embedding method. We will denote the Voronoi cell of the fine
lattice as Vf .
In the simplest, classical version of the QIM, the cover is quantized to the
lattice determined by the message Alice wants to send. Alice divides the cover
into blocks of N elements and quantizes each block x ∈ RN to Λm in order to
embed m in x. Formally, the stego block is y = QΛm (x). The lattice Λ and the
set of dither vectors are shared with Bob, which enables him to extract the
message as the index of the coset to which the stego block belongs. Notice that
this embedding method is robust with respect to small amounts of noise. As long
as the noisy block ỹ does not get out of its shifted Voronoi cell V = y + Λm that
surrounds y, Bob can quantize the distorted block ỹ to Λf before extracting the
message. By changing the spacing between the points in Λ, one can obtain a
trade-off between the embedding distortion and resistance to channel noise.
96 Chapter 6. Steganographic security

+ + +
Λ◦ Λ◦
+ + Λ+ +
V◦ Vf

+ + +

Figure 6.3 Left: Lattice Λ◦ (circles) with its Voronoi cell V◦ . Right: The fine lattice
Λ◦ ∪ Λ+ and its Voronoi cell Vf . Note that Λ+ = Λ◦ + (1, 1).

Example 6.1: [Stochastic QIM with two lattices] Figure 6.3 shows an ex-
ample of a regular lattice in R2 ,
Λ = Λ◦ = {(2i, 2j)| i, j ∈ Z} (6.43)
and two dither vectors d◦ = (0, 0), d+ = (1, 1). Note that Λ+ = {(2i + 1, 2j +
1)| i, j ∈ Z}, V◦ = [−1, 1] × [−1, 1], and Vf = {x| |x[1] + x[2]| < 1} if we use the
L1 norm. The message that can be sent in each block consisting of two cover
elements is one bit, m ∈ {0, 1}.

This simple version of QIM is not suitable for steganography as the stego
images confined to the fine lattice would certainly not follow the distribution of
covers and could easily be identified by Eve as suspicious. Alice needs to spread
the stego objects around the lattice points to preserve the distribution. This
is the idea behind the stochastic QIM [235]. For now, we will assume that the
covers are x ∈ RN rather than considering them broken up into disjoint blocks
of N elements each.
The Euclidean space is divided into M regions that are translates of each
other,
Rm = Λm + Vf , 1 ≤ m ≤ M. (6.44)
Note that Rm is the region of all noisy stego objects that carry the same mes-
sage, m. Let p[m] = Pc (Rm ) be the probability of a cover being in Rm . Let
us further assume that the probability that message m is sent is exactly p[m].
This could be arranged, for example, by prebiasing the stream of originally uni-
formly distributed message symbols using an entropy decoder (see more details
in Section 7.1.2 on model-based steganography).
We now define the embedding function Emb(x, m) for any cover x ∈ RN and
message m. Alice first identifies z ∈ Λm that is closest to x. If z already belongs
to Rm , no embedding change is required and Alice simply sends y = x. In the
Steganographic security 97

opposite case, Alice generates y randomly from the shifted Voronoi cell z + Vf
according to the distribution Pc constrained to the cell z + Vf , Pc (·)/Pc (z + Vf ).
Note that the bound on embedding distortion, D1 , determines how close to each
other the lattice points must be.
It should be clear that the stego object y ends up in Rm with probability p[m].
Also, within each cell, the probability distribution has the correct shape. The
only question is with what probability y appears in a given cell, or Pr{y ∈ z +
Vf }. It can be shown that for a well-behaved Pc , Pr{y ∈ z + Vf } ≈ Pc (z + Vf ).
More accurately, for any  > 0, |Pr{y ∈ z + Vf } − Pc (z + Vf )| <  for sufficiently
small D1 . Rather than proving this for the general case, we provide a qualitative
argument for the simple case of the fine lattice shown in Figure 6.3, right. It
shall be clear that the argumentation applies to more general cases.
Let us compute the probability of y ∈ z + Vf . In Figure 6.4, z + Vf is the
diamond. Since p[0] = p[1] = 12 , Pr{y ∈ z + Vf } is 12 times the probability that
x falls into the cell directly plus 12 times the probability that we end up in one of
the four triangles and move to the cell. Denoting the union of the four triangular
regions as T , we have
1 1
Ps (z + Vf ) = Pr{y ∈ z + Vf } = Pc (z + Vf ) + Pc (T ). (6.45)
2 2
If the distribution Pc were locally linear around z, we would have the equality
Ps (z + Vf ) = Pr{y ∈ z + Vf } = Pc (z + Vf ). For general but well-behaved Pc ,6 the
difference between the probabilities Ps (z + Vf ) and Pc (z + Vf ) can be made ar-
bitrarily small for sufficiently small D1 . (Recall that D1 determines the spacing
between the points in the lattice.) In summary, we will have |Pc (x) − Ps (x)| < 
for all x for sufficiently small D1 . Assuming Pc (x) > κ > 0 whenever Pc (x) = 0
(which will be satisfied for any finite cover set C), we can use the result of Exer-
cise 6.2 to claim that the KL divergence DKL (Pc ||Ps ) < /κ.

6.3.3 Further reading


Steganography with distortion-limited embedder in the active-warden scenario
has been investigated in [49, 180, 235, 237]. Here, we provide only a brief sum-
mary of the achievements while referring the reader to technical papers for more
information.
The authors of [180] define the concept of steganographic capacity and study
it for two alternative definitions of a distortion-limited embedder and two def-
initions of a distortion-limited warden. The rather surprising result is that the
capacity can be computed in the same manner no matter what combination of
the definitions is taken. A portion of this result appears in Chapter 13 dealing
with the subject of steganographic capacity. The authors also give a specific
construction of secure steganographic schemes for the cover source formed by

6 E.g., for smooth Pc with bounded partial derivatives.


98 Chapter 6. Steganographic security

+ + +

z + Vf
z

+ + +

+ + +

Figure 6.4 Example of a cell Vf surrounding a point z from the fine lattice formed by
circles and crosses. The four triangular regions shaded light gray are the regions from
where the cover x can be mapped to the cell during embedding.

 
Bernoulli sequences B 12 (binary sequences with probability of 0 equal to 12 )
both under an active and under a passive warden. Covers with a multivariate
Gaussian distribution and their security and steganographic capacity are studied
in [235]. This work also contains a rather fundamental result that block embed-
ding in such Gaussian covers is insecure in the sense that the KL divergence grows
linearly with the number of blocks. Perfectly secure and -secure steganographic
methods for Gaussian covers are studied in [49]. The authors compute a lower
bound on the capacity of perfectly secure stegosystems and study the increase in
capacity for -secure steganography. They also describe a practical lattice-based
construction for high-capacity secure stegosystems. The work is contrasted with
results obtained for digital watermarking where the steganographic constraint
Pc = Ps is absent. Codes for construction of high-capacity stegosystems are de-
scribed in [237]. The authors also show how their approach can be generalized
from iid sources to Markov sources and sources over continuous alphabets.

6.4 Complexity-theoretic approach

The information-theoretic definition of steganographic security is based on two


idealizations that are rather strong. The first one is the assumption that covers
can be described by a probability distribution, which is completely known to
Eve. This assumption is accepted as the worst-case scenario because one may
argue that Eve can observe the channel traffic for sufficiently long to learn this
distribution with arbitrary accuracy. Eve may also obtain additional informa-
Steganographic security 99

tion through espionage. Thus, our paranoia dictates that we should assume that
Eve knows all details of the distribution. Realistically, however, this can only
be reasonable within a sufficiently simple model of covers in some artificially
conceived communication channel rather than in reality, where the covers are
very complex objects, such as digital-media files. Additionally, one may attack
the very assumption that covers can be described by a random variable at all.
We choose not to delve into this rather philosophical issue and instead refer the
reader to [22] for an intriguing discussion of this issue from the perspective of
epistemology.
Second, the information-theoretic approach by definition ignores complexity
issues. It is concerned only with the possibility of constructing an attack rather
than its practical realization. It is quite feasible that, even though an attack on
a stegosystem can be mounted in principle, its practical realization may require
resources so excessive that no warden can implement the attack. For example, if
the computational complexity of the attack grows exponentially with the security
parameters of the stegosystem (stego key length and number of samples in the
cover), one could choose the key length and covers large enough to make sure
that any warden with polynomially bound resources will be destined to random
guessing. This idea is the basis of definitions of steganographic security based on
complexity-theoretic principles [113, 124].
To help explain the point further, we take a look at the related field of cryp-
tography. According to Shannon’s groundbreaking work on cryptosystems, the
only unconditionally secure cryptosystems are those whose key has the same
length as the text to be encrypted. This is what security defined using informa-
tion theory leads to. However, there exists an important and very useful class
of asymmetric cryptosystems (so-called public-key systems [207]) whose security
stems from the excessive computational complexity of constructing the attack
rather than the fundamental impossibility to construct one. In this section, we
explain a similar approach to defining steganographic security that is based on
complexity-theoretic considerations.
In 2002, Hopper et al. [113] and Katzenbeisser and Petitcolas [124] inde-
pendently proposed a complexity-theoretic definition of steganographic security.
These two proposals share several important novel ideas. First, the requirement of
knowing the probability distribution of covers is replaced with a weaker assump-
tion of availability of an oracle O that samples from the set of covers according
to their distribution over the channel. Second, the security of a stegosystem is
established by means of a probabilistic game between a judge and the warden.
The warden is allowed to sample the oracle O and the embedding oracle is im-
plemented as a black box seeded with an unknown stego key. The warden is
then essentially asked to distinguish between the outputs of the two oracles. The
advantage of the warden is defined as the probability of correct decision minus
1
2 . The stegosystem is secure if the warden’s advantage is negligible (it falls to
zero faster than any power of 1/k, where k is the security parameter, such as
100 Chapter 6. Steganographic security

the number of bits in the stego key or the number of elements in the cover). We
first describe the approach that appeared in [113].

6.4.1 Steganographic security by Hopper et al.


The communication channel is considered in a rather general framework as a
distribution over sequences of timestamped bits7 and thus extends to stegano-
graphic methods that potentially use timing between individual bits for message
hiding. The oracle O is allowed to make partial draws from the channel – it can
sample the channel distribution conditioned on channel history and provide the
next b timestamped bits on the basis of the previous h bits. Such an oracle will
be denoted Oh,b and is assumed to be available to the prisoners as well as the
warden.
A stegosystem is defined as a pair of probabilistic algorithms8 PEmb(h, k, m)
and PExt(h, k, {0, 1} ).9 The embedding algorithm PEmb takes on its input
a k-bit stego key k ∈ {0, 1}k , the secret message m ∈ {0, 1} , a history of h
timestamped bits, and the oracle Oh,b . When applied to lb timestamped cover
bits, PEmb returns a sequence of l blocks of b timestamped stego bits. The
extraction algorithm takes a key k, a history of h bits, and the sequence of lb
stego bits, and returns the hidden message m.
Security of the stegosystem with respect to a passive warden is established
by the following game played between a judge and the warden. The warden
is given access to oracle Oh,b and is allowed to make as many queries as she
likes. Then, the judge seeds the embedding oracle PEmb(·, k, ·) with a fixed
stego key and implements it as a black box with two inputs marked with a
dot. The first dot is an integer, h, while the second dot can be any bit string
m ∈ {0, 1} . The judge then prepares a second oracle, O(·, ·), also with two inputs
defined10 as O(h, m) = Oh,|PEmb(h,k,m)| . In other words, O(h, m) samples the
next |PEmb(h, k, m)| bits from the channel conditioned on the channel history
of h bits. The judge then uniformly randomly selects either PEmb(·, k, ·) or O(·, ·)
and gives it to the warden. The warden, who does not know whether she was
given the first or the second oracle, is allowed to make at most q queries and
draw at most l bits in total. She then uses the results of the queries to make
a decision regarding whether she is observing an output from the stegosystem,
PEmb, or the cover oracle O. Hopper defines advantage against a stegosystem

7 A timestamped bit is a pair {m[i], t[i]}, where t[i] is the time of sending the ith bit m[i].
8 A probabilistic algorithm uses randomness as part of its reasoning, usually realized using a
PRNG.
9 We use the prefix P in the embedding and extraction algorithms to stress the difference
that the PEmb and PExt are now probabilistic algorithms rather than mappings as in the
previous four sections or in Chapter 4. The symbol  in PExt means that it can operate on
bit strings of any length.
10 |s| denotes the length of bit string s.
Steganographic security 101

and -insecurity on the basis of the warden’s ability to correctly identify the
oracle using polynomial-complexity calculations.
One of the most intriguing implications of this complexity-theoretic view of
steganographic security is the fact that secure stegosystems exist if and only if
secure one-way (hash) functions exist under the assumption that the minimal
entropy11 of the cover source is larger than 2. This ties the security in steganog-
raphy to the security of common cryptographic primitives. The proof of this
statement is constructive and the embedding method is essentially a combina-
tion of steganography by cover selection and cover synthesis (see Chapter 4) that
proceeds by embedding one bit in a block of b cover bits.
Alice uses a key-dependent hash function12 h(k, ·) that maps b timestamped
bits to {0, 1}. The hash function enters the embedding process in the follow-
ing manner. Having synthesized (embedded) h timestamped stego bits, Alice
generates the next b bits, y, by querying the oracle Oh,b until h(k, y) returns
the required message. The number of calls to the oracle is limited by a fixed
upper bound and thus there is a small probability that a message bit will not
be embedded in the block. This embedding algorithm, which is called rejection
sampling, is an example of steganography by cover selection. The extraction al-
gorithm simply applies the same hash function to each block of b timestamped
stego bits to recover one message bit from each block. Under some rather tech-
nical assumptions about the hash function, which are satisfied if secure one-way
functions exist, this embedding algorithm is secure in the sense of Section 6.4.1.

6.4.2 Steganographic security by Katzenbeisser and Petitcolas


A similar complexity-theoretic definition of steganographic security has been
given by Katzenbeisser and Petitcolas [124]. Here, we only summarize their ap-
proach. The judge first prepares the embedding oracle PEmb(·, k, ·) implemented
as a black box with a randomly chosen stego key k. Here, the first argument is
an input for the cover, x, and the third argument for the message, m. This oracle
returns the stego object containing the message m embedded in x using key k.
The judge then gives the warden both the embedding oracle and the cover oracle
O. The warden can query O an arbitrary (but finite) number of times, obtaining
a sequence of covers that follow the distribution of covers. She can also query the
embedding oracle with any messages and covers, including pathological covers or
messages, obtaining thus a sequence of stego objects. Again, she is not limited
in number of queries but performs only polynomial-complexity calculations. The
queries to both oracles can be arbitrarily interwoven and based on the previous
queries. After the warden has finished her reasoning, the judge queries O twice,
obtaining two covers, x1 and x2 , selects a random message, m, and computes

11 The minimal entropy, Hmin , of a source with pmf p[i] is defined as Hmin (p) =
mini log2 1/p[i].
12 In [113], the authors use the term pseudo-random function.
102 Chapter 6. Steganographic security

PEmb(x2 , k, m). Then, the judge flips a fair coin and gives either the cover x1 or
the stego object PEmb(x2 , k, m) to the warden. The warden performs a polyno-
mial probabilistic test to decide whether she is observing a cover or stego object.
The advantage of the warden is the probability of a correct guess minus 12 . The
stegosystem is secure for oracle O if the warden’s advantage is negligible.
It is possible to construct secure steganographic systems S by reducing an
intractable problem Pint to the steganographic decision problem for S. The proof
of security can then be realized by contradiction in the following manner. Under
the assumption that S is not secure in the above sense, there exists a probabilistic
game Z between the warden and the judge that allows the warden to detect
stego objects with non-negligible probability. If the stegosystem S is constructed
in such a way that the existence of Z implies that instances of Pint can be solved
with non-negligible probability, we obtain a contradiction with the intractability
of Pint . The interested reader is referred to [124], where the authors use this
approach to construct a stegosystem whose insecurity would lead to an attack
on the RSA cryptosystem [207].

6.4.3 Further reading


A number of authors expanded the initial work of Hopper on the complexity-
theoretic approach to steganography. Most of the results are quite technical,
requiring the reader to be closely familiar with advanced concepts used in
study of public-key cryptosystems. A complexity-theoretic model for public-key
steganography [153, 154, 234] with an active warden was proposed by Backes and
Cachin [14] and further studied in [112]. Dedic et al. [56] showed that for secure
stegosystems the number of queries the sender must carry out is exponential in
the relative payload. Furthermore, they provide constructions of stegosystems
that nearly match the achievable payloads. The assumption of availability of an
oracle that samples conditionally on the channel history was criticized in [166].
The authors study how steganographic security is impacted by imperfect sam-
plers that can only sample with limited history.
Having presented several alternative definitions of security in steganography,
we note that the information-theoretic definition is the most widely accepted
approach in investigation of the security of multimedia objects, such as digital
images. It is possible that the concept of security in secret-key steganography
will follow the same path as in cryptography. Practical symmetric cryptographic
schemes, such as DES (Data Encryption Standard) or AES (Advanced Encryp-
tion Standard), cannot be proved secure but are nevertheless widely used because
they offer properties that make them very useful in applications. Their security
lies in the fact that nobody has so far been able to produce an attack substantially
faster than a brute-force search for the key. Their design is based on principles
and elements developed through joint research in cryptography and cryptanaly-
sis. Steganography seems to be taking the same course. Theoretical models and
existing attacks give the designers of stegosystems guidance and specific means
Steganographic security 103

for construction of the next generation of steganographic schemes. In practice,


security is thus often understood in a much less rigorous sense as the inability
to practically construct a reliable steganographic detector using existing attacks
or their modifications.

Summary
r The information-theoretic definition of steganographic security assumes that
the cover source is a random variable with known distribution Pc .
r Feeding covers into a stegosystem according to their distribution Pc , the dis-
tribution of stego objects follows distribution Ps .
r A steganographic system is perfectly secure (or -secure) if the Kullback–
Leibler divergence DKL (Pc ||Ps ) = 0 (or DKL (Pc ||Ps ) ≤ ).
r The KL divergence is a measure of how different the two distributions are.
r The existence of perfect compression of the cover source implies the existence
of perfectly secure steganographic schemes.
r Describing the covers using a simplified model, we obtain a concept of security
with respect to a model. Stegosystems preserving this model are undetectable
within the model.
r Spread-spectrum methods and lattice-based methods (quantization index
modulation) can be used to construct stegosystems with distortion-limited
embedder and distortion-limited active warden.
r Steganographic security defined in the information-theoretic sense is con-
cerned only with the possibility to mount an attack and not its feasibility
(computational complexity).
r An alternative definition of security is possible in which access to the distribu-
tions of covers is replaced with availability of an oracle that generates covers.
Security is then defined as the inability of a polynomially bounded warden to
construct a reliable attack.
r The complexity-theoretic security of stegosystems can be proved by reducing
the steganalysis problem to a known intractable problem or by using secure
one-way functions.

Exercises

6.1 [KL divergence between two Gaussians] Let f1 (x) and f2 (x) be the
pdfs of two Gaussian random variables N (μ1 , σ12 ) and N (μ2 , σ22 ). Prove for their
KL divergence
ˆ
f1 (x) 1 σ22 σ12 (μ2 − μ1 )2
DKL (f1 ||f2 ) = f1 (x) log dx = log + − 1 + .
f2 (x) 2 σ12 σ22 σ22
(6.46)
104 Chapter 6. Steganographic security

6.2 [KL divergence is continuous in max norm] Let g and h be two


distributions on C with h(x) > κ > 0 whenever g(x) > 0. Let h be a small per-
turbation of h such that maxx |h(x) − h (x)| < . Then, for sufficiently small
,
 
  
DKL (g||h) − DKL (g||h ) < . (6.47)
κ
Hint: Use the log inequality log(1 + x) ≤ x, which holds for all x ∈ R.

6.3 [KL divergence and detectability (counterexample)] Let C =


{1, . . . , N } and define two pmfs on C,
1
gN [i] = for i ∈ C, (6.48)
N



⎨1/N + δN for i = 1
hN [i] = 1/N − δN for i = 2 (6.49)


⎩1/N for i > 2.

First, show that


1
DKL (gN ||hN ) = − log(1 − N 2 δN
2
). (6.50)
N
Then, use the likelihood-ratio test
hN (x)
L(x) = (6.51)
gN (x)
to decide between two hypotheses H0 : x ∼ gN and H1 : x ∼ hN for one obser-
vation x. Show that PFA + PMD ≥ 1 − δN no matter what threshold is used in
the likelihood-ratio test. Finally, choose
1
δN = 1 − e−N 2 (6.52)
N
to show that DKL (gN ||hN ) = N .

6.4 [KL divergence for LSB embedding in iid source] Assume that the
cover image is a sequence of independent and identically distributed random
variables with pmf p0 [i]. After embedding relative payload α = 2β using LSB
embedding, the stego image is a sequence of iid random variables with pmf

pβ [2i] = (1 − β)p0 [2i] + βp0 [2i + 1], (6.53)


pβ [2i + 1] = βp0 [2i] + (1 − β)p0 [2i + 1]. (6.54)

Show that
β2 
127
1 1
DKL (p0 ||pβ ) = (p0 [2i] − p0 [2i + 1])2 + + O(β 3 ).
2 i=0 p0 [2i] p0 [2i + 1]
(6.55)
Hint: Use Proposition B.7.
Steganographic security 105

6.5 [KL divergence for ±1 embedding in iid source] Assume that the
cover image is a sequence of independent and identically distributed random
variables with range {0, . . . , 255} and pmf p0 [i]. After embedding relative pay-
load α = 2β using ±1 embedding (see Section 7.3.1), the stego image becomes a
sequence of iid random variables with pmf computed in Exercise 7.1. Show that
the KL divergence satisfies

253
(p0 [i − 1] − 2p0 [i] + p0 [i + 1])2
DKL (p0 ||pβ ) = β2
i=2
8p0 [i]
(p0 [1] − 2p0 [0])2 2 (2p0 [0] − 2p0 [1] + p0 [2])2 2
+ β + β
8p0 [0] 8p0 [1]
(p0 [253] − 2p0 [254] + p0 [255])2 2
+ β
8p0 [254]
(p0 [254] − 2p0 [255])2 2
+ β + O(β 3 ). (6.56)
8p0 [255]
Hint: Use Proposition B.7.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
7 - Practical steganographic methods pp. 107-134

Chapter DOI:
Cambridge University Press
7 Practical steganographic methods

The definition of steganographic security given in the previous chapter should be


a guiding design principle for constructing steganographic schemes. The goal is
clear – to preserve the statistical distribution of cover images. Unfortunately, dig-
ital images are quite complicated objects that do not allow accurate description
using simple statistical models. The biggest problem is their non-stationarity and
heterogeneity. While it is possible to obtain simple models of individual small flat
segments in the image, more complicated textures often present an insurmount-
able challenge for modeling because of a lack of data to fit an accurate local
model. Moreover, and most importantly, as already hinted in Chapter 3, digital
images acquired using sensors exhibit many complicated local dependences that
the embedding changes may disturb and leave statistically detectable artifacts.
Consequently, the lack of good image models gives space to heuristic methods.
In this chapter, we discuss four major guidelines for construction of practical
steganographic schemes:

r Preserve a model of the cover source (Section 7.1);


r Make the embedding resemble some natural process (Section 7.2);
r Design the steganography to resist known steganalysis attacks (Section 7.3);
r Minimize the impact of embedding (Section 7.4).

Steganographic schemes from the first class are based on a simplified model of
the cover source. The schemes are designed to preserve the model and are thus
undetectable within this model. The remaining three design principles are heuris-
tic. The goal of the second principle is to masquerade the embedding as some
natural process, such as noise superposition during image acquisition. The third
principle uses known steganalysis attacks as guidance for the design. Finally,
the fourth principle first assigns a cost of making an embedding change at each
element of the cover and then embeds the secret message while minimizing the
total cost (impact) of embedding. It is also possible and, in fact, advisable, to
take into consideration all four principles.
We now describe each design philosophy in more detail and give examples of
specific embedding schemes.
108 Chapter 7. Practical steganographic methods

7.1 Model-preserving steganography

This principle follows directly from the definition of steganographic security.


The designer first chooses a model of cover images and then makes the stegano-
graphic scheme preserve this model. This will guarantee that the stego scheme
will be undetectable as long as the chosen model completely describes the cov-
ers. Arguably, the simplest model is formed by a sequence of independent and
identically distributed (iid) random variables. In this model, the cover is com-
pletely described by the probability distribution function. This means that for
a given cover image, we need to preserve its first-order statistics or histogram.
Note that this will lead to undetectable stegosystems for both homogeneous and
heterogeneous cover sources (Section 6.2.2).
There exist many approaches that one could take to design a histogram-
preserving steganographic scheme [64, 75, 110, 183, 218, 232]. The approach
that we explain next is based on the general idea of statistical restoration in
which a portion of the image is reserved and not used during embedding so that
it can be utilized later to guarantee preservation of the first-order statistics.

7.1.1 Statistical restoration


We choose to illustrate the principle of statistical restoration on the example of
the steganographic algorithm OutGuess originally introduced by Provos [198].
OutGuess embeds messages into a JPEG image by slightly modifying the quan-
tized DCT coefficients. It also preserves the histogram of all DCT coefficients.
The iid model can be heuristically justified by the argument that DCT coef-
ficients in an individual 8 × 8 block are largely decorrelated and the fact that
inter-block dependences among DCT coefficients are much weaker than depen-
dences among neighboring pixels in the spatial domain.
Steganographic methods based on statistical restoration, such as OutGuess,
are two-pass procedures. In the first (embedding) pass, a stego key is used to
select a pseudo-random subset, De , of all DCT coefficients (both luminance and
chrominance coefficients) that will be used for embedding. Similar to the first
JPEG steganographic algorithm Jsteg (Section 5.1.2), OutGuess embeds the
message bits using simple LSB embedding (Section 5.1) into the coefficients
from De while skipping over all coefficients equal to 0 or 1 to avoid introducing
disturbing artifacts. In the second pass, corrections are made to the DCT coeffi-
cients outside of the set De to match the histogram of the stego image with the
cover-image histogram.
Before embedding starts, OutGuess calculates the maximum length of a ran-
domly spread message (the maximal correctable payload) that can be embedded
in the cover image during the first pass, while making sure that there will be
enough coefficients for the correction phase to adjust the histogram to its original
values. The reader is encouraged to verify that the maximal correctable payload
Practical steganographic methods 109

is determined by the most imbalanced LSB pair, which is the pair {2k, 2k + 1}
with the largest ratio
max{h[2k], h[2k + 1]}
. (7.1)
min{h[2k], h[2k + 1]}
Because the histogram of DCT coefficients in a single-compressed JPEG image
has a spike at zero (see the discussions in Chapter 2) and because the LSB pair
{0, 1} is skipped during embedding, the most imbalanced LSB pair is {−2, −1}.
Because h[−2] < h[−1] in typical cover images, after embedding the maximum
correctable payload all remaining coefficients with value −2 will have to be
modified to −1 in the correction phase.
Writing n01 for the number of all DCT coefficients not equal to 0 or 1, the
maximal correctable payload will be αmax n01 , 0 ≤ αmax ≤ 1. Because all αmax n01
coefficients are selected pseudo-randomly in the embedding phase, the number
of unused DCT coefficients with value −2 after embedding is (1 − αmax )h[−2].
In order to restore the number of −1s in the stego image, this value must be
larger than or equal to the expected decrease in the number of coefficients with
value −1, which is (αmax /2)h[−1] − (αmax /2)h[−2]. This is because, assuming a
random message is embedded, the probability that a message bit will match the
LSB of −1 is 12 and thus on average (αmax /2)h[−1] coefficients with value −1 will
be unchanged by embedding and the same number of them will be modified to
−2. The second term, (αmax /2)h[−2] is the expected number of coefficients with
value −2 that will be modified to −1 during embedding. Because h[−2] < h[−1],
the expected drop in the number of coefficients with value −1 is the difference
(αmax /2)h[−1] − (αmax /2)h[−2]. Thus, we obtain the following inequality and
eventually an upper bound on the maximum correctable payload αmax :
αmax αmax
(1 − αmax )h[−2] ≥ h[−1] − h[−2], (7.2)
2 2

2h[−2]
αmax ≤ . (7.3)
h[−1] + h[−2]
This condition guarantees that at the end of the embedding phase on average
there will be enough unused coefficients with magnitude −2 that can be flipped
back to −1 to make sure that the occurrences of the LSB pair {−2, −1} are pre-
served after embedding. As this is the most imbalanced LSB pair, the occurrences
of virtually all other LSB pairs can be preserved using the same correction step,
as well. We note that some very sparsely populated histogram bins in the tails
of the DCT histogram may not be restored correctly during the second phase,
but, since their numbers are statistically insignificant, the impact on statistical
detectability is negligible. The average capacity αmax of OutGuess for typical
natural images is around 0.2 bpnc (bits per non-zero DCT coefficient).
Steganographic schemes that embed messages in the spatial domain require
more complex models because neighboring pixels are more correlated than DCT
coefficients in a JPEG file. These correlations cannot be captured using first-
110 Chapter 7. Practical steganographic methods

order statistics. Instead, one can use the joint statistics of neighboring pixel
pairs [75], statistics of differences between neighboring pixels (see Section 11.1.3
on structural steganalysis), or Markov chains.

7.1.2 Model-based steganography


Steganography based on statistical restoration chooses the sample statistics (e.g.,
the histogram) as the model to preserve. In contrast, model-based steganogra-
phy [204] fits a parametric model through the sample data and preserves this
data-driven model and does so without the need for a correction step. The cover
image is modeled as a random variable that can be divided into two components,
x = (xinv , xemb ), where xinv is invariant with respect to embedding and xemb may
be modified during embedding. We denote the range of each random variable as
Xinv and Xemb . For example, we can think of LSB embedding in 8-bit grayscale
images where xinv ∈ {0, 1}7 = Xinv are the 7 most significant bits (or the index
of the LSB pair) and xemb ∈ {0, 1} = Xemb , the LSB of x.
The cover model is formed by the conditional probabilities Pr{xemb |xinv }.
These probabilities will be needed to extract the message and thus must be
known to the recipient. This can be arranged by making Pr{xemb |xinv } depend
only on the invariant component xinv .
First, the set of all cover elements (pixels or DCT coefficients) is written as a
union of disjoint subsets

C(xinv ), (7.4)
xinv ∈Xinv

where C(xinv ) is the set of cover elements whose invariant part is xinv . The
embedding algorithm embeds a portion of the message in each C(xinv ) in the fol-
lowing manner. First, the message bits are encoded using symbols from Xemb
and then the symbols are prebiased so that they appear with probabilities
Pr{xemb |xinv = xinv }. This is achieved by running the message symbols through
an entropy decompressor, for a compression scheme designed to compress symbols
from Xemb distributed according to the same conditional probabilities.1 When
the decompressor is fed with xemb distributed uniformly in Xemb , it will output
symbols with probabilities Pr{xemb |xinv = xinv }. Thus, when xemb of all cover el-
ements from C(xinv ) are replaced with the transformed symbols, xemb , the stego
image elements will follow the cover-image model as desired.
The fraction of bits that can be embedded in each element of C(xinv ) is the
entropy of Pr{xemb |xinv = xinv } or

H (Pr{xemb |xinv = xinv }) =



− Pxemb |xinv (xemb |xinv ) log2 Pxemb |xinv (xemb |xinv ), (7.5)
xemb

1 In practice, one can use, for example, arithmetic compression.


Practical steganographic methods 111

Cover x Model Message m

xinv xemb Pxemb |xinv =xinv

xinv xemb Entropy decoder

Stego y

Figure 7.1 Model-based steganography (embedding).

Model Stego y

Pxemb |xinv =xinv xinv xemb

Entropy decoder

Message m

Figure 7.2 Model-based steganography (extraction).

where we denoted for brevity

Pxemb |xinv (xemb |xinv ) = Pr{xemb = xemb |xinv = xinv }. (7.6)

Thus, the total embedding capacity is



|C(xinv )|H (Pr{xemb |xinv = xinv }) , (7.7)
xinv

where |C(xinv )| is the cardinality of C(xinv ).


To illustrate the model-based approach, we describe a specific realization for
steganography in JPEG images [204]. Similar to OutGuess or Jsteg, DCT coef-
ficients equal to 0 or 1 are not used for embedding and the embedding mecha-
nism is LSB flipping. Also, Xinv = {0, 1}7 and Xemb = {0, 1}. The model-based-
steganography paradigm is applied to each DCT mode separately. Thus, at the
beginning the cover JPEG file is decomposed into 64 subsets corresponding to 64
DCT modes (spatial frequencies). Let h[i] be the histogram of DCT coefficients
for one fixed mode. Because the sums h[2i] + h[2i + 1] are invariant under LSB
112 Chapter 7. Practical steganographic methods

embedding for all i, we can use this invariant and fit a parametric model, h(x),
through the points ((2i + 2i + 1)/2, (h[2i] + h[2i + 1])/2). For example, we can
model the DCT coefficients using the generalized Cauchy model with pdf (see
Appendix A)
−p
p−1 |x|
h(x) = 1+ (7.8)
2s s
and determine the parameters using maximum-likelihood estimation (see Exam-
ple D.8). Denoting the 7 most significant bits of an integer a as MSB7 (a), we
define the model using the conditional probabilities
h(2i)
Pr{xemb = 0|xinv = MSB7 (2i)} = , (7.9)
h(2i) + h(2i + 1)
h(2i + 1)
Pr{xemb = 1|xinv = MSB7 (2i)} = . (7.10)
h(2i) + h(2i + 1)
The sender continues with the embedding process by selecting all h[2i] +
h[2i + 1] DCT coefficients at the chosen DCT mode that are equal to 2i or
2i + 1. Their LSBs are replaced with a segment of the message that was de-
compressed using an arithmetic decompressor to the length h[2i] + h[2i + 1].
The decompressor is designed to transform a sequence of uniformly distributed
message bits to a biased sequence with 0s and 1s occurring with probabilities
(7.9)–(7.10). Because we are replacing a bit sequence with another bit sequence
with the same distribution, the model (7.9)–(7.10) will be preserved.
The recipient first constructs the model and computes the probabilities given
by (7.9)–(7.10). This can be achieved because h[2i] + h[2i + 1] is invariant with
respect to embedding changes! Individual message segments are extracted by
feeding the LSBs for each DCT mode and each LSB pair {2i, 2i + 1} into the
arithmetic compressor and concatenating them.
This specific example of model-based steganography is designed to preserve the
model of the histograms of all individual DCT modes. This is in contrast with
OutGuess that preserves the sample histogram of all DCT coefficients and not
necessarily the histograms of the DCT modes. For 80% quality JPEG images, the
average embedding capacity of this algorithm is approximately 0.8 bpnc, which
is remarkably large considering the scope of the model and four times larger than
for OutGuess.
We now calculate the embedding efficiency of this algorithm defined as the
average number of bits embedded per unit distortion. In order to simplify the
notation, we set

p0 = Pr{xemb = 0|xinv = MSB7 (2i)}. (7.11)

The average number of bits embedded in one DCT coefficient equal to 2i or


2i + 1 for one fixed DCT mode is

H(p0 ) = −p0 log2 p0 − (1 − p0 ) log2 (1 − p0 ). (7.12)


Practical steganographic methods 113

An embedding change is performed when the coefficient’s LSB does not match the
prebiased message bit. Because the probability that the prebiased message bit is 0
is the same as the probability that the LSB of the coefficient is 0, which is p0 , the
embedding needs to change the LSB with probability p0 (1 − p0 ) + (1 − p0 )p0 =
2p0 (1 − p0 ). Therefore, the embedding efficiency is

−p0 log2 p0 − (1 − p0 ) log2 (1 − p0 )


e(p0 ) = . (7.13)
2p0 (1 − p0 )
Note that e(p0 ) is always greater than or equal to 2 (follow Figure 7.3). Sur-
prisingly, this is higher than the embedding efficiency of simple LSB embedding,
which is 2 because in LSB embedding every other LSB is modified, on average.
To summarize, this specific example of model-based steganography preserves
the models of histograms of all 64 DCT modes and does so while providing
embedding efficiency larger than 2. This is quite an improvement over the naive
Jsteg described in Chapter 5.

4.5
Embedding efficiency e(p0 )

3.5

2.5

2
0 0.2 0.4 0.6 0.8 1
p0

Figure 7.3 Embedding efficiency of Model-Based Steganography for JPEG images.

There exists a more advanced version of this algorithm [205] that attempts
to preserve one higher-order statistic called “blockiness” defined as the sum of
discontinuities along the boundaries of 8 × 8 pixel blocks in the spatial domain
(see (12.9) in Chapter 12). This is achieved using the idea of statistical restoration
by making additional modifications to unused DCT coefficients in an iterative
manner to adjust the blockiness to its original value in the cover image.2 In
Chapter 12, the original version of Model-Based Steganography is abbreviated
MBS1, while the more advanced version with deblocking is denoted as MBS2.

2 These additional changes, however, make this version of Model-Based Steganography more
detectable (see Chapter 12 and [190, 210]).
114 Chapter 7. Practical steganographic methods

7.2 Steganography by mimicking natural processing

Model-preserving steganographic schemes are undetectable within the chosen


model. However, unless the model comprehensively captures the cover source,
all that is needed to construct a steganalysis algorithm is to identify a statis-
tical quantity that is disturbed by the embedding. It turns out that finding
such a statistic is usually not very difficult. Often, in steganography by statisti-
cal restoration the additional changes in the correction phase only make things
worse and in fact make steganalysis easier [190, 210, 245]. Preserving a more
complex cover source is not really a practical answer to this problem because
schemes based on statistical restoration do not scale well with the complexity
of the model. For example, it is not immediately clear how to at the same time
preserve the histogram of DCT coefficients and the statistics of DCT coefficient
pairs from neighboring 8 × 8 blocks without sacrificing the embedding capacity.
Even though a practical embedding method capable of approximately preserving
multiple statistics was described in [144] (the Feature Correction Method), the
approach has not led to secure steganographic algorithms so far.
The lack of accurate models justifies heuristic approaches to steganography,
such as those that attempt to mask the embedding as a natural process. Pre-
sumably, if the effect of embedding were indistinguishable from some natural
processing, stego images should stay compatible with the distribution of cover
images. In this section, we describe a practical realization of this idea by masking
embedding as superposition of noise with given statistical properties.

7.2.1 Stochastic modulation


In Chapter 3, we learned that the process of digital image acquisition is affected
by multiple noise sources. Even when taking two images of exactly the same
scene under the same conditions and identical camera settings, the images will
differ in their noise component due to the presence of random phenomena caused
by quantum properties of light (shot noise) and noise present in electronic com-
ponents of the sensor. This suggests the idea of constructing a steganographic
method so that the impact of embedding resembles superposition of sensor noise
during image acquisition [77, 82]. The steganalyst would then be required to
distinguish whether the image noise component is solely due to image acquisi-
tion or contains a component due to message embedding. This is essentially a
heuristic plan to produce stego images compatible with the distribution of the
cover source.
Our goal is to construct an embedding scheme so that the impact of embedding
is equivalent to adding to the cover image a signal obtained by independent
realizations of a random variable η with a given probability density function fη .
Because digital images are represented with a finite bit depth, the noise will have
to be adequately quantized as well. Assuming for simplicity and without loss of
Practical steganographic methods 115

generality that we work with grayscale images, we denote by r[i] = round(η[i])


the realizations of η after rounding them to integers from the set {0, . . . , 255}.
The rounded noise sequence r[i] is called the stego noise. The probability mass
function of the stego noise is

1
ˆ 2
k+

p[k] = Pr{r[i] = k} = fη (x)dx for any integers i, k. (7.14)


k− 12

In theory, by adding realizations of the stego noise to the cover image we could
embed H(p) bits (the entropy of round(η)) at every pixel. Instead of trying to
develop a practical method that achieves this capacity, which is not an easy task,
we provide a simple suboptimal method.
It will be advantageous to work with bits represented using the pair {−1, 1}
rather than {0, 1}. Thus, from now till the end of this section, −1 and 1 represent
binary 0 and 1, respectively. Next, we define a parity function π with the following
antisymmetry property,
π(u + v, v) = −π(u, v) (7.15)
for all integers u and v. The following function, for example, satisfies this re-
quirement:


⎪ u
for 2k|v| ≤ u ≤ 2k|v| + |v| − 1
⎨(−1)
π(u, v) = −(−1)u for (2k − 1)|v| ≤ u ≤ 2k|v| − 1 (7.16)


⎩0 for v = 0,
where k is an integer.
The embedding algorithm starts by generating a pseudo-random path through
the image using the stego key and two independent stego noise sequences r[i] and
s[i] with the probability mass function (7.14). The sender follows the embedding
path and embeds one message bit, m, at the ith pixel, x[i], if and only if r[i] = s[i].
In this case, the sender can always embed one bit by adding either r[i] or s[i] to
x[i] because due to the antisymmetry property of the parity function π (7.15)
π(x[i] + s[i], r[i] − s[i]) = −π(x[i] + s[i] + r[i] − s[i], r[i] − s[i]) (7.17)
= −π(x[i] + r[i], r[i] − s[i]). (7.18)
In the case when r[i] = s[i], the sender does not embed any bit, replaces x[i]
with y[i] = x[i] + r[i], and continues with embedding at the next pixel along the
pseudo-random walk.
To complete the embedding algorithm, however, we need to resolve one more
technicality of the embedding process. If, during embedding, y[i] = x[i] + r[i]
gets out of its dynamic range, the sender has little choice but to slightly deviate
from the stego noise model and instead add r [i] so that π(x[i] + r [i], r[i] − s[i]) =
m with |r [i] − r[i]| as small as possible.
116 Chapter 7. Practical steganographic methods

The recipient first uses the stego key to generate the same stego noise se-
quences, r[i] and s[i], and the pseudo-random walk through the pixels. The mes-
sage bits are read from the stego image pixels, y[i], as parities m = π(y[i], r[i] −
s[i]). No bit is extracted from pixel y[i] if r[i] = s[i].
In summary, the sender starts by generating two independent stego noise se-
quences with the required probability mass function and then at each pixel at-
tempts to embed one message bit by adding one of the two samples of the stego
noise. This works due to the antisymmetry property of the parity function. If the
two stego noise samples are the same, the embedding process does not embed
any bits. The sender, however, still adds the noise to the image so that the stego
image is, indeed, obtained by adding noise of a given pmf to the cover image as
required.
Stochastic modulation embeds one bit at every pixel as long as r[i] = s[i]. Thus,

the relative embedding capacity is 1 − Pr{r = s} = 1 − k p[k]2 bits per pixel,
where the probabilities can be computed using (7.14). The random noise sources
during image acquisition, such as the shot noise and the readout noise, are well
modeled as Gaussian random variables with zero mean and variance σ 2 . In this
case, the relative embedding capacity per pixel is as shown in Figure 7.4. Because
the stego noise sequences are more likely to be equal for small σ, the capacity
increases with the noise variance. Also, note that the sender can communicate
about 0.7 bits per pixel using quantized standard Gaussian noise N (0, 1).
Note that stochastic modulation is suboptimal in general because only 1 −
 2
k p[k] bits are embedded at every pixel rather than H(p)
bits, the entropy
of the stego noise (Exercise 7.4 shows that, indeed, 1 − k p[k]2 ≤ H(p)). Al-
though it is not known how to construct steganographic systems reaching the the-
oretic embedding capacity for general stego noise, there exist capacity-reaching
constructions for special instances of p, such as when the stego noise amplitude
is at most 1 (see Section 9.4.5).
No matter how plausible the heuristic behind stochastic modulation sounds,
the fact is that it can be reliably detected using modern steganalysis methods
based on feature extraction and machine learning. The main reason is that dur-
ing image acquisition, the noise is injected before the signal is even quantized
in the A/D converter and further processed. Adding the noise to the final TIFF
or BMP image is not the same because this image already contains a multitude
of complex dependences among neighboring pixels due to in-camera process-
ing, such as demosaicking, color correction, and filtering. Thus, the stego noise
should be superimposed on the raw sensor output rather than the final image.
It is, however, not clear at this point how to embed bits in the pixel domain by
modifying the raw sensor output. A possible solution is to apply coding meth-
ods for communication with non-shared selection channels, which is the topic of
Chapter 9.
Practical steganographic methods 117

Relative embedding capacity


0.8

0.6

0.4

0.2

0
0 2 4 6 8
σ2
Figure 7.4 Relative embedding capacity of stochastic modulation (in bits per pixel)
realized by adding white Gaussian noise of variance σ 2 .

7.2.2 The question of optimal stego noise


For the ideal case of stochastic modulation capable of communicating H(p) bits
per pixel, it is natural to ask about the best stego noise that would embed the
highest payload with the smallest embedding distortion. Stochastic modulation
with such stego noise would minimize the embedding distortion, which could
intuitively decrease the statistical detectability.
The properties of the optimal stego noise will obviously depend on how we
measure the distortion between cover and stego images, d(x, y). The expected
value of the embedding distortion per pixel, E[d(x, y)], will be some function of
the stego noise pmf p, and will be denoted as d(p) = E[d(x, y)].
Imposing a bound on the embedding distortion, D, we wish to determine popt
that maximizes the relative embedding payload

popt = arg max H(p). (7.19)


d(p)≤D

This expression is recognized in information theory as the rate–distortion bound,


which is a relationship that connects the communication rate and distortion. The
relative payload for the optimal distribution, H(popt ), depends on the distortion
bound D and it is also a function of the distortion measure d. This is why we
denote it as Hd (D) = H(popt ). We can say that, given a bound on embedding
distortion, D, the relative payload α that can be embedded with any instance of
stochastic modulation must satisfy

α ≤ Hd (D). (7.20)
Alternatively, the relative payload α can be embedded with distortion no smaller
than Hd−1 (α), where Hd−1 is the inverse function to Hd . This translates into the
118 Chapter 7. Practical steganographic methods

following bound on the embedding efficiency (ratio of payload and distortion):


α
e ≤ −1 . (7.21)
Hd (α)
We now find the stego noise distribution popt for a specific choice of the distor-
tion measure, dγ (x, y) = |x − y|γ , as defined in (4.12). The expected distortion
per pixel is

dγ (p) = p[k]|k|γ ≤ D. (7.22)
k

We now show that the optimal stego noise distribution is the discrete gener-
alized Gaussian
e−λ|k|
γ

popt [k] = , (7.23)


Z(λ)
where

e−λ|k|
γ
Z(λ) = (7.24)
k

is the normalization factor. To see this, we write for the entropy H(p) of an
arbitrary distribution p satisfying the bound (7.22)
 1  1  popt [k]
H(p) = p[k] log = p[k] log + p[k] log (7.25)
p[k] popt [k] p[k]
k k k

≤ log Z(λ) + λ p[k]|k|γ ≤ log Z(λ) + λD, (7.26)
k

where we used the non-negativity of the KL divergence (Proposition B.5)


 popt [k]
0 ≤ DKL (p||popt ) = − p[k] log . (7.27)
p[k]
k

From the property of the KL divergence, the equality is reached when p = popt ,
which proves the optimality of popt (7.23). The parameter λ is determined from

the requirement k p[k]|k|γ = D.
Note that when the distortion is measured as energy, γ = 2, the bound (7.22)
limits the variance of p. In this case, the stego noise with maximal entropy is
the discrete Gaussian, in compliance with the classical result from information
theory that the highest-entropy noise among all variance-bounded distributions
is the Gaussian distribution.
For steganography whose embedding changes are restricted to ±1, the stego
noise distribution satisfies p[k] = 0 for k ∈/ {−1, 0, 1} and dγ (p) becomes the
change rate (4.17) for any γ > 0. In this case, the function Hdγ (x) can be deter-
mined analytically [233] (also see Exercise 7.5) to be
Hdγ (x) = −x log2 x − (1 − x) log2 (1 − x) + x, (7.28)
the ternary entropy function (Section 8.6.2).
Practical steganographic methods 119

7.3 Steganalysis-aware steganography

Steganography is advanced through steganalysis. Thus, it is only natural to take


into account existing attacks on steganographic techniques when designing a new
one. In fact, OutGuess was originally designed as an advanced version of Jsteg
resistant to the histogram attack (Chapter 5), rather than from the definition
of steganographic security as presented in Chapter 6. Security with respect to
known steganalysis is an obvious necessary condition that any new steganography
aspiring to be secure must satisfy. Historically, steganalysis was often used as a
guiding principle to avoid known pitfalls when designing the next generation of
steganographic methods.

7.3.1 ±1 embedding
It was recognized early on that the embedding operation of flipping the LSB cre-
ates many problems due to its asymmetry (even values are never decreased and
odd values never increased during embedding). Flipping LSBs is an unnatural
operation that introduces characteristic artifacts into the histogram. An obvious
remedy is to use an embedding operation that is symmetrical. A trivial modifi-
cation of LSB embedding is the so-called ±1 embedding, also sometimes called
LSB matching. This embedding algorithm embeds message bits as LSBs of cover
elements; however, when an LSB needs to be changed, instead of flipping the
LSB, the value is randomly increased or decreased, with the obvious exception
that the values 0 and 255 are only increased or decreased, respectively. This has
the effect of modifying the LSB but, at the same time, other bits may be modi-
fied as well. In fact, in the most extreme case, all bits may be modified, such as
when the value 127 = (01111111)2 is changed to 128 = (10000000)2.
Note that the extraction algorithm of ±1 embedding is the same as for LSB
embedding – the message is read by extracting the LSBs of cover elements. The
±1 embedding is much more difficult to attack than LSB embedding. While
there exist astonishingly accurate attacks on LSB embedding (see Sample Pairs
Analysis in Chapter 11), no attacks on ±1 embedding with comparable accuracy
currently exist (Sections 11.4 and 12.5 contain examples of a targeted and a blind
attack).

7.3.2 F5 embedding algorithm


The effect of the embedding operation on the security of steganographic algo-
rithms for the JPEG format is larger than on schemes that embed in the spatial
domain. This is because there exist good models for the histogram of DCT co-
efficients (see Chapter 2). The F5 algorithm [241] was originally designed to
overcome the histogram attack while still offering a large embedding capacity.
The F5 contains two important ingredients – its embedding operation and ma-
120 Chapter 7. Practical steganographic methods

trix embedding. The operation of LSB flipping was replaced with decrementing
the absolute value of the DCT coefficient by one. This preserves the natural
shape of the DCT histogram, which looks after embedding as if the cover image
was originally compressed using a lower quality factor. The second novel design
element, the matrix embedding, is a coding scheme that decreases the number
of embedding changes. For now, we do not consider matrix embedding in the
algorithm description and instead postpone it to Chapter 8.
The F5 algorithm embeds message bits along a pseudo-random path deter-
mined from a user passphrase. The message bits are again encoded as LSBs of
DCT coefficients along the path. If the coefficient’s LSB needs to be changed, in-
stead of flipping the LSB, the absolute value of the DCT coefficient is decreased
by one. To avoid introducing easily detectable artifacts, the F5 skips over the
DC terms and all coefficients equal to 0. In contrast to Jsteg, OutGuess, or
Model-Based Steganography, F5 does embed into coefficients equal to 1.
Because the embedding operation decreases the absolute value of the coefficient
by 1, it can happen that a coefficient originally equal to 1 or −1 is modified to
zero (a phenomenon called “shrinkage”). Because the recipient will be reading the
message bits from LSBs of non-zero AC DCT coefficients along the same path,
the bit embedded during shrinkage would get lost.3 Thus, if shrinkage occurs,
the sender has to re-embed the same message bit, which, by the way, will always
be a 0, at the next coefficient. However, by re-embedding these 0-bits, the sender
will end up embedding a biased bit stream containing more zeros than ones.
This would mean that odd coefficient values will be more likely to be changed
than even-valued coefficients. Yet again, we run into an embedding asymmetry
that introduces “staircase” artifacts into the histogram. There are at least two
solutions to this problem. One is to embed the message bit m as the XOR of the
coefficient LSB and some random sequence of bits also generated from the stego
key. This way, shrinkage will be equally likely to occur when embedding a 1 or
a 0. The implementation of F5 solves the problem differently by redefining the
LSB for negative numbers:

1 − x mod 2 for x < 0
LSBF5 (x) = (7.29)
x mod 2 otherwise.

Because in natural images the numbers of coefficients equal to 1 and −1 are ap-
proximately the same (h[1] ≈ h[−1]), this simple measure will also cause shrink-
age to occur when embedding both 0s and 1s with approximately equal proba-
bility. The pseudo-code for the embedding and extraction algorithm for F5 that
does not employ matrix embedding is shown in Algorithms 7.1 and 7.2.
To calculate the embedding capacity of F5, realize that F5 does not embed into
DCT coefficients equal to 0 and the DC term. Also, no bit is embedded during

3 Realize that the recipient has no means of determining whether the DCT coefficient was
originally zero or became zero due to embedding.
Practical steganographic methods 121

Algorithm 7.1 Embedding message m ∈ {0, 1}m in JPEG cover image x ∈ X n


using the F5 algorithm (no matrix embedding employed).
// Initialize a PRNG using stego key (or passphrase)
// Input: message m ∈ {0, 1}m, quantized JPEG DCT coefficients x ∈ X n
Path = Perm(n);
// Perm(n) is a pseudo-random permutation of {1, 2, . . . , n}
y = x;
i = 1;j = 1; // i message index, j coefficient index
while (i ≤ m) & (j ≤ n) {
if (x[Path[j]] = 0) & (x[Path[j]] is not DC term) {
if LSBF5 (x[Path[j]]) = m[i] {i = i + 1;}
else {
y[Path[j]] = x[Path[j]] − sign(x[Path[j]])
if y[Path[j]] = 0 {i = i + 1;}
}
}
j = j + 1;
}
// y are stego DCT coefficients conveying i message bits

Algorithm 7.2 Extracting message m from a JPEG stego image y ∈ X n em-


bedded using the F5 algorithm (no matrix embedding employed).
// Initialize a PRNG using stego key (or passphrase)
// Input: quantized JPEG DCT coefficients y ∈ X n
Path = Perm(n);
// Perm(n) is a pseudo-random permutation of {1, 2, . . . , n}
i = 1;j = 1; // i message index, j coefficient index
while (j ≤ n) {
if (y[Path[j]] = 0) & (y[Path[j]] is not DC term) {
m[i] = LSBF5 (y[Path[j]]) {i = i + 1;}
}
j = j + 1;
}
// read the header of extracted bits to find the message length and trun-
cate m

shrinkage, which will lead to loss of (h[1] + h[−1])/2 bits. Thus, given a JPEG
file with in total nAC AC DCT coefficients, the embedding capacity is nAC −
h[0] − (h[−1] + h[1])/2, where h is the histogram of all AC DCT coefficients.
For example, for a cover source formed by JPEG images with 80% quality, the
122 Chapter 7. Practical steganographic methods

average embedding capacity is about 0.75 bits per non-zero DCT coefficient,
which is quite large.
We stress that the F5 algorithm does not preserve the histogram but pre-
serves its crucial characteristics, such as its monotonicity and monotonicity of
increments [241].

7.4 Minimal-impact steganography

Steganography by cover modification will inevitably introduce some embedding


changes into the cover image. It is intuitively clear that the modifications may
have different impact on statistical detectability depending on the local con-
text, properties of the cover element being modified, and other factors. If we
could quantify the contribution, ρ[i], of modifying every cover element, x[i], to
the overall statistical detectability, we would essentially convert the problem of
maximizing the security into an optimization problem: “How to embed a given
payload while minimizing the overall expected embedding impact?” This ap-
proach to steganography is rather general and it is also appealing because of
its modular architecture. For example, progress in our understanding of the im-
pact of embedding on security will only lead to updated values ρ[i], while the
optimization algorithm may stay the same.
Steganographic methods designed from the principle of minimal embedding
impact may not minimize the KL divergence between cover and stego images
for any chosen model of covers (see Exercise 7.6). As explained earlier in this
chapter, our approach is heuristic, justified by the lack of accurate cover models.
Indeed, schemes that painstakingly preserve some simple cover model may in
fact be quite detectable because of the model misfit. The strategy proposed in
this section is to abstract from a model and, instead of trying to preserve the
cover model, accept in advance the fact that the steganography you will build
will not be perfect and minimize the impact of embedding.
Let us assume that the impact of making an embedding change at pixel i can
be captured using a scalar value ρ[i] ≥ 0. Denoting by x and y the cover and
stego image, the total embedding impact is defined as


n
dρ (x, y) = ρ[i] (1 − δ(x[i] − y[i])) , (7.30)
i=1

where δ(x) is the Kronecker delta (2.23). We point out that (7.30) implicitly
assumes that the embedding impact is additive because it is defined as a sum
of impacts at individual pixels. In general, however, the embedding modifica-
tions could be interacting among themselves, reflecting the fact that making two
changes to adjacent pixels might be more or less detectable than making the
same changes to two pixels far apart from each other. A detectability measure
that takes interaction among pixels into account would not be additive. If the
Practical steganographic methods 123

density of embedding changes is low, however, the additivity assumption is plau-


sible because the distances between modified pixels will generally be large and
the embedding changes will not interfere much.
The reader should realize that ρ[i] does not necessarily have to correspond to
embedding distortion as defined in Chapter 4. For example, embedding changes
in textured or noisy areas of the cover image are less likely to introduce detectable
artifacts than changes in smooth areas. This could be captured by introducing
weights, ω[i] ≥ 0, for each cover element and defining

ρ[i] = ω[i]|x[i] − y[i]|γ , (7.31)

where γ is a non-negative parameter. If the embedding change is probabilistic


and more than one value y[i] is possible, (7.31) is understood as the expected
value. For example, in ±1 embedding, y[i] = x[i] + 1 and y[i] = x[i] − 1 with
equal probability and thus E[|x[i] − y[i]|γ ] = 1.
We now give a few examples of typical assignments ρ used in steganography.
If ω[i] = 1 for all i and |x[i] − y[i]| = 1, dρ is the total number of embedding
changes and (7.30) coincides with the distortion measure ϑ defined in Chapter 4.
It is a reasonable measure of embedding impact when the magnitude of all em-
bedding changes is 1. For ω[i] = 1 and γ = 2, dρ is the energy of modifications,
while the choice ω[i] = 1 and γ = 1 gives the L1 norm between x and y (see
Chapter 4). Optimal-parity embedding as introduced in Section 5.2.4 could be
interpreted as minimal-embedding-impact steganography with ρ[i] equal to the
isolation of the palette color at pixel i. As another example, if for some reason
some cover elements are not to be modified under any circumstances, we can set
ω[i] = ∞ for them. This measure of embedding impact will be used in Chapter 9.
The impact ω[i] may also be determined from some side-information about
the ith element. For example, let us assume that the cover is a color TIFF image
sampled at 16 bits per channel (48 bits per pixel). The sender wishes to embed
a message while decreasing the color depth to a true-color 8-bit per channel
image while minimizing the combined quantization and embedding distortion.
Let z[i] be the 16-bit color value and let Q = 28 be the quantization step for the
color-depth reduction. The quantization error at the ith pixel is
 
 z[i]  z[i] 
 
e[i] = Q − , (7.32)
 Q Q 

0 ≤ e[i] ≤ Q/2. To embed a message bit as the LSB of the rounded value, the
sender will need to round z[i] in the opposite direction, which would result in an
increased quantization error of Q − e[i]. Thus, the embedding distortion would
be the difference between the two rounding errors ρ[i] = Q − 2e[i]. Note that it
is therefore advantageous for the sender to select for embedding those pixels with
the smallest ρ[i], which are exactly those 16-bit values z[i] that are close to the
middle of quantization intervals, e[i] ≈ Q/2. In this case, the recipient cannot
read the message because the selection channel is not available to him (he sees
124 Chapter 7. Practical steganographic methods

only the final quantized values). This problem can be solved using special coding
techniques called wet paper codes explained in Chapter 9.
In general, it is a highly non-trivial problem to design steganographic schemes
that embed a given payload while minimizing the total embedding impact. There
exist many suboptimal schemes for special choices of the embedding impact. The
whole of Chapter 8 is devoted to design of steganographic schemes when ρ[i] = 1
and |x[i] − y[i]| = 1, or in other words when the embedding impact is the number
of embedding changes.
We have already encountered special cases of minimal-embedding-impact
steganography in Chapter 5. The optimal parity assignment was designed to
produce the smallest expected distortion when embedding in a palette image
along a pseudo-random path. The suboptimal scheme for palette images called
the embedding-while-dithering method also belongs to this category.
Since the design of steganographic schemes that attempt to minimize the em-
bedding impact requires knowledge of certain coding techniques, we do not de-
scribe any specific instances of embedding schemes in this chapter. Instead, we
postpone this topic to the next two chapters.
In the next section, we establish a fundamental performance bound for
minimal-embedding-impact steganography. In particular, we derive a quantita-
tive relationship between the maximal payload one can embed for a given bound
on the embedding impact. Knowledge of this theoretical bound will give us an
opportunity to evaluate the performance of suboptimal steganographic schemes
and compare them.

7.4.1 Performance bound on minimal-impact embedding


This section is devoted to theoretical analysis of minimal-embedding-impact
steganographic schemes. Our goal is to derive, for a given assignment ρ[i], a
relationship between the relative payload α and the minimal embedding impact
needed to embed this payload using any steganographic method.
Let us assume that the sender wants to communicate m = αn bits in n pix-
els with assigned embedding-impact measure ρ[i], i = 1, . . . , n. For n-element
objects, x, y, we define the modification pattern s ∈ {0, 1}n as s[i] = 1 when
x[i] = y[i] and s[i] = 0 otherwise. Furthermore, let d(s) = dρ (x, y) be the im-
pact of making embedding changes at pixels with s[i] = 1. Let us assume that
the recipient also knows the cover x. The sender then basically communicates the
modification pattern s. Assuming the sender selects each pattern s with proba-
bility p(s), the amount of information that can be communicated is the entropy
of p(s),

H(p) = − p(s) log2 p(s). (7.33)
s

Our problem is now reduced to finding the probability distribution p(s) on the
space of all possible flipping patterns s that minimizes the expected value of the
Practical steganographic methods 125

embedding impact

d(s)p(s) (7.34)
s

subject to the constraints



H(p) = − p(s) log2 p(s) = m, (7.35)
s

p(s) = 1. (7.36)
s

This problem can be solved using Lagrange multipliers. Let


! !
  
F (p(s)) = p(s)d(s) + a m + p(s) log2 p(s) + b p(s) − 1 .
s s s
(7.37)
Then,
∂F 1
= d(s) + a log2 p(s) + +b=0 (7.38)
∂p(s) log 2

if and only if p(s) = Ae−λd(s) , where A−1 = s e−λd(s) and λ is determined from

− p(s) log2 p(s) = m. (7.39)
s

Thus, the probabilities p(s) follow an exponential distribution with respect to


the embedding impact d(s). Note that this result does not depend on the specific
form of d and thus holds, for example, for non-additive measures as well.
If the embedding impact of the pattern s is an additive function of “singleton”
patterns (patterns for which only one pixel is modified), then d(s) = s[1]ρ[1] +
· · · + s[n]ρ[n], and p(s) accepts the form
n "
n
−λ i=1 s[i]ρ[i]
p(s) = Ae =A e−λs[i]ρ[i] , (7.40)
i=1
"
n "
n
A−1 = e−λs[i]ρ[i] = (1 + e−λρ[i] ), (7.41)
s i=1 i=1

which further implies


"
n
p(s) = p(i, s[i]), (7.42)
i=1

where p(i, 0) and p(i, 1) are the probabilities that the ith pixel is not (is) modified
during embedding,
1 e−λρ[i]
p(i, 0) = , p(i, 1) = . (7.43)
1 + e−λρ[i] 1 + e−λρ[i]
This means that the joint probability distribution p(s) can be factorized and
thus we need to know only the marginal probabilities p(i, 1) that the ith pixel is
126 Chapter 7. Practical steganographic methods

modified. It also enables us to write for the entropy



n
H(p) = H (p(i, 1)) , (7.44)
i=1

where the function H applied to a scalar is the binary entropy function H(x) =
−x log2 x − (1 − x) log2 (1 − x).
Note that in the special case when ρ[i] = 1, for all i, the embedding impact
per pixel is the change rate d/n = β, and we obtain

n
e−λ
m= H (p(i, 1)) = nH , (7.45)
i=1
1 + e−λ
!

n
ne−λ
d=E p(i, 1)ρ[i] = , (7.46)
i=1
1 + e−λ

which gives the following relationship between the change rate and the relative
message length α = m/n:

β = H −1 (α). (7.47)

In Section 8.4.2 on matrix embedding, we rederive this bound in a different


manner using purely combinatorial considerations.
We now derive the relationship between embedding capacity and impact in
the limit of a large number of pixels, n → ∞. Let us sort ρ[i] from the smallest

to the largest and normalize so that i ρ[i] = 1. Let ρ be a Riemann-integrable
non-decreasing function on [0, 1] such that ρ (i/n) = ρ[i]. Then for n → ∞, the
average distortion per element
ˆ 1
1
n
d
= p(i, 1)ρ[i] → p(x)ρ(x)dx, (7.48)
n n i=1 0

where p(x) = e−λρ(x) /(1 + e−λρ(x) ). By the same token,


ˆ 1
1
n
m
α= = H (p(i, 1)) → H (p(x)) dx. (7.49)
n n i=1 0

Finally, by direct calculation


ˆ ˆ1 ˆ 
1
ρ(x)e−λρ(x) 1
log 2 H (p(x)) dx = λ dx + log 1 + e−λρ(x) dx (7.50)
0 1 + e−λρ(x) 0
0
ˆ1 
(ρ(x) + xρ (x)) e−λρ(x) −λρ(1)
=λ dx + log 1 + e .
1 + e−λρ(x)
0
(7.51)
The second equality is obtained by integrating the second integral by parts. We
derived the embedding-capacity–embedding-impact relationship in a parametric
Practical steganographic methods 127

0.5 Constant profile

Relative embedding impact d/n


Square-root profile
0.4 Linear profile
Square profile
0.3

0.2

0.1

0 0.2 0.4 0.6 0.8 1


Relative payload α

Figure 7.5 Minimal relative embedding impact versus relative message length for four

embedding-impact profiles on [0, 1]. Constant: ρ(x) = 1, square-root profile: ρ(x) = x,
linear: ρ(x) = x, square: ρ(x) = x2 .

form

1
d(λ) = Gρ (λ), (7.52)
n
1 
α(λ) = λFρ (λ) + log 1 + e−λρ(1) , (7.53)
log 2

where λ is a non-negative parameter and

ˆ1
ρ(x)e−λρ(x)
Gρ (λ) = dx, (7.54)
1 + e−λρ(x)
0
ˆ1
(ρ(x) + xρ (x)) e−λρ(x)
Fρ (λ) = dx. (7.55)
1 + e−λρ(x)
0

Figure 7.5 shows the embedding impact per pixel, d/n, as a function of relative
payload α for four profiles ρ(x). The square profile ρ(x) = x2 has the smallest
impact for a fixed payload among the four and the biggest difference occurs for
small payloads. This is understandable because the square profile is the smallest
among the four for small x.
Before closing this section, we provide one interesting result that essentially
states that among a certain class of embedding operations for JPEG images,
the embedding operation of F5 is optimal because, in some well-defined sense, it
minimizes the embedding impact.
128 Chapter 7. Practical steganographic methods

7.4.2 Optimality of F5 embedding operation


In this section, we show that the embedding operation of F5 (decreasing the
absolute value of DCT coefficients) introduces the minimal embedding distortion
among all operations from a certain class. Thus, one could also think of F5 as
an instance of minimal-impact steganography.
As in Chapter 2, we denote by d[i] the DCT coefficients from the cover image
after dividing by quantization steps but before rounding them to integers. Here,
we use only a one-dimensional index assuming that the DCT coefficients are
sorted in some way. While the range of d[i] depends on the implementation of
the DCT transform, here we assume d[i] are real numbers. The DCT coefficients
after rounding are denoted D[i]. After applying a steganographic algorithm to
the JPEG file, the quantized DCT coefficients D[i] are changed to y[i]. The
coefficients D[i] and y[i] are thus integers. We denote by h[k] the histogram of
quantized AC DCT coefficients of the cover image. For γ > 0, the distortion due
to rounding is

n
dγ,round = |d[i] − D[i]|γ . (7.56)
i=1

We take the total distortion due to quantization and embedding as the measure
of embedding impact

n
dγ,emb = |d[i] − y[i]|γ . (7.57)
i=1

Additionally, we define the probabilistic -embedding operation that changes a


DCT coefficient towards zero with probability 1 −  and away from zero with
probability . In other words, D[i] = D[i] − sign (D[i]) with probability 1 − 
and D[i] = D[i] + sign (D[i]) with probability . The F5 embedding operation is
obtained for  = 0, while the ±1 embedding corresponds to  = 12 .

Proposition 7.1. [Optimality of F5 embedding operation] In the absence


of any information about the unquantized DCT coefficients, the expected value of
dγ,emb is minimized for  = 0, which is the embedding operation of F5, for any
γ > 0.

Proof. Viewing d[i] as instances of a random variable, let f (x) be its probability
distribution function. In Chapter 2, we learned that the histogram of quantized
DCT coefficients in a JPEG file follows a distribution with a sharp peak at
zero that is monotonically decreasing for positive coefficient values and increas-
ing for negative coefficients. Thus, we assume that f is increasing on (−∞, 0)
and decreasing on (0, ∞) and therefore Lebesgue integrable. Let us inspect the
embedding distortion for one value of the quantized coefficient d = 0, where,
say, d < 0 (this means that f is increasing at d) under the assumption that a
random fraction of β DCT coefficients are modified. The embedding operation
Practical steganographic methods 129

will thus randomly change a fraction  of βh[d] coefficients away from zero (in-
crease their absolute value) and the fraction 1 −  towards zero (decrease their
absolute value). Let us now express the increase of distortion due to embedding
with respect to the distortion solely due to rounding dγ,emb − dγ,round (follow
Figure 7.6).

60

40
f (x)

20

0
d−1 d d+1
Figure 7.6 Illustrative example for derivations in the text.

The unquantized DCT coefficients that are quantized to d lie between d − 12


and
 d + 12 (the dashed lines in Figure 7.6). If an unquantized coefficient x ∈
d − 12 , d is later changed to d − 1 during embedding (with probability ),
the increase in distortion is Δ1 (x) = (x − (d − 1))γ − (d − x)γ . If the coeffi-
cient is changed to d + 1 (with probability 1 − ), the increase  in distortion is
Δ2 (x) = (d + 1 − x) − (d − x) . For coefficients x ∈ d, d + 2 , the increase in
γ γ 1

distortion is Δ3 (x) = (x − (d − 1))γ − (x − d)γ , when d is changed to d − 1 (with


probability ), and Δ4 (x) = (d + 1 − x)γ − (x − d)γ , when d is changed to d + 1
(with probability 1 − ). Thus, the expected value of the distortion increase,
d() = dγ,emb − dγ,round, is
ˆ d
d() = (βf (x)Δ1 (x) + β(1 − )f (x)Δ2 (x)) dx
d− 12
ˆ d+ 12
+ (βf (x)Δ3 (x) + β(1 − )f (x)Δ4 (x)) dx, (7.58)
d
ˆ d
= β f (x) (Δ1 (x) − Δ2 (x)) dx
d− 12
ˆ d+ 12
+ β f (x) (Δ3 (x) − Δ4 (x)) dx + C, (7.59)
d

where C does not depend on . We clearly have Δ1 (x) − Δ2 (x) = Δ3 (x) −


Δ4 (x) = (x − d + 1)γ − (d + 1 − x)γ = g(x). Moreover, g(d + y) = −g(d − y) for
130 Chapter 7. Practical steganographic methods

   
y ∈ − 12 , 12 and g(x) ≥ 0 on d, d + 12 . Thus, we can write for d()
ˆ 1 ˆ 1
2 2
d() = β f (d − y)g(d − y)dy + β f (d + y)g(d + y)dy + C (7.60)
0 0
ˆ 1
2
= β (f (d + y) − f (d − y)) g(d + y)dy + C. (7.61)
0
 
Because f is increasing and g(d + y) ≥ 0 on 0, 12 , d() is minimized when  = 0,
which corresponds to the F5 embedding operation.

Summary
r In this chapter, we study methods for construction of practical steganographic
schemes for digital images.
r There exist four heuristic principles that can be applied to decrease the KL di-
vergence between cover and stego images and thus improve the steganographic
security:
– Model-preserving steganography
– Making embedding mimic natural processing
– Steganalysis-aware steganography
– Minimum-embedding-impact steganography
r In model-preserving steganography, a simplified model of cover images is for-
mulated and the embedding is forced to preserve that model.
r The most frequently adopted model is to consider the cover as a sequence of
iid random variables. The model itself is either the sample distribution of cover
elements or a parametric fit through the sample distribution. Steganography
preserving the cover histogram is undetectable within the iid model.
r Model-preserving steganography can be attacked by identifying a quantity
that is not preserved under embedding.
r Stochastic modulation is an example of embedding designed to mimic the
image-acquisition process. The message is embedded by superimposing quan-
tized iid noise with a given distribution.
r In steganalysis-aware steganography, the designer of the stego system focuses
on making the impact of embedding undetectable using existing steganalysis
schemes.
r Minimum-embedding-impact steganography starts by assigning to each cover
element a scalar value expressing the impact of making an embedding change
at that element. The designer then attempts to embed messages by minimizing
the total embedding impact.
r Among all embedding operations that change a quantized DCT coefficient
towards zero and away from zero with fixed probability, the operation of
the F5 algorithm minimizes the combined distortion due to quantization and
embedding.
Practical steganographic methods 131

Exercises

7.1 [Histogram after ±1 embedding] Let h[i], i = 0, 1, . . . , 255, denote


the histogram of an 8-bit cover grayscale image. Show that after embedding a
message of relative length α using ±1 embedding, the stego image histogram is
hα = Ah, where A is tri-diagonal with 1 − α/2 on the main diagonal and α/4
on the two second main diagonals (with the exception of A[2, 1] = A[254, 255] =
α/2):
⎛ ⎞
1 − α2 α4 0 0 ··· 0 0
⎜ α 1− α α 0 ··· 0 0 ⎟
⎜ 2 2 4 ⎟
⎜ 0 α
− α α
· · · ⎟
⎜ 1 0 0 ⎟
⎜ 4 2 4

A = ⎜ ··· ··· α
1 − α α
· · · · · · ⎟. (7.62)
⎜ 4 2 4 ⎟
⎜ 0 0 ··· α
1− 2 α α
0 ⎟
⎜ 4 4 ⎟
⎝ 0 0 ··· 0 α
4 1− 2 α α
2

0 0 ··· 0 0 α
4 1− 2 α

7.2 [Impact of F5 on histogram] In a JPEG image, let h(kl) [i] be the


number of AC DCT coefficients corresponding to spatial frequency (k, l) that
are equal in absolute value to i. Suppose that the F5 changes a total of n DCT
coefficients. Thus, the probability that a randomly selected non-zero AC DCT
coefficient is changed is β = n/n0 , where n0 is the number of all non-zero AC
DCT coefficients. Show that the expected value of the stego image histogram is
# $
(kl)
E hβ [i] = (1 − β)h(kl) [i] + βh(kl) [i + 1], i > 0, (7.63)
# $
(kl)
E hβ [0] = h(kl) [0] + βh(kl) [1]. (7.64)

7.3 [One-stego-sequence stochastic modulation] Consider the following


version of stochastic modulation. The sender generates one stego noise sequence
r[i] = round(η[i]), where η[i] follows an arbitrary distribution symmetrical about
zero. Assume that the parity function satisfies
π(x + r, r) = −π(x − r, r) ∈ {−1, 1} for any r = 0. (7.65)
(Can you find one such π?) Ignoring the boundary effects for now, at each pixel
x[i] where r[i] = 0, the sender embeds one message bit m ∈ {−1, 1} as
y[i] = x[i] + m × π(x[i] + r[i], r[i])r[i]. (7.66)
First, show that the effect of embedding is the same as adding to the cover image
a random variable with the same distribution as r[i]. Also show that the recipient
can read one message bit as
m = π(y[i], r[i]) (7.67)
whenever r[i] = 0. Finally, if the same stego sequence is used in this simpler
version of stochastic modulation and in the version described in Section 7.2.1,
which version embeds at a lower distortion and why?
132 Chapter 7. Practical steganographic methods

7.4 [Suboptimality of stochastic modulation] Prove that



1− p[k]2 ≤ H(p) (7.68)
k

for any pmf p. Hint: log p[k] = log (1 + (p[k] − 1)) ≤ p[k] − 1 by the log-
inequality log(1 + x) ≤ x, ∀x.

7.5 [Stego noise with amplitude 1] Show that for stego noise se-
quences with amplitude 1 (p[k] = 0 for k ∈ / {−1, 0, 1}), the entropy of the
optimal distribution H(popt ) = Hdγ (D) = −D log2 D − (1 − D) log2 (1 − D) +
D for any γ > 0. Hint: First show that for p[−1] + p[1] = 2p, a constant,
−p[−1] log2 p[−1] − p[1] log2 p[1] is maximal when p[−1] = p[1] = p. Thus, we
need only to search for optimal distributions in the family of distributions
p = (p, 1 − 2p, p) parametrized by a scalar parameter p. For such distributions,
the distortion bound is 2p ≤ D and the result is obtained by setting p = D/2.

7.6 [-embedding minimizing the KL divergence] The principle of min-


imal embedding impact does not have to produce steganographic schemes that
minimize the KL divergence. Consider the following example. Let us model the
quantized DCT coefficients in a JPEG file as iid instances of a random variable
with distribution given by h, the model of the histogram of quantized DCT coef-
ficients from the cover. Let us assume that h is non-increasing for positive values,
h[i] ≥ h[i + 1], and non-decreasing for negative values. Let us also assume that
h[i] > 0 for all i. A steganographic scheme that changes a fraction of β of non-
zero DCT coefficients using the -embedding operation defined in Section 7.4.2
has the following impact on the histogram:
hβ [i] = (1 − β)h[i] + (1 − )βh[i + 1] + βh[i − 1] for i > 1, (7.69)
hβ [i] = (1 − β)h[i] + βh[i + 1] + (1 − )βh[i − 1] for i < −1, (7.70)
hβ [1] = (1 − β)h[1] + (1 − )βh[2], (7.71)
hβ [−1] = (1 − β)h[−1] + (1 − )βh[−2], (7.72)
hβ [0] = h[0] + (1 − )βh[1] + (1 − )βh[−1]. (7.73)
Using the fact that KL divergence is locally quadratic for small β (see Proposi-
tion B.7), show that
2DKL (h||hβ ) 2 (h[1] + h[−1])
2
((1 − )h[2] − h[1])2
= (1 − ) +
β2 h[0] h[1]
((1 − )h[−2] − h[−1])2
+
h[−1]
 ((1 − )h[i + 1] + h[i − 1] − h[i])2
+
i>1
h[i]
 (h[i + 1] + (1 − )h[i − 1] − h[i])2
+ + O(β). (7.74)
i<−1
h[i]
Practical steganographic methods 133

By differentiating with respect to , show that the KL divergence is minimal


when  = 1 /2 , where
 (h[i + 1] − h[i − 1])(h[i] − h[i − 1]) h[−2]
1 = − (h[−1] − h[−2])
i<−1
h[i] h[−1]
(h[−1] + h[1])2 h[2]
+ − (h[1] − h[2])
h[0] h[1]
 (h[i + 1] − h[i − 1])(h[i + 1] − h[i])
+ , (7.75)
i>1
h[i]
 (h[i + 1] − h[i − 1])2 h[−2]2 (h[−1] + h[1])2
2 = + +
i<−1
h[i] h[−1] h[0]
h[2]2  (h[i + 1] − h[i − 1])2
+ + . (7.76)
h[1] i>1
h[i]

Use the monotonicity of h to show that 1 ≤ 2 and thus  ≤ 1. Contrast this


result to Proposition 7.1.

7.7 [OutGuess does not preserve individual histograms] Even though


OutGuess preserves the global histogram of all DCT coefficients in a JPEG file,
it does not have to preserve histograms of individual DCT modes. Prove this
statement by considering one LSB pair {2j, 2j + 1} for the global histogram, h,
and the same bin for the histogram, h(kl) , of a selected DCT mode (k, l). Assume
that
h[2j + 1] h(kl) [2j + 1]
= . (7.77)
h[2j] h(kl) [2j]
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
8 - Matrix embedding pp. 135-166

Chapter DOI:
Cambridge University Press
8 Matrix embedding

In the previous chapter, we learned that one of the general guiding principles for
design of steganographic schemes is the principle of minimizing the embedding
impact. The plausible assumption here is that it should be more difficult for Eve
to detect Alice and Bob’s clandestine activity if they leave behind smaller embed-
ding distortion or “impact.” This chapter introduces a very general methodology
called matrix embedding using which the prisoners can minimize the total num-
ber of changes they need to carry out to embed their message and thus increase
the embedding efficiency. Even though special cases of matrix embedding can be
explained in an elementary fashion on an intuitive level, it is extremely empow-
ering to formulate it within the framework of coding theory. This will require the
reader to become familiar with some basic elements of the theory of linear codes.
The effort is worth the results because the reader will be able to design more
secure stegosystems, acquire a deeper understanding of the subject, and realize
connections to an already well-developed research field. Moreover, according to
the studies that appeared in [143, 95], matrix embedding is one of the most
important design elements of practical stegosystems.
As discussed in Chapter 5, in LSB embedding or ±1 embedding one pixel
communicates exactly one message bit. This was the case of OutGuess as well
as Jsteg. Assuming the message bits are random, each pixel is thus modified
with probability 12 because this is the probability that the LSB will not match
the message bit. Thus, on average two bits are embedded using one change or,
equivalently, the embedding efficiency is 2. If the message length is smaller than
the embedding capacity of the cover image, it is possible to substantially increase
the embedding efficiency and thus embed the same payload with fewer embedding
changes.
To explain why this is possible at all, consider the following simple example.
Let us assume that we wish to embed a message of relative length α = 23 using
LSB embedding. Thus, given a cover image with n pixels, the message contains
nα = 2n/3 bits. This means that we need to embed 2 bits, m[1], m[2], in a group
of three pixels with grayscale values g[1], g[2], g[3]. Classical LSB embedding
would embed bit m[1] at g[1] and bit m[2] at g[2] while skipping g[3] and thus
embed with embedding efficiency 2. We can, however, do better by embedding
136 Chapter 8. Matrix embedding

the bits as

m[1] = LSB(g[1]) ⊕ LSB(g[2]), (8.1)


m[2] = LSB(g[2]) ⊕ LSB(g[3]), (8.2)

where ⊕ is the exclusive or. If the cover values already satisfy both equations,
no embedding changes are necessary. If the first equation is satisfied but not the
second one, the sender can flip the LSB of g[3]. If the second equation is satisfied
but not the first one, the sender should flip g[1]. If neither is satisfied, the sender
will flip g[2]. Because the probability of each case is 14 , the expected number of
changes is 0 × 14 + 1 × 14 +1 × 14 + 1 × 14 = 34 and the embedding efficiency (in
2
bits per change) is e = 3/4 = 83 > 2.
Note that in this scheme, we can no longer say which pixels convey the message
bits. Both bits are communicated by the group as a whole. One can say that in
matrix embedding the message is communicated by the embedding changes as
well as their position. In fact, the message-extraction rule that the receiver uses
is multiplication1 of the vector of LSBs, x = LSB(g), by a matrix

110
m= x, (8.3)
011

which gave this method its name – matrix embedding [19, 52, 98, 233].
Before presenting a general approach to matrix embedding based on coding
theory, in Section 8.1 we generalize the simple trick of this introduction. Up
until now, the material can be comfortably grasped by all readers without any
background in coding theory. To prepare the reader for the material presented
in the rest of this chapter, Section 8.2 contains a brief overview of the theory
of binary linear codes. Readers not familiar with this subject are additionally
urged to read Appendix C, which contains a more detailed tutorial. Section 8.3
contains the main result of this chapter, the matrix embedding theorem. The
theorem gives us a general methodology to improve the embedding efficiency
of steganographic methods using linear codes. We also revisit Section 8.1 and
interpret the method from a new viewpoint. The theoretical limits of matrix
embedding methods are the subject of Section 8.4. A matrix embedding approach
suitable for hiding large payloads appears in Section 8.5. It also illustrates the
usefulness of random codes. Section 8.6 deals with embedding methods realized
using codes defined over larger alphabets (q-ary codes). Such codes can further
increase the embedding efficiency at the cost of larger distortion. Finally, in
Section 8.7 we explain an alternative approach to minimizing the number of
embedding changes that is based on sum and difference covering sets of finite
cyclic groups.

1 All arithmetic operations are performed in the usual binary arithmetic (8.14) (also, see
Appendix C for more details).
Matrix embedding 137

8.1 Matrix embedding using binary Hamming codes

In this section, we explain the matrix embedding method used in the F5 al-
gorithm (Chapter 7), which was the first practical steganographic scheme to
incorporate this embedding mechanism. The reader will recognize later that the
method is based on binary Hamming codes.
Assume the sender wants to communicate a message with relative length
αp = p/(2p − 1), p ≥ 0, which means that p message bits, m[1], . . . , m[p], need
to be embedded in 2p − 1 pixels. We denote by x the vector of LSBs of 2p − 1
pixels from the cover image (e.g., collected along a pseudo-random path if the
embedding uses a stego key for random spread of the message bits). The sender
and recipient share a p × (2p − 1) binary matrix H that contains all non-zero
binary vectors of length p as its columns. An example of such a matrix for p = 3
is
⎛ ⎞
0001111
H = ⎝0 1 1 0 0 1 1⎠. (8.4)
1010101
As in the previous simple example, the sender modifies the pixel values so that
the column vector of their LSBs, y, satisfies
m = Hy. (8.5)
We call the vector Hy the “syndrome” of y. If by chance the syndrome of the
cover pixels already communicates the correct message, Hx = m, which happens
with probability 1/2p , the sender does not need to modify any of the 2p − 1 cover
pixels, sets y = x, and proceeds to the next block of 2p − 1 pixels and embeds
the next segment of p message bits.
When Hx = m, the sender looks up the difference Hx − m as a column in H
(there must be such a column because H contains all non-zero binary p-tuples).
Let us say that it is the jth column and we write it as H[., j]. By flipping the
LSB of the jth pixel and keeping the remaining pixels unchanged,
y[j] = 1 − x[j], (8.6)
y[k] = x[k], k = j, (8.7)
the syndrome of y now matches the message bits,
Hy = m. (8.8)
This is because Hy = Hx + H(y − x) = Hx − m + H[., j] + m = m, because
Hx − m = H[., j] and in binary arithmetic z + z = 0 for any z.
To complete the description of the steganographic algorithm, the recipient
follows the same path through the image as the sender and reads p message bits
from the LSBs of each block of 2p − 1 pixels as the syndrome m = Hy.
Let us now compute the embedding efficiency of this embedding method. With
probability 1/2p, the sender does not modify any of the 2p − 1 pixels. Further-
138 Chapter 8. Matrix embedding

more, she makes exactly one change with probability 1 − 1/2p . The average num-
ber of embedding changes is thus 0 × 1/2p + 1 × (1 − 1/2p ) = 1 − 1/2p and the
embedding efficiency is
p
ep = . (8.9)
1 − 2−p
The embedding efficiency and the relative payload αp = p/(2p − 1) for different
values of p are shown in Table 8.1.

Table 8.1. Relative payload, αp , and embedding efficiency, ep , in bits per change for
matrix embedding using binary Hamming codes [2p − 1, 2p − 1 − p].

p αp ep
1 1.000 2.000
2 0.667 2.667
3 0.429 3.429
4 0.267 4.267
5 0.161 5.161
6 0.093 6.093
7 0.055 7.055
8 0.031 8.031
9 0.018 9.018

Note that with increasing p, the embedding efficiency, ep , increases while the
relative payload, αp , decreases. The improvement over embedding each message
bit at exactly one pixel is quite substantial. For example, a message of relative
length 0.093 can be embedded with embedding efficiency 6.093, which means
that a little over six bits are embedded with a single embedding change.
The motivational example from the beginning of this section is obtained for
p = 2. Also, notice that the classical LSB embedding corresponds to p = 1, in
which case the matrix H is 1 × 1 and the embedding efficiency is 2.
If the sender plans to communicate a message whose relative length α is not
equal to any αp , one needs to choose the largest p for which αp ≥ α to make
sure that the complete message can be embedded. This means that for α > 23 ,
this matrix embedding method does not bring any improvement over classical
embedding methods that embed one bit at each pixel.
The parameter p also needs to be communicated to the receiver, which can
be arranged in many different ways. One possibility is to reserve a few pixels
from the image using the stego key and embed the binary representation of p
in LSBs of those pixels and then use the rest of the image to communicate the
main payload.

Example 8.1: [Matrix embedding using binary Hamming codes] We now


work out a simple example of how Hamming codes can be used in practice
Matrix embedding 139

⎛ ⎞ ⎛ ⎞ ⎛ ⎞
  0 0 0
g = 11 10 15 17 13 21 19 , ⎝ 1 ⎠−⎝ 0 ⎠=⎝ 1 ⎠
  0 1 1
x = π(g) = 1 0 1 1 1 1 1 ,
Hx m Hx − m

change 15 to 14 ⎛ ⎞
1
⎜ 0 ⎟
⎛ ⎞⎜ ⎟ ⎛ ⎞ ⎛ ⎞
0 0 0 1 1 1 1 ⎜ 1 ⎟ 0 0
⎜ ⎟
⎝ 0 1 1 0 0 1 1 ⎠×⎜
⎜ 1 ⎟=⎝ 1 ⎠

⎝ 0 ⎠
1 0 1 0 1 0 1 ⎜ 1 ⎟ 0 1
⎜ ⎟
⎝ 1 ⎠
H x 1 Hx m

Figure 8.1 Example of embedding using binary Hamming codes.

to embed bits. Let us assume that we are embedding relative payload α = 37 ,


which means we are embedding 3 bits in 23 − 1 = 7 pixels. Given a block of
pixel values g = (11, 10, 15, 17, 13, 21, 19) and message m = (0, 0, 1) , Alice first
converts the pixel values to bits, x = g mod 2 = (1, 0, 1, 1, 1, 1, 1), and computes
the syndrome Hx = (0, 1, 0) , where H appears in (8.4). Then, she finds the triple
Hx − m = (0, 1, 1) as a column in H. It is the third column. Thus, to embed m
in the block, she needs to flip the LSB of the third pixel. For example, she may
change g[3] = 15 to 14. Note that Alice could have achieved the same effect by
changing the value to 16 as well.

There is a strong connection between matrix embedding methods and covering


codes, which we now explore. The reader will realize that what appeared as a
clever trick in this section will now fit into a much more general framework. For-
mulating the approach within the realm of coding theory will enable us to design
more general and efficient matrix embedding methods as well as derive bounds
on how much it is theoretically possible to minimize the number of embedding
changes. To this end, the reader needs to become more familiar with the basic
concepts from the theory of linear codes. Readers with a basic background in
coding theory can skip the next section.

8.2 Binary linear codes

This section briefly introduces selected basic concepts and notation from the
theory of binary linear codes that will be needed to understand the material in
the rest of this chapter. The reader is encouraged to read Appendix C, which
140 Chapter 8. Matrix embedding

contains a brief tutorial on linear codes over finite fields. A good introduction to
coding theory is [248].
We denote by Fn2 the vector space of all binary vectors, x = (x[1], . . . , x[n]), of
length n where addition and multiplication by a scalar is performed elementwise,
x + y = (x[1] + y[1], . . . , x[n] + y[n]), (8.10)
bx = (bx[1], . . . , bx[n]), (8.11)
for all x, y ∈ Fn2 , b ∈ {0, 1} = F2 . All operations in F2 are in the usual binary
arithmetic
0 + 0 = 1 + 1 = 0, (8.12)
0 + 1 = 1 + 0 = 0, (8.13)
0 · 1 = 1 · 0 = 0, (8.14)
1 · 1 = 1, (8.15)
0 · 0 = 0. (8.16)
A binary linear code C of length n is a vector subspace of Fn2 . Its elements are
called codewords. As a subspace, the code C is closed under linear combination of
its elements, which means that ∀x, y ∈ C and ∀b ∈ {0, 1}, x + y ∈ C and bx ∈ C.
The code has dimension k (and codimension n − k) if it has a basis consisting of
k ≤ n linearly independent vectors. (We say that the code C is an [n, k] code.) A
code is completely described by its basis because all codewords can be obtained
as linear combinations of the basis vectors. Writing the basis vectors as rows of
a k × n matrix G, we obtain its generator matrix.
Linear code can be alternatively described using its (n − k) × n parity-check
matrix, H, whose n − k rows are linearly independent vectors that are orthogonal
to C, or
HG = 0, (8.17)
where the prime denotes transposition. The code can thus also be defined as
C = {c ∈ Fn2 |Hc = 0}. (8.18)
The generator and parity-check matrices are not unique for the same reason
that a vector subspace does not have a unique basis.
The space Fn2 can be endowed with a measure of distance called the Hamming
distance defined as the number of places where x and y differ,
 
dH (x, y) = {i ∈ {1, . . . , n}|x[i] = y[i]}. (8.19)
In the context of steganography, dH (x, y) is the number of embedding changes.
The Hamming weight of x ∈ Fn2 is w(x) = dH (x, 0), which is the number of non-
zero elements in x.
The ball of radius r centered at x is the set
B(x, r) = {y ∈ Fn2 |dH (x, y) ≤ r} , (8.20)
Matrix embedding 141

and

n n n r
n
V2 (r, n) = 1 + + + ···+ = (8.21)
1 2 r i=0
i

is its volume (the number of vectors in B(x, r)).


One important characteristic of a code that will be relevant for applications
in steganography is the covering radius

R = maxn dH (x, C), (8.22)


x∈F2

where dH (x, C) = minc∈C dH (x, c) is the distance between x and the code. In
other words, the covering radius is determined by the most distant point x from
the code.
Recalling the matrix embedding method from Section 8.1, it should not be
surprising that we define for each x ∈ Fn2 its syndrome as s = Hx ∈ Fn−k2 . For
any syndrome s ∈ F2 , its coset C(s) is
n−k

C(s) = {x ∈ Fn2 |Hx = s}. (8.23)

Note that C(0) = C and C(s1 ) ∩ C(s2 ) = ∅ for s1 = s2 . The whole space Fn2 can
thus be decomposed into 2n−k cosets, each coset containing exactly 2k elements,

C(s) = Fn2 . (8.24)
s∈Fn−k
2

Because C(s) is the set of all solutions to the equation Hx = s, from linear algebra
the coset can be written as C(s) = {x ∈ Fn2 |x = x̃ + c, c ∈ C} = x̃ + C, where x̃
is an arbitrary member of C(s).
A coset leader e(s) is a member of the coset C(s) with the smallest Hamming
weight. It is easy to see that the Hamming weight of any coset leader is at most
R, the covering radius of C. Take x ∈ Fn2 arbitrary and calculate its syndrome
s = Hx. Then,

R = max dH (z, C) ≥ dH (x, C) = min w(x − c) = w(e(s)) (8.25)


z∈F2
n c∈C

because when c goes through all codewords, x − c goes through all members
of the coset C(s). Note that the inequality is tight as there exists z such that
R = dH (z, C).
This result also implies that any syndrome, s ∈ Fn−k 2 , can be obtained by
adding at most R columns of H. This is because C(s) = {x|Hx = s} and the
weight of a coset leader is the smallest number of columns that need to be
summed to obtain the coset syndrome s. Thus, one method to determine the
covering radius of a linear code is to first form its parity-check matrix and then
find the smallest number of columns of H that can generate any syndrome.
142 Chapter 8. Matrix embedding

8.3 Matrix embedding theorem

The matrix embedding theorem provides a recipe for how to turn any linear code
into a matrix embedding method. The parameters of the code will determine the
properties of the stegosystem, namely its payload and embedding efficiency. By
making the connection between steganography and coding, we will be able to
view the trick explained in Section 8.1 from a different angle and construct new
useful embedding methods.
We will assume that Alice and Bob use a bit-assignment function π : X →
{0, 1} that assigns a bit to each possible value of the cover element from X . For
example, X = {0, . . . , 255} for 8-bit grayscale images and π could be the LSB
of pixels (DCT coefficients), π(x) = x mod 2. Thus, a group of n pixels can be
represented using a binary vector x ∈ Fn2 . The sole purpose of the embedding
operation is to modify the bit assigned to the cover element, which could be
achieved by flipping the LSB or adding ±1 to the pixel value, etc. At this point,
it is immaterial how the pixels are changed because we measure the embedding
impact only as the number of embedding changes.
Recalling the definition from Section 4.3, the steganographic scheme is a pair
of mappings Emb and Ext,

Emb : Fn2 × M → Fn2 , (8.26)


Ext : Fn2 → M, (8.27)

with the property

Ext (Emb(x, m)) = m, ∀x ∈ Fn2 and m ∈ M. (8.28)

Here, y = Emb(x, m) is the vector of bits extracted from the same block of n
pixels in the stego image and M is the set of all messages that can be communi-
cated. (The embedding capacity is log2 |M| bits.) Note that here we ignore the
role of the stego key as it is, again, not important for our reasoning.
Let us suppose that it is possible to embed every message in M using at most
R changes,

dH (x, Emb(x, m)) ≤ R, ∀x ∈ Fn2 , m ∈ M. (8.29)

Because the embedding distortion is measured as the number of embedding


changes, the embedding efficiency (in bits per embedding change) is

log2 |M|
e= , (8.30)
Ra
where Ra is the average number of embedding changes over all messages and
covers
 
Ra = E dH (x, Emb(x, m)) . (8.31)
Matrix embedding 143

We also define the lower embedding efficiency


log2 |M|
e= . (8.32)
R
Obviously, since Ra ≤ R, we have e ≤ e.
Having established the terminology and notation, we are now ready to formu-
late and prove the matrix embedding theorem.

Theorem 8.2. [Matrix embedding theorem] An [n, k] code with covering


radius R can be used to construct a steganographic scheme that can embed n − k
bits in n pixels by making at most R embedding changes. The embedding efficiency

is e = (n − k)/Ra , where Ra = (1/2n ) x∈Fn dH (x, C) is the average distance to
2
code.

Proof. Let H be a parity-check matrix of the code, x ∈ Fn2 the vector of LSBs
of n pixels, and m ∈ Fn−k
2 the vector of message bits. Define the embedding
mapping as

y = Emb(x, m)  x + e(m − Hx), (8.33)

where we remind the reader that e(m − Hx) is a coset leader of the coset cor-
responding to the syndrome m − Hx. The corresponding extraction algorithm
is

m = Ext(y) = Hy. (8.34)

In other words, the recipient reads the message by extracting the LSBs from the
stego image, y, and multiplying them by the parity-check matrix,

Hy = Hx + He = Hx + m − Hx = m. (8.35)

Because the message is being communicated as a syndrome of some linear


code, steganography using matrix embedding is sometimes called syndrome cod-
ing. From (8.33), dH (x, y) = w(e(m − Hx)) and thus the number of embedding
changes is at most R because the weight of every coset leader is at most R (8.25).
If the message is a random bit stream, then all coset leaders are equally likely to
participate in the embedding. Thus, the expected number of embedding changes
is equal to the expected weight of a coset leader,
1  1  k
w(e(s)) = 2 w(e(s)) (8.36)
2n−k 2n
s∈F2n−k
s∈F2
n−k

1   1 
= dH (x, C) = dH (x, C), (8.37)
2n 2n
s∈F2
n−k x∈C(s) n x∈F2

which is the average distance to code, Ra . In other words, we have just proved
that e = (n − k)/Ra . The second equality follows from (8.25) expressing the fact
that all coset members x ∈ C(s) have the same distance to C: dH (x, C) = w(e(s)),
144 Chapter 8. Matrix embedding

equal to the weight of a coset leader of C(s). The fact that cosets partition the
whole space implies the third and final equality.

8.3.1 Revisiting binary Hamming codes


We now revisit the example from Section 8.1 through the matrix embedding
theorem. The embedding method used a parity-check matrix H whose columns
were all non-zero binary p-tuples. (The matrix had 2p − 1 columns.) The length of
the linear code was n = 2p − 1 and its codimension p. These are binary Hamming
codes, Hp . Because the parity-check matrix contains all non-zero binary vectors
of length p, any syndrome can be written as a trivial “sum” of one column from
H. Thus, the covering radius of the Hamming codes is R = 1.
The average distance to code, Ra , can be easily obtained by realizing that
Hamming codes are perfect (see Appendix C) and the whole space Fn2 can be
written as a union of balls with unit radius centered at the codewords,

B(c, 1) = Fn2 . (8.38)
c∈Hp

Because the volume of each ball is exactly 1 + n = 2p and all words with the
exception of the center codeword have distance 1 from the code, the average
distance to code is Ra = n/(n + 1) = 1 − 2−p . Thus, the embedding efficiency
of matrix embedding based on binary Hamming code Hp is ep = p/(1 − 2−p ) in
agreement with the result obtained in Section 8.1.

8.4 Theoretical bounds

The matrix embedding theorem tells us that linear codes can be used to construct
steganographic schemes that impose fewer embedding changes than the simple
paradigm of embedding one bit at every pixel. This is quite significant because
it gives us the ability to communicate longer messages for a fixed distortion
budget. It would be valuable to know the limits of this approach and determine
the performance (embedding efficiency) of the best possible matrix embedding
method and then attempt to reach this limit.

8.4.1 Bound on embedding efficiency for codes of fixed length


We now determine the largest possible embedding efficiency achievable using
linear codes of fixed length n. To this end, we consider the embedding efficiency
for codes capable of embedding relative payload α, or the class of [n, n(1 − α)]
codes. An upper boundon  embedding efficiency requires a lower bound on R and
n
Ra . Because there are i possible sums of i columns of the parity-check
n matrix
H, the number of cosets whose coset leaders have weight i is at most i . Thus,
Matrix embedding 145

the covering radius R must be at least equal to Rn for which

n n n n
+ + ···+ +ξ = 2αn , (8.39)
0 1 Rn − 1 Rn

where 0 < ξ ≤ 1 is a real number. Note here that every [n, n(1 − α)] code has
2αn cosets.
Besides the lower bound on the covering radius, R ≥ Rn , we obtain a lower
bound for the average distance to code Ra ,

Rn −1 n n


i=1 i i + Rn ξ Rn
Ra ≥ , (8.40)
2αn
and an upper bound on embedding efficiency
αn αn2αn
e= ≤ Rn −1 n  n
. (8.41)
Ra i=1 i + Rn ξ
i Rn

We remark that codes  for which the number of coset leaders with Hamming
weight i is exactly ni and ξ = 1 have the largest possible embedding efficiency
within the class of codes of length n. Perfect codes have this property (see Ap-
pendix C). There are only four perfect codes: the trivial repetition code (see
Exercises 8.1 and 8.6), Hamming codes, the binary Golay code, and the ternary
Golay code [248].
The bounds derived above can be summarized in a proposition.

Proposition 8.3. [Bound on embedding efficiency, fixed code length]


Any matrix embedding scheme realized using a linear code of length n capable of
embedding relative payload α must occasionally make at least Rn (8.39) embedding
changes. The embedding efficiency is at most (8.41).

8.4.2 Bound on embedding efficiency for codes of increasing length


The second bound we derive will tell us the maximal embedding efficiency one
can achieve using codes of arbitrary length for a fixed relative payload. We start
by fixing the change rate β = R/n and determine the largest relative payload
α that one can embed by making at most βn changes as n → ∞. We can limit
ourselves to β < 12 because we can always add the all-ones codeword to the code
and thus make the code contain the repetition code whose covering radius is at
most n/2 ≥ R n(Exercise
 8.1).
There are i possible ways one can make i changes in n pixels. Thus, using
at most R = βn changes, we can embed at most

βn
n
|M| ≤ = V2 (βn, n) (8.42)
i=0
i
146 Chapter 8. Matrix embedding

0.8

H(x) 0.6

0.4

0.2
H(x) = −x log2 x − (1 − x) log2 (1 − x)

0
0 0.2 0.4 0.6 0.8 1
x
Figure 8.2 The binary entropy function.

messages if it is possible to arrange that every change communicates a different


message. We now determine the asymptotic behavior of this sum. To obtain some
insight, we will first use Stirling’s formula
√ n n 1
n! = 2πn 1+O (8.43)
e n
for the last (and the largest) term in the sum
n n!
= (8.44)
βn (n − βn)!(βn)!

2πn(n/e)n
=  √
2π(n − βn)((n − βn)/e)n−βn 2πβn(βn/e)βn
1 + O (1/n)
× (8.45)
(1 + O (1/βn)) (1 + O (1/(n − βn)))
1+O(1/n)
(1 − β)−n(1−β) β −nβ
(1+O(1/βn))(1+O(1/(n−βn)))
=  (8.46)
2πnβ(1 − β)
= c(n)2−n[(1−β) log2 (1−β)−β log2 β] , (8.47)
where we denoted
1 1 + O (1/n)
c(n) =  . (8.48)
2πnβ(1 − β) (1 + O (1/βn)) (1 + O (1/(n − βn)))
Using the binary entropy function, H(x) = −x log2 x − (1 − x) log2 (1 − x),
this result can be rewritten as
n
= c(n)2nH(β) . (8.49)
βn
Matrix embedding 147

By taking the logarithm,


n
log2 = log2 c(n) + nH(β), (8.50)
βn
n
log2 βn log2 c(n)
=1+ . (8.51)
nH(β) nH(β)
Because
log2 c(n) log n
=O , (8.52)
nH(β) n
we finally obtain
n
log2 βn
lim = 1. (8.53)
n→∞ nH(β)
We have now proved that the largest term in V2 (βn, n) behaves asymptotically
as 2nH(β) or log2 V2 (βn, n)  nH(β). We now derive an asymptotic upper bound,
which will enable us to state that log2 V2 (βn, n) ≈ nH(β), where f (n) ≈ g(n) if
limn→∞ f (n)/g(n) = 1. For this, we will need the tail inequality.

Lemma 8.4. [Tail inequality] For any β ≤ 1


2 and n > 0

nH(β) ≥ log2 V2 (βn, n). (8.54)

Proof.
n
n i
1 = (β + 1 − β) = n
β (1 − β)n−i (8.55)
i=0
i
n
n β
i βn
n β
i
= (1 − β) n
≥ (1 − β) n
(8.56)
i=0
i 1−β i=0
i 1−β

βn
n β βn 
βn
n
≥ (1 − β) n
= (1 − β) n(1−β) βn
β (8.57)
i 1−β i
i=0 i=0
n[(1−β) log2 (1−β)+β log2 β]
=2 V2 (βn, n), (8.58)

which is the tail inequality. The second inequality holds because β/(1 − β) ≤ 1
for β ≤ 12 .

Thus, using the tail inequality,


n
log2 ≤ log2 V2 (βn, n) ≤ nH(β) (8.59)
βn
and, because of (8.53), we have
log2 V2 (βn, n)
lim = 1. (8.60)
n→∞ nH(β)
148 Chapter 8. Matrix embedding

Therefore, the maximal number of bits one can embed using at most R changes
is
R
mmax ≤ nH . (8.61)
n
We can rewrite this to obtain a bound on the lower embedding efficiency, e, for
any relative payload α that can be embedded:
mmax R
α≤ ≤H , (8.62)
n n

R
H −1 (α) ≤ , (8.63)
n

n 1
≤ −1 , (8.64)
R H (α)

αn α
e= ≤ −1 . (8.65)
R H (α)
The bound (8.65) on the lower embedding efficiency is also an asymptotic
bound on the embedding efficiency e,
α
e ≤ −1 . (8.66)
H (α)
This is because the relative average distance to code, Ra /n, which determines
embedding efficiency, and the relative covering radius, R/n, for which we have a
bound, are asymptotically identical as n → ∞ [81].
Knowing the performance bounds, a practical problem for the steganographer
is finding codes that could reach the theoretically optimal embedding efficiency
with low computational complexity. Figure 8.3 shows the bound on embedding
efficiency (8.66) and the performance of selected codes as a function of 1/α, where
α is the relative payload. We can see that, although the Hamming codes have
substantially higher embedding efficiency than the trivial paradigm of embedding
one message bit at one pixel with e = 2, they are still far from the theoretical
upper bound, indicating a space for improvement. A slightly better embedding
efficiency can be obtained using BCH codes [208] and certain classes of non-linear
codes [20].
An important result, which is beyond the scope of this book, states that the
bound (8.66) is tight and asymptotically achievable using linear codes. In partic-
ular, the relative codimension (n − k)/n of almost all random [n, k] codes asymp-
totically achieves H(R/n) for a fixed change rate R/n < 12 and n → ∞ (see, e.g.,
Theorem 12.3.5 in [46]). Thus, there exist embedding schemes based on linear
codes whose embedding efficiency is asymptotically optimal.
Therefore, at least theoretically it should be possible to construct good matrix
embedding schemes from random codes. However, while for structured codes,
such as the Hamming codes, finding a coset leader was particularly simple, for
Matrix embedding 149

Embedding efficiency e(α) 6

4
Bound
Binary Hamming codes
Sparse codes
2
2 4 6 8 10 12 14 16 18 20
α−1
Figure 8.3 Embedding efficiency of binary Hamming codes (crosses) and the bound on
embedding efficiency (8.66). The stars mark the embedding efficiency of sparse linear
codes of length n = 1, 000 and n = 10, 000 as reported in [68].

general random codes it is an NP-complete problem whose complexity increases


exponentially with n. In the special case of embedding large relative payloads,
we can afford to keep the number of codewords small and find coset leaders
using brute force (see Section 8.5). Another possibility is to use sparse linear
codes where efficient algorithms based on message-passing exist [68]. In fact, such
codes were shown to achieve embedding efficiency very close to the theoretical
bound (8.66) (see Figure 8.3).

8.5 Matrix embedding for large relative payloads

The positive impact of matrix embedding on security is bigger when embedding


large payloads because small payloads are less statistically detectable already.
Unfortunately, matrix embedding using binary Hamming codes as introduced in
Section 8.1 does not bring any improvement for payloads α > 23 . In this section,
we describe matrix embedding methods suitable when the relative payload is
large, or α ≈ 1. The emphasis here is on the adjective “relative” because such
payloads can in fact be quite small in absolute terms when compared with the
embedding capacity. Take, for example, adaptive steganography where the sender
constrains herself to a certain subset of the image right from the beginning to
avoid introducing detectable artifacts or the perturbed quantization steganogra-
phy as described in Chapter 9.
Recall that in matrix embedding, the stego vector of bits y = x + e(m − Hx)
is obtained by finding a coset leader for the syndrome m − Hx. And finding the
coset leader is the most time-consuming part of the embedding algorithm. For
150 Chapter 8. Matrix embedding

large relative payloads α = (n − k)/n ≈ 1, the codes in matrix embedding [n, n −


αn] have small dimension k = n(1 − α), and thus a small number of codewords.
This observation can be used to our advantage.
The method in this section will be based on random codes to show that even
randomly generated codes can provide good performance, which is, after all, not
that surprising because in the previous section we learned that random codes
asymptotically reach the bound on embedding efficiency (8.66). Moreover, due
to the small dimension, k, coset leaders can be found efficiently using look-up
tables.
We assume that the sender wants to embed n − k bits, m ∈ {0, 1}n−k , in n pix-
els with bits x ∈ {0, 1}n. The code parity-check matrix H is generated randomly
(from a shared stego key) in its systematic form H = [In−k , A], where In−k is
an (n − k) × (n − k) unity matrix and A is a random (n − k) × k matrix. The
generator matrix is thus G = [A , Ik ] (see Appendix C).
The algorithm has two steps:

1. Find any coset member ỹ ∈ C(m − Hx) as the solution of Hỹ = m − Hx.
Because H is in its systematic form, for example ỹ = (m − Hx, 0), where
0 ∈ {0, 1}k .
2. Find the coset leader e(m − Hx). To do so, we need to find the codeword c̃
closest to ỹ,

dH (ỹ, c̃) = min dH (ỹ, c) = min w(ỹ − c) = w (e(m − Hx)) , (8.67)


c∈C c∈C

because the vector ỹ − c goes through all members of the coset. Thus,
e(m − Hx) = ỹ − c̃, which completes the description of the embedding map-
ping Emb via the matrix embedding theorem. If the dimension of the code
is small, the closest codeword can be quickly found by brute force or using
precalculated look-up tables.

The relative embedding payload of these random [n, k] codes is


n−k k
αk,n = =1− . (8.68)
n n
For a fixed code dimension k, αk,n → 1 as n approaches ∞. Figure 8.4 shows
the embedding efficiency of random codes of small dimension as a function of rel-
ative payload α. The points were obtained by averaging the embedding efficiency
over 200 randomly generated codes. Note that the embedding efficiency is quite
close to the bound (8.41) for codes of the same length n, indicating the fact that
in the class of linear codes of fixed length, the random codes offer embedding
efficiency close to the optimal value.
The reader is referred to [96] for examples of structured codes suitable for
embedding large payloads. A simple ternary code that achieves very good per-
formance for embedding large payloads is in Exercise 8.6 (and Figure 8.8). An
alternative approach for efficient embedding of large payloads, which is not based
Matrix embedding 151

5
Upper bound, any n
Upper bound fixed n, k = 14
4.5 k = 14
Upper bound fixed n, k = 10
k = 10
Embedding efficiency e(α)
4

3.5

2.5

2
0.5 0.6 0.7 0.8 0.9 1
α
Figure 8.4 Embedding efficiency versus relative payload α for random codes of dimension
k = 10 and 14. Also shown is the theoretical upper bound for codes of arbitrary
length (8.66) and bounds for codes of a fixed length (8.41).

on syndrome coding, uses so-called Sum and Difference Covering Sets (SDCSs)
of finite cyclic groups [158, 160, 174] (also see Section 8.7).

8.6 Steganography using q-ary symbols

All matrix embedding methods introduced in the previous sections used binary
codes. The reason for this was that each cover element was represented with a
bit through some bit-assignment function, π, such as its LSB. Fundamentally,
there is no reason why we could not use q-ary representation for q > 2. Consider,
for example, the ±1 embedding where it is allowed to change a pixel value by
±1. Each pixel can thus accept one of three possible values. By using a function
that assigns a ternary symbol at each pixel, for example, π(x) = x mod 3, one
ternary symbol can be embedded at each pixel rather than just one bit. Because
one ternary symbol conveys log2 3 bits of information, the number of embedding
changes would be decreased. Let us calculate the embedding efficiency of this
ternary ±1 embedding to see the improvement.
We already know that the embedding efficiency of classical (binary) ±1 em-
bedding is 2 because the probability that a pixel needs to be modified is 12 . Alter-
natively, two bits are embedded using one embedding change. When embedding
a random stream of ternary symbols, the probability that a pixel will not have
152 Chapter 8. Matrix embedding

to be modified is 13 . In 23 of the cases, the pixel value is modified by ±1. Thus,


the average number of embedding changes per pixel is 23 and the payload per
pixel is log2 3 bits, which leads to embedding efficiency log2 3/(2/3) = 2.3774...
bits per change. Thus, ternary ±1 embedding has higher embedding efficiency
than binary ±1 embedding. Because the magnitude of embedding changes is the
same in both cases, there is no reason not to use ternary ±1 embedding over the
binary embedding.
If ternary is better than binary, it is tempting to increase the magnitude of
embedding changes even further and embed more bits per embedding change.
For example, by allowing changes by up to ±2, we would embed log2 5 bits
per pixel and the embedding efficiency log2 5/(4/5) would be even higher than
for the ternary case. However, embedding changes of higher magnitude are also
statistically more detectable than embedding changes of lower magnitude. Thus,
it is not immediately clear that using pentary embedding would indeed lead
to less detectable steganographic schemes. We revisit this intriguing topic in
Section 8.6.3 after deriving the bound on embedding efficiency of q-ary codes.
As in the binary case, it is possible to construct matrix embedding schemes
for q-ary linear codes and further increase the embedding efficiency. The only
change is that we will now work with vectors of n q-ary symbols extracted from
the cover/stego image rather than binary vectors. This translates to q-ary linear
codes where the binary arithmetic is replaced with arithmetic in a finite field
Fq (see Appendix C). Because the matrix embedding theorem can be proved in
exactly the same manner, we provide only the formulation without the proof.

Theorem 8.5. [Matrix embedding theorem for q-ary codes] An [n, k]


q-ary code with covering radius R can be used to construct a steganographic
scheme that can embed n − k q-ary symbols in n pixels by making at most R
embedding changes. The embedding efficiency is e = [(n − k) log2 q]/Ra bits per

change, where Ra = (1/q n ) x∈Fnq d(x, C) is the average distance to code.

8.6.1 q-ary Hamming codes


The binary Hamming codes from Sections 8.1 and 8.3.1 are an attractive choice
for matrix embedding due to their simple implementation. In this section, we
describe their generalized version that works over a finite field Fq .
A q-ary Hamming code is defined using its parity-check matrix H, which will
now contain all different non-zero vectors x ∈ Fpq of length p (different up to
a multiplication by an element from the field ). For example, the parity-check
matrix for the ternary Hamming code with p = 3 is
⎛ ⎞
100011102121 1

H= 010101110212 1⎠. (8.69)
001110121011 2
Matrix embedding 153

Thus, this is a [13, 10] code with covering radius R = 1. The radius is 1 because
we need only a linear combination of one column to obtain any syndrome. For
example, the syndrome (2, 2, 2) , which does not appear directly as a column in
H, can be obtained as a linear combination (multiple) of just one column, the
seventh column. # p $
q −1 qp −1
In general, a q-ary Hamming code is a q−1 , q−1 − p code with covering
radius R = 1. The length is n = (q p − 1)/(q − 1) because there are q p − 1 non-
zero vectors of q-ary symbols of length p and each non-zero column appears in
q − 1 versions (there are q − 1 non-zero multiples). Of course, the codimension
is n − k = p.
Hamming codes can be used to construct steganographic schemes capable of
embedding p q-ary symbols or p log2 q bits in (q p − 1)/(q − 1) pixels by making
on average 1 − q −p changes (because the probability that the syndrome of the
cover, Hx ∈ Fpq , already matches the p message symbols is 1/q p ). Thus, we have
the corresponding relative payload αp (in bits per pixel) and embedding efficiency
ep (in bits per change)

p log2 q
αp = , (8.70)
(q p − 1)/(q − 1)

p log2 q
ep = . (8.71)
1 − q −p

The embedding efficiency of ternary Hamming codes is shown in Figure 8.5.


Note that ternary Hamming codes offer larger embedding efficiency than binary
Hamming codes.

Example 8.6: [±1 embedding using ternary Hamming codes with co-
dimension p]
Assuming the parity-check matrix (8.69), let g = (14, 13, 13, 12, 12, 16, 18,
20, 19, 21, 23, 22, 24) be the vector of grayscale values from a cover-image
block of 13 pixels. The corresponding vector of ternary symbols is x = g mod 3 =
(2, 1, 1, 0, 0, 1, 0, 2, 1, 0, 2, 1, 0), which has the following syndrome:

⎛ ⎞ ⎛ ⎞
2+1+2+1+1 1
Hx = 1 + 1 + 2 + 2 + 2 mod 3 = 2 ⎠ .
⎝ ⎠ ⎝ (8.72)
1+1+1+2+1 0

Let us say that we wish to embed a ternary message m = (0, 1, 2) . We now need
to find the element of the vector x that needs to be changed so that its syndrome
is the desired message. Proceeding as in the matrix embedding theorem, we
154 Chapter 8. Matrix embedding

Embedding efficiency e(α) 6

Bound on binary codes


4
Bound on ternary codes
Bound on pentary codes
Binary Hamming codes
2 Ternary Hamming codes
0 2 4 6 8 10 12
α−1
Figure 8.5 Embedding efficiency e as a function of α−1 , where α is the relative message
length. The graph shows the bounds on embedding efficiency for binary (8.66) (q = 2,
solid line), ternary (q = 3, dashed line), and pentary (q = 5, dotted line) codes (8.84), as
well as the efficiency for binary (+) and ternary () Hamming codes.

calculate2
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0 1 2
⎝ ⎠ ⎝ ⎠ ⎝
m − Hx = 1 − 2 = 2 ⎠ = 2H[., 7], (8.73)
2 0 2

where H[., 7] is the seventh column of H. Thus, m = Hx + 2H[., 7] = Hy,


where y[i] = x[i], i = 7, and y[7] = x[7] + 2 = 2. Because g[7] = 18, which is
a ternary 0, we need to change g[7] from 18 to 17 because 17 is ternary 2
(17 mod 3 = 2). Note that changing g[7] to 20 would work as well, however
at the expense of making an embedding change of magnitude 2 and thus this
change would not be compatible with ±1 embedding. To summarize, in order to
embed the ternary message (0, 1, 2) , the stego image pixel values will thus be
(14, 13, 13, 12, 12, 16, 17, 20, 19, 21, 23, 22, 24).

8.6.2 Performance bounds for q-ary codes


Mimicking the flow of Section 8.4.1, we first determine the largest embedding

efficiency of q-ary codes [n, k] for a fixed n and k. Because there are ni (q − 1)i
possible linear combinations of i columns of the parity-check matrix H, the
number of cosets whose coset leaders have weight i is at most ni (q − 1)i . Thus,

2 All arithmetic operations are modulo 3.


Matrix embedding 155

the covering radius R must be at least equal to Rn for which


n n n n
+ (q − 1) + · · · + (q − 1)Rn −1 + ξ (q − 1)Rn = q n−k ,
0 1 Rn − 1 Rn
(8.74)
where 0 < ξ ≤ 1 is a real number. Note that an [n, k] q-ary code has q n−k
cosets.
This gives us a lower bound on the covering radius, R ≥ Rn , as well as a lower
bound for the average distance to code Ra (which is the average weight of a coset
leader)
Rn −1 n n
i=1 i i (q − 1) + Rn ξ Rn (q − 1)
i Rn
Ra ≥ . (8.75)
q n−k
A lower bound on Ra again provides an upper bound on embedding efficiency
(n − k) log2 q (n − k)q n−k log2 q
e= ≤ Rn −1 n   . (8.76)
Ra i=1 i i (q − 1)i + R ξ n (q − 1)Rn
n Rn

This bound is reached by codes n for which the number of coset leaders with
Hamming weight i is exactly i (q − 1) and ξ = 1. The only such codes are
i

perfect codes and they enjoy the largest possible embedding efficiency within
the class of linear codes of length n.
Next, we derive a bound for codes of arbitrary length capable of communicating
a fixed relative payload. The maximal number of messages |M| that can be
communicated with change rate β = R/n is bound from above by the number of
possibilities one can make up to βn changes in n pixels,
βn
n
|M| ≤ Vq (βn, n) = (q − 1)i . (8.77)
i=0
i

The asymptotic behavior of the sum is again determined by the largest last term:
n
log2 Vq (βn, n) ≥ log2 + βn log2 (q − 1) (8.78)
βn
≈ n[H(β) + β log2 (q − 1)] = nHq (β), (8.79)
n
where we used the asymptotic expression (8.53) for log2 βn and
Hq (x) = −x log2 x − (1 − x) log2 (1 − x) + x log2 (q − 1) (8.80)
is the q-ary entropy function shown in Figure 8.6 for q = 3.
The inequality (8.79) together with the equivalent of the tail inequality (see
Exercise 8.3)
nHq (β) ≥ log2 Vq (βn, n) (8.81)
proves that the relative payload that can be embedded using change rate β is
bounded by
log2 |M|
α≤ ≈ Hq (β). (8.82)
n
156 Chapter 8. Matrix embedding

1.5

H3 (x) 1

0.5

H3 (x) = −x log2 x − (1 − x) log2 (1 − x) + x

0
0 0.2 0.4 0.6 0.8 1
x
Figure 8.6 The ternary entropy function.

From here, we obtain the bound on the lower embedding efficiency


α α
e= ≤ −1 , (8.83)
β Hq (α)

which is also a bound on the embedding efficiency


α
e≤ . (8.84)
Hq−1 (α)

A matrix embedding scheme with the maximal embedding efficiency can thus
embed relative payload α bpp by making on average Hq−1 (α) changes per pixel.
Figure 8.5 shows the upper bound (8.84) on embedding efficiency for q = 2, 3, 5
as a function of α−1 , where α is the relative payload in bpp. Note that the
bounds start at the point α = log2 q, e = [q/(q − 1)] log2 q, which corresponds to
embedding at the largest relative payload of log2 q bpp. The same figure shows
the benefit of using q-ary codes for a fixed relative payload α. For example, for
α = 1, the ternary ±1 embedding can theoretically achieve embedding efficiency
e  4.4, which is significantly higher than 2 – the maximal efficiency of binary
codes at this relative message length. The embedding efficiency of binary and
ternary Hamming codes for different values of p is shown with “+” and “”
symbols, respectively.

8.6.3 The question of optimal q


In the previous two sections, we learned that by encoding individual elements of
the cover image into q-ary symbols rather than into bits, one can significantly
increase the embedding efficiency of matrix embedding. However, in order to be
able to modify the symbol assigned to each pixel to all q values from Fq , we need
Matrix embedding 157

to allow the following values of embedding changes:


% &
q−1 q−3 q−1
Dodd = − ,− , . . . , −1, 0, 1, . . . , (8.85)
2 2 2
for q odd, and
% &
q−2 q−4 q
Deven = − ,− , . . . , −1, 0, 1, . . . , (8.86)
2 2 2
for q even. Thus, for q > 3, the magnitude of modifications sometimes has to
be larger than 1. The question is whether it is better to have fewer changes
with larger magnitude or more changes with lower magnitude. In contrast to the
case of binary or ternary codes when the magnitude of embedding changes was
always 1, it is now no longer appropriate to measure embedding impact using
the number of embedding changes because the changes have unequal magnitude.
Instead, we measure embedding impact using the distortion measure as defined
in Chapter 4,

n
dγ (x, y) = |x[i] − y[i]|γ (8.87)
i=1

for γ ≥ 1.
Let us assume that we use optimal q-ary matrix embedding schemes with em-
bedding efficiency reaching the bound (8.84). This means that one can embed
relative payload α (bpp) by making on average Hq−1 (α) changes per pixel. As-
suming that the message is encoded using symbols from Fq and forms a random
stream, the magnitude of these changes is equally likely to reach any of the q − 1
non-zero values in D and the expected impact per changed pixel is
1  γ
dγ = |d| , (8.88)
q−1
d∈D

where D stands for Dodd for q odd and Deven for q even. Thus, the expected
embedding impact per pixel when embedding relative payload α using optimal
q-ary matrix embedding is
1  γ
Δ(α, q, γ) = β(α)dγ = Hq−1 (α) |d| , (8.89)
q−1
d∈D

because β(α) = Hq−1 (α) is the minimal change rate (8.82) for payload α embed-
dable using q-ary codes.
We can obtain some insight into the trade-off between the number of em-
bedding changes and their magnitude by determining the value of q that
minimizes Δ(α, q, γ). Figure 8.7 shows Δ(α, q, γ) for different values of q for
α ∈ {0.1, 0.2, . . . , 0.9} and γ = 1. Note that q = 3 leads to the smallest embed-
ding impact for all relative payloads. This statement holds true for any γ ≥ 1
because Δ(α, 2, γ) = Δ(α, 3, γ) for all γ (the magnitude of embedding changes
is 1 in these cases) and Δ(α, q, γ) ≥ Δ(α, q, 1) for all q > 3. Thus, as long as the
158 Chapter 8. Matrix embedding

α = 0.9

Embedding impact per pixel 10−1

α = 0.1

10−2
2 3 5 10 15
q

Figure 8.7 Expected embedding impact Δ(α, q, 1) for various values of α and q. The
minimal embedding impact is always obtained for q = 3.

embedding impact can be captured using dγ , we can conclude that the most se-
cure steganography is obtained for ternary codes and it does not pay off to make
fewer embedding changes with larger magnitude (or use q-ary codes with q > 3).
Of course, this conclusion hinges on the assumption that statistical detectability
can be captured with a distortion measure. This is not, however, entirely clear.
It is possible that for some combination of the embedding scheme and the cover-
source model, the KL divergence between cover and stego objects will be lower
for codes with q > 3. This issue is currently an open research problem.

8.7 Minimizing embedding impact using sum and difference


covering set

The main theme of this chapter is improving the embedding efficiency of stegano-
graphic schemes by decreasing the number of embedding changes. So far, we have
explored methods based on syndrome coding with linear codes (so-called matrix
embedding). Although this approach is the one most developed today, there ex-
ists an alternative and quite elegant approach that originated from the theory
of covering sets of cyclic groups. Moreover, it can be thought of as a generaliza-
tion of ±1 embedding to groups of multiple pixels. Additionally, by connecting a
known problem in steganography with another branch of mathematics, steganog-
raphy may benefit from future breakthroughs in this direction. In this section,
we explain the main ideas and include appropriate references where the reader
may find more detailed information. Following the spirit of this book, we first
introduce a simple example that will later be generalized.
Matrix embedding 159

Table 8.2. Required modification of the cover pair (x[1], x[2]) depending on the value of
Ext(x[1], x[2]) (the first column) to embed a quaternary symbol b ∈ Z4 = {0, 1, 2, 3}
(the first row).

x\b 0 1 2 3
0 (0,0) (1,0) (0,1) (–1,0)
1 (–1,0) (0,0) (1,0) (0,1)
2 (0,–1) (–1,0) (0,0) (1,0)
3 (1,0) (0,–1) (–1,0) (0,0)

Consider a steganographic scheme whose embedding mechanism is limited to


making ±1 changes to each cover element. Let x[1], x[2] be two elements of
the cover. We now show how one can embed a quaternary symbol3 (or two
message bits) into the pair (x[1], x[2]) by modifying at most one element by 1.
The embedding will have the property that the symbol b can be extracted from
the pair of stego elements (y[1], y[2]) using the following extraction mapping:

b = Ext(y[1], y[2]) = (y[1] + 2y[2]) mod 4. (8.90)

Table 8.2 shows that it is always possible to embed a quaternary symbol in each
pair (x[1], x[2]) by modifying at most one element of the pair by ±1. For example,
when Ext(x[1], x[2]) = 2 and we wish to embed symbol 0, we modify x[2] →
x[2] − 1. (In fact, in this case, we can achieve this same effect by modifying x[2] →
x[2] + 1, etc.) Here, we ignore the boundary issues when the modified element
may get out of its dynamic range. If a random symbol stream is embedded using
this method, we embed 2 bits by making a modification to one of the pixels with
probability 12 3 2
16 = 4 . Thus, the embedding efficiency of this method is e = 3/4 =
8
3 = 2.66 . . .. This is a higher embedding efficiency than simply embedding each
bit as the LSB of each pixel with efficiency 2. It is also higher than embedding a
ternary symbol (log2 3 bits) using ±1 embedding, which has embedding efficiency
3
log2 2/3 = 2.377 . . ..
This simple idea, which originally appeared in [174], can be greatly gen-
eralized [158, 160]. For any positive integer M , we will denote by ZM =
{0, 1, . . . , M − 1} the finite cyclic group of order M where the group operation
is addition modulo M . The group (ZM , +) obviously satisfies the axioms of a
group because it is closed with respect to the group operation (for any a, b ∈ ZM ,
the addition is defined as a + b mod M ∈ ZM ), the operation is also associative,
there exists an identity element, which is 1, and each element, a, has an inverse
element, M − a.
Let A = {a[1], . . . , a[n]} be a sequence of elements from ZM . The set A is
called a Sum and Difference Covering Set (SDCS) with parameters (n, k, M ) if

3 A quaternary symbol is an element of the set {0, 1, 2, 3} = Z4 .


160 Chapter 8. Matrix embedding

for each b ∈ ZM there exist s[i] ∈ {0, 1, −1}, i = 1, . . . , n, such that



n
|s[i]| ≤ k, (8.91)
i=1

n
s[i]a[i] = b in ZM . (8.92)
i=1

In other words, it is possible to write every element of ZM through addition or


subtraction operations of at most k elements from A.
The significance of such sets for steganography will become apparent on show-
ing that an (n, k, M ) SDCS can be used to construct a steganographic scheme
that modifies each cover element by at most 1 and embeds log2 M bits in n
pixels by making at most k modifications. Consider an n-element cover (or a
cover block), represented using the vector x = (x[1], . . . , x[n]) ∈ ZnM , and a mes-
sage symbol b ∈ ZM . Here, the cover representation is again obtained via some
symbol-assignment function that maps cover elements to ZM . We define the
embedding and extraction operation in the following manner:
Emb(x, b) = (x[1] + s[1], . . . , x[n] + s[n]) = y, (8.93)
n
Ext(y) = a[i]y[i], (8.94)
i=1
 
where s[i] are such that i |s[i]| ≤ k and b = i a[i]y[i]. All operations in the
extraction function are understood as modulo M . To see the existence of such a

vector s, we first write the difference w = b − ni=1 a[i]x[i] as an element of ZM ,

n
w= a[i]s[i], (8.95)
i=1

with ni=1 |s[i]| ≤ k because A is an (n, k, M ) SDCS. To minimize the number
of embedding operations, in the embedding function we should choose such a

vector s with the minimal sum ni=1 |s[i]|. If there exists more than one such
vector, we choose randomly among them with uniform distribution. Then,

n 
n
a[i]y[i] = a[i](x[i] + s[i]) = b − w + w = b, (8.96)
i=1 i=1

which proves that the extraction function indeed obtains the correct message
symbol b. For practical implementations, the embedding function can be easily
implemented using a look-up table tying each message symbol with the vector s.

For further considerations, it will be convenient to denote dA (b) = ni=1 |s[i]|
the minimal number of embedding modifications to embed symbol b. The embed-
ding efficiency of the resulting steganographic scheme with SDCS A on covers
with elements in ZM is
log2 M
eA =  . (8.97)
(1/M ) M−1
b=0 dA (b)
Matrix embedding 161


Example 8.7: Consider A = {1, 2, 6} and convince yourself that for 3i=1 |s[i]| ≤

2 the sum 3i=1 s[i]a[i] can attain all values between −8 and 8. Out of these
 
M = 17 values, 3i=1 |s[i]| = 0 for the value 0, 3i=1 |s[i]| = 1 for values from the

set {−6, −2, −1, 1, 2, 6}, which is {11, 15, 16, 1, 2, 6} in Z17 , and 3i=1 |s[i]| = 2 for
values in {−8, −7, −5, −4, −3, 3, 4, 5, 7, 8}. Thus, A is an SDCS with parameters
(3, 2, 17) and the embedding efficiency of the associated ±1 embedding scheme
is
log2 17
eA = ≈ 2.673. (8.98)
1
17 (1 × 0 + 6 × 1 + 10 × 2)

This embedding scheme can embed log2 17 bits in n = 3 cover elements by mak-
ing at most k = 2 modifications by ±1.

If, for a given (n, k, M ) SDCS, there does not exist an (n, k, M  ) SDCS with
M  > M , we call the SDCS maximal (its associated steganographic method com-
municates the largest possible payload for a given choice of n and k). Because
there are V3 (n, k) possible ways one can make k or fewer changes by ±1 to n
cover elements, we obtain the following bound:


k
n
M≤ 2i , (8.99)
i=0
i

which is essentially the same bound as (8.77). For k = 1, the bound states M ≤
1 + 2n. In this case, it is possible to find the maximal SDCS with M = 1 + 2n,
which is A = {1, 2, . . . , n}. To see that A is an (n, 1, 1 + 2n) SDCS, the reader is
encouraged to inspect Exercise 8.5, where the embedding scheme is formulated
from a different perspective using rainbow coloring of lattices.
Finding maximal SDCSs is a rather difficult task. In fact, it is not easy to
find SDCSs in general. Several parametric constructions of SDCSs are described
in [160]. Table 8.3 contains examples of SDCSs useful for embedding large pay-
loads, all found by a computer search. The embedding efficiency and relative
embedding capacity of the associated embedding schemes are also displayed
in Figure 8.8, where we compare the embedding efficiency of steganographic
schemes constructed using SDCSs, Hamming codes, and the ternary repetition
code from Exercise 8.6.
Curiously, the existence of an (n, k, M ) SDCS does not imply the existence
of SDCSs with (n, k, M  ), M  < M . It is known, for example, that although
there exists a (9, 2, 132) SDCS, there are no SDCSs with parameters (9, 2, x) for
x ∈ {131, 129, 128, 127, 125}. The subject of finding SDCSs is related to other
problems from discrete geometry, graph theory, and sum cover sets [73, 103, 105,
115]. Progress in these areas is likely to find applications in steganography via
the considerations explained in this section.
162 Chapter 8. Matrix embedding

Table 8.3. Examples of SDCSs and the relative payload, α, and embedding efficiency, e,
of their associated steganographic schemes. Note that SDCSs with an even M cannot be
enlarged to an odd M because we would lose the covering property. For example, for the
SDCS {1, 3, 9, 14}, 7 = −9 − 14 mod 30 and 7 would not be covered in Z31 .

(n, k, M ) SDCS α e

(3, 2, 17) {1,2,6} 1.3625 2.6726


(4, 2, 30) {1,3,9,14} 1.2267 2.9441
(5, 2, 42) {1,2,7,14,18} 1.0785 3.1445
(6, 2, 61) {1,2,5,11,19,27} 0.9885 3.3498
(7, 2, 80) {1,22,26,30,34,36,39} 0.9031 3.5122
(8, 2, 104) {2,4,6,13,16,34,39,40} 0.8376 3.6676
(9, 2, 132) {2,11,33,34,44,50,55,58,62} 0.7827 3.8109
(4, 3, 53) {1,2,6,18} 1.4320 2.5727
(5, 3, 105) {1,3,14,36,42} 1.3428 2.7756
(6, 3, 174) {1,3,9,21,51,86} 1.2405 2.9300
(6, 2, 64) {1,2,4,12,21,28} 1 0.3021

4.5
Embedding efficiency e(α)

3.5
Bound on binary codes
3 Bound on ternary codes
Binary Hamming codes
Ternary Hamming codes
2.5
Ternary Golay code
SDCS construction
2 Ternary repetition code
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
α−1
Figure 8.8 Embedding efficiency of ±1 steganographic schemes based on SDCS
constructions from Table 8.3. For comparison, we include the embedding efficiency of
ternary (codimension 1 and 2) and binary Hamming codes (codimension 1–3), the ternary
Golay code, and the ternary repetition code from Exercise 8.6 for n = 4, 7, 10, 13, 16
(from the right).

Summary
r Matrix embedding (or syndrome coding) is a coding method that can increase
the embedding efficiency of steganographic schemes.
r It can be applied only when the message length is smaller than the embedding
capacity.
Matrix embedding 163

r The shorter the message, the larger the improvement due to matrix embed-
ding.
r Matrix embedding is part of the theory of covering codes.
r Good matrix embedding schemes should have small average distance to code
because it determines the embedding efficiency.
r Some of the simplest matrix embedding methods are based on Hamming
codes.
r By assigning a q-ary symbol from a finite field to each cover element, it is
possible to further increase embedding efficiency using q-ary codes.
r Measuring the embedding impact using distortion that takes into account the
magnitude of modifications, ternary codes provide the minimal embedding
impact. In particular, it does not pay off to make fewer embedding changes
of magnitude larger than 1.
r It is possible to design steganographic schemes using sum and difference cov-
ering sets of finite cyclic groups. Current schemes find applications for em-
bedding large payloads.

Table 8.4. Properties of optimal q-ary matrix embedding schemes when embedding into
cover containing n pixels. The function Hq (x) is the q-ary entropy function,
Hq (x) = −x log2 x − (1 − x) log 2 (1 − x) + x log2 (q − 1).

Maximal payload embeddable using up to R changes nHq (R/n)


Average number of embedding changes to embed m bits nHq−1 (m/n)
m/n
Maximal embedding efficiency to embed m bits Hq−1 (m/n)

Exercises

8.1 [Binary repetition code] The binary repetition code of length n consists
of two codewords C = {(0, . . . , 0), (1, . . . , 1)}. Show that the covering radius of
this code and its average distance to code are
' (
n−1
R= , (8.100)
2
n n−1
Ra = 1 − 2−n+1 . (8.101)
2 R
8.2 [Binary–ternary conversion] Write a computer program that converts
a stream of q-ary symbols represented using integers {0, 1, . . . , q − 1} to a bi-
nary stream and vice versa. Make sure that the length of the binary stream is
approximately n log2 q for good encoding.
164 Chapter 8. Matrix embedding

8.3 [Tail inequality for q > 2] Prove the tail inequality (8.81). Hint: First,
you can assume that
q−1
0≤β≤ (8.102)
q
because if we restrict our attention to codes containing the all-ones vector, and
thus all its multiples (0, . . . , 0), (1, . . . , 1), . . . , (q − 1, . . . , q − 1), no vector x ∈ Fnq
can be further from these q codewords than n(1 − 1/q) because the furthest
vector contains q/n zeros, q/n ones, . . ., and q/n symbols q − 1. Then write

βn
n β 1
i
1 = (β + 1 − β) ≥ (1 − β)
n
(q − 1)i
n
, (8.103)
i=0
i 1−β q−1

use the inequality β/[(1 − β)(q − 1)] ≤ 1, which follows from (8.102), and follow
the same steps as in the proof of the tail inequality for q = 2.

8.4 [Rainbow coloring] Let Ld = {(i1 , . . . , id )|i1 ∈ Z, . . . , id ∈ Z} be a d-


dimensional integer lattice. For each lattice point, (i1 , . . . , id ), we define its neigh-
borhood as
 )
d

N (i1 , . . . , id ) = (j1 , . . . , jd ) |ik − jk | = 1 . (8.104)
k=1

In other words, the neighborhood is formed by the lattice point itself and 2d
other points that differ from the center point in exactly one coordinate by 1.
Show that the following assignment of 2d + 1 colors, c, c ∈ {0, . . . , 2d + 1}, to
the lattice

d
c(i1 , . . . , id ) = kik mod (2d + 1) (8.105)
k=1

has the property that the neighborhood of every lattice point contains exactly
2d + 1 different colors.

8.5 [±1 embedding in groups] One possibility to avoid modifying the pixels
by more than 1, yet use the power of q-ary embedding, is to group pixels into
disjoint subsets of d pixels. By modifying each pixel by ±1, we obtain 2d possible
modifications plus one case when no modifications are carried out. The rainbow
coloring from the previous example will enable us to assign colors ((2d + 1)-ary
symbols) to each pixel group and embed one (2d + 1)-ary symbol by modifying
at most one pixel by 1 (±1 embedding). Show that the relative payload and
embedding efficiency of this steganographic method are
log2 (2d + 1) log2 (2d + 1)
αd = , ed = . (8.106)
d 1 − (2d + 1)−1
8.6 [Large relative payload using ternary ±1 embedding] Let C be the
ternary [n, 1] repetition code (one-dimensional subspace of Fn3 ) with a ternary
Matrix embedding 165

parity-check matrix H = [In−1 , u], where u is the column vector of 2s. This code
can be used to embed (n − 1) log2 3 bits per n pixels, which gives relative payload
αn = (n − 1)/n log2 3 → αmax = log2 3 bpp with increasing n. Thus this code is
suitable for embedding large payloads close to the maximal relative payload
αmax . If we denote the number of 0s, 1s, and 2s in an arbitrary vector of Fn3 by
a, b, and c, respectively, then the average distance to C can be computed as
1  n n−a
Ra = n (n − max{a, b, c}) , (8.107)
3 a b
where the sum extends over all triples {a, b, c} of non-negative integers such
that a + b + c = n. Thus, the embedding efficiency of this code is e = [(n −
1) log2 3]/Ra . For example, it is possible to embed α = 1.188 bpp with embed-
ding efficiency 2.918 bits per change for n = 4. The performance of this family
of codes is shown in Figure 8.8. In this exercise, prove the expression for the
average distance to code, Ra .
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
9 - Non-shared selection channel pp. 167-192

Chapter DOI:
Cambridge University Press
9 Non-shared selection channel

In Chapter 6, we learned that steganographic security can be measured with


the Kullback–Leibler divergence between the distributions of cover and stego
images. Four heuristic principles for minimizing the divergence were discussed
in Chapter 7. One of them was the principle of minimal embedding impact,
which starts with the assumption that each cover element, i, can be assigned
a numerical value, ρ[i], that expresses the contribution to the overall statistical
detectability if that cover element was to be changed during embedding. If the
values ρ[i] are approximately the same across all cover elements, minimizing the
embedding impact is equivalent to minimizing the number of embedding changes.
The matrix embedding methods introduced in the previous chapter can be used
to achieve this goal.
If ρ[i] is highly non-uniform, Alice may attempt to restrict the embedding
changes to a selection channel formed by those cover elements with small ρ[i].
Constraining the embedding process in this manner, however, brings a funda-
mental problem. Often, the values ρ[i] are computed from the cover image or
some side-information that is not available to Bob. Thus, Bob is generally un-
able to determine the same selection channel from the stego image and thus read
the message. Channels that are not shared between the sender and the recipi-
ent are called non-shared selection channels. The main focus of this chapter is
construction of methods that enable communication with non-shared selection
channels.
We now give a few typical examples when non-shared selection channels arise.
Imagine that Alice has a raw, never-compressed image and wants to embed in-
formation in its JPEG compressed form. Intuitively, the side-information in the
form of the raw image should help her better conceal the embedding changes.
When compressing the image, Alice can inspect the DCT coefficients after they
have been divided by quantization steps but before they are rounded to integers
and select for embedding those coefficients whose fractional part is close to 0.5.
Such coefficients experience the largest quantization error during JPEG compres-
sion and the smallest combined error (rounding + embedding) if rounded to the
“other value.” For example, when rounding the coefficient −3.54, we can embed
a bit by rounding it to −3 or to −4. The rounding distortion (rounding to −4) is
0.46. If embedding requires rounding to −3 instead, the combined rounding and
embedding distortion is only slightly larger, 0.54. Selecting such coefficients for
168 Chapter 9. Non-shared selection channel

embedding, however, creates an obvious and seemingly insurmountable problem.


Bob will not be able to tell which DCT coefficients in the stego JPEG file were
used for embedding because the cover is not available to him and he cannot
completely undo the loss due to rounding in JPEG compression.
As another simple example, consider adaptive steganography where the cover
elements are chosen for embedding based on their neighborhood. Alice calculates
for each pixel, i, in the cover image the variance, σ 2 [i], from all pixels in its local
3 × 3 neighborhood. Then, she sorts the variances σ 2 [i] from the largest to the
smallest and embeds the payload of m bits using LSB embedding into the m
pixels with the largest local variance. When Bob attempts to read the message,
it may well happen that the m pixels with the largest local variance in the stego
image will not be completely the same (or their order may not be the same) as
those selected by Alice. Again, Bob is unable to read the message. The adaptive
method for palette images presented in Section 5.2.5 is yet another example of
this problem.
Non-shared selection channels in steganography are sometimes explained using
the metaphor “writing on wet paper” [91]. Imagine that the cover image x was
exposed to rain and some of it pixels got wet. Alice is allowed only to slightly
modify the dry pixels (the selection channel) but not the wet pixels. During
transmission, the stego image y dries out and thus Bob has no information about
which pixels were dry. The question is how many bits can be communicated to
Bob and how? This problem is recognized in information theory as writing in
memory with defective cells [60, 108, 152, 231, 254]. A computer memory contains
n cells out of which n − k cells are permanently stuck at either 0 or 1. The device
that writes data into the memory knows the locations and status of the stuck
cells. The task is to write as many bits as possible into the memory (up to k) so
that the reading device, that does not have any information about the stuck cells,
can correctly read the data. Clearly, writing on wet paper is formally equivalent
to writing in memory with stuck cells. (The stuck cells correspond to wet pixels.)
We now provide a simple argument [152] based on random binning that shows
that asymptotically, as n → ∞ for a fixed ratio k/n, it is possible to write all k
bits into the memory. In other words, we can write as many bits in the memory
as if the reading device knew the location of the stuck cells! This surprising
fact also follows from the Gel’fand–Pinsker theorem for channels with random
parameters known to the sender [99].
Select an arbitrary  > 0 and randomly assign all n-bit vectors to 2k−n disjoint
bins. This assignment must be shared with the reading device. The index of
the bin will be the message communicated. Because there are 2k−n bins, the
message that can be communicated as the bin index has k − n bits. Given a
specific message of k − n bits, find the bin B with index equal to the message.
Then, find in B a word with n − k cells stuck exactly as in the memory and write
this word into the memory. The reading device will simply read an n-bit word
from the memory, find the bin to which the word belongs, and from the shared
codebook extract the bin index, which is the message. If we are unable to find
Non-shared selection channel 169

a word in B that would be compatible with the defects, we declare a failure to


write the message into the memory. We now show that the probability of failure
is asymptotically negligible. Let us calculate the probability that we will not be
able to find a defect-compatible word in bin B. Because there are in total 2k
words compatible with the memory and we have 2n /2k−n = 2n−k+n words in
each bin, the probability that none of the 2k compatible words will be in B is

2n−k+n
!2n
2n−k
2k 1 k
1− n = 1− → 0 as n → ∞ for = const.,
2 2n−k n
(9.1)
 2n−k
because 1 − 1/2n−k → 1/e < 1. Thus, for any  > 0 we can write k − n
bits into the memory with probability that approaches 1 exponentially fast.
Although asymptotically optimal, the random-binning argument above is not
a practical way to construct steganographic schemes with non-shared selection
channels due to the enormous size of the codebook that needs to be shared. In
Section 9.1, we describe an approach based on syndrome coding using random
linear codes, which is quite suitable for applications in steganography. A prac-
tical and fast wet paper code can be obtained using random linear codes called
LT codes (Section 9.2). To improve the embedding efficiency of wet paper codes,
in Section 9.3 we present another random construction. The usefulness of having
a practical solution for the non-shared selection channel is demonstrated in Sec-
tion 9.4, where we list several fascinating and very diverse applications of wet
paper codes.
Alternative approaches to writing on wet paper that are not discussed in this
book are based on maximum distance separable codes, such as the Reed–Solomon
codes [97], and on BCH codes [208].

9.1 Wet paper codes with syndrome coding

Steganographic schemes with non-shared selection channels can be realized using


syndrome coding quite similar to matrix embedding in the sense that the message
is communicated as the syndrome of stego image bits for some linear code. There
are, however, some fundamental differences between coding for matrix embedding
and for non-shared selection channels. These differences as well as similarities are
commented upon throughout the text.
We assume that n cover elements are represented using a bit-assignment func-
tion, π, as a vector of n bits x ∈ {0, 1}n . For example, one can think of x as the
vector of LSBs of pixels. The sender forms a selection channel of k changeable
elements x[j], j ∈ S ⊂ {1, . . . , n}, |S| = k, that can be modified during embed-
ding. The remaining n − k elements x[j], j ∈ / S, are not to be changed. Using
the writing-on-wet-paper metaphor, S contains indices of “dry” elements (func-
tioning memory cells) while the rest of the elements are “wet” (stuck cells). The
170 Chapter 9. Non-shared selection channel

sender’s goal is to communicate m < k message bits m ∈ {0, 1}m to the recipient
who has no information about the selection channel S.
Approaching this problem using the paradigm of syndrome coding, the message
is communicated to the recipient as a syndrome. To this end, the sender modifies
some changeable elements in the cover image so that the bits assigned to the stego
image, y, satisfy
Dy = m, (9.2)
where D is an m × n binary matrix shared by the sender and the recipient.
The recipient reads the message by multiplying the vector of stego image bits
y by the matrix D. While this appears identical to matrix embedding, there is
one important difference because the sender is now allowed to modify only the
changeable elements of x.
Using the variable v = y − x, (9.2) can be rewritten as
Dv = m − Dx, (9.3)
where v[i] = 0, for i ∈
/ S, and v[j], j ∈ S, are to be determined. Because v[i] = 0
for i ∈
/ S, the product, Dv, on the left-hand side can be simplified. The sender
can remove from D all n − k columns corresponding to indices i ∈ / S and also
remove from v all n − k elements v[i], i ∈/ S. To avoid introducing too many new
symbols, we will keep the same symbol for the pruned vector v and write (9.3)
as
Hv = z, (9.4)
where H is an m × k submatrix of D consisting of those columns of D with indices
from S. Note that v ∈ {0, 1}k is an unknown vector holding the embedding
changes and z = m − Dx ∈ {0, 1}m is a known right-hand side. Equation (9.4)
is a system of m linear equations for k unknowns v. We now discuss several
options for solving this system.
If the solution exists, it can be found using standard linear-algebra methods,
such as Gaussian elimination. The solution will exist for any right-hand side z
if the rank of H is m (or the rows of H are linearly independent), which means
that we must have m ≤ k as the necessary condition. However, the complexity
of Gaussian elimination will be prohibitively large, O(km2 ), because we cannot
directly impose structure on H that would allow us to solve the system more
efficiently. This is because H was obtained as a submatrix of a larger, user-
selected matrix D through the selection channel over which the sender has no
control because it is determined by the cover image or side-information as in the
examples from the introduction to this chapter. In the next section, we introduce
a class of sparse matrices for which fast algorithms for solving (9.4) are available.
Before we do so, we make two more remarks.
If D is chosen randomly (e.g., generated from the stego key), the probability
that H will be of full rank, rank(H) = m, is 1 − O(2m−k ) (see, for example, [31])
and thus approaches 1 exponentially fast with increasing k − m. This means that
Non-shared selection channel 171

syndrome coding with random matrices is asymptotically capable of communi-


cating the same number of bits as there are dry cover elements as if the recipient
knew the selection channel. It is also another way to prove that the capacity of
defective memory is asymptotically equal to the number of correctly functioning
cells.
Another way of looking at the system (9.4) is to think of H as a parity-check
matrix of some linear code of length k and codimension m, in which case solving
for v requires finding a member of the coset C(z) (see Appendix C on coding or
Section 8.2). Again, because it is D and not H over which the sender has com-
plete control, we cannot easily impose that H be a parity-check matrix of some
structured code that would enable us to efficiently find coset members. Note
that by choosing v as a coset leader, rather than an arbitrary coset member, the
sender would not only communicate through a non-shared selection channel but
additionally minimize the number of embedding changes! However, the complex-
ity of finding coset leaders for general codes is exponential in k and constitutes a
much harder task (see Section 9.3 on improving the embedding efficiency of wet
paper codes). Our focus for now will be on communicating through a non-shared
selection channel, which means solving the system (9.4).

9.2 Matrix LT process

In the previous section, we showed that syndrome coding with matrix D shared
between the sender and the recipient can be used to communicate secret messages
using non-shared selection channels. While the recipient reads the message by
performing a simple matrix multiplication Dy, the sender needs to solve the
linear system (9.4). In this section, we show how to perform this task with low
complexity by choosing D from a special class of random sparse matrices.
The basic idea is to make D (and thus H) sparse so that it can be put with
high probability into upper-diagonal form simply by permuting its rows and
columns. Imagine that matrix H has a column with exactly one 1 in, say, the
j1 th row. The sender swaps this column with the first column and then swaps the
first and j1 th rows, which brings the 1 to the upper left corner of H. Note that
at this stage, for the permuted matrix, H[1, 1] = 1 and H[j, 1] = 0, for j > 1.
We now apply the same step again while ignoring the first column and the first
row of the permuted matrix. Let us assume that we can find again a column
with only one 1, say in the j2 th row,1 and swap the column with the second
column of H followed by swapping the second and j2 th rows. As a result, we
will obtain a matrix with 1s on the first two elements of its main diagonal and
0s below them, H[1, 1] = 1, H[2, 2] = 1, H[j, 1] = 0 for j > 1, and H[j, 2] = 0 for
j > 2. We continue this process, this time ignoring the first two columns and

1 Since we are ignoring the first row, this column may have another 1 as its first element.
172 Chapter 9. Non-shared selection channel

rows, and eventually stop after m steps. At the end of this process, the row and
column permutations will produce a permuted matrix in an upper-diagonal form,
H[i, i] = 1 for i = 1, . . . , m, H[j, i] = 0 for j > i. Such a linear system can be
efficiently solved using the standard back-substitution as in Gaussian elimination.
(Note that the permutations preserve the low density of the matrix.) We call this
permutation procedure a matrix LT process because it was originally invented
for erasure-correcting codes called LT codes [164]. If, at some step during the
permutation process, we cannot find a column with exactly one 1, we say that the
matrix LT process has failed. The trick is to give H properties that will guarantee
that the matrix LT process will successfully finish with a high probability.
If the Hamming weights of columns of H follow a probability distribution
called the Robust Soliton Distribution (RSD) [164], the matrix LT process will
not fail with high probability. Imposing this distribution on the columns of D,
the columns of H will inherit it, too, because H is a submatrix of D obtained
by removing some of its columns. The RSD requires that the probability that a
column in D has Hamming weight i, 1 ≤ i ≤ m, be (1/η)(ν[i] + τ [i]), where

1/m i=1
ν[i] = (9.5)
1/[i(i − 1)] i = 2, . . . , m,


⎪ i = 1, . . . , m/T  − 1
⎨T /(im)
τ [i] = [T log(T /δ)]/m i = m/T  (9.6)


⎩0 i = m/T  + 1, . . . , m,
 √
η= m i=1 (ν[i] + τ [i]), T = c log(m/δ) m, δ and c are suitably chosen constants
whose choice will be discussed later. An example of the RSD for m = 100 is
shown in Figure 9.1. To generate a matrix with the number of ones in its columns
following the RSD, we can first generate a sequence of integers w[1], w[2], . . . that
follows the RSD. Then, the ith column of the matrix is generated by applying a
random permutation to a column containing w[i] ones and m − w[i] zeros.
To obtain some insight into why this distribution looks the way it does, note
that the columns with few ones are more frequent than denser columns to guar-
antee that the LT process will always find a column with just one 1. Without
the rather mysterious spike at Hamming weight 18, however, the matrix would
become too sparse to be of full rank. Thus, the spike ensures that the rank of the
matrix is m. More rigorous analysis of the RSD appears in Chapter 50 of [171]
and the exercises therein.
The analysis of LT codes [164] implies that when the Hamming weights of
columns of D (and thus of H) follow the RSD (9.5)–(9.6) the matrix LT process
finishes successfully with probability Ppass > 1 − δ if the message length m and
the number of changeable elements k satisfy

log2 (m/δ)
k ≥ θm = 1+O √ m. (9.7)
m
Non-shared selection channel 173

0.5

0.4

0.3

0.2

0.1

0
1 5 10 15 18 20 25 30
Figure 9.1 Robust soliton distribution for Hamming weights of columns of D for δ = 0.5,
c = 0.1, and m = 100.

Yet again, we see that asymptotically the sender can communicate k bits because
θ → 1 as m → ∞. The capacity loss due to the finite value of m is about 6%
when m = 10, 000, c = 0.1, and δ = 5 (θ = 1.062), while the probability that the
LT process succeeds is about Ppass ≈ 0.75. This probability increases and the
capacity loss decreases with increasing message length (see Table 9.1).
We note at this point that in practice one can achieve low overhead and high
probability of a successful pass through the LT process with δ > 1, which is in
contradiction with its probabilistic meaning. This is possible because the inequal-
ity Ppass > 1 − δ guaranteed by (9.7) is not tight.
Assuming the maximal-length message is sent (m ≈ k), the average number
of operations required to complete the LT process is [91]

O (n log(m/δ)) + O (m log(m/δ)) = O (n log(m/δ)) , (9.8)

which is significantly faster than Gaussian elimination. The first term arises
from evaluating the product Dx, while the second term is the complexity of the
LT process. The gain in implementation efficiency over using simple Gaussian
elimination is shown in Table 9.1.

9.2.1 Implementation
We now describe how the matrix LT process can be incorporated in a stegano-
graphic method. The sender starts by forming the matrix D with columns fol-
lowing the RSD. A stego key can be used to initialize the pseudo-random number
generator. Applying the bit-assignment function to the cover image, the sender
obtains the vector of bits x and computes the right-hand side z = m − Dx. The
matrix LT process is used to find the solution v to the linear system (9.4). Cover
elements i with v[i] = 1 should be modified to change their assigned bit.
174 Chapter 9. Non-shared selection channel

Table 9.1. Running time (in seconds) for solving m × m and m × θm linear systems
using Gaussian elimination and the matrix LT process, respectively (c = 0.1, δ = 5);
Ppass is the probability of a successful pass through the LT process. The experiments
were performed on a single-processor Pentium PC with 3.4 GHz processor.

m Gauss LT θ Ppass
1,000 0.023 0.008 1.098 43%
10,000 17.4 0.177 1.062 75%
30,000 302 0.705 1.047 82%
100,000 9320 3.10 1.033 90%

The receiver forms the matrix D, applies the bit-assignment function to image
elements (obtaining vector y), and finally extracts the message as the syndrome
m = Dy. Note, however, that in order to do so the recipient needs to know the
message length m because the RSD (and thus D) depends on m as a parameter.
(The remaining parameters c and δ can be public knowledge.) The message
length m thus needs to be communicated to the receiver. For example, the sender
can reserve a small portion of the cover image (e.g., determined from the stego
key) where the parameter m will be communicated using a small matrix D0 with
uniform distribution of 0s and 1s, instead of the RSD, and solve the system using
Gaussian elimination. Because in typical applications m could be encoded using
no more than 20 bits, the Gaussian elimination does not present a significant
increase in complexity because solving a system of 20 equations should be fast.
The payload of m bits is then communicated in the rest of the image using the
matrix LT process whose matrix D follows the RSD.
To complete the algorithm description, we need to explain how the sender
solves the problem with occasional failures of the matrix LT process. Again,
among several different approaches that one can take, probably the simplest one
is to make D dependent on the message length, m, for example by making the
seed for the PRNG that generates the sequence of integers w[1], . . . dependent on
a combination of the stego key and message length. If a failure occurs, a dummy
bit is appended to the message, and the matrix D is generated again followed
by another run of the matrix LT process till a successful pass is obtained.
Algorithm 9.1 contains a pseudo-code for the matrix LT process to ease prac-
tical implementation. The input is a binary m × k matrix H and the right-hand
side z ∈ {0, 1}m. The output is the solution v to the system Hv = z.

9.3 Wet paper codes with improved embedding efficiency

We already know that communication with non-shared selection channels using


syndrome coding requires solving a linear system Hv = m − Dx. For a random
message, m, the solution, v, obtained using the matrix LT process will have
Non-shared selection channel 175

Algorithm 9.1 Matrix LT process for solving the linear system Hv = z.


i = 1 and t = 0

while (i ≤ m) & (∃i ≥ i, 
j>i H[j, i ] = 1) {
swap rows i and ji ; // where H[ji , i ] = 1
swap z[i] and z[ji ];
swap columns i and i ;
swap v[i] and v[i ];
t = t + 1;
τ [t] // is the transposition i ↔ i ;
i = i + 1;
}
if i ≤ m declare failure and STOP;
// H is now in upper-diagonal form
v[i] = 0 for m < i ≤ k;
Use back-substitution to determine v[i], i ≤ m;
// Apply the transpositions τ to v in the reverse order
while t > 0 {
v ← τ [t](v);
t = t − 1;
}
// The resulting v is the solution to the system Hv = z

on average 50% of ones and 50% of zeros. Since ones correspond to embedding
changes, the message will be embedded with embedding efficiency 2. If the mes-
sage is shorter than the maximal communicable message, m < k, there will be
more than one solution. In the language of coding theory, the solutions will form
a coset C(z) = {x ∈ {0, 1}n|Hx = z}. If the sender selects the solution with the
smallest number of ones (a coset leader) the embedding impact will be mini-
mized or, equivalently, the embedding efficiency maximized. Unfortunately, the
problem of finding a coset leader for general codes is NP-complete.
In this section, we describe a version of wet paper codes with improved embed-
ding efficiency. It is a block-based scheme that embeds small message segments
of p bits in each block using random codes of codimension p. For such codes, the
problem of finding the solution v with the smallest number of ones can be solved
simply using brute force.
Keeping the same notation, we assume there are k changeable pixels in a
cover image consisting of n pixels and we wish to communicate m < k message
bits. The sender and receiver agree on a small integer p (e.g., p ≈ 20) and us-
ing the stego key divide the cover image into nB = m/p disjoint pseudo-random
blocks, where each block will convey p message bits. Each block will thus con-
tain n/nB = pn/m cover elements (for simplicity we assume all quantities are
integers). Since the blocks are formed pseudo-randomly, there will be on average
176 Chapter 9. Non-shared selection channel

(k/n) × (pn/m) = pk/m = p/α changeable pixels, where α = m/k, 0 ≤ α ≤ 1, is


the relative payload.2
Using the stego key, the sender will generate a pseudo-random binary p × pn/m
matrix D that will be used to embed p message bits in every block. Since the
embedding efficiency is determined by the average distance to code (which should
be as small as possible), the matrix D should not have any duplicate or zero
columns. This can be guaranteed if the number of pixels in each block, n/nB ,
satisfies n/nB < 2p or, equivalently, α = m/n = pnB /n > p/2p , which will not
be satisfied for p ≈ 20 only for extremely short payloads where detectability is
not an issue anyway. Thus, the columns of D can be generated by drawing n/nB
integers from the set {1, 2, . . . , 2p − 1} without replacement and writing them as
binary vectors of length p.
As described in Section 9.1, in each block B the sender forms a binary sub-
matrix H of D and computes the syndrome z = m − Dx, where m ∈ {0, 1}p is a
segment of p message bits to be embedded at B and x ∈ {0, 1}pn/m is the vector
of bits of cover image pixels from B. The submatrix H will in general be different
in every block and will have exactly p rows and, on average, p/α columns. The
sender now needs to find a coset leader, v, of the coset C(z). To explain the
method, we introduce the following concepts.
Let U1 ⊂ Fp2 be the set of all columns of H and Ui+1 = U1 + Ui − (U1 ∪ · · · ∪
Ui ) − {0}, i = 1, . . . , p, be a sequence of sets, where the sum of two sets is defined
as A + B = {a + b|a ∈ A, b ∈ B}. Note that Ui = ∅ for i > R, where R is the
covering radius of H. Also note that Ui is the set of syndromes that can be
obtained by adding i columns of H but no less than i. Equivalently, Ui is the
set of all coset leaders of weight i. For a given right-hand side z, we could find a
coset leader by generating the sets U1 , U2 , . . . and stop once z ∈ Ur . The problem
is that the cardinality of these sets increases exponentially. We now describe a
simple algorithm that enables us to find the coset leader with Hamming weight
r by only generating the sets Ui for i ≤ r/2.
Let z = H[., j1 ] + · · · + H[., jr ], where r ≤ R is the minimal number of
columns of H adding up to z. Note that v with zeros everywhere except
for indices j1 , . . . , jr is a coset leader. Since we work in binary arithmetic,
z + H[., j1 ] + · · · + H[., jr/2 ] = H[., jr/2+1 ] + · · · + H[., jr ], which implies (z +
Ur/2 ) ∩ Ur−r/2 = ∅. This observation leads to Algorithm 9.2.
After the solution v has been found, the sender modifies those cover elements
in the block for which v[i] = 1. The modified block of pixels from the stego image
is denoted y, which completes the description of the embedding algorithm.
The recipient knows n from the stego image and knows p because it is public.
Since the message length m is used in dividing the image into blocks, it needs
to be communicated in the stego image as well, for example using the method

2 Note that we measure the relative payload with respect to the number of changeable pixels,
k, rather than the number of all pixels, n.
Non-shared selection channel 177

Algorithm 9.2 Meet-in-the-middle algorithm for finding coset leaders.


if (z ∈ U1 ) {
v[j1 ] = 1; // because z = H[., j1 ] for some j1
set v[j] = 0 for all other j;
return;
} else {
l = r = 1;
}
while ((z + Ul ) ∩ Ur = ∅) {
if (l = r) {
r = r + 1;
if (Ur not yet generated) generate Ur ;
} else {
l = l + 1;
if (Ul not yet generated) generate Ul ;
}
}
// any v ∈ (z + Ul ) ∩ Ur is a coset leader of weight l + r

described in Section 9.2.1. Knowing m, the recipient uses the secret stego key
and partitions the rest of the stego image into the same disjoint blocks as the
sender and extracts p message bits m from each block of pixels y as the syndrome
m = Dy.

9.3.1 Implementation
To assess the memory requirements and complexity of Algorithm 9.2, we con-
sider the worst case, when the sender needs to generate all sets U1 , . . . , UR/2 .
The cardinalities of Ui exponentially increase with i, reach a maximum at around
i ≈ Ra , the average distance to code, and then quickly fall off to zero for i > Ra .
We already know from Chapter 8 that with increasing length of the code (or
increasing p), Ra → R. This means that the above algorithm avoids computing
the largest of the sets Ui . Nevertheless, we will still need to keep in memory all
Ui , i = 1, . . . , R/2 and the indices j1 , . . . , ji for  element of Ui . Because on
each
p/α
average |U1 | = p/α, we have on average |Ui |  ≤ i . Thus, the total memory re-
 p/α   
quirements are bounded by O R/2 × R/2 ≈ O p2(p/α)H(Rα/2p) ≈ O (p2κp ),
where κ = H(H −1 (α)/2)/α < 1 and H(x) is the binary entropy function. (For
example, for α = 12 , κ = 0.61.) Here, we used
n thenH(k/n)
asymptotic form for the
binomial number (8.49) from Chapter 8, k ≈ 2 , and the fact that
−1
R ≈ (p/α)H (α) is the expected number of embedding changes for large p (see
Table 8.4).
178 Chapter 9. Non-shared selection channel

8 Bound

Embedding efficiency e(α) p = 18


p = 14
6
p = 10
p=6

2
2 4 6 8 10 12
α−1
Figure 9.2 Embedding efficiency e of wet paper codes realized using random linear codes
of codimension p = 6, 10, 14, 18 displayed as a function of 1/α. The solid curve is the
asymptotic upper bound on embedding efficiency.

To obtain a bound on the computational complexity, note that we need to com-


pute U1 + Ui for i = 1, . . . , R/2. Thus, the computational complexity is bounded
 p/α   
by O p/α × R/2 × R/2 ≈ O p2 2κp . Because the complexity is exponential
with p, the largest p for which this algorithm can be used with running times of
the order of seconds on a single-processor PC with 3.4 GHz processor is about
p ≈ 18.
We make one comment on the solvability of (9.4) in each block. The equation
Hv = z will have a solution for all z ∈ Fp2 if and only if rank(H) = p. The proba-
 p(1−1/α)
bility of this is 1 − O 2 as this is the probability that a random binary
matrix with dimension p × p/α, α = m/k, will have full rank [30]. This probabil-
ity quickly approaches 1 with decreasing message length m or with increasing p
(for fixed m and k) because k > m.
For k/m ≈ 1, the probability that rank(H) < p may become large enough to
encounter a failure to embed all p bits in some blocks. For example, for p = 18
and k/m = 2 (or relative payload α = 12 ), n = 106 , k = 50, 000, the probability
of failure is about 0.0043. The fact that the number of columns in H varies from
block to block also contributes to failures. We note that the probability of failure
quickly decreases with increasing k/m and is not an issue as long as k/m > 3 or
α < 13 . The failures can be dealt with by communicating to the receiver which
blocks failed to hold all p bits. For details of this procedure, the reader is referred
to [91].
Non-shared selection channel 179

9.3.2 Embedding efficiency


With increasing code length and fixed relative payload, matrix embedding
schemes based on random linear codes asymptotically achieve the theoretical
upper bound on embedding efficiency (see Chapter 8). Even though with p = 18
the codes’ performance is still far from the limit, they lead to a significant im-
provement in embedding efficiency. Figure 9.2 shows the embedding efficiency
as a function of the ratio 1/α = k/m for a cover image with n = 106 pixels and
k = 50, 000 changeable pixels for p = 6, 10, 14, 18. The values were obtained by
averaging over 100 embeddings of random messages in the same cover image with
the same parameters k, n, and m.
Note that for a fixed p, the efficiency increases with shorter messages. Once the
number of changeable pixels in each block exceeds 2p , the embedding efficiency
starts saturating at p/(1 − 2−p ), which is the value approached with decreasing
payload α. This is because the p/α columns of H eventually cover the whole
space Fp2 and thus we embed every non-zero syndrome s = 0 using at most one
embedding change.
We close this section with one more interesting observation. An observant
reader will notice in Figure 9.2 that, for fixed p, the embedding efficiency in-
creases with decreasing relative payload α in a curious non-uniform manner.
In particular, while the codes with codimension p = 14 and p = 18 have almost
identical embedding efficiency for α−1 = 5, the difference becomes quite substan-
tial for α−1 = 9. In fact, it is even possible that for a fixed α, a higher embedding
efficiency may be obtained for codes with lower p. In such a situation, it would
be pointless to use codes with higher p as we would obtain worse performance
and, at the same time, increase the computational complexity and memory re-
quirements. A detailed analysis of this rather peculiar phenomenon, which is of
importance to practitioners, appears in [93].

9.4 Sample applications

The methods for communication using non-shared selection channels introduced


above are quite important tools for the steganographer because there exist nu-
merous situations in steganography when non-shared channels arise. In this sec-
tion, we demonstrate this for a rather diverse spectrum of applications ranging
from minimum-impact steganography and public-key steganography to improved
matrix embedding methods and an improved F5 algorithm (nsF5).

9.4.1 Minimal-embedding-impact steganography


According to the principle of minimal embedding impact, the steganographer
first assigns to each cover element, i, a scalar value, ρ[i], that expresses the
increase in statistical detectability should the ith cover element be modified
180 Chapter 9. Non-shared selection channel

during embedding. If ρ[i] are approximately uniform, minimizing the embedding


impact is equivalent to minimizing the number of embedding changes, which was
the topic of Chapter 8 on matrix embedding. If ρ[i] vary greatly from element
to element, the sender should embed the payload so that the sum of ρ[i] over all
modified cover elements is as small as possible.
One of the simplest strategies3 the sender can adopt to communicate m mes-
sage bits is to embed into m cover elements x[i1 ], . . . , x[im ] with the smallest
ρ[i], ρ[ik ] ≤ ρ[j] whenever j ∈
/ S, S = {i1 , . . . , im }. However, ρ[i] are often de-
termined by some side-information unavailable to the recipient, which means
that we are facing a non-shared selection channel. Thus, the sender pronounces
the elements from S changeable (or dry) and applies some of the methods ex-
plained in this chapter to communicate the payload to the recipient. We now
provide details of a specific embedding method based on this strategy called
perturbed-quantization steganography [89].

9.4.2 Perturbed quantization


Let us assume that the sender obtains the cover image through some process
that ends with quantization, for example lossy compression, resampling, filter-
ing, or the image-acquisition process itself. The sender’s goal is to minimize the
combined distortion due to processing and embedding. Let us assume that the
input cover image (also called precover) is represented with a vector X ∈ Z N ,
where Z is the range of its elements. For example, for a 16-bit grayscale image,
Z = {0, . . . , 216 − 1}. Here, we intentionally used the term “input cover image”
and a capital letter to denote it because the cover against which security of
the embedding changes should be evaluated will be obtained from X using a
transformation F of the following form:

F = Q ◦ T : Z N → X n, (9.9)

where X is the dynamic range of the transformed signal x = F (X) and the
real-valued map T : Z N → Rn is some form of processing. The circle stands for
composition of mappings Q ◦ T (X) = Q (T (X)). We denote by T (X) = u ∈ Rn
the intermediate image. The map Q is an integer scalar quantizer with range X
extended to work on vectors by coordinates: Q(u) = (Q(u[1]), . . . , Q(u[n])). We
stress that the cover image is the signal x.
Following the principle of minimum embedding impact, we define ρ[i] using
the uniquely determined integer a, a ≤ u[i] < a + 1, as
 
 
ρ[i] = |u[i] − a − (a + 1 − u[i])| = 2u[i] − (a + 1/2). (9.10)

3 Exercises 9.3–9.6 investigate a more general strategy under the assumption that wet paper
codes with the largest theoretical embedding efficiency are used.
Non-shared selection channel 181

In other words, ρ[i] is the increase in the quantization error when quantizing u[i]
to the second closest value to u[i] instead of the closest one. If the sender uses a
bit-assignment function that always assigns two different bits to integers a and
a + 1, the sender has the power to flip the bit assigned to x[i] by rounding to
the second closest value.
We now give a few examples of mappings F that could be used for perturbed-
quantization steganography.

Example 9.1: [Downsampling] For grayscale images in raster format, the trans-
formation T maps an M1 × N1 matrix of integers X[i, j] into an m1 × n1 matrix
of real numbers u = u[r, s] using a resampling algorithm.

Example 9.2: [Decreasing the color depth by d bits] The transformation


T maps an M1 × N1 matrix of integers X[i, j] in the range Z = {0, . . . , 2nc − 1}
into a matrix of real numbers u[i, j] = X[i, j]/2d of the same dimensions, x[i, j] ∈
X = {0, . . . , 2nc −d − 1}.

Example 9.3: [JPEG compression] For a grayscale image, the transformation


T maps an M1 × N1 matrix of integers X[i, j] into a matrix of DCT coefficients
u[i, j] in a block-by-block manner (here we assume for simplicity that M1 and
N1 are multiples of 8). In each 8 × 8 pixel block B, the (k, l)th element of the
transformed block in the DCT domain is DCT(B)[k, l]/Q[k, l], where DCT is
the two-dimensional DCT (2.20) and Q[k, l] is the (k, l)th element of the JPEG
quantization matrix (see Section 2.3 for more details about the JPEG format).

Example 9.4: [Double JPEG compression]) Normally, the quantization error


has uniform distribution, which limits the amount of changeable cover elements
with small ρ[i]. Their number can be artificially increased by repeated quantiza-
tion. Imagine compressing a raw image using primary and secondary quantiza-
tion matrices Q(1) and Q(2) . When the quantization steps Q(1) [k, l] and Q(2) [k, l]
satisfy aQ(1) [k, l] = bQ(2) [k, l] + 12 Q(2) [k, l] for some integers a and b, it means
that DCT coefficients D[k, l] that are equal to a after the first compression end
up in the middle of quantization intervals during the second compression. Such
coefficients are called contributing coefficients. Some combinations of the quality
(1) (2)
factors, such as qf = 85 and qf = 70, produce a large number of contributing
coefficients and thus large embedding capacity. To embed n bits, the selection
182 Chapter 9. Non-shared selection channel

channel is formed by n contributing coefficients with the smallest impact (9.10).


This algorithm is described in detail in the original publication [92].

We now provide some heuristic thoughts about the steganographic security of


perturbed quantization. In order to mount an attack, the warden would have
to find statistical evidence that some of the values u[j] were not quantized to
their correct values. This, however, may not be easy in general for the follow-
ing reasons. The sender is using side-information that is largely removed during
quantization and is thus unavailable to the warden. Moreover, the rounding pro-
cess at changeable elements is more influenced by noise naturally present in
images than for the remaining elements.
The warden, however, might be able to model some regions in the image well
enough (e.g., regions with a smooth gradient) and attempt to detect embedding
changes in those regions only. Thus, the sender can (and should) exclude from the
selection channel S those elements whose unquantized values can be predicted
with better accuracy. This reasoning leads to two modifications of the perturbed-
quantization (PQ) method from Example 9.4.

Example 9.5: [PQt] Texture-adaptive perturbed quantization narrows the se-


lection channel only to contributing coefficients coming from blocks with the
highest texture. The block texture t(B) is computed from the singly compressed
(1)
cover JPEG image with quality factor qf decompressed to the spatial domain.
The pixel block is divided into disjoint 2 × 2 blocks. For each 2 × 2 block, the
difference between the highest and the lowest pixel value is calculated. The tex-
ture measure t(B) is the sum of these differences over the whole block. To embed
n bits, the selection channel is formed by n contributing coefficients from blocks
with the highest measure of texture.

Example 9.6: [PQe] Energy-adaptive perturbed quantization narrows down the


selection channel to contributing coefficients from blocks with the highest energy
e(B), which is calculated as a sum of squares of all quantized DCT coefficients
in the block. To embed n bits, the selection channel is formed by n contributing
coefficients selected from blocks with the highest energy.

In contrast to the distortion-based PQ method where the selection channel is


formed by contributing DCT coefficients with the smallest rounding error, in
PQt and PQe the contributing coefficients are selected on the basis of the block
texture (or energy) rather than the rounding distortion. These two adaptive
Non-shared selection channel 183

versions of perturbed quantization provide better security than the version from
Example 9.4 (see [95] and the steganalysis results in Table 12.8).

9.4.3 MMx embedding algorithm


The MMx algorithm [141] is a steganographic algorithm for JPEG images with an
embedding mechanism that is a combination of matrix embedding using binary
[2p − 1, 2p − 1 − p] Hamming codes and perturbed quantization. MMx minimizes
the embedding impact by utilizing as side-information the unquantized version
of the cover on its input (raw image before JPEG compression). It uses a clever
trick to avoid having to use wet paper codes. Here, we only briefly explain the
principle on a simple example of a scheme that uses [7, 4] Hamming codes – it
embeds 3 bits into 7 DCT coefficients D[i], i = 1, . . . , 7, by making at most x
embedding changes. We speak of an (x, 3, 7) MMx method.
The sender first uses the non-rounded value of the ith DCT coefficient to derive
the embedding impact, ρ[i], at the ith DCT coefficient according to (9.10) with
the only exception when either a = 0 or a + 1 = 0, or, in other words, when
the value of the non-rounded DCT coefficient is in the interval [−1, 1]. Since
the decoder will read message bits only from non-zero coefficients (as in F5),
the sender never rounds such a coefficient to zero. Instead, she rounds it to 2
(if the coefficient is positive) or −2 (if the coefficient is negative). In this case,
the embedding distortion will be increased to 1 + ρ[i]. This choice removes the
problem with shrinking a non-zero coefficient to zero at the expense of a larger
embedding distortion.
During the actual embedding, the sender first tries to embed 3 bits in 7 DCT
coefficients using at most one change as in matrix embedding using Hamming
codes (Chapter 8). If the jth coefficient D[j] had to be rounded to the “other”
side, the embedding impact is ρ[j]. The Hamming code uniquely determines the
coefficient that needs to be modified. Chances are that the coefficient happens to
have a large ρ[j]. The sender tries the embedding again, this time allowing two
embedding changes (two coefficients to be rounded to the other side) and lists
all pairs of columns H[., j  ], H[., j  ] from the parity-check matrix H for which
H[., j  ] + H[., j  ] = H[., j]. For Hamming code [7, 4] there will always be exactly
three such pairs (in general, the number of pairs is equal to the code codimension
p). For each pair, the sender calculates the embedding impact ρ[j  ] + ρ[j  ]. If
one of these combined impacts is smaller than ρ[j], the sender makes embedding
changes at that coefficient pair instead to decrease the embedding impact. This
embedding method can be extended by allowing up to x = 3 embedding changes
to see whether the message can be embedded with an even smaller impact by
modifying three pixels.
Depending on the number of allowed changes, we speak of an MMx algorithm,
where x = 1, 2, 3, . . . is the number of allowed embedding changes. In practice,
only negligible improvement is typically obtained by using x > 3. For small pay-
loads, the MMx method appears to resist blind steganalysis attacks the best out
184 Chapter 9. Non-shared selection channel

of all known steganographic methods for JPEG images [95] (as of late 2008).
The reader is encouraged to inspect the results of blind steganalysis presented
in Tables 12.7 and 12.8.

9.4.4 Public-key steganography


In this section, we explain how public-key cryptography can be used to construct
a version of public-key steganography. Non-shared selection channels will again
play an important role.
In public-key cryptography, there exist two keys – an encryption key E and a
decryption key D. The encryption key is made public, while the decryption key
is kept private. Public-key encryption schemes have the important property that,
knowing the encryption key, it is computationally hard to derive the decryption
key. Thus, giving everyone the ability to encrypt messages does not automatically
give the ability to decrypt. When Alice wants to send an encrypted message, m,
to Bob, she can send him EB (DA (m)), where EB and DA stand for the public (en-
cryption) and private (decryption) key of Bob and Alice, respectively. Bob reads
the message by m = EA (DB (EB (DA (m)))) because DB EB = EA DA = Identity.
Note that only Bob can read the message because only he has the private de-
cryption key DB . At the same time, he will know that it was Alice who sent the
message as only she possesses the decryption key DA .
The public-key scheme enables Alice and Bob to exchange secrets without
previously agreeing on a secret key, which makes such schemes very useful in
practice, such as in financial transactions over computer networks. The reader is
referred to [207] to learn more about construction of such encryption schemes.
Public-key cryptography can be used to construct an equivalent paradigm for
steganography [5]. Imagine that Alice uses a steganographic scheme with a public
selection channel but encrypts her payload using a public-key encryption scheme.
Upon receiving an image from Alice, Bob suspects steganography, extracts the
payload, and decrypts it to see whether there is a secret message from Alice.
Note that in this setup the encrypted message is publicly available to Eve (the
warden) but, as long as the cryptosystem is strong, Eve will not be able to tell
whether the extracted bit stream is a random sequence or ciphertext. The fact
that the selection channel is public, however, may give Eve a starting point to
mount a steganalytic attack. For example, she can compare statistical properties
of pixels from the public selection channel with the remaining pixels and look
for statistically significant deviations [90]. Without any doubts, public selection
channels are a security weakness. This problem can be eliminated by using se-
lection channels that are completely random implemented using wet paper codes
with a public randomly generated matrix D. This gives everyone the ability to
read the message, as required, without giving any information about the selection
channel.
Non-shared selection channel 185

9.4.5 e + 1 matrix embedding


In this section, we describe a clever method for using a binary code as if it
were a ternary code with a correspondingly higher embedding efficiency. Let us
assume that we have a matrix embedding scheme with bit-assignment function
π(x) = LSB(x) based on a binary code C with embedding efficiency e bits per
change. This means that the sender can embed m bits in n pixels on average by
making m/e embedding changes. Let us denote the set of modified pixels as S,
E[|S|] = m/e. If the sender modifies pixels in S by ±1, rather than by flipping
their LSBs, when making the embedding changes she has a choice to adjust the
second LSB of every pixel in S to either 0 or 1 and thus embed an additional
m/e bits in the second LSB plane. Since the recipient will not know the set S,
the sender has to use wet paper codes, this time with the second LSB as the
bit-assignment function4 and S as the set of changeable pixels.
The recipient first extracts m message bits from LSBs of the image using the
parity-check matrix of the code C and then extracts an additional m/e bits from
the second LSB plane using wet paper codes. Because a total of m + m/e bits is
embedded using m/e changes, the embedding efficiency of this scheme is

m + m/e
= e + 1. (9.11)
m/e

To summarize, if we allow the sender to modify the pixels by ±1 rather than


flip their LSBs, the embedding efficiency of the original binary matrix embedding
scheme can be increased by 1. On the other hand, ±1 embedding changes allow
application of ternary codes that enjoy a higher bound on embedding efficiency.
Thus, are we gaining anything using this scheme over ternary schemes? Quite
surprisingly, it can be shown that the embedding efficiency of this e + 1 scheme
is as high as what can be achieved using ternary codes with the additional
advantage that there is no need to convert binary streams to ternary and vice
versa.
We now prove that if the binary code C reaches its upper bound on embed-
ding efficiency, e = α/H −1 (α), the corresponding e + 1 matrix embedding scheme
reaches the bound for ternary codes [258]. This interesting result tells us that if
we have near-optimal binary matrix embedding, we can automatically construct
near-optimal ternary codes! To see this, realize that for an optimal binary matrix
embedding scheme, the relative payload embeddable using change rate β is H(β)
(see Table 8.4). The relative payload embeddable using the same change rate for
the e + 1 scheme is H(β) + β = H(β) + β log2 (3 − 1) = H3 (β), which is exactly
the ternary bound.

4 π2 (x) = LSB(x/2).
186 Chapter 9. Non-shared selection channel

Apply code C0

v[1] v[2] ··· v[n]

2 p 2p 2 p
* * *
x[i, 1] x[i, 2] ··· x[i, n]
i=1 i=1 i=1

x[1, 1] x[1, 2] ··· x[1, n]

x[2, 1] x[2, 2] ··· x[2, n]


Apply Hamming
. . . code Hp
. . .
. . .

x[2p − 1, 1] x[2p − 1, 2] ··· x[2p − 1, n]

x[2p , 1] x[2p , 2] ··· x[2p , n]

Figure 9.3 A block of n2p cover elements used in the ZZW code construction.

9.4.6 Extending matrix embedding using Hamming codes


Wet paper codes find numerous and sometimes quite unexpected applications in
steganography. A nice example is the surprising ZZW construction [257], using
which one can build new families of very efficient matrix embedding codes from
existing codes. This construction is important because the new codes follow the
upper bound on embedding efficiency.
Let C0 be a code (not necessarily linear) of length n that can embed m bits in
n pixels using on average Ra changes. We will say that C0 is (Ra , n, m). The ZZW
construction leads to a family of codes Cp that are (Ra , n2p , m + pRa ), p ≥ 1.
Consider a single cover block with n2p pixels. The following procedure,
schematically depicted in Figure 9.3, would be repeated in each block if there is
more than one block in the image. Divide the block into n disjoint subsets, each
consisting of 2p pixels. Denote by x[i, s], i = 1, . . . , 2p , s = 1, . . . , n, the ith pixel
in the sth subset. First, form the XOR of all bits from each subset s,

*
p
2
v[s] = x[i, s], s = 1, . . . , n. (9.12)
i=1

Then, using C0 embed m bits in v considering v as some fictitious cover image.


This embedding will require changing r bits of v or, equivalently, changing the
XOR of the corresponding subsets s1 , s2 , . . . , sr ∈ {1, . . . , n}. On average, there
will be Ra such subsets, or E[r] = Ra . To change the XOR in (9.12), one pixel
must be changed in each subset si . We will let this one change communicate an
additional p bits through binary Hamming codes, which will give us the expected
payload m + pRa per n2p pixels for the code Cp . However, because the receiver
Non-shared selection channel 187

will not know which subsets communicate this additional payload (the receiver
will not know the indices s1 , . . . , sr ), the sender must use wet paper codes, which
is the step described next.
Let H be the p × (2p − 1) parity-check matrix of a binary Hamming code
Hp (Section 8.3.1). Compute the syndrome of each subset as s(s) = Hx[., s] ∈
{0, 1}p , where x[., s] = (x[1, s], . . . , x[2p − 1, s]) written as a column vector.5 Con-
catenate all these syndromes to one column vector of np bits

s(1) , s(2) , . . . , s(n) . (9.13)

Now realize that due to the property of Hamming codes, for each si we can
arrange that each syndrome s(si ) , i = 1, . . . , r, can be changed to an arbitrary
syndrome by making at most one change to x[1, s], . . . , x[2p − 1, s].
Label all p bits of syndromes coming from subsets s1 , . . . , sr as dry (which
makes in total pr dry bits) and all remaining bits in (9.13) as wet. If there is more
than one block in the image, concatenate the vectors (9.13) from all blocks to
form one long vector of Lnp bits, where L is the number of blocks. This vector will
have p(r1 + · · · + rL ) dry bits or on average E[p(r1 + · · · + rL )] = LRa p dry bits.
Now form the random sparse matrix D with Lnp columns and p(r1 + · · · + rL )
rows so that its columns follow the RSD as described in Section 9.2. Thus, using
wet paper codes, we can communicate on average LpRa message bits in the whole
image (plus Lm bits embedded using C0 in each block). If the wet paper code
dictates that a syndrome s(s) be changed to s = s(s) , we can arrange for this by
modifying exactly one bit in the corresponding vector of bits x[1, s], . . . , x[2p −
1, s]. If no change in the syndrome s(s) is needed, all bits x[1, s], . . . , x[2p − 1, s]
must stay unchanged. But, because we still need to change the XOR of all bits
x[1, s], . . . , x[2p , s] in (9.12), we simply flip the 2p th bit x[2p , s] because this bit
was put aside and does not participate in the syndrome calculation.
To summarize, we embed in each block of n2p cover elements m + pRa bits
using on average Ra changes. We can also say that the relative payload
m + pRa
αp = (9.14)
n2p
can be embedded with embedding efficiency
m + pRa m
ep = =p+ . (9.15)
Ra Ra
This newly constructed family of codes Cp has one important property. With
increasing p, the embedding efficiency ep follows the upper bound on embedding
efficiency in the sense that the limit is finite [79],
αp 1 m n
lim ep − −1
= − + log2 = Δ(Ra , n, m). (9.16)
p→∞ H (αp ) log 2 Ra Ra

5 Note that we are reserving the last element from each subset x[2p , s] to be used later.
188 Chapter 9. Non-shared selection channel

Embedding efficiency e(α) 6

Bound
Family ( 21 , 2p , 1 + p/2)
2
2 4 6 8 10 12 14 16 18 20
α−1
1 
Figure 9.4 Embedding efficiency of codes from the family 2
, 2p , 1 + p/2 for
p = 0, . . . , 6.

By inspecting this construction for the trivial embedding


1  method that embeds
1
1 bit in 1 pixel using on average 2 change, or the
 1 2 p, 1, 1 code,
 we discover some-
thing truly remarkable. The family of codes is 2 , 2 , 1 + p/2 and its embedding
efficiency for various values of p is shown in Figure 9.4. Surprisingly, this family
outperforms all known matrix embedding schemes constructed from structured
codes (both linear and non-linear) [257]. Extensions of codes that use random
constructions [68] (also see Section 8.3 and Section 8.5) lead to even better code
families.

9.4.7 Removing shrinkage from F5 algorithm (nsF5)


The F5 algorithm is a steganographic method for JPEG images. Its bit-
assignment function is π(x) = LSB(x) for x ≥ 0 and π(x) = 1 − LSB(x) for x < 0
while the embedding operation always decreases the absolute value of DCT co-
efficients.6 The F5 algorithm also incorporates matrix embedding using binary
Hamming codes.
When a DCT coefficient is changed to 0, which can happen only when the
original coefficient value was 1 or −1, so-called shrinkage occurs. Because the
recipient extracts the message only from non-zero coefficients, when shrinkage
occurs, the sender keeps the embedding change and embeds the same payload
again to prevent the recipient from losing a portion of the payload. This, however,
decreases the embedding efficiency and the decrease is far from negligible because
1 and −1 are the most frequently occurring non-zero DCT coefficients.

6 As explained in Chapter 7, this operation minimizes the total embedding impact in some
well-defined sense.
Non-shared selection channel 189

The shrinkage presents a problem to the recipient only because he is unable


to distinguish whether a DCT coefficient was changed to zero during embedding
or was already equal to zero in the cover image. This is yet another example of a
non-shared selection channel and, as such, it can be solved using wet paper codes.
Because, coincidentally, the embedding efficiency of wet paper codes implemented
using random linear codes with codimension p = 18 (Figure 9.2) is close to the
embedding efficiency of binary Hamming codes (Figure 8.3), wet paper codes
allow us to increase embedding efficiency by eliminating the adverse effect of
shrinkage. This version of the F5 algorithm is called nsF5 (no-shrinkage F5).
We now provide a brief sketch of the implementation of both the embedding
and the extraction algorithm. Let us assume that we want to embed a payload of
m bits in a JPEG image consisting of n DCT coefficients out of which n01 ≥ m
are non-zero and thus changeable. First, all coefficients (including zeros) are
divided using the stego key into m/p randomly generated blocks. Each block will
have exactly n/(m/p) = np/m DCT coefficients out of which on average n01 p/m
will be non-zero and thus changeable. The relative payload is α = m/n01 . The
sender forms a random binary matrix D of dimensions p × np/m as described
in Section 9.3. The sender now applies the method of Section 9.3 with random
codes of codimension p = 18.
The recipient uses the stego key and divides all DCT coefficients in the stego
image into the same blocks as the sender. He also generates the same random
binary matrix D as the sender. A segment of p message bits is extracted as the
syndrome Dy from the LSBs, y, of all DCT coefficients in the block.
The improvement in embedding efficiency has a dramatic impact on the secu-
rity of the F5 algorithm (see Table 12.8).

Summary
r When the placement of embedding changes is not shared with the recipient,
we speak of a non-shared selection channel. Other synonyms are writing on
wet paper or writing to memory with defective cells.
r There exist numerous situations in steganography when non-shared channels
arise, such as in minimum-impact steganography, adaptive steganography, and
public-key steganography.
r Communication using non-shared channels can be realized using syndrome
codes also called wet paper codes.
r Syndrome codes with random matrices are capable of asymptotically com-
municating the maximum possible payload but have complexity cubic in the
message length m.
r Sparse codes with robust soliton distribution of ones in their columns are also
asymptotically optimal and can be implemented with complexity O(n log m),
where n is the number of cover elements.
190 Chapter 9. Non-shared selection channel

r Wet paper codes with improved embedding efficiency can be implemented


using random linear codes with small codimension.

Exercises

9.1 [Writing in memory with one stuck cell] When the number of stuck
cells is 1 or n − 1, it is easy to develop algorithms for writing in memory with
defective cells. For n − 1 stuck cells, or k = 1, we can write one bit into the
memory simply by adjusting it so that the message bit m[1] is equal to XOR of
all bits in the memory.
The complementary case, when there is one stuck cell, is a little more compli-
cated. Let m[i], i = 1, . . . , n − 1, be the message to be written in the memory.
If the stuck cell is the jth cell, j > 1, we write m into the memory in the fol-
lowing manner. If m[j − 1] = x[j], we can write the message because the defect
is compatible with the message. We write (m[1], m[2], . . . , m[n − 1]) into cells
(2, 3, . . . , n) and we write 1 into x[1]. If m[j − 1] = x[j], we write the negation
of m, 1 − m, into cells (2, 3, . . . , n) and we write 0 into x[1].
If the stuck cell is the first cell, j = 1, we write the message into cells (2, 3, . . . , n)
as is (if the stuck bit x[1] = 1) and we write its negation into the same cells if
the stuck bit x[1] = 0. Convince yourself that the reading device can always read
the message correctly using the following rule:
if x[1] = 1, read the message as (x[2], x[3], . . . , x[n]), (9.17)
if x[1] = 0, read the message as (1 − x[2], 1 − x[3], . . . , 1 − x[n]). (9.18)

9.2 [Rank of a random matrix] Show that the probability, R[k], that a
randomly generated k × k binary matrix is of full rank is
"
k
1
R[k] = 1− → 0.2889 . . . , as k → ∞. (9.19)
i=1
2i

Note that here the rank should be computed in binary arithmetic. The rank of
a binary matrix computed in binary arithmetic and the rank computed in real
arithmetic may be different.
Hint: Use induction with respect to i for an i × k matrix.

9.3 [Minimum embedding impact I] Let the cover image have n pixels
with embedding impact ρ[i] and assume that the elements are already sorted
so that ρ[i] is non-decreasing, ρ[i] ≤ ρ[i + 1]. In order to embed m bits, the
sender marks pixels 1, . . . , k, m ≤ k, as changeable and embeds the message
into changeable pixels using binary wet paper codes with the best theoretically
possible embedding efficiency e = α/H −1 (α), where α = m/k. Show that the
value of k that minimizes the total embedding impact is

m 
k
−1
kopt = arg min H ρ[i]. (9.20)
m≤k≤n k i=1
Non-shared selection channel 191

Hint: According to Table 8.4, the embedding will introduce kH −1 (m/k) changes

and the expected impact per changed pixel is (1/k) ki=1 ρ[i].

9.4 [Minimum embedding impact II] Fix the ratio β = m/n and con-
sider (9.20) in the limit for n → ∞. Assume that ρ[i] = ρ(i/n) for some non-
decreasing function (profile) ρ. Define x = k/n and show that the optimization
problem (9.20) becomes
kopt β
xopt = = arg min H −1 R(x), (9.21)
n β≤x≤1 x
where
ˆx
R(x) = ρ(t)dt. (9.22)
0

Moreover, by differentiating show that xopt is a solution of the following algebraic


equation:
ρ(x) β
= . (9.23)
R(x) βx + x2 log2 (1 − H −1 (β/x))
Finally, show that x = β is never a solution for any profile ρ. Put another way,
it is always better to reserve more pixels as changeable than the m pixels with
the smallest impact.
Hint:
−1
dH −1 (x) 1 − H −1 (x)
= log2 . (9.24)
dx H −1 (x)
9.5 [Minimal embedding impact III] Show that xopt = 1 when β ≥ β0 ,
where β0 is the unique solution to
1 β
= 2 . (9.25)
R(1) β + log2 (1 − H −1 (β))
In other words, for any profile for a sufficiently large message it is always better
to use all pixels (k = n).

9.6 [Minimal embedding impact IV] Investigate the profile ρ(x) = xγ ,


γ > 0. Show that

c(γ)β for β ≤ β0
xopt = (9.26)
1 for β > β0 ,
where c(γ) is the solution to
1 γ
c log 1 − H −1 =− . (9.27)
c γ+1
In particular, the profile ρ(x) for perturbed quantization is ρ(x) = x, because
the quantization error is approximately uniformly distributed. Show that for
192 Chapter 9. Non-shared selection channel

this profile c = 1.254, obtaining thus the following rule of thumb: In perturbed-
quantization steganography implemented using wet paper codes with optimal
embedding efficiency, it is better to use k ≈ m + m/4 rather than k = m.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
10 - Steganalysis pp. 193-220

Chapter DOI:
Cambridge University Press
10 Steganalysis

In the prisoners’ problem, Alice and Bob are allowed to communicate but all
messages they exchange are closely monitored by warden Eve looking for traces of
secret data that may be hidden in the objects that Alice and Bob exchange. Eve’s
activity is called steganalysis and it is a complementary task to steganography.
In theory, the steganalyst is successful in attacking the steganographic channel
(i.e., the steganography has been broken) if she can distinguish between cover
and stego objects with probability better than random guessing. Note that, in
contrast to cryptanalysis, it is not necessary to be able to read the secret message
to break a steganographic system. The important task of extracting the secret
message from an image once it is known to contain secretly embedded data
belongs to forensic steganalysis.
In Section 10.1, we take a look at various aspects of Eve’s job depending on
her knowledge about the steganographic channel. Then, in Section 10.2 we for-
mulate steganalysis as a problem in statistical signal detection. If Eve knows the
steganographic algorithm, she can accordingly target her activity to the specific
stegosystem, in which case we speak of targeted steganalysis (Section 10.3). On
the other hand, if Eve has no knowledge about the stegosystem the prisoners
may be using, she is facing the significantly more difficult problem of blind ste-
ganalysis detailed in Section 10.4 and Chapter 12. She now has to be ready to
discover traces of an arbitrary stegosystem. Both targeted and blind steganalysis
work with one or more numerical features extracted from images and then clas-
sify them into two categories – cover and stego. For targeted steganalysis, these
features are usually designed by analyzing specific traces of embedding, while
in blind steganalysis the features’ role is much more ambitious as their goal is
to completely characterize cover images in some low-dimensional feature space.
Blind steganalysis has numerous alternative applications, which are discussed in
Section 10.5.
The reliability of steganalysis is strongly influenced by the source of covers.
The prisoners may choose a source that will better mask their embedding and, on
the other hand, they may also make fatal errors and choose covers so improperly
that Eve will be able to detect even single-bit messages. The influence of covers
on steganalysis is discussed in Section 10.6.
Although this chapter and this book focus on steganalysis based on analyzing
statistical anomalies of pixel values that most steganographic algorithms leave
194 Chapter 10. Steganalysis

behind, Eve may utilize other auxiliary information available to her. For example,
she can inspect file headers or run brute-force attacks on the stego/encryption
keys, hoping to reveal a meaningful message when she runs across a correct key.
Such attacks are called system attacks (Section 10.7) and belong to forensic
steganalysis (Section 10.8).
The main purpose of this chapter is to prepare the reader for Chapters 11
and 12, where specific steganalysis methods are described. Readers not familiar
with the subject of statistical hypothesis testing would benefit from reading
Appendix D before continuing.

10.1 Typical scenarios

Any information about the steganographic channel that is a priori available to


Eve can aid her in mounting an attack. Recall from Chapter 4 that the stegano-
graphic channel consists of five basic elements:
r channel used to exchange data between Alice and Bob,
r cover source,
r message source,
r data-embedding and -extraction algorithms,
r source of stego keys driving the embedding/extraction algorithms.
In this book, with a few exceptions, we make an assumption that the warden is
passive, which means that she passively observes the communication and does
not interfere with it in any way. Thus, the physical channel used to exchange
information is lossless and has no impact on steganalysis or steganography. The
other four elements, the embedding algorithm and the source of covers, messages,
and stego keys, however, are crucial for Eve. In general, the more information
she has, the more successful she will be in detecting the presence of secretly
embedded messages. Although there certainly exist many different situations
with various levels of detail available to Eve about the individual elements of the
steganographic channel, we highlight two typical and very different scenarios –
the case of traffic monitoring and analysis of a seized computer.
An automatic traffic-monitoring device is an algorithm that analyzes every im-
age passing through a certain network node. An example would be a program that
inspects every image posted on a binary discussion group or a server monitoring
all traffic through a specific Internet node. In this case, the warden has lim-
ited information about the cover source, the message source, or the stegosystem.
This is the most difficult situation for Eve and one where Kerckhoffs’ principle
does not fully apply. Consequently, Eve needs steganalysis algorithms capable of
detecting as wide a spectrum of steganographic schemes as possible.
If the set of parties communicating using steganography is narrowed down us-
ing some side-information, the diversity of the cover source may become lower,
giving Eve a better chance to detect any steganographic activities. For example,
Steganalysis 195

when Eve already suspects that somebody may be using steganography, addi-
tional intelligence may be gathered that may provide some information about the
cover source or the steganographic algorithm. Say, if Alice downloads a stegano-
graphic tool while being eavesdropped, Eve suddenly obtains prior information
about the embedding algorithm, as well as the stego key space. Or, if Eve knows
that Alice sends to Bob images obtained using her camera, Eve could purchase
a camera of the same model and tailor her attack to this cover source.
Steganalysis may become significantly easier when a suspect’s computer is
seized and Eve’s task is to steganalyze images on the hard disk. In this case,
the stego tool may still reside on the computer or its traces may be recover-
able even after it has been uninstalled (e.g., using Wetstone’s Gargoyle http:
//www.wetstonetech.com). This gives her valuable prior information about the
potential stego channel. In some situations, Eve may be able to find multiple
versions of one image that are nearly identical. She can first investigate whether
these slightly different versions are the result of steganography or some other
natural process, such as compressing one image using two different JPEG com-
pressors [95]. By comparing the images, she can learn about the nature of em-
bedding changes and their placement (selection channel) and possibly conclude
that the changes are due to embedding a secret message. Once Eve determines
that steganography is taking place, the steganalysis is complete and she may
continue with forensic steganalysis aimed at extracting the secret message itself
or she may decide to interrupt the communication channel if it is within her
competence.

10.2 Statistical steganalysis

In this section, you will learn that statistical steganalysis is a detection prob-
lem. In practice, it is typically achieved through some simplified model of the
cover source obtained by representing images using a set of numerical features.
Depending on the scope of the features, we recognize two major types of statis-
tical steganalysis – targeted and blind. The following sections describe several
basic strategies for selecting appropriate features for both targeted and blind
steganalysis. Examples of specific targeted and blind methods are postponed to
Chapters 11 and 12.
Before proceeding with the formulation of statistical steganalysis, we briefly
review some basic facts from Chapter 6. The information-theoretic definition of
steganographic security starts with the basic assumption that the cover source
can be described by a probability distribution,
´ Pc , on the space of all possible
cover images, C. The value Pc (B) = B Pc (x)dx is the probability of selecting
cover x ∈ B ⊂ C for hiding a message. Assuming that a given stegosystem as-
sumes on its input covers x ∈ C, x ∼ Pc , stego keys, and messages (both attaining
values on their sets according to some distributions), the distribution of stego
images is Ps .
196 Chapter 10. Steganalysis

10.2.1 Steganalysis as detection problem


Let us assume that Eve can collect sufficiently many cover images and estimate
the pdf Pc . If her knowledge of the steganographic channel allows her to estimate
the pdf Ps , or learn to distinguish between Pc and Ps using machine-learning
tools, we speak of steganalysis of a known stegosystem, which is a detection
problem that leads to simple hypothesis testing

H0 : x ∼ Pc , (10.1)
H1 : x ∼ Ps . (10.2)

Optimal Neyman–Pearson and Bayesian detectors can be derived for this prob-
lem using the likelihood-ratio test.
If Eve has no information about the stego system, the steganalysis problem
becomes

H0 : x ∼ Pc , (10.3)
H1 : x ∼ Pc , (10.4)

which is a composite hypothesis-testing problem and is in general much more


complex.
Obviously, there are many other possibilities that fall in between these two
formulations depending on the information about the steganographic channel
available to Eve. For example, Eve may know the embedding algorithm but not
the message source. Then, Ps will depend on an unknown parameter, the change
rate β, and Eve faces the composite one-sided hypothesis-testing problem

H0 : β = 0, (10.5)
H1 : β > 0. (10.6)

Eve may employ appropriate tools, such as the generalized likelihood-ratio test,
or convert this problem to simple hypothesis testing by considering β as a ran-
dom variable and making an assumption about its distribution (the Bayesian
approach).

10.2.2 Modeling images using features


Practical steganalysis of multimedia objects, such as digital images, cannot work
with the full representation of images due to their large complexity and dimen-
sionality. Instead, a simplified model whereby the detection problem becomes
more tractable must be accepted. The models used in steganalysis are usually
obtained by representing images using a set of numerical features. Each image,
x ∈ C, is mapped to a d-dimensional feature vector f = (f1 (x), . . . , fd (x)) ∈ Rd ,
where each fi : C → R. The random variables representing the cover source,
x ∼ Pc , and the stego images, y ∈ Ps , are thus transformed into the correspond-
ing random variables f (x) ∼ pc and f (y) ∼ ps on Rd . For accurate detection, the
Steganalysis 197

features need to be chosen so that the clusters of f (x) and f (y) have as little
overlap as possible.

10.2.3 Optimal detectors


Any steganalysis algorithm is a detector, which can be described by a map F :
Rd → {0, 1}, where F (x) = 0 means that x is detected as cover, while F (x) = 1
means that x is detected as stego. The set R1 = {x ∈ Rd |F (x) = 1} is called the
critical region because the detector decides “stego” if and only if x ∈ R1 . The
critical region fully describes the detector.
The detector may make two types of error – false alarms and missed detections.
The probability of a false alarm, PFA , is the probability that a random variable
distributed according to pc is detected as stego, while the probability of missed
detection, PMD , is the probability that a random variable distributed according
to ps is incorrectly detected as cover:
ˆ
PFA = Pr {F (x) = 1|x ∼ pc } = pc (x)dx, (10.7)
R1
ˆ
PMD = Pr {F (x) = 0|x ∼ ps } = 1 − ps (x)dx. (10.8)
R1

As derived in Section 6.1.1, the error probabilities of any detector must satisfy
the following inequality:
1 − PFA PFA
(1 − PFA ) log + PFA log ≤ DKL (pc ||ps ) ≤ DKL (Pc ||Ps ),
PMD 1 − PMD
(10.9)
where DKL is the Kullback–Leibler divergence between the two probability distri-
butions. For secure steganography, DKL (Pc ||Ps ) = 0, and the only detector that
Eve can build is one that randomly guesses between cover and stego, which means
that Eve cannot construct an attack. For -secure stegosystems, DKL (Pc ||Ps ) ≤ ,
and the reliability of detection decreases with decreasing  (see Figure 6.1).
In steganography, useful detectors must have a low probability of a false alarm.
This is because images detected as potentially containing secret messages are
likely to be subjected to further forensic analysis (see Section 10.8) to determine
the steganographic program, the stego key, and eventually extract the secret
message. This may require brute-force dictionary attacks that can be quite ex-
pensive and time-consuming. Thus, it is more valuable to have a detector with
very low PFA even though its probability of missed detection may be quite high
(e.g., PMD > 0.5 or higher). Because steganographic communication is typically
repetitive, even a detector with PMD = 0.5 is still quite useful as long as its
false-alarm probability is small.
The hypothesis-testing problem in steganalysis is thus almost exclusively for-
mulated using the Neyman–Pearson setting, where the goal is to construct a
detector with the highest detection probability PD = 1 − PMD while imposing a
198 Chapter 10. Steganalysis

bound on the probability of false alarms, PFA ≤ FA . Even though it is possible
to associate cost with both types of error, Bayesian detectors are typically not
used in steganalysis, because the prior probabilities of encountering a cover or
stego image can rarely be accurately estimated.
Given the bound on false alarms, FA , the optimal Neyman–Pearson detector
is the likelihood-ratio test
ps (x)
Decide H1 when L(x) = > γ, (10.10)
pc (x)
where γ > 0 is a threshold determined from the condition
ˆ
PFA = pc (x)dx = FA , (10.11)
R1

where

R1 = {x ∈ Rd |L(x) > γ} (10.12)


is the critical region of the detector. The ratio L(x) is called the likelihood ratio.
Even though in theory Eve could construct optimal steganalysis algorithms
by estimating the pdfs of cover/stego images and use the likelihood-ratio test,
this can be done only for low-dimensional models of covers that may not describe
covers well. Because the dimensionality of the feature spaces is often large, which
is especially true for blind steganalysis, one can rarely estimate the underlying
probability distributions accurately. Instead, the detection problem is viewed as
a classification (pattern-recognition) problem and solved by training a classifier
on a large database of cover and stego images. The classifier parameters are
typically adjusted to obtain a low probability of false alarms.

10.2.4 Receiver operating characteristic (ROC)


For a given detector, the function PD (PFA ) is called the Receiver-Operating-
Characteristic (ROC) curve and it describes the performance of the detector. A
few examples of ROC curves are given in Figure 10.1. The ROC curve of a poor
steganalysis method is close to the diagonal line (Figure 10.1(b)), while good
steganalysis methods have ROC curves that are close to the curve depicted in
Figure 10.1(c).
Comparing detectors really means comparing their ROC curves. While
some detectors may be unambiguously compared because their ROCs satisfy
(1) (2)
PD (PFA ) > PD (PFA ) for all PFA ∈ [0, 1] (e.g., Detector 1 is more reliable than
Detector 2 for all PFA ), in general ROCs can intersect and then it is not clear
which detector is better. An example of two curves that are hard to compare
is in Figure 10.1(d). This problem can be avoided if we could extract a scalar
measure of performance from each ROC because scalars can be unambiguously
ordered. Among the different measures of performance that were proposed by
various researchers, we name the following three:
Steganalysis 199

r The area, ρ, between the ROC curve and the diagonal line normalized so that
ρ = 0 when the ROC coincides with the diagonal line and ρ ≈ 1 for ROC
curves corresponding to nearly perfect detectors (Figure 10.1(c)). Mathemat-
ically,
ˆ1
ρ=2 PD (x)dx − 1. (10.13)
0

This quantity is sometimes called accuracy. A different but closely related


measure is Area Under Curve (AUC) or Area Under ROC (AUR), which is
simply the area under the ROC curve
ˆ1
1+ρ
AUC = PD (x)dx = . (10.14)
2
0
r The minimal total average decision error under equal prior probabilities
(Pr{x is stego} = Pr{x is cover})
1
PE = min (PFA + PMD (PFA )) . (10.15)
PFA ∈[0,1] 2

The minimum is reached at a point where the tangent to the ROC curve has
slope 12 . In Figure 10.2, this point is marked with a circle.
r The false-alarm rate at probability of detection equal to PD = 1
2

1
PD−1 . (10.16)
2
This point is marked with a square in Figure 10.2.

None of these quantities is completely satisfactory because of the lack of funda-


mental reasoning behind them. The value PD−1 (1/2) is probably the most useful
for steganalysis because, as explained above, what matters the most is the de-
tector’s false-alarm probability. The work of Ker [136] is an attempt to compare
steganalysis detectors in a more fundamental manner in the limit as the change
rate goes to zero.
In practice, the underlying distributions pc and ps are usually obtained ex-
perimentally in a sampled form. For one-dimensional distributions over the set
of real numbers, ps is usually a shifted version of pc and both distributions are
unimodal. In this case, the critical region is determined by a scalar threshold
γ (see (10.10) and (10.12)). Thus, the ROC can be drawn either by fitting a
parametric model through the sample distribution or directly by moving the
threshold γ from −∞ to +∞ and computing the relative fraction of cover and
stego features above γ (see Algorithm 10.1).
In the next section, we discuss methods for constructing the feature spaces.
The specific form of the feature map f and the type of the hypothesis-testing
problem depend on the steganalysis scenario and the steganographic channel.
200 Chapter 10. Steganalysis

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(a) (b)
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(c) (d)

Figure 10.1 Examples of ROC curves. The x and y axes in all four graphs are the
probability of false alarms, PFA , and the probability of detection, PD , respectively.
(a) Example of an ROC curve; (b) ROC of a poor detector; (c) ROC of a very good
detector; (d) two hard-to-compare ROCs.

Algorithm 10.1 Drawing an ROC curve for one-dimensional features fs [i]


(stego) and fc [i] (cover), i = 1, . . . , k, computed from k cover and k stego im-
ages.
f = sort(fs ∪ fc );
// f is the set of all features sorted to form a non-decreasing sequence
PFA [0] = 1;PD [0] = 1;
for i = 1 to 2k {
if f [i] ∈ fc {PFA [i] = PFA [i − 1] − 1/k;PD [i] = PD [i − 1];}
else {PD [i] = PD [i − 1] − 1/k;PFA[i] = PFA [i − 1];}
DrawLine((PFA[i − 1], PD [i − 1]),(PFA [i], PD [i]));
}

We distinguish two general cases: attacking a known steganographic method (or


embedding operation) and attacking an unknown steganographic method. The
corresponding approaches in steganalysis are called targeted and blind steganal-
ysis.
Steganalysis 201

0.8

0.6
PD (PFA )

0.4

0.2

0
0 P −1 (1/2) 0.4 0.6 0.8 1
D
PFA
Figure 10.2 Scalar measures typically used to compare detectors. The square marks the
point on the ROC curve corresponding to the criterion PD−1 (1/2). The point marked with
a circle corresponds to the minimal total decision error PE .

10.3 Targeted steganalysis

The features in targeted steganalysis are constructed from knowledge of the


embedding algorithm and are thus targeted to a specific embedding method
or embedding operation (e.g., LSB embedding). On the other hand, features
in blind steganalysis must be constructed in such a manner as to be able to
detect every possible steganographic scheme, including future schemes. While a
targeted steganalysis method can work very well with a single scalar feature,
blind steganalysis methods often require larger sets of features and are usually
implemented using machine-learning. In principle, however, both targeted and
blind steganalysis can use multiple features and tools from machine-learning
and pattern recognition. The main difference between them is the scope of their
feature sets.

10.3.1 Features
We now describe several strategies for constructing features for targeted ste-
ganalysis and illustrate them with specific examples.
Because in targeted steganalysis the embedding mechanism of the stego system
is known, it makes sense to choose as features the quantities that predictably
202 Chapter 10. Steganalysis

change with embedding. While it is usually relatively easy to identify many such
features, features that are especially useful are those that attain known values
on either stego or cover images.
In the expressions below, β is the change rate (ratio between the number of
embedding changes and the number of all elements in the cover image) and fβ
is the feature computed from a stego image obtained by changing the ratio of β
of its corresponding cover elements.

T1. [Testing for stego artifacts] Identify a feature that attains a specific known
value, fβ , on stego images and attains other, different values on cover images.
Then, formulate a composite hypothesis-testing problem as

H0 : f = fβ , (10.17)
H1 : f = fβ . (10.18)

Note that here H0 is the hypothesis that the image under investigation is
stego, while H1 stands for the hypothesis that it is cover.
T2. [Known cover property] Identify a feature f that predictably changes
with embedding fβ = Φ(f0 ; β) so that Φ can be inverted, f0 = Φ−1 (fβ ; β). As-
suming F (f0 ) = 0 for some known function F : Rd → Rk , estimate β̂ from
F (Φ−1 (fβ ; β̂)) = 0 and test

H0 : β̂ = 0, (10.19)
H1 : β̂ > 0. (10.20)

Note that now we have the more typical case when the hypothesis H0 stands
for cover and H1 for stego. Also, notice that a by-product of this approach
is an estimate of the change rate, which can usually be easily related to the
relative payload, α. For example, for methods that embed each message bit
at one cover element, β = α/2.
T3. [Calibration] In some cases, it is possible to estimate from the stego image
what the value of a feature would be if it were computed from the cover image.
This process is called calibration. Let fβ be the feature computed from the
stego image and f̂0 be the estimate of the cover feature. If the embedding
allows expressing fβ as a function of f0 and β, fβ = Φ(f0 ; β), we can again
estimate β from fβ = Φ(f̂0 ; β̂) and again test

H0 : β̂ = 0, (10.21)
H1 : β̂ > 0. (10.22)

This method also provides an estimate of the change rate β (or the message
payload α).

We now give examples of these strategies. Strategy T1 was pursued when deriv-
ing the histogram attack in Section 5.1.1. The attack was based on the observa-
tion that the histogram h of an 8-bit grayscale image fully embedded with LSB
Steganalysis 203

embedding must satisfy


h[2k] ≈ h[2k + 1], k = 0, . . . , 127. (10.23)
Because the sum h[2k] + h[2k + 1] is invariant with respect to LSB embedding,
we can take as a feature vector the histogram h and formulate the following
composite hypothesis-testing problem:
h[2k] + h[2k + 1]
H0 : h[2k] = , k = 0, . . . , 127, (10.24)
2
h[2k] + h[2k + 1]
H1 : h[2k] = , k = 0, . . . , 127, (10.25)
2
which leads to the histogram attack from Section 5.1.1, where this problem was
approached using Pearson’s chi-square test.
As an example of Strategy T2, we revisit the attack on Jsteg from Section 5.1.2.
There, we used some a priori knowledge about the cover image. Using the obser-
vation that histograms of cover JPEG images are approximately symmetrical,
the following quantity was studied for steganalysis of Jsteg:
   
F (h) = h[2k] + h[2k + 1] − h[2k + 1] − h[2k], (10.26)
k>0 k<0 k≥0 k<0

where h is the histogram of quantized DCT coefficients. Because the embedding


mechanism is known, we were able to express the stego image histogram as a
function of the relative message length α = 2β and the cover-image histogram
h0 , hα = Φ(h0 ; α), and invert the relationship h0 = Φ−1 (hα ; α). The symmetry
of the cover-image histogram gave us the equation F (h0 ) = 0, which was solved
for the estimate α̂.
A specific example of Strategy T3 that uses calibration will be given in Chap-
ter 11, where we describe a method for attacking the F5 algorithm. In this
section, we explain only the basic principle of calibration.
Calibration attempts to estimate selected macroscopic quantities of the cover
image from the stego image. This is, indeed, possible for JPEG images as the
quantized DCT coefficients are robust with respect to steganographic embedding
because the distortion is usually small. Calibration begins by decompressing
the stego image to the spatial domain, then cropping the image by 4 columns
and 4 rows, and then recompressing again using the same quantization matrix
as that of the stego image. The resulting JPEG image is visually similar to
the cover image and its quantized DCT coefficients are no longer influenced by
steganographic embedding because the JPEG compression was performed on an
8 × 8 grid shifted by 4 pixels with respect to the grid in the cover image. Thus,
the process of recompression on a shifted grid essentially erased the effect of
embedding changes. Note that we cannot claim that the recompressed image
is an approximation to the cover image because it is compressed on a shifted
grid. Nevertheless, it is intuitively clear that macroscopic quantities, such as the
histogram, should be approximately equal to those of the cover image. Note that
geometrical transformations other than cropping can also be used, for example,
204 Chapter 10. Steganalysis

Decompress Crop Compress

J1 J2

F F
F(J1 ) − F(J2 )
Figure 10.3 Calibration is used to estimate some macroscopic quantities of the cover
image from the stego image.

4,000
Estimated histogram
Stego image hist.
Frequency of occurrence

3,000 Cover image hist.

2,000

1,000

0
−5 0 5 10
Value of DCT coefficient (1, 0)

Figure 10.4 Histogram of the DCT coefficient for the spatial frequency (1, 0) for the cover
image (+), F5 fully embedded stego image (), and the calibrated image (◦).

a slight rotation, resizing, or random warping as performed in the attack on


watermarking schemes called Stirmark [151].
In Figure 10.4, we illustrate how accurately calibration works. The figure shows
the histogram of DCT coefficients for the spatial frequency (0, 1) for the cover
image (+), a stego image fully embedded with the F5 algorithm (), and the
estimate of the cover-image histogram (◦) obtained using calibration. Clearly,
calibration has provided a very close estimate to the original histogram.
In summary, the defining characteristic of targeted steganalysis is that the fea-
tures are designed by analyzing the embedding mechanism of a specific stegosys-
tem or a specific embedding operation (e.g., LSB embedding in the two cases
above). In particular, in targeted schemes the features are designed without any
ambitions to obtain an exhaustive representation of cover images in some lower-
dimensional space.
Steganalysis 205

10.3.2 Quantitative steganalysis


Many targeted steganalysis methods use as a detection statistic an estimate of
the number of embedding changes (or the change rate). Steganalysis designed
to estimate the change rate is called quantitative. Quantitative techniques are
important in forensic steganalysis (see Section 10.8) because they give Eve addi-
tional, quite valuable forensic information. For example, when she detects mes-
sages in multiple images and the message-length estimates are clustered around
multiples of some typical cipher block lengths, Eve can infer that the message is
encrypted and even narrow down the possibilities for the encryption algorithm.
The change-rate estimate provided by quantitative steganalysis is subject to
errors because the assumptions under which the estimator was derived were
not satisfied. Let us take a closer look at Strategy T2 for targeted steganalysis.
There are two different sources of error. The relationship fβ = Φ(f0 ; β) is really
an equality between expected values and should have been written more precisely
as

E [fβ ] = Φ(f0 ; β), (10.27)

where the expected value is taken over embeddings inducing change rate β and
over pseudo-random walks. The second source of error is the cover assumption
F (f0 ) = 0. This is, again, an equality in expected value, this time over the covers

E[F (f0 )] = 0. (10.28)

The error of the output of many quantitative steganalyzers can be modeled


as a realization of a random variable consisting of two factors: the within-image
error and the between-image error [22, 23],

β̂ − β = w + b. (10.29)

The within-image error, w, is caused by random correlations between the image


and the message and the fact that the pseudo-random walk visits a different part
of the image each time we embed. In Strategy T2, we can say more specifically
that this error is due to the fact that the equality fβ = Φ(f0 ; β) holds only in
expectation. Imagine embedding with the same change rate, β, into the same
image but each time along a different pseudo-random walk and with a different
message. Repeating this experiment Nw times, the variations in the change-rate
estimate are due to the within-image error. The within-image error distribution
depends on the image content and on the relative payload itself. In a blue-sky
picture, the variance of this error will be smaller than in an image containing
a variety of different textures and objects. In general, the larger the payload
the smaller is the contribution of the image content to this type of error. The
within-image error is well modeled with a Gaussian distribution.
The between-image error is a random variable whose distribution is tied to
properties of natural images. In Strategy T2, this error is due to the failure of
the covers to satisfy F (f0 ) = 0. Because the within-image error is Gaussian, we
206 Chapter 10. Steganalysis

can isolate the realization of the between-image error for a given image by simply
averaging all Nw estimates. The distribution of the between-image error is then
sampled by repeating this experiment over Nb images. Student’s t-distribution
is often a good fit for the between-image error.
As an example, we now look more closely at the quantitative steganalyzer for
Jsteg (Section 5.1.2). There, Figure 5.4 shows the histograms of estimated pay-
load for five payloads across Nb = 954 grayscale images. The histograms show
the mixture of both types of error. To separate them, we perform the following
experiment with the same database of images. For a fixed change rate β = 0.2,
each image was embedded with Jsteg Nw = 200 times with different messages
and stego keys (which determine the pseudo-random walk). By running the quan-
titative steganalyzer (5.25), we obtain a matrix of change-rate estimates β̂[i, j],
i = 1, . . . , Nw , j = 1, . . . , Nb . The distribution of the between-image error is ob-
tained by taking the average change-rate estimate over Nw embeddings for each
image,

1 
Nw
b[j] = β̂[i, j] − β. (10.30)
Nw i=1

The sample pdf of the between-image error is shown in Figure 10.5. The figure
also includes the log–log empirical cdf plot showing the tail probability Pr{b > x}
as a function of x and the corresponding Gaussian fit using a thin line (see
Appendix A for more details about the plot). It is apparent that this error
exhibits thick tails and the Gaussian model is not a good fit. The plot, however,
seems to indicate that Student’s t-distribution might be a reasonably good fit
because the tail of the experimental data becomes linear for large x in agreement
with the model (Pr{x > x} ≈ x−ν for a random variable x following Student’s
t-distribution with ν degrees of freedom). The inter-quartile range (IQR) for the
between-image error is [0.198, 0.218], suggesting thus that the right tail is thicker
than the left tail.
Figure 10.6 shows the within-image error and the log–log empirical cdf plot
for the right tail obtained for one randomly chosen image from the database
based on Nw = 10, 000 embeddings. The within-image error appears to be well
modeled with a Gaussian distribution. Again, this property seems to be generic
among other quantitative steganalyzers [23].
We close this section with a few notes about some interesting recent devel-
opments in quantitative steganalysis. If the statistical distributions of random
variables involved in Strategy T2 can be derived or if at least reasonable as-
sumptions about them can be made, it becomes possible to derive more accurate
quantitative estimators using the maximum-likelihood principle,

β̂ = arg max Pr{fβ |β}. (10.31)


β

An example of this approach is [133], which is also briefly mentioned in Sec-


tion 11.1.3.
Steganalysis 207

100 10−1

P r{b > x}
50 10−2

0 10−3
−0.1 0 0.1 0.2 10−3 10−2 10−1
b x
Figure 10.5 Distribution of the between-image error and its log–log empirical cdf plot for
the quantitative steganalyzer for Jsteg (5.25). The thin line is a Gaussian fit.

400
10−1
P r{w > x}

200
10−2

0
0.26 0.27 0.28 0.29 10−3 −3
10 10−2
Estimated β x
Figure 10.6 Distribution of the within-image error and its log–log empirical cdf plot for one
image for the quantitative steganalyzer for Jsteg (5.25). The thin line is a Gaussian fit.

An interesting possibility to construct quantitative steganalyzers from blind


steganalyzers was proposed in [195] (also see Section 12.4.1). It turns out that
it is possible to learn the relationship between the feature vector used by blind
steganalyzers (see the next section) and the change rate using regression methods
and obtain as a result a very accurate quantitative steganalyzer for essentially
every steganographic method that is detectable using that feature vector. An
important advantage of this cookie-cutter approach to quantitative steganalysis
is that it can be applied even when the embedding mechanism is completely
unknown as long as one can obtain a large set of stego images embedded with
messages of known size.

10.4 Blind steganalysis

The goal of blind steganalysis is to detect any steganographic method irrespec-


tive of its embedding mechanism. As in targeted steganalysis, we cannot work
with the full representation of images and instead transform them to a lower-
208 Chapter 10. Steganalysis

dimensional feature space, where the distributions of cover and stego images are
pc and ps . Ideally, we would want the feature space to be complete [144] in the
sense that for any steganographic scheme

DKL (Ps ||Pc ) >  ⇒ DKL (ps ||pc ) > 0 (10.32)

so that we do not lose on our ability to distinguish between cover and stego
images by representing the images with their features. In practice, we may be
satisfied with a weaker property, namely the requirement that it be hard to
practically construct a stego scheme with DKL (ps ||pc ) = 0.
Next, we outline several general strategies for selecting good features for blind
steganalysis and then review the options available for constructing detectors.

10.4.1 Features
The impact of embedding can be considered as adding noise of certain specific
properties. Thus, many features are designed to be sensitive to adding noise while
at the same time being insensitive to the image content.

B1. [Noise moments] Transform the image to some domain, such as the Fourier
or wavelet domain, where it is easier to separate image content and noise.
Compute some statistical characteristics of the noise component (such as sta-
tistical moments of the sample distributions of transform coefficients). By
working with the noise residual instead of the image, we essentially improve
the signal-to-noise ratio (here, signal is the stego noise and noise is the cover
image itself) and thus improve the features’ sensitivity to embedding changes,
while decreasing their undesirable dependence on image content.
B2. [Calibrated features] Identify a feature, f , that is likely to predictably
change with embedding and calibrate it. In other words, compute the dif-
ference

fβ − f̂0 . (10.33)

This process is graphically shown in Figure 10.3. The purpose of calibration


is two-fold. It makes the feature approximately zero-mean on the set of covers

Epc [fβ (x) − f̂0 (x)] ≈ 0 (10.34)

and it decreases its variance,

Varpc [fβ (x) − f̂0 (x)]. (10.35)

For these features to work best, it is advisable to construct them in the same
domain as where the embedding occurs. For example, when designing features
for detection of steganographic schemes that embed data in quantized DCT
coefficients of JPEG images, compute the features directly from the quan-
tized DCT coefficients. This makes sense because the embedding changes are
Steganalysis 209

“lumped” in that domain, while in a different domain, such as the Fourier


domain, the effect of embedding changes is more spread out.
B3. [Targeted features] Many features for blind steganalysis originated in tar-
geted steganalysis. In fact, it is quite reasonable to include in the feature set
the features that can reliably detect specific steganographic schemes because,
this way, the blind steganalyzer will likely detect these steganographic schemes
well.
B4. [Known properties of covers] If the covers are known to satisfy some a
priori statistical properties, such as the symmetry of the DCT histogram,
they can and should be taken into consideration for design of features. As
an example, consider the histograms h(kl) [i] of DCT coefficients for a specific
spatial frequency (k, l), k, l = 0, . . . , 7. Since these histograms are known to
follow the generalized Gaussian distribution, a potentially useful feature is the
square error between the histogram and its parametric generalized Gaussian
fit
 2
h(kl) [i] − g(i; μ̂, α̂, β̂) , (10.36)
i

where g(x; μ, α, β) = [α/(2βΓ(1/β))]e−| β | is the generalized Gaussian pdf


x−μ α

and μ̂, α̂, β̂ are its mean, shape, and width parameters estimated from h(kl) [i].
One might as well take the estimated parameters directly as features or use
their calibrated versions. Alternatively, non-parametric models may be used
as well (e.g., sample distributions of DCT coefficients or their groups).
Specific examples of features that can be used for blind steganalysis are given in
Chapter 12.

10.4.2 Classification
After selecting the feature set, Eve has at least two options to construct her de-
tector. One possibility is to mathematically describe the probability distribution
of cover-image features, for example, by fitting a parametric model, p̂c , to the
sample distribution of cover features and test

H0 : x ∼ p̂c , (10.37)
H1 : x ∼ p̂c . (10.38)

The second option is to use a large database of images and embed them using
every known steganographic method1 with uniformly distributed change rates
β (or payloads distributed according to some distribution if the change-rate
distribution is available) and then fit another distribution, p̂s , through the ex-

1 To be more precise, the selection of stego images in Eve’s training set should reflect the
probability with which they occur in the steganographic channel. These probabilities are,
however, unknown in general.
210 Chapter 10. Steganalysis

perimentally obtained data. Eve now faces a simple hypothesis-testing problem,

H0 : x ∼ p̂c , (10.39)
H1 : x ∼ p̂s . (10.40)

Even with a complete feature set, however, approaching the above detection
problems using the likelihood-ratio test (10.10) is typically not feasible. This is
because the feature spaces that aspire to be complete in the above practical sense
are still relatively high-dimensional (with dimensionality of the order of 103 or
higher) to obtain accurate parametric or non-parametric models of pc . Thus, in
practice the detection is formulated as classification. After all, Eve is interested
only in detecting stego images, which is a simpler problem than estimating the
underlying distributions. She trains a classifier on features f (x) for x drawn from
a sufficiently large database of cover and stego images to recognize both classes.
Eve can construct her classifier in two different manners. The first option is to
train a cover-versus-all-stego binary classifier on two classes: cover images and
stego images produced by a sufficiently large number of stego algorithms and an
appropriate distribution of message payloads. The hope is that if the classifier is
trained on all possible archetypes of embedding operations, it should be able to
generalize to previously unseen schemes. With this approach, there is always the
possibility that in the future some new steganographic algorithm may produce
stego images whose features will be incompatible with the distribution p̂s , in
which case such images may be misclassified as cover.
Alternatively, Eve can train a classifier that can recognize cover images in
the feature space and marks everything that does not resemble a cover image
as potentially stego. Mathematically, Eve needs to specify the null-hypothesis
region R0 containing features of cover images (R0 is the complement of the
critical region R1 ). This can be done, for example, by covering the support
of p̂c with hyperspheres [167, 168]. An alternative approach using a one-class
neighbor machine is explained in Section 12.3. This one-class approach to blind
steganalysis has several important advantages. First, the classifier training is
simplified because only cover images are used. Also, the classifier does not need
to be retrained when new embedding methods appear. The potential problem is
that the database has to be very large and diverse. We emphasize the adjective
diverse because we certainly do not wish to misidentify processed covers (e.g.,
sharpened images) as containing stego just because the classifier has not been
trained on them.
In general, the modular structure of blind detectors makes them very flexi-
ble and gives them the ability to evolve with progress in steganography and in
machine-learning. One can obviously easily exchange the machine-learning en-
gine, add more features, or expand the training database. Note that all these
actions require retraining the classifier.
Steganalysis 211

10.5 Alternative use of blind steganalyzers

Even though the main purpose of blind steganalysis is to detect steganography,


there are many other important applications of this approach to steganalysis.
In this section, we review these applications, postponing specific examples to
Chapter 12.

10.5.1 Targeted steganalysis


A blind steganalyzer can be used to construct a targeted attack on any stegano-
graphic scheme simply by training the blind steganalyzer on a narrower training
set consisting of cover images and the corresponding stego images embedded by
a specific steganographic scheme. If there already exist targeted attacks on the
scheme, it is advisable to augment the feature set by the features used in the
targeted attacks to further improve the steganalyzer accuracy. This approach to
targeted steganalysis often produces the most reliable steganalysis. For example,
the blind steganalyzer for JPEG images described in Chapter 12 is more accurate
in detecting F5 and OutGuess than the first targeted attacks on both schemes
(see the targeted attack on F5 in Chapter 11 and the attack on OutGuess in [87]).
Also, in this application dimensionality-reduction methods, such the method
of [173], could be applied to decrease the dimensionality of the feature space and
provide a simpler and perhaps more accurate targeted detector. Moreover, it is
possible to use the quantitative response of the detector, such as the distance
between the stego image feature and the cluster of cover features, to derive an
estimate of the number of embedding changes (see [195] and Section 12.4.1).

10.5.2 Multi-classification
An important advantage of blind steganalysis is that the position of the stego
image feature in the feature space provides additional information about the
embedding algorithm. In fact, the features embedded by different steganographic
algorithms form clusters and thus it is possible to classify stego images into
known steganographic schemes instead of the binary classification between cover
and stego images.
k A binary classifier can be extended to classify into k > 2 classes
by building 2 binary classifiers distinguishing between every pair of classes and
then fusing the results using a classical voting system. This method for multi-
classification is known as the Max–Wins principle [114].
Multi-classification into known steganographic methods is the first step to-
wards forensic steganalysis, whose goal is to identify the embedding algo-
rithm [189] and the secret stego key [94], and eventually extract the embedded
message.
212 Chapter 10. Steganalysis

10.5.3 Steganography design


Blind steganalyzers can also be used as an oracle for design of steganographic al-
gorithms. The security of a new algorithm can be readily tested by constructing
a targeted attack using the approach outlined in Section 10.5.1. The steganalysis
results thus provide immediate feedback and guidance to Alice. Depending on
how the features are constructed, Alice may be able to identify the features that
contribute to successful detection the most and then modify the steganographic
algorithm to decrease the impact of embedding on those features. In fact, this
approach to steganography has become standard today and the majority of re-
search articles proposing new embedding algorithms report steganalysis results
using blind steganalyzers (e.g., [217]).

10.5.4 Benchmarking
The feature set in a blind steganalyzer capable of detecting known steganographic
schemes is likely to be a good low-dimensional model of covers. This suggests
using the feature space as a simplified model of covers for benchmarking stegano-
graphic schemes by evaluating the KL divergence between the features of covers
and stego objects calculated from some fixed large database. Since the KL di-
vergence is generally hard to estimate accurately in high-dimensional spaces,
alternative statistics, such as the two-sample statistic called maximum mean
discrepancy, could be used instead [104, 191]. The third possibility is to bench-
mark for small payloads only and use for benchmarking the Fisher information
evaluated in the feature space as explained in Section 6.1.2.
It is important to realize that benchmarking steganography in this manner is
with respect to the feature model and the image database. It is possible that
two steganographic techniques might rank differently using a different feature
set (model) or image database [37] (also see the discussion in the next section).
The problem of fair benchmarking is an active area of research.

10.6 Influence of cover source on steganalysis

The properties of the cover-image source have a major influence on the accuracy
of steganalysis. In general, the more “spread out” the cover-image pdf Pc is,
the easier it is for Alice to hide messages without increasing the KL divergence
DKL (Pc ||Ps ) and the more difficult it is to detect them for Eve. For this reason,
one should avoid using covers with little redundancy, such as images with a
low number of colors represented in palette image formats, because there the
spatial distribution of colors is more predictable. Among the most important
attributes of the cover source that influence steganalysis accuracy, we name the
color depth, image content, image size, and previous processing. Experimental
Steganalysis 213

evaluation of the influence of the cover source on the reliability of blind and
targeted steganalysis has been the subject of [22, 23, 102, 129, 130, 139].
Images with a higher level of noise or complex texture have a more spread-
out distribution than images that were compressed using lossy compression or
denoised. Scans of films or analog photographs are especially difficult for ste-
ganalysis because high-resolution scans of photographs resolve the individual
grains in the photographic material, and this graininess manifests itself as high-
frequency noise. It is also generally easier to detect steganographic changes in
color images than in grayscale images, because color images provide more data
for statistical analysis and because Eve can utilize strong correlations between
color channels.
The image size has an important influence on steganalysis as well. Intuitively,
it should be more difficult to detect a fixed relative payload in smaller images
than in larger images because features computed from a shorter statistical sample
are inherently more noisy. This intuitive observation is analyzed in more detail
in Chapter 13 on steganographic capacity. The effect of image size on reliability
of steganalysis also means that JPEG covers with a low quality factor are harder
to steganalyze reliably because the size of the cover is determined by the number
of non-zero coefficients, which decreases with decreasing quality factor.
Image processing may play a decisive role in steganalysis. Processing that is
of low-pass character (denoising, blurring, and even lossy JPEG compression to
some degree) generally suppresses the noise naturally present in the image, which
makes the stego noise more detectable. This is especially true for spatial-domain
steganalysis. In fact, it is possible that a certain steganalysis technique can have
very good performance on one image database and, at the same time, be almost
useless on a different source of images. Thus, it is absolutely vital to test new
steganalysis techniques on as large and as diverse a source of covers as possible.
Sometimes, it may not be apparent at first sight that a certain processing may
introduce artifacts that may heavily influence steganalysis. For example, a trans-
formation of grayscales, such as contrast adjustment, histogram equalization, or
gamma correction, generally does not influence the image noise component in
any significant manner. However, it may introduce characteristic spikes and ze-
ros into the histogram [222]. This unusual artifact is caused by discretization of
the grayscale transformation to force it to map integers to integers. The spikes
and valleys in the histogram that would otherwise not be present in the image
can aid steganalysis methods that use the histogram for their reasoning. A good
example is the superior performance for detection of ±1 embedding of ALE (it
uses Amplitudes of Local Extrema of the histogram) steganalysis [37, 38] on the
database of images supplied with the image-editing software Corel Draw. The
images happen to have been processed using a grayscale transformation, which
makes ALE perform very well.
JPEG images are less sensitive to processing that occurred prior to compres-
sion. This is because the compression has a tendency to suppress many artifacts
that are otherwise strikingly present in the uncompressed images. JPEG images,
214 Chapter 10. Steganalysis

however, present challenges of their own. Because the statistical distribution of


DCT coefficients in a JPEG file is significantly influenced by the quality fac-
tor, for some steganographic schemes separate steganalyzers may need to be
built for each quality factor to improve their accuracy. This is a complication as
JPEG format allows customized quantization tables and it would not be practi-
cal to construct steganalyzers for each possible matrix. Another complication for
steganalysis of JPEG images is repeated JPEG compression with different qual-
ity factors because it leads to a phenomenon called “double compression” that
may drastically change the statistical distribution of DCT coefficients. When a
singly compressed JPEG file is decompressed to the spatial domain and then re-
compressed with the same 8 × 8 grid of pixel blocks, the coefficients experience
double quantization, which may lead to an unusual-looking histogram. Take, for
example, the case when DCT coefficients from a fixed DCT mode were origi-
nally quantized with the primary quantization step 7 and then quantized with
the secondary step 4. Note that no coefficients quantize to 4 after the second
compression and the histogram will have a zero at the first multiple of 4. Coef-
ficients with value 7 after the first compression now quantize to 8. Interestingly,
some portion of the coefficients equal to 14 after the first compression now quan-
tize to 12 and a portion of them to 16. The histogram of the DCT mode in the
doubly compressed image will thus exhibit unusual patterns not typically found
in JPEG images (see Figure 10.7). In other words, the resonance between the
primary and secondary quantization steps may leave a characteristic imprint in
the histogram of individual DCT modes. Double-compression artifacts may be
mistaken by some steganalysis methods as an impact of steganography.
The process of calibration introduced in Section 10.3 is especially vulnerable to
double compression because, during calibration, the effects of both compressions
are suppressed and the steganalyst does not estimate the doubly compressed
cover but instead the cover singly compressed by the second quality factor. One
possible approach here is to estimate the primary quantization matrix, such as
by using methods explained in [192, 196], and then calibrate by mimicking both
compression processes. The need to do this significantly complicates steganalysis
as one now needs to build steganalyzers for all possible combinations of primary
and secondary quality factors [193].
There are some pathological situations when a certain combination of covers
and steganographic system is so unfortunate that it becomes possible to detect
even a single modification! Consider images obtained by decompressing JPEG
images of a fixed quality to the spatial domain. This image source is significantly
different (and much smaller) than the set of all uncompressed images because
JPEG is a many-to-one mapping. This is because, during JPEG compression,
many slightly different 8 × 8 blocks of pixels are mapped to the same 8 × 8 block
of quantized DCT coefficients. Thus, after embedding a message, e.g., using LSB
or ±1 embedding in the spatial domain, it is very likely that no 8 × 8 block of
quantized DCT coefficients can, when decompressed, produce the pixel values in
the modified spatial block. At the same time, because the steganographic changes
Steganalysis 215

3,000 3,000

2,000 2,000

1,000 1,000

0 0
−15 0 15 −15 0 15
Figure 10.7 Histogram of luminance DCT coefficients for spatial frequency (1, 1) for the
image shown in Figure 5.1 compressed with quality factor 70 (left), and the same for the
(1) (2)
image first compressed with quality factor qf = 85 and then with qf = 70 (right). The
quantization steps for the primary and secondary compression for this DCT mode were
Q(1) [1, 1] = 7, Q(2) [1, 1] = 4.

are small, after transforming the block back to the DCT domain, the coefficients
will still exhibit traces of quantization due to the previous JPEG compression.
In fact, if the number of embedding changes is sufficiently small (say one or two
embedding changes per block), it is possible to recover the original block of cover-
image pixels simply by a brute-force search for the modified pixels. This way,
Eve can not only detect the presence of a secret message with high probability
but also identify which pixels have been modified! Steganalysis based on this
idea is called JPEG compatibility steganalysis [85].

10.7 System attacks

When the steganographic algorithm F5 for JPEG images was introduced, most
researchers focused on attacking the impact of the embedding changes using
statistical steganalysis. It was only later that Niels Provos pointed out2 that
the JPEG compressor in F5 implementation always inserted the following JPEG
comment into the header of the stego image “JPEG Encoder Copyright 1998,
James R. Weeks and BioElectroMech,” which is rarely present in JPEG images
produced by common image-editing software. This comment thus serves as a
relatively reliable detector of JPEG images produced by F5. This is an example
of a system attack. Johnson [119, 121, 122] and [142] give examples of other
unintentional fingerprints left in stego images by various stego products.
Although these weaknesses are not as interesting from the mathematical point
of view, it is very important to know about them because they can markedly sim-
plify steganalysis or provide valuable side-information about the steganographic
channel.
The size of the stego key space can also be used to attack a steganographic
scheme even though the embedding changes it introduces are otherwise statis-
tically undetectable. The stego key usually determines a pseudo-random path

2 Personal communication by Andreas Westfeld.


216 Chapter 10. Steganalysis

Algorithm 10.2 System attack on stego image y by trying all possible keys.
The stego key is found once a meaningful message is extracted.
// Input: stego image y
while (Keys left) {
k = NextKey();
m = Ext(y, k);
if (m meaningful) {
output(’Image is stego.’);
output(’Message = ’, m);
output(’Key = ’, k);
STOP;
}
}
output(’Image is cover.’);

through the image where the message bits are embedded. A weak stego key or
a small stego key space create a security weakness that can be used by Eve to
mount the system attack shown in Algorithm 10.2.
Depending on the size of the stego key space, Eve can go through all stego
keys or use a dictionary attack. For each key tried, Eve extracts an alleged mes-
sage. Once she obtains a legible message, she will know that Alice and Bob use
steganography and she will have the correct stego key. At this point, a malicious
warden can choose to impersonate either party or simply block the communica-
tion.
The attack above will not work if the message is encrypted prior to embedding,
because the warden cannot reliably distinguish between a random bit stream and
an encrypted message. However, encrypting the message using a strong encryp-
tion scheme with a secure key still does not mean that the stego key does not
have to be strong because Eve can determine the stego key by other means
than inspecting the message. She can still run through all stego keys as above;
however, this time she will be checking the statistical properties of the pixels
along the pseudo-random path rather then the extracted message bits. The sta-
tistical properties of pixels along the true embedding path should be different
than statistical properties of a randomly chosen path, as long as the image is not
fully embedded.3 To see this, assume that the steganographic scheme embeds the
payload by changing n0 pixels while visiting k < n pixels along some embedding
path in an n-pixel cover image. The change rate for the first k pixels along the
embedding path will be n0 /k, while the change rate when following a random
path through the image is only n0 /n < n0 /k. Thus, a sudden increase of this
ratio is indicative of the fact that a correct stego key was used.

3 Fully embedded images can most likely be reliably detected using other methods.
Steganalysis 217

For example, for simple LSB embedding, n0 /k ≈ 1/2 while n0 /n = α/2 < 1/2,
where α is the relative payload. In this case, the warden can apply the histogram
attack from Chapter 5 as her detector. More examples of this type of system
attack can be found in [94].

10.8 Forensic steganalysis

The goal of steganalysis is to detect the presence of secret messages. In the clas-
sical prisoners’ problem, once Eve finds out that Alice and Bob communicate
using steganography, she decides to block the communication channel. In prac-
tice, however, Eve may not have the resources or authority to do so or may not
even want to because blocking the channel would only alert Alice and Bob that
they are being subjected to eavesdropping. Instead, Eve may try to determine
the steganographic algorithm and the stego key, and eventually extract the mes-
sage. Such activities belong to forensic steganalysis, which can be loosely defined
as a collection of tasks needed to identify individuals who are communicating in
secrecy, the stegosystem they are using, its parameters (the stego key), and the
message itself. We list these tasks below, with the caveat that in any given situ-
ation some of these tasks do not have to be carried out because the information
may already be a priori available.

1. Identification of web sites, Internet nodes, or computers that should be ana-


lyzed for steganography.
2. Development of algorithms that can distinguish stego images from cover im-
ages.
3. Identification of the embedding mechanism, e.g., LSB embedding, ±1 embed-
ding, embedding in the frequency domain, embedding in the image palette,
sequential, random, or content-adaptive embedding, etc.
4. Determining the steganographic software.
5. Searching for the stego key and extracting the embedded data.
6. Deciphering the extracted data and obtaining the secret message (cryptanal-
ysis).

The power of steganography is that it not only provides privacy in the sense
that no one can read the exchanged messages, but also hides the very presence
of secret communication. Thus, the primary problem in steganography detection
is to decide what communication to monitor in the first place. Since steganalysis
algorithms may be expensive and slow to run, focusing on the right channel is
of paramount importance.
Second, the communication through the monitored channel will be inspected
for the presence of secret messages using steganalysis methods. Once an im-
age has been detected as containing a secret message, it is further analyzed to
determine the steganographic method.
218 Chapter 10. Steganalysis

The warden can continue by trying to recover some attributes of the embedded
message and properties of the stego algorithm. For example, some detection algo-
rithms may provide Eve with an estimate of the number of embedding changes.
In this case, she can approximately infer the length of the secret message. If the
approximate location of the embedding changes can be determined, this may
point to a class of stego algorithms. For example, Eve can use the histogram
attack, described in Chapter 5, to determine whether the message has been se-
quentially embedded. If the prisoners reuse the stego key, then Eve may have
multiple stego images embedded with the same key, which can help her deter-
mine the embedding path [137, 139] and narrow down the class of possible stego
methods.
The character of the embedding changes also leaks information about the em-
bedding mechanism. If Eve can determine that the LSBs of pixels were modified,
she can then focus on methods that embed into LSBs. Eventually, Eve may guess
which stego method has been used and attempt to determine the stego key and
extract the embedded message. If the message is encrypted, Eve then needs to
perform cryptanalysis on the extracted bit stream.
In her endeavor, Eve can mount different types of attacks depending on the
information available to her [120]. The most common case, which we have already
considered, is the stego-image-only-attack in which Eve has only the stego image.
However, in some situations, Eve may have additional information available that
may aid in her effort. For example, in a criminal case the suspect’s computer
may be available, with several versions of the same image on the hard disk. This
will allow Eve to directly infer the location, number, and character of embedding
modifications. This scenario is known as a known-cover attack.
If the steganographic algorithm is known to Eve, her options further increase
as she can now mount two more attacks – the known-stego-method attack and
the known-message attack. Eve can, for example, embed the message with vari-
ous keys and identify the correct key by comparing the locations of embedding
changes in the resulting stego image with those in the stego image under inves-
tigation.

Summary
r Steganalysis is the complementary task to steganography. Its goal is to detect
the presence of secret messages.
r Steganography is broken when the mere presence of a secret message can
be proved. In particular, it is not necessary to read the message to break a
stegosystem.
r Activities directed towards extracting the message belong to forensic steganal-
ysis.
r Steganalysis can be formulated as the detection problem using a variety of
hypothesis-testing scenarios.
Steganalysis 219

r There are two major types of steganalysis attacks – statistical and system
attacks.
r System attacks use some weakness in the implementation or protocol. Statisti-
cal attacks try to distinguish cover and stego images by computing statistical
quantities from images.
r All statistical steganalysis methods work by representing images in some fea-
ture space where a detector is constructed.
r If the features are designed to detect a specific stegosystem, we speak of
targeted steganalysis.
r Features designed to attack an arbitrary stegosystem lead to blind steganalysis
algorithms.
r The features for targeted schemes are designed by
– identifying quantities that predictably change with embedding,
– estimating these quantities for the cover image from the stego image (cali-
bration), or
– finding a function of such quantities that attains a known value on covers.
r Features for blind steganalysis are usually constructed in a heuristic manner
to be sensitive to typical steganographic changes and insensitive to image
content. Calibration can be used to achieve this goal. The features can be
computed in the spatial domain, frequency domain, or wavelet domain.
r The goal of quantitative steganalysis is to estimate the embedded payload (or,
more accurately, the number of embedding changes).
r The error of the estimate from quantitative steganalyzers has two components
– a within-image error and a between-image error. The within-image error
depends on the image content and the payload. It is well modeled using a
Gaussian distribution. The between-image error is the estimator bias for each
image caused by the properties of natural images. It is well modeled using
Student’s t-distribution.

Exercises

10.1 [Gauss–Gauss ROC] Consider a scalar feature with pc = N (0, σ 2 ) and


ps = N (μ, σ2 ), μ > 0. Prove the following expressions for the ROC curve and the
three measures of detector performance from the text:

μ
PD (PFA ) = Q Q−1 (PFA ) − , (10.41)
σ
−1 1 μ
PD =Q , (10.42)
2 σ
μ
ρ = 1 − 2Q √ , (10.43)
σ 2
μ
PE = Q , (10.44)

220 Chapter 10. Steganalysis

where Q(x) is the complementary cumulative distribution function of a standard


normal variable N (0, 1),
ˆ∞
1 t2
Q(x) = √ e− 2 dt. (10.45)

x

Hint: When computing ρ,


ˆ1 ˆ∞
ρ+1 μ μ 
= Q Q−1 (x) − dx = − Q y− Q (y)dy (10.46)
2 σ σ
0 −∞
ˆ∞ ˆ∞ ¨
1 − x2 +y2 1 − x2 +y2
= e 2 dxdy = e 2 dxdy (10.47)
2π 2π
−∞ y− μ
σ x−y≥− μ
σ

and make a substitution


1
r = √ (x + y), (10.48)
2
1
s = √ (x − y). (10.49)
2
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
11 - Selected targeted attacks pp. 221-250

Chapter DOI:
Cambridge University Press
11 Selected targeted attacks

Steganalysis is the activity directed towards detecting the presence of secret mes-
sages. Due to their complexity and dimensionality, digital images are typically
analyzed in a low-dimensional feature space. If the features are selected wisely,
cover images and stego images will form clusters in the feature space with mini-
mal overlap. If the warden knows the details of the embedding mechanism, she
can use this side-information and design the features accordingly. This strategy
is recognized as targeted steganalysis. The histogram attack and the attack on
Jsteg from Chapter 5 are two examples of targeted attacks.
Three general strategies for constructing features for targeted steganalysis were
described in the previous chapter. This chapter presents specific examples of four
targeted attacks on steganography in images stored in raster, palette, and JPEG
formats. The first attack, called Sample Pairs Analysis, detects LSB embedding
in the spatial domain by considering pairs of neighboring pixels. It is one of the
most accurate methods for steganalysis of LSB embedding known today. Sec-
tion 11.1 contains a detailed derivation of this attack as well as several of its
variants formulated within the framework of structural steganalysis. The Pairs
Analysis attack is the subject of Section 11.2. It was designed to detect stegano-
graphic schemes that embed messages in LSBs of color indices to a preordered
palette. The EzStego algorithm from Chapter 5 is an example of this embedding
method. Pairs Analysis is based on an entirely different principle than Sample
Pairs Analysis because it uses information from pixels that can be very distant.
The third attack, presented in Section 11.3, is targeted steganalysis of the F5
algorithm that demonstrates the use of calibration as introduced in Section 10.3.
The last attack concerns detection of ±1 embedding in the spatial domain and
is detailed in Section 11.4. These steganalysis methods were chosen to illustrate
the basic principles on which many targeted attacks are based.

11.1 Sample Pairs Analysis

A large number of steganographic applications today use LSB embedding for


hiding messages. The likely reasons for its popularity are ease of implementation,
speed, and large embedding capacity. Thus, reliable methods for detection of this
embedding paradigm are of great interest.
222 Chapter 11. Selected targeted attacks

1
−6 −6

−10 −10

−14 −14
128
s

−18 −18

−22
−22
256
1 128 256 1 128 256
r
Figure 11.1 Left: Logarithm of the normalized adjacency histogram of horizontally
neighboring pixel pairs (r, s) from 6000 never-compressed raw digital-camera images (see
the description of Database RAW in Section 11.1.1). Right: A cross-section along the
minor diagonal of the adjacency histogram.

In Chapter 5, we introduced an attack on LSB embedding called the histogram


attack. It was based on an observation that flipping LSBs has a tendency to
even out the histogram counts in LSB pairs (pairs of intensities differing by their
LSBs {2k, 2k + 1}, k = 0, . . . , 127). The histogram attack worked reasonably well
when the stego image was fully embedded or when the image was only partially
embedded but along a known embedding path, such as sequentially by rows
or columns. When the message is scattered along a pseudo-random path, the
histogram attack will not work unless the majority (i.e., 99%) of all pixels were
used for embedding.
Steganalysis methods that use only a first-order statistic, such as his-
tograms of pixels, cannot capture the relationship among neighboring pixels.
By utilizing the fact that neighboring pixels in images exhibit strong corre-
lations, it is possible to construct much more reliable and accurate steganal-
ysis methods. The first method that used local correlations among neighbor-
ing pixels to accurately detect LSB embedding was the heuristic RS Analy-
sis [84]. The simplest case of RS Analysis, Sample Pairs Analysis (SPA), was
later rederived [62] in a way that enabled multiple extensions and improve-
ments [61, 63, 127, 128, 131, 133, 134, 135, 163]. Together, these contributions
revolutionized steganalysis of LSB embedding and provided extremely accurate
methods capable of reliably detecting payloads as small as 0.03 bpp in some
cover sources. Next, we describe in detail the simplest version of SPA as it ap-
peared in [62]. Later in this chapter, SPA is reformulated within a more general
framework that provides deeper insight into its inner workings and gives birth
to several important generalizations and improvements.
From Section 5.1.1, we already know that LSB embedding predictably changes
the image histogram. The problem is that the variety of cover-image histograms is
so high that no reasonable assumption can be made about them. The situation,
however, is quite different when one looks at pairs of neighboring pixels. Let
P be the set of all pairs of, say horizontally, neighboring pixels. Due to local
Selected targeted attacks 223

correlations among pixels of natural images, we are more likely to see a pair of
neighboring pixels (r, s) than a pair (r , s ) whenever |r − s| < |r − s |. In other
words, the larger the difference |r − s| is, the less probably will such a pair occur
in a natural image.1 Thus, unlike the histogram of pixels, the histogram of pixel
pairs has a predictable shape (see Figure 11.1). Let us now take a look at a pixel
pair (r, s) with r < s. The pair can accept four different forms,

(r, s) ∈ {(2i, 2j), (2i, 2j + 1), (2i + 1, 2j), (2i + 1, 2j + 1)}. (11.1)

Let h[r, s] be the number of horizontally adjacent pixel pairs (r, s) in the image.
Because r < s, we expect the counts for the pair (2i, 2j + 1) to be the lowest
among the four pairs, and the counts for (2i + 1, 2j) to be the highest. LSB
embedding changes any pair into another with probability that depends on the
change rate β. Because the set of four pairs (11.1) is obviously closed under
LSB embedding, the count of pixel pairs in the stego image, hβ , is a convex
combination of the cover counts for all four pairs. For example, if the payload is
embedded pseudo-randomly in the image,

hβ [2i + 1, 2j] = β (1 − β) h[2i, 2j] + β 2 h[2i, 2j + 1]


+ (1 − β)2 h[2i + 1, 2j]
+ β (1 − β) h[2i + 1, 2j + 1], (11.2)
2
hβ [2i, 2j + 1] = β (1 − β) h[2i, 2j] + (1 − β) h[2i, 2j + 1]
+ β (1 − β) h[2i + 1, 2j + 1]
+ β 2 h[2i + 1, 2j]. (11.3)

Similar expressions can be obtained for the other two counts. The important
observation here is that hβ [2i + 1, 2j] will decrease with β because three out
of four terms in (11.2) are smaller than h[2i + 1, 2j]. By a similar argument,
hβ [2i, 2j + 1] will increase with β. Eventually, when β = 0.5, the expected values
of all four counts will be the same. We will put the pair (r, s) into set X whenever
r < s and s is even, and we will include the pair in set Y whenever r < s and s
is odd. With LSB embedding, the counts of pixel pairs in X will decrease, while
the counts of pairs in Y will increase.
A similar analysis can be carried out for pairs (r, s) for which r > s. There,
the situation is complementary in the sense that the counts of pairs (2i + 1, 2j)
will increase, while the counts of (2i, 2j + 1) will decrease. Thus, we include in
X pairs with r > s, s odd, while (r, s) ∈ Y when r > s and s even. This way, the
cardinality of X will decrease with LSB embedding, while the cardinality of Y
will increase, for all pairs (r, s), r = s.

1 Because the order of pixels in the pair matters, we will denote the pair in round brackets
(r, s), rather than curly brackets {r, s}.
224 Chapter 11. Selected targeted attacks

00, 10

11, 01 01, 10
00, 10 X V W Z 00, 11
11, 01 01, 10

00, 11

Figure 11.2 Transitions between primary sets, X , V, W, Z, under LSB flipping. Note that
Y = V ∪ W.

By now, we have partitioned the set of all pairs of horizontally neighboring


pixels P into three sets:

X = {(r, s) ∈ P|(s is even and r < s) or (s is odd and r > s)}, (11.4)
Y = {(r, s) ∈ P|(s is even and r > s) or (s is odd and r < s)}, (11.5)
Z = {(r, s) ∈ P|r = s}. (11.6)

The symmetry of the definitions of the sets X and Y indicates that for cover
images the cardinalities of X and Y should be the same,

|X | = |Y|, (11.7)

because in natural images it should be equally likely to have r > s or r < s


independently of whether s is even or odd. However, we now know that the
equality will be broken by flipping LSBs. Thus, with embedding the difference
|X | − |Y| will no longer be zero. Clearly, we identified a quantity that predictably
changes with embedding and whose value is known on cover images. All we now
need to do is to quantify this important observation.
To this end, we further partition Y into two subsets W and V, V = Y − W,
where

W = {(r, s) ∈ P|r = 2k, s = 2k + 1 or r = 2k + 1, s = 2k}. (11.8)

In other words, W ∪ Z is the set of all pairs from P that belong to one LSB pair
{2k, 2k + 1}.
The sets X , W, V, and Z are called primary sets. Note that P = X ∪ W ∪ V ∪
Z.
We now analyze what happens to a given pixel pair (r, s) under LSB embed-
ding. There are four possibilities:

1. both r and s stay unmodified (modification pattern 00)


2. only r is modified (modification pattern 10)
3. only s is modified (modification pattern 01)
4. both r and s are modified (modification pattern 11).
Selected targeted attacks 225

LSB embedding may cause a given pixel pair to move from its primary set
to another primary set. The transitions of pairs between the primary sets are
depicted in Figure 11.2. When an arrow points from set A to set B, it means that
a pixel pair originally in A moves to B if modified by the modification pattern
associated with the arrow.
For each modification pattern Ω ∈ {00, 10, 01, 11} and any subset A ⊂ P, we
denote by φ(Ω, A) the expected fraction of pixel pairs in A modified with pattern
Ω. Under the assumption that the message bits are embedded along a random
path through the image, each pixel in the image is equally likely to be modi-
fied. Thus, the expected fraction of pixels modified with a specific modification
pattern Ω ∈ {00, 10, 01, 11} is the same for every primary set A ∈ {X , V, W, Z},
φ(Ω, X ) = · · · = φ(Ω, Z) = φ(Ω). With change rate β, we obtain the following
transition probabilities:

φ(00) = (1 − β)2 , (11.9)


φ(01) = φ(10) = β(1 − β), (11.10)
φ(11) = β 2 . (11.11)

Together with the transition diagram of Figure 11.2, we can now express the
expected cardinalities of the primary sets for the stego image as functions of the
change rate β and the cardinalities of the cover image. Denoting the primary
sets after embedding with a prime, we obtain

E[|X  |] = (1 − β)|X | + β|V|, (11.12)



E[|V |] = (1 − β)|V| + β|X |, (11.13)
E[|W  |] = (1 − 2β + 2β 2 )|W| + 2β(1 − β)|Z|. (11.14)

From now on, we drop the expectations and assume that the cardinalities of the
primary sets from the stego image will be close to their expectations.
Following Strategy T2 from Chapter 10, our goal is to derive an equation for
the unknown quantity β using only the cardinalities of primed sets because they
can be calculated from the stego image. Equations (11.12) and (11.13) imply
that

|X  | − |V  |=(|X | − |V|)(1 − 2β). (11.15)

Because |X | = |Y|, we have |X | = |V| + |W|, and from (11.15)

|X  | − |V  | = |W|(1 − 2β). (11.16)

The transition diagram shows that the embedding process does not modify the
union W ∪ Z. Denoting κ = |W| + |Z| = |W  | + |Z  |, on replacing |Z| with κ −
|W|, equation (11.14) becomes

|W  | = |W|(1 − 2β)2 + 2β(1 − β)κ. (11.17)


226 Chapter 11. Selected targeted attacks

Algorithm 11.1 Sample Pairs Analysis for estimating the change rate from a
stego image. The constant γ is a threshold on the test statistic β̂ set to achieve
PFA < FA , where FA is a bound on the false-alarm rate.
// Input M × N image x
// Form pixel pairs
P = {(x[i, j], x[i, j + 1])|i = 1, . . . , M, j = 1, . . . , N − 1}
x = y = 0;κ = 0;
for k = 1 to M (N − 1) {
(r, s) ← kth pair from P
if (s even & r < s) or (s odd & r > s){x = x + 1;}
if (s even & r > s) or (s odd & r < s){y = y + 1;}
if (s/2 = r/2){κ = κ + 1;}
}
if κ = 0 {output(’SPA failed because κ = 0’); STOP}
a = 2κ;b = 2 (2x √− M (N − 1));c = y − x;
β± = Re((−b ± b2 − 4ac)/(2a));
β̂ = min(β+ , β− );
if β̂ > γ {
output(’Image is stego’);
output(’Estimated change rate = ’, β̂);
}

Eliminating |W| from (11.16) and (11.17) leads to

|W  | = (|X  | − |V  |)(1 − 2β) + 2β(1 − β)κ. (11.18)

Since |X  | + |Y  | + |Z  | = |X  | + |V  | + |W  | + |Z  | = |P|, (11.18) is equivalent to

2κβ 2 + 2(2|X  | − |P|)β + |Y  | − |X  | = 0, (11.19)

which is a quadratic equation for the unknown change rate β. This equation can
be directly solved for β because all coefficients in this equation can be evaluated
from the stego image (recall that κ = |W  | + |Z  |). The final estimate of the
change rate β̂ is obtained as the smaller root of this quadratic equation. Note
that when κ = 0, W ∪ Z = ∅ and thus |X | = |X  | = |Y| = |Y  | = |P|/2. In this
case (11.19) becomes a useless identity and we cannot estimate β. However, since
κ is the number of pixel pairs where both values belong to the same LSB pair,
this will happen with very small probability for natural images.

11.1.1 Experimental verification of SPA


In this section, we test SPA on images from two databases to demonstrate its
accuracy and illustrate how much the accuracy may depend on the cover-image
source.
Selected targeted attacks 227

Database RAW consists of 2567 raw, never-compressed, 24-bit color images


of different dimensions (all larger than 1 megapixel). The images were acquired
using 22 different digital cameras ranging from low-cost point-and-shoot cameras
to semi-professional SLR cameras. For the tests, the images were converted to
8-bit grayscale.
Database SCAN consists of 3000 high-resolution never-compressed 1500 ×
2100 scans of films in the 32-bit CMYK TIFF format downloaded from the
NRCS Photo Gallery (http://photogallery.nrcs.usda.gov). For the tests,
the images were converted to 8-bit grayscale and downsampled to 640 × 480
using bicubic resizing.
For each database, the cover images were embedded using LSB embedding
with random bit streams of three different relative payloads α = 0.05, 0.25, and
0.5 bpp, which correspond to expected change rates β = 0.025, 0.125, and 0.25.
Then, SPA was applied to all cover as well as stego images. The median es-
timated change rate and its sample median absolute error (MAE2 ) are shown
in Table 11.1. Notice that the estimator produced markedly better results on
digital-camera images than on scans. For scans, the estimator exhibits a positive
bias and a significantly larger MAE compared with the result from the database
of digital-camera images. This is due to the fact that the scans were much more
noisy than the digital-camera images because when scanning at high dpi the scan
captures the film grain, which casts a characteristic stochastic texture resembling
noise.
To further render the difference between the results on these two databases,
Figures 11.3 and 11.4 show the histogram of the estimated change rate for raw
digital-camera images and film scans. Besides the obvious fact that the results
for scans are much more scattered, note that the distribution is non-Gaussian,
asymmetric, and has thick tails. These properties seem to be characteristic for
change-rate estimators in general. The estimation error is a mixture of two er-
rors – the within-image error and the between-image error (see Section 10.3.2).
In this particular experiment, since the embeddings were done by fixing the pay-
load rather than the change rate, there is a third source of error caused by the
fact that, when embedding a fixed payload α, the actual change rate will not
be exactly β = α/2 but will fluctuate around this expected value, following a
binomial distribution [22].

11.1.2 Constructing a detector of LSB embedding using SPA


We now use the data obtained in the experiments to illustrate how one may
proceed with constructing an LSB detector in practice. The reader will see how
important for steganalysis the type of side-information about the steganographic

2 MAE is a more robust statistic than variance. Moreover, since we know from Section 10.3.2
that the error of quantitative steganalyzers often has thick tails matching Student’s t-
distribution, the variance may not even exist. Also, see discussion in Section A.1.
228 Chapter 11. Selected targeted attacks

Table 11.1. Median estimated change rate and its median absolute error (MAE)
obtained using Sample Pairs Analysis for raw digital-camera images and film scans.

Median/MAE of β̂
Cover (β = 0) β = 0.025 β = 0.125 β = 0.25
RAW 0.0015/0.0076 0.0264/0.0074 0.1261/0.0062 0.2507/0.0050
SCAN 0.0130/0.0331 0.0373/0.0316 0.1347/0.0254 0.2575/0.0177

400

300

200

100

0
0 0.05 0.1 0.15
α = 2β

Figure 11.3 Histogram of estimated message length for images from Database RAW
embedded with α = 0.05 bpp.

channel available to Eve is. Let us assume that Eve knows that Alice and Bob
always embed relative payload α = 0.05 (change rate β ≈ 0.025) using LSB em-
bedding in randomly selected pixels. If, additionally, Eve knows that Alice and
Bob use digital-camera images, she uses the data from experiments on Database
RAW and draws an ROC using Algorithm 10.1. To avoid introducing any sys-
tematic errors due to a low number of images in the database, she divides the
database into two disjoint subsets D = D1 ∪ D2 and takes the cover features (the
features are the estimates β̂) only from images in D1 and stego features only from
images in D2 .
The ROC describes a class of detectors that differ by their threshold γ,

Decide “stego” if β̂ > γ. (11.20)

Different choices of γ correspond to different


  points on the ROC curve (see
−1 1
Figure 11.5). From the graph, we have PD 2 ≈ 0.009 (see equation (10.16)).
Actually, Eve can select a point on the ROC curve (circle in Figure 11.5) that
will give her a higher PD ≈ 0.68 at the same PFA = PD−1 12 . Also note that
the ROC curve is not concave. Since we know that ROCs for optimal detectors
Selected targeted attacks 229

150

100

50

0
0 0.05 0.1 0.15
α = 2β

Figure 11.4 Histogram of estimated message length for images from Database SCAN
embedded with α = 0.05 bpp.

must be concave (see Appendix D), we can further improve the detector’s ROC
by connecting the points (0, 0) and (u1 , PD (u1 )) (the origin and the square in
Figure 11.5), which corresponds to the following family of detectors Fu in the
interval u ∈ [0, u1 ]:


Fu1 (x) with probability u/u1
Fu (x) = (11.21)
F0 (x) with probability 1 − u/u1 .

Note that this detector would decrease the false-alarm rate at PD = 0.5 from
roughly 0.009 to 0.006. By imposing a bound on the false-alarm rate, Eve can
now select the threshold and use the appropriate detector in her eavesdropping.
For comparison, we show in Figure 11.6 the ROC curve for detecting LSB
embedding at relative payload α = 0.05 in raw scans of film. Note that the per-
formance of the detector is markedly worse than for digital-camera images. The
ROC can be made concave using the same procedure as above.
We note that if Eve were facing a different steganographic channel where the
change rate follows a known distribution fβ , she could construct a detector in the
same way as above with one difference: the change rate in stego images, D2 , would
have to follow fβ instead of being fixed at 0.025. If Eve has no prior information
about the payload, she can still obtain the distribution of β̂ on cover images,
fit a parametric model, and fix the threshold on the basis of the bound on false
alarms. This time, however, Eve will not be able to determine the probability of
correctly detecting a stego image as stego as this depends on the distribution of
payloads.
230 Chapter 11. Selected targeted attacks

0.8

PD 0.6

0.4

0.2

0 u1
0 0.02 0.03 0.04 1
PFA
Figure 11.5 ROC for the SPA detector distinguishing cover digital-camera images
(Database RAW) from the same images embedded using LSB embedding in randomly
selected pixels with α = 0.05. Because the ROC very quickly reaches PD = 1, we display
the curve only in the range PFA ≤ 0.05. The circle corresponds to the point with
PD = 0.68 with PFA < 0.01. The line connecting the origin and the square symbol is a
concave hull of the ROC that can be obtained using the detector at u = u1 and the
detector at u = 0.

11.1.3 SPA from the point of view of structural steganalysis


Sample Pairs Analysis can be interpreted within a more general framework called
structural steganalysis [63, 128, 133, 134]. This reformulation is appealing be-
cause it provides deeper insight into the inner workings of SPA and allows several
important generalizations that lead to more accurate steganalysis. In this sec-
tion, we describe the framework as it appeared in [128] and then discuss possible
avenues that can be taken while referring the reader to the literature.
Similar to SPA, we start by dividing the image into pairs of pixels and then
divide the pairs into three types of trace sets, C, E, O, parametrized by integer
index i,
+  ,
Ci = (r, s) ∈ P r/2 − s/2 = i , (11.22)
Ei = {(r, s) ∈ P|r − s = i, s even} , (11.23)
Oi = {(r, s) ∈ P|r − s = i, s odd} . (11.24)

The set Ci contains all pairs whose values differ by i after right-shifting their
binary representation (dividing by 2 and rounding down). There are four pos-
sibilities for (r, s) ∈ Ci : r − s = 2i, which covers two cases when either both r
and s are even or both are odd, or r − s = 2i − 1 (if r is even and s odd), or
r − s = 2i + 1 (if r is odd and s even). Thus, each trace set Ci can be written as
Selected targeted attacks 231

0.8

PD 0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
PFA
Figure 11.6 ROC curve for the SPA detector of relative payload α = 0.05 in raw scans of
film (Database SCAN). When comparing this figure with Figure 11.5, note the range of
the x axis.

a disjoint union of four trace subsets

Ci = E2i ∪ O2i−1 ∪ E2i+1 ∪ O2i . (11.25)

Note that trace sets Ci are invariant with respect to LSB embedding because
the value of r/2 does not depend on the LSB of r. The four trace subsets
of Ci , however, are in general not invariant with respect to LSB embedding.
The transition diagram between the trace subsets, including the probabilities of
transition, is shown in Figure 11.7. As an example, we explain the transitions
from E2i . All pairs (r, s) from E2i have both r and s even. The probability that
a pair (r, s) ∈ E2i ends up again in E2i is the probability that neither r nor s
gets flipped, which is (1 − β)2 . The probability of the transition E2i → O2i is the
probability that both get flipped, which is β 2 . The probabilities of the remaining
two transitions, E2i → E2i+1 and E2i → O2i−1 , are equal to the probability that
one pixel gets flipped but not the other, which is β(1 − β).
The cardinality of each trace set after changing a random portion of β of
LSBs is a random variable with the following expected values derived from the
transition diagram:

⎛  ⎞ ⎛ 2 ⎞⎛ ⎞
E[|E2i |] b ab ab a2 |E2i |
⎜ E[|O2i−1
 ⎟ ⎜
|] ⎟ ⎜ ab ab ⎟ ⎜ ⎟
⎜ b2 a2 ⎟ ⎜ |O2i−1 | ⎟ , (11.26)
⎝ E[|E  |] ⎠ = ⎝ ab a2 b2 ab ⎠ ⎝ |E2i+1 | ⎠
2i+1

E[|O2i |] a2 ab ab b2 |O2i |

where a = β, b = 1 − β. Here, we again use the prime to denote the sets obtained
from the stego image. For any 0 ≤ β < 12 , the matrix is invertible and we can
232 Chapter 11. Selected targeted attacks

(1 − β)2 (1 − β)2

β(1 − β)
E2i+1 O2i

β(1 − β) β2 β2 β(1 − β)

E2i O2i−1
β(1 − β)

(1 − β)2 (1 − β)2

Figure 11.7 Diagram of transitions between trace sets from Ci .

express the cardinalities of cover-image trace sets as


⎛ ⎞ ⎛ ⎞⎛  ⎞
|E2i | b2 −ab −ab a2 E[|E2i |]
⎜ |O2i−1 | ⎟ 1 ⎜ −ab b2 a2 −ab ⎟ ⎜ E[|O2i−1

|] ⎟
⎜ ⎟ ⎜ ⎟⎜ ⎟ (11.27)
⎝ |E2i+1 | ⎠ = (b − a)2 ⎝ −ab a2 b2 −ab ⎠ ⎝ E[|E  |] ⎠ .
2i+1

|O2i | a2 −ab −ab b2 E[|O2i |]

Thus, in theory, if we knew the change rate, we should be able to recover the
original cardinalities of cover trace sets by substituting the trace-set cardinalities
of the stego image, arguing that they should be close to their expected values
as long as the number of pixels in each trace set is large. Alternatively, if we
succeed in finding a condition that the trace-set cardinalities of covers must
satisfy, we obtain equation(s) for the unknown change rate β. In other words, we
are again following Strategy T2 from Chapter 10. By the same reasoning as in
the derivation of SPA, we expect to see in covers approximately the same number
of pairs with r − s = j independently of whether s is even or odd, or |Ej | ≈ |Oj |.
LSB embedding violates this condition only for odd values of j. Indeed, the
 
condition |E2i | = |O2i | implies E[|E2i |] = E[|O2i |] as can be easily verified from
the first and fourth equations from (11.26). The condition

|E2i+1 | = |O2i+1 | (11.28)

leads to the following quadratic equation for β obtained from the third equation
from (11.27) and the second equation from (11.27) written for O2i+1 rather than
O2i−1 :

   
− β(1 − β)|E2i | + β 2 |O2i−1 | + (1 − β)2 |E2i+1 | − β(1 − β)|O2i |
   
= −β(1 − β)|E2i+2 | + (1 − β)2 |O2i+1 | + β 2 |E2i+3 | − β(1 − β)|O2i+2 |, (11.29)
Selected targeted attacks 233

which simplifies to
β 2 (|Ci | − |Ci+1 |)
      

+ β |E2i+2 | + |O2i+2 | − 2|E2i+1 | + 2|O2i+1 | − |E2i | − |O2i |
  

+ |E2i+1 | − |O2i+1 | =0 (11.30)
Here, we used the fact that |Ci | = |Ci | and |Ci | = |E2i
 
| + |O2i−1 
| + |E2i+1 
| + |O2i |,
which follows from (11.25). The smaller root of this quadratic equation is the
change-rate estimate obtained from the ith trace set Ci .
At this point, we have multiple choices regarding how to aggregate the available
equations to estimate the change rate:
1. Sum all equations (11.30) for all indices i (or for some limited range, such
as |i| ≤ 50) and solve the resulting single quadratic equation. This option is
essentially the step taken in the generalized version of SPA as it appeared
in [63].
2. Solve (11.30) for some small values of |i|, e.g., |i| ≤ 2, obtaining individual
estimates β̂i , and estimate the final change rate as β̂ = mini β̂i . This choice
has been shown to provide more stable results compared with SPA [127].
3. Solve (11.28) in the least-square sense [128, 163]

β̂ = arg min (|E2i+1 | − |O2i+1 |)2 , (11.31)
β
i

where we substitute from (11.27) for the cardinalities of cover trace sets,
replacing the expected values with the observed cardinalities of stego image
trace sets.
The above formulation of SPA allows direct extensions to groups of more than
two pixels (see the triples analysis and quadruples analysis by Ker [128, 131],
which was reported to provide more accurate change-rate estimates, especially
for short messages). An alternative extension of SPA to groups of multiple pixels
appeared in [61].
The least-square estimator (11.31) can be interpreted as a maximum-likelihood
estimator under the assumption that the differences between the cardinalities of
cover trace sets, |E2i+1 | − |O2i+1 |, are independent realizations of a Gaussian
random variable. In this sense, the least-square estimator is a step in the right
direction because it views the cover image as an entity unknown to the stegan-
alyzer and postulates a statistical model for it. The iid Gaussian assumption is,
however, clearly false because the cardinalities of trace sets decrease with increas-
ing value of |i| (trace sets with large values of |i| are sparsely populated) and thus
exhibit larger relative variations. It is not immediately clear what assumptions
can be imposed on the cover trace sets to derive a “more proper” maximum-
likelihood variant of SPA. An interesting solution was proposed by Ker [133],
who introduced the concept of a precover consisting of pixel pairs with precisely
|E2i+1 | + |O2i+1 | pairs of pixels differing by 2i + 1. A specific cover is then ob-
tained by randomly (uniformly) associating each pair with either E2i+1 or O2i+1 .
234 Chapter 11. Selected targeted attacks

This assumption allows one to model the differences |E2i+1 | − |O2i+1 | as random
variables with a binomial distribution (or their Gaussian approximation) and de-
rive an appropriate maximum-likelihood estimator of the change rate. This ML
version of SPA has been shown to provide improved estimates for low embedding
rates.
We close this section with a brief mention of other related approaches to de-
tection of LSB embedding in the spatial domain. Methods based on statistics of
differences between pairs of neighboring pixels include [88, 155, 256]. Approaches
that use a classical signal-detection framework appeared in [40] and [55]. A qual-
itatively different method called the WS method [83] (Weighted Stego image)
that uses local image estimators is presented in Exercises 11.3 and 11.4. An
improved version of this method has been shown to produce some of the most
accurate results for detection of LSB embedding in the spatial domain [138].
The WS method for the JPEG domain appeared in [21]. Another advantage of
the WS method is that it is less sensitive than SPA to the assumption that the
message is embedded along a pseudo-random path and thus gives better results
when the embedding path is non-random or adaptive. Structural steganalysis
has also been extended to detection of embedding in two LSBs in [135, 253].

11.2 Pairs Analysis

Although Pairs Analysis can detect LSB embedding in grayscale and color
images in general, it was originally developed for quantitative steganalysis of
methods that embed messages in indices to a sorted palette using LSB em-
bedding. The EzStego algorithm of Chapter 5 is a typical example of such
schemes. Prior to embedding, EzStego first sorts the palette colors accord-
ing to their luminance and then reindexes the image data accordingly so that
the visual appearance of the image does not change. Then, the usual LSB
embedding in indices is applied. Besides EzStego, early versions of Steganos
(http://steganos.com) and Hide&Seek (ftp://ftp.funet.fi/pub/crypt/
steganography/hdsk41.zip) also employed a similar method.
Before describing Pairs Analysis, we introduce notation and analyze the impact
of EzStego embedding on cover images. Let c[i] = (r[i], g[i], b[i]), i = 0, . . . , 255
be the 256 colors from the sorted palette. During LSB embedding, each color can
be changed only into the other color from the same color pair {c[2k], c[2k + 1]},
k = 0, . . . , 127. For a fixed k, we extract the colors c[2k] and c[2k + 1] from the
whole image, for example by scanning it by rows (Figure 11.8). This sequence
of colors can be converted to a binary vector of the same length by associating
a “0” with c[2k] and a “1” with c[2k + 1]. This binary vector will be called a
color cut for the pair {c[2k], c[2k + 1]} and will be denoted Z(c[2k], c[2k + 1]).
Because palette images have a small number of colors and natural images contain
macroscopic structure, Z is more likely to exhibit long runs of zeros or ones rather
than some random pattern. The embedding process will disturb this structure
Selected targeted attacks 235

8 4 5 0

5 8 9 5
Z( 4 , 5 ) = ( 0 1 1 1 0 0 1 0 )
0 4 4 5

7 3 6 4

Figure 11.8 An example of a color cut. The pixels shaded in gray represent two colors
from the same LSB pair. Crossed-out pixels in white represent the remaining colors.

and increase the entropy of Z. Finally, when the maximal-length message has
been embedded in the cover image (1 bit per pixel), Z will be a random binary
sequence and the entropy of Z will be maximal.
Let us now take a look at what happens during embedding to color cuts for
the “shifted” color pairs {c[2k − 1], c[2k]}, k = 1, . . . , 127. During embedding,
the colors c[2k − 2] and c[2k − 1] are exchanged for each other and so are the
colors c[2k] and c[2k + 1]. Even after embedding the maximal message (each pixel
modified with probability 12 ), the color cut Z(c[2k − 1], c[2k]) will still show some
residual structure. To see this, imagine a binary sequence W that was formed
from the cover image by scanning it by rows and associating a “0” with the colors
c[2k − 2] and c[2k − 1] and a “1” with the colors c[2k] and c[2k + 1]. Convince
yourself that, after embedding a maximal pseudo-random message in the image,
the color cut Z(c[2k − 1], c[2k]) is the same as starting with the sequence W and
skipping each element of W with probability 12 . Because W showed structure in
the cover image, most likely long runs of 0s and 1s, we see that randomly chosen
subsequences of W will show some residual structure as well.
We are now ready to describe the steganalysis method. Denoting the concate-
nation of bit strings with the symbol “&,” we first concatenate all color cuts
Z(c[2k], c[2k + 1]) into one vector
Z = Z(c[0], c[1])& . . . &Z(c[254], c[255]), (11.32)
and all color cuts for shifted pairs Z(c[2k − 1], c[2k]) into
Z = Z(c[1], c[2])& . . . &Z(c[253], c[254])&Z(c[255], c[0]). (11.33)
Next, we define a simple measure of structure in a binary vector as the number of
homogeneous bit pairs 00 and 11 in the vector. For example, a vector of n 1s will
have n − 1 homogeneous bit pairs. We denote by R(β) the expected relative3
number of homogeneous bit pairs in Z after flipping the LSBs of indices of a
fraction β of randomly chosen pixels, 0 ≤ β ≤ 1. We recognize β as the change
rate. Similarly, let R (β) be the expected relative number of homogeneous bit
pairs in Z . For β < 12 , this change rate corresponds to relative payload α = 2β.
Exercise 11.1
 1  shows that R(x) is1 a quadratic polynomial with its vertex at
x = 2 and R 2 = (n − 1)/(2n) ≈ 2 (see Figure 11.9). The value R(β) is known
1

3 By relative, we mean the number of pairs normalized by n.


236 Chapter 11. Selected targeted attacks

1 R (x)

0.8

R(x)
R(x) 0.6

0.4

0.2

0
0 β 0.4 0.6 1−β 1
x
Figure 11.9 Expected number of homogeneous pairs R(x) and R (x) in color cuts Z and
Z as a function of the change rate x. The circles correspond to y values that can be
obtained from the stego image with unknown change rate β. The values R(0) and R (0)
are not known but satisfy R(0) = R (0).

as it can be calculated from the stego image. Note that R(β) = R(1 − β) is also
known.  
The value of R 12 can be derived from Z (see Exercise 11.2), while R (β) and
R (1 − β) can be calculated from the stego image and the stego image with all
colors flipped, respectively. Modeling R (x) as a second-degree polynomial, the
difference D(x) = R(x) − R (x) = Ax2 + Bx + C is also a second-degree polyno-
mial.
Finally, we accept one additional assumption,

R(0) = R (0), (11.34)

which says that the number of homogeneous pairs in Z and Z must be the same
if no message has been embedded. This is, indeed, intuitive because there is no
reason why the color cuts of pairs and shifted pairs in the cover image should
have different structures.
In summary, we know the four values

D(0) = R(0) − R (0) = 0, (11.35)


D(1/2) = R(1/2) − R (1/2), (11.36)

D(β) = R(β) − R (β), (11.37)

D(1 − β) = R(1 − β) − R (1 − β), (11.38)
Selected targeted attacks 237

which gives us four equations for four unknowns A, B, C, β:

C = 0, (11.39)
4D(1/2) = A + 2B, (11.40)
D(β) = Aβ 2 + Bβ, (11.41)
D(1 − β) = A(1 − β)2 + B(1 − β). (11.42)

It can easily be verified that

β (D(β) − D(1 − β) + 4D(1/2)) = 2β 2 A + 2β 2 B + βB (11.43)


2
= D(β) + β 4D(1/2), (11.44)

which is a quadratic equation for β,

4D(1/2)β 2 − (D(β) − D(1 − β) + 4D(1/2)) β + D(β) = 0. (11.45)

The smaller of the two roots is our approximation to the unknown change rate
β. The pseudo-code for Pairs Analysis is shown in Algorithm 11.2.

11.2.1 Experimental verification of Pairs Analysis


The performance of Pairs Analysis is demonstrated using experiments on a
database of 180 color GIF images. The images were originally stored in a high-
quality JPEG format and came from four different digital cameras. For the test,
the images were resampled to 800 × 600 pixels using Corel Photo-Paint 9 (with
the anti-alias option) and converted to palette images with the following op-
tions: optimized palette, ordered dithering. All images were embedded using the
EzStego algorithm with pseudo-random message spread with relative payload
α = 0, 0.2, 0.4, 0.6, 0.8, 1 and then processed using Pairs Analysis. The results
are shown in Figure 11.11.
Pairs Analysis can be further improved by scanning the image along a space-
filling curve rather than in a row-by-row manner [243]. The Hilbert scan (Fig-
ure 11.10) is an example of a scanning order that becomes in a limit a space-filling
curve [184]. Because this scanning order is more likely to capture uniform seg-
ments in the image than the simple raster order, the corresponding color cuts
have a higher number of homogeneous pairs, which translates into slightly higher
detection accuracy.

11.3 Targeted attack on F5 using calibration

In this section, we illustrate Strategy T3 for design of targeted steganalysis (from


Section 10.3) by describing an attack on the F5 algorithm (see the description of
the algorithm in Section 7.3.2). From Exercise 7.2, we know that the histogram
(kl)
of absolute values of DCT coefficients hβ corresponding to DCT mode (k, l) in
238 Chapter 11. Selected targeted attacks

Algorithm 11.2 Pairs Analysis. γ is a threshold on the test statistic β̂ set to


achieve PFA < FA , where FA is the bound on the false-alarm rate.
// Arrange all pixels by scanning the image along
// a continuous path into a 1d vector v[i], i = 1, . . . , M × N
vflip = LSBflip(v); // flip LSBs of all stego elements
Z = {∅};Z = {∅};Z = {∅};
for k = 0 to 127 Z = Z&ColorCut(v,2k,2k + 1);
for k = 0 to 126 {
Z = Z &ColorCut(v,2k + 1,2k + 2);
Z = Z &ColorCut(vflip,2k + 1,2k + 2);
}
Z = Z &ColorCut(v,255,1);Z = Z &ColorCut(vflip,255,1);
D(β) = CountHomog(Z)-CountHomog(Z);
D(1 − β) = CountHomog(Z)-CountHomog(Z);
D(1/2) = 1/2 − R (1/2); // compute R (1/2) from Exercise 11.2
a = 4D(1/2);b =√ D(1 − β) − D(β) − a;c = D(β);
β± = Re((−b ± b2 − 4ac)/(2a)); β̂ = min(β+ , β− );
if β̂ > γ {
output(’Image is stego’);
output(’Estimated change rate = ’, β̂);
}
function: y = CountHomog(b)
y = 0;nb = length(b);
for i = 1 to nb {
y = y + b[i]b[i − 1] + (1 − b[i])(1 − b[i − 1]);
}
y = y/nb ;
function: Z = ColorCut(x,2k,2k + 1)
j = 0;nx = length(x);
for i = 1 to nx {
if (x[i] = 2k or x[i] = 2k + 1) {
Z[j] = x[i] mod 2;j = j + 1;
}
}

an F5-embedded stego image satisfies


# $
(kl)
E hβ [i] = (1 − β)h(kl) [i] + βh(kl) [i + 1], i > 0, (11.46)
# $
(kl)
E hβ [0] = h(kl) [0] + βh(kl) [1], (11.47)

where β is the embedding change rate. Because F5 preserves the symmetry of


the histograms, we cannot use the same approach as in attacking Jsteg. Instead,
Selected targeted attacks 239

32

0
0 32
Figure 11.10 Pixel-scanning order along a Hilbert curve for a 32 × 32 image.

1
Estimated payload α

0.8

0.6

0.4

0.2

0 20 40 60 80 100 120 140 160 180


Image #
Figure 11.11 Estimated payload (α = 2β) from 180 GIF images embedded using EzStego
with relative payloads α = 0, 0.2, 0.4, 0.6, 0.8, 1. The straight lines mark the true expected
change rate.

(kl)
the histogram of absolute values of DCT coefficients for the cover image, h0 ,
is estimated from the stego image using calibration as described in Section 10.3.
Denoting the estimated histograms of absolute values of DCT coefficients cor-
(kl)
responding to spatial frequency (k, l) as ĥ0 , the change rate can be estimated
(kl)
from equations (11.46)–(11.47), where we substitute ĥ0 for the cover image
240 Chapter 11. Selected targeted attacks

histograms and replace the expected values with the sample values
(kl) (kl) (kl)
hβ [i] = (1 − β)ĥ0 [i] + β ĥ0 [i + 1], i > 0, (11.48)
(kl) (kl) (kl)
hβ [0] = ĥ0 [0] + β ĥ0 [1]. (11.49)
This is a system of linear equations for various values of (k, l) and i for just one
unknown – the change rate β. Not all values (k, l) and i, however, should be used.
The histograms are in general less populated for higher spatial frequencies and
higher values of i. Again, we have several possibilities for how to aggregate the
equations to obtain the change-rate estimate (see Section 11.1.3). For brevity,
here we present only the approach proposed in [86].
The steganalyst is advised to obtain three least-square estimates β̂01 , β̂10 , β̂11
from histograms with (k, l) ∈ {(0, 1), (1, 0), (1, 1)} and i = 0, 1,
2
(kl) (kl) (kl)
β̂kl = arg min hβ [0] − ĥ0 [0] − β ĥ0 [1]
β
2
(kl) (kl) (kl)
+ hβ [1] − (1 − β)ĥ0 [1] − β ĥ0 [2] , (11.50)

which can be solved analytically because the function that is minimized is a


quadratic polynomial in β,
  
(kl) (kl) (kl) (kl) (kl) (kl) (kl)
ĥ0 [1] hβ [0] − ĥ0 [0] + hβ [1] − ĥ0 [1] ĥ0 [2] − ĥ0 [1]
β̂kl = 2 2 .
(kl) (kl) (kl)
ĥ0 [1] + ĥ0 [2] − ĥ0 [1]
(11.51)
The final estimate β̂ is obtained as
β̂01 + β̂10 + β̂11
β̂ = . (11.52)
3
The performance of this estimator is evaluated in [87].

11.4 Targeted attacks on ±1 embedding

The main reason why LSB embedding in the spatial domain can be detected
very reliably is the non-symmetrical character of this embedding operation. By
symmetrizing the embedding operation to “add or subtract 1 at random” (±1
embedding) instead of flipping the LSB, a majority of accurate attacks on LSB
embedding is thwarted. In this section, we show some targeted attacks on ±1
embedding, which are, in fact, applicable to more general embedding paradigms
in the spatial domain as long as the impact of embedding the message can be
described as adding iid noise to the image (e.g., stochastic modulation from
Section 7.2.1).
The first attempts to construct steganalytic methods for detection of embed-
ding by noise adding appeared in [39, 40, 242]. We now describe the approach
proposed in [107] and its extension [129, 130].
Selected targeted attacks 241

h
hs
4,000

Histogram count

3,000

2,000

60 65 70 75 80 85 90
Grayscale value

Figure 11.12 Adding stego noise to an image smoothens its histogram.

Any steganographic scheme that embeds messages by adding independent


noise to the cover image will smooth the image histogram (see Figure 11.12).
This is because the stego image can be viewed as a sum of two independent
random variables – the cover image and the stego noise. Since the probability
mass function of the sum of two independent random variables is the convolution
of their probability mass functions, the stego image histogram, hs , is a low-pass
version of the cover-image histogram, h,

hs = h  f , (11.53)

or

hs [i] = h[j]f [i − j], (11.54)
j

where the indices i, j run over the index set determined by the number of colors
in the histogram (e.g., for 8-bit grayscale images, i, j ∈ {0, . . . , 255}). In (11.53),
f is the probability mass function of the stego noise. The specific form of this
convolution for ±1 embedding has been derived in Exercise 7.1.
This observation gives us an idea for deriving useful features for steganalysis
by analyzing the histogram smoothness. Due to the low-pass character of the
convolution, hs will be smoother than h and thus its energy will be concentrated
in lower frequencies. This can be captured by switching to the Fourier repre-
sentation of the histograms and the noise pmf. For clarity, in this section we
will denote Fourier-transformed quantities of all variables with the correspond-
ing capital letters. The Discrete Fourier Transform (DFT) of an N -dimensional
242 Chapter 11. Selected targeted attacks

vector x is defined as

N −1
2πjk
X[k] = x[j]e−i N , (11.55)
j=0

where i in (11.55) stands for the imaginary unit (i2 = −1). The Fourier transform
of the stego-image histogram is obtained as an elementwise multiplication of the
cover-image histogram and the noise pmf,
Hs [k] = H[k]F[k] for each k. (11.56)
The function Hs is called the Histogram Characteristic Function (HCF) of the
stego image. At this point, a numerical quantity is needed that can be computed
from the HCF and that would evaluate the location of the energy in the spec-
trum. Because the absolute value of the DFT is symmetrical about the midpoint
value k = N/2, a reasonable measure of the energy distribution is the Center Of
Gravity (COG) of |H| computed for indices k = 0, . . . , N/2 − 1,
N/2−1
k|H[k]|
COG(H) = k=0 N/2−1
. (11.57)
k=0 |H[k]|
It can be shown using the Chebyshev sum-inequality (see Exercise 11.5) that as
long as |F[k]| is non-increasing,4
COG(Hs ) ≤ COG(H), (11.58)
which can be intuitively expected because the stego image histogram is smoother
and thus the energy of the HCF will shift towards lower frequencies (see Fig-
ure 11.13).
For steganalysis of color images, we will use the three-dimensional color his-
togram, h[j1 , j2 , j3 ], which denotes the number of pixels with their RGB color
(j1 , j2 , j3 ). Furthermore, the one-dimensional DFT (11.55) is replaced with its
three-dimensional version

N −1
2π(j1 k1 +j2 k2 +j3 k3 )
H[k1 , k2 , k3 ] = h[j1 , j2 , j3 ]e−i N . (11.59)
j1 ,j2 ,j3 =0

Due to the symmetry of the three-dimensional DFT, we constrain ourselves to


the first octant, k1 , k2 , k3 ≥ 0, and compute the three-dimensional COG with its
mth coordinate, m = 1, 2, 3,
 N2 −1
k ,k ,k =0 km |H[k1 , k2 , k3 ]|
COG(H)[m] = 1N 2 3 . (11.60)
2 −1
k1 ,k2 ,k3 =0 |H[k , k
1 2 3, k ]|
The center of gravity (11.57) and (11.60) can be used directly as a feature for
steganalysis. For color images with a low level of noise, such as decompressed

4 Many known noise distributions, such as the Gaussian or Laplacian distributions, have mono-
tonically decreasing |F[k]|.
Selected targeted attacks 243

·104

|H[k]|
8 |Hs [k]|

Power 6

0 20 40 60 80 100
Frequency k

Figure 11.13 Stego image HCF falls off to zero faster because the stego image histogram
is smoother.

JPEGs or professionally enhanced images, it is possible to identify a fixed thresh-


old separating the COG of cover and fully embedded stego images relatively
well [107]. Figure 11.14 (top) shows the COG of 186 cover and stego images
embedded with change rate β = 0.25 with ±1 embedding (the change rate is
computed with respect to 3n, where n is the number of pixels in the image).
The images for this test came from the test database of 954 images described in
Section 5.1.2. They were taken by three different cameras in an uncompressed
format at their native resolution and then converted to true-color BMP images
and JPEG compressed with quality factor 75. All images were decompressed
back to the spatial domain and cropped to their central 800 × 600 region. To
obtain a scalar value that can be easily compared, the figure shows the average

COG[1] + COG[2] + COG[3]


. (11.61)
3
Since the three-dimensional COGs lie very close to the axis of the first octant,
the averaging does not affect the separability of the statistic between cover and
stego images in any significant manner.
Figure 11.14 (bottom) shows the result of the exact same experiment per-
formed directly on the raw images rather than their JPEG compressed form.
Observe that, even though the embedding does decrease the value of the COG,
it is no longer possible to separate cover and stego images because the COG
of covers varies too much. Surprisingly, the steganalysis of images from a Ko-
dak DC290 camera still works. This is because this camera seems to suppress
the noise in images as part of its in-camera processing chain. Thus, subsequent
steganographic embedding is more detectable.
244 Chapter 11. Selected targeted attacks

80

70
COG

60

Cover
50 Stego

Canon G2 Canon PS S40 Kodak DC290

80

60
COG

40
Cover
Stego

Canon G2 Canon PS S40 Kodak DC290


Figure 11.14 The COG (11.61) for color cover and stego images embedded using ±1
embedding in the spatial domain with change rate β = 0.25. The bottom graph was
generated from raw, never-compressed color images from three different digital cameras,
while the top graph corresponds to their JPEG compressed versions (quality factor 75).
All images were cropped to their central 800 × 600 portion.

In general, the performance of steganalysis based on COG of the HCF varies


greatly with the cover source and it is not possible to identify a universal thresh-
old that would separate cover and stego images well. The detection becomes even
worse for grayscale images. We obviously need a way to calibrate the COG to
remove its dependence on cover source and content.
It has been proposed in [130] to estimate the COG of the cover image by
resizing the stego image to half its size by averaging values in groups of 2 × 2
pixels. It is hoped that the resizing will produce just the right amount of blurring
to enable approximate estimation of the COG of the cover image HCF. Given
Selected targeted attacks 245

a stego image y[i, j], i = 1, . . . , M, j = 1, . . . , N , the downsampled stego image


ŷ[k, l], k = 1, . . . , M/2, l = 1, . . . , N/2, is obtained as
1
ŷ[k, l] = (y[2k, 2l] + y[2k + 1, 2l] + y[2k, 2l + 1] + y[2k + 1, 2l + 1]) . (11.62)
4
Denoting by Ĥ and COG(Ĥ) the HCF and its COG for the resized image, ŷ,
the ratio
COG(H)
(11.63)
COG(Ĥ)
is taken as the calibrated COG feature for steganalysis. For certain cover sources,
the calibrated COGs for cover and stego images are better separated than
COG(Hs ) and COG(H) because dividing by the estimated COG decreases
image-to-image variations [130].
For grayscale images, the performance can be improved also by working with
a better model than the pixels’ sample pmf. It is particularly appealing to work
with a higher-order statistical model. Denoting the set of all pairs of horizontally
adjacent pixels in the image as P, we represent the image with an adjacency
histogram, also called a co-occurrence matrix,
t[i, j] = {(x, y) ∈ P|x = i, y = j}. (11.64)
In other words, t[i, j] is defined as the number of horizontally neighboring pixel
pairs with grayscales i and j. This matrix is sparser than the histogram (because
there are 2562 possible pairs) and thus reacts more sensitively to embedding
changes. Because of local correlations present in natural images, the adjacency
histogram has the largest values on the diagonal and then it quickly falls off
(see Figure 11.1). The HCF of the adjacency histogram is now two-dimensional,
T[k, l], and can be used for steganalysis in the same way as the HCF above. The
author in [130] used an alternative scalar quantity defined as
 N2 −1
k,l=0 (k + l)|T[k, l]|
COG(T) =  N2 −1 . (11.65)
k,l=0 |T[k, l]|
Figure 11.15 shows the improvement in detection of ±1 embedding in grayscale
decompressed JPEG images by computing the COG from the adjacency his-
togram (11.65) rather than from the histogram itself (11.57). The experiment
was performed on the same set of images coming from three digital cameras
and cropped to their central 800 × 600 portion. Note that while the COG com-
puted from the HCF does not have any significant distinguishing power (top), the
COG of the adjacency histogram seems to work reasonably well on this image set
(bottom). The reader is referred to the original publication [129, 130] for a more
detailed discussion of experimental results of the above methods when tested
on various cover sources. Further extension of this approach appears in [159],
where the authors proposed to work with differences of neighboring pixels rather
than the pixel values directly. Overall, the performance of steganalysis based on
246 Chapter 11. Selected targeted attacks

Cover
50 Stego

40
COG
30

20

10

Canon G2 Canon PS S40 Kodak DC290

Cover
120 Stego

100
COG

80

60

40
Canon G2 Canon PS S40 Kodak DC290
Figure 11.15 Top: The COG (11.57) for the same images as in Figure 11.14 converted to
grayscale and JPEG compressed using quality factor 75. The embedding change rate was
β = 0.25 (w.r.t. the number of pixels). Bottom: The COG of cover and stego images of
the adjacency histogram (11.65) computed for the same grayscale JPEG images.

HCF appears to vary considerably across different image sources. This is true in
general for most spatial-domain steganalyzers, including the blind constructions
explained in Section 12.5 [37, 139].
Other methods for detection of noise adding and ±1 embedding include meth-
ods based on signal estimation [55, 220, 249], histogram artifacts [37, 38], blind
steganalyzers [10, 252], and blind steganalysis methods described in Section 12.5.
Selected targeted attacks 247

Summary
r LSB embedding in the spatial domain can be reliably detected even at low
relative payloads due to the asymmetry of the embedding operation.
r Sample Pairs Analysis is an example of an attack on LSB embedding in
pseudo-randomly spread pixels. It works by analyzing the embedding impact
on subsets of pairs of neighboring pixels.
r Structural steganalysis is a reformulation of SPA using the concept of trace
sets. This formulation makes it possible to derive detectors that can incorpo-
rate statistical assumptions about the cover image and provide a convenient
framework for generalizing SPA to work with groups of more than two pixels.
r Pairs Analysis is another method for detection of LSB embedding that is
based on a different principle by considering the spatial distribution of colors
from each LSB pair in the entire image. It is especially suitable for attacking
LSB embedding in palette image formats.
r The steganographic algorithm F5 can be attacked using calibration by first
quantifying the relationship between the histograms of stego and cover images
and then estimating the cover-image histogram using calibration.
r ±1 embedding in the spatial domain can be detected using the center of
gravity of the histogram characteristic function (absolute value of the Fourier
transform of the histogram) as the feature. This is because adding an inde-
pendent stego noise to the cover image smooths the histogram. The accuracy
of this method can be improved by considering the adjacency histogram and
by applying calibration (resampling the stego image).

Exercises

11.1 [Pairs analysis I] Prove that the expected value of the relative num-
ber of homogeneous pairs R(β) for color cut Z for change  rate β ∈ [0, 1] is a
parabola with its minimum at 2 . In particular, R 2 = (n − 1)/(2n) ≈ 12 for
1 1

large n, where n is the length of the color cut (number of pixels in the image).
Hint: Write the color cut Z of the cover image as a concatenation of r seg-
ments consisting of consecutive runs of k1 , . . . , kr 0s or 1s, k1 + · · · + kr = n.

Thus, R(0) = ri=1 (ki − 1) = n − r. After changing the LSB of a random por-
tion of β pixels, the probability that a homogeneous pair of consecutive bits will
stay homogeneous is β 2 + (1 − β)2 . The expected
 number of homogeneous pairs
in the ith segment is thus β + (1 − β) (ki − 1) + 2β(1 − β), where the last
2 2

term comes from the right end of the segment (an additional pair will be formed
at the boundary if the last bit in the segment flips and the first bit of the next
segment does not flip, or vice versa). This last term is missing from the last
segment.

11.2 [Pairs analysis II] Let Z = {b[i]}ni=1 be the color cut for the shifted
color pairs and R (β) be the number of homogeneous pairs after flipping a portion
248 Chapter 11. Selected targeted attacks

 
β of pixels. Prove that the expected value of R 12 is
  n−1
 1
E R = 2−k hk , (11.66)
2
k=1

where hk is the number of homogeneous pairs in the sequence of pairs


(b[1], b[1 + k]), (b[2], b[2 + k]), (b[3], b[3 + k]), . . . , (b[n − k], b[n]). (11.67)
Hint: Let W = W1 & · · · &W128 be the concatenation of binary sequences Wj
formed from the stego image by scanning it by rows and associating a 0 with the
colors c[2j − 2] and c[2j − 1] and a 1 with the colors c[2j] and c[2j + 1]. The
expected form of the color cut Z (c[2j − 1], c[2j]) after embedding a maximal
message in the cover image is the same as starting with the sequence Wj and
skipping each element of Wj with probability 12 . Imagine you are going through
Z while skipping each element with probability 12 . Then, the probability of
skipping exactly k − 1 elements in a row is 2−k , k = 1, 2, . . .. Because there are hk
homogeneous pairs in the sequence of pairs b[1]b[1 + k], b[2]b[2 + k], b[3]b[3 +
k], . . ., b[n − k]b[n], the expected number of homogeneous pairs separated by
k − 1 elements is 2−k hk . The formula for the expected value is obtained by
summing these contributions from k = 1 to the maximal separation satisfying
k − 1 = n − 2.

11.3 [Weighted stego image] In this (and the next) exercise, you will derive
another attack on LSB embedding in the spatial domain called a Weighted Stego
(WS) attack. Let x[i], i = 1, . . . , n be the pixel values from an 8-bit grayscale
cover image containing n pixels. The value of x[i] after flipping its LSB will be
denoted with a bar,
x̄[i]  LSBflip(x[i]) = x[i] + 1 − 2(x[i] mod 2). (11.68)
Let y[i] denote the stego image after flipping the fraction β of pixels along a
pseudo-random path (this corresponds to embedding an unbiased binary message
of relative message length 2β). Let wθ [i] be the “weighted” stego image,
wθ [i] = y[i] + θ(ȳ[i] − y[i]), 0 ≤ θ ≤ 1, (11.69)
with weight θ. Let

n
D(θ) = (wθ [i] − x[i])2 (11.70)
i=1

be the sum of squares of the differences between the pixels of the weighted stego
image and the cover image. Show that D(θ) is minimal for θ = β. In other words,

n
β = arg min D(θ) = arg min (wθ [i] − x[i])2 . (11.71)
θ θ
i=1

Hint: Substitute (11.69) into (11.71) and divide the sum over pixels for which
x[i] = y[i] (unmodified pixels) and y[i] = x̄[i] (flipped pixels). Then simplify and
Selected targeted attacks 249

differentiate with respect to θ to find the minimum of the polynomial quadratic


in θ.

11.4 [WS attack on LSB embedding] Equation (11.71) cannot be used


directly for estimating the change rate β from the stego image because the cover-
image pixels x[i] are not known to the steganalyst. To turn the result of the
previous exercise into an attack, replace x[i] with its estimate x̂[i] obtained from
its neighboring values and find the minimum value of (11.70) by differentiating
it with respect to θ and solving D (θ) = 0 for θ. You should obtain the following
estimator of the change rate:
1
n
β̂ = (y[i] − ȳ[i])(y[i] − x̂[i]). (11.72)
n i=1

This estimator can be further improved by introducing non-negative local weights


w[i],

n
β̂ = w[i](y[i] − ȳ[i])(y[i] − x̂[i]), (11.73)
i=1

where i w[i] = 1. The purpose of the weights is to give more emphasis to those
pixels where we expect x̂[i] to be a more accurate estimate of the cover pixel
and less emphasis to those pixels that are likely to be poorly estimated. Because
our ability to estimate the cover image is better in smooth regions, the weights
should be inversely proportional to some local measure of texture. A good, albeit
empirical, choice is w[i] = 1/(5 + σ̂ 2 [i]), where σ̂ 2 [i] is the sample pixel variance
estimated from a 3 × 3 neighborhood of pixel i.
Implement this estimator for
1
x̂[i, j] = (y[i − 1, j] + y[i + 1, j] + y[i, j − 1] + y[i, j + 1]) (11.74)
4
and test its performance on images. More details about this estimator can be
found in [83, 138]. This attack can be adapted to work for JPEG images as
well [21, 244].

11.5 [Monotonicity of COG] Prove (11.58) using the Chebyshev sum-


inequality [176] valid for any non-decreasing sequence a[i], non-increasing se-
quence b[i], and a non-negative sequence p[i],

N −1 
N −1 
N −1 
N −1
p[i] p[i]a[i]b[i] ≤ p[i]a[i] p[i]b[i]. (11.75)
i=0 i=0 i=0 i=0

Hint: Substitute into the Chebyshev inequality

p[i] = |H[i]|, (11.76)


a[i] = i, (11.77)
b[i] = |F[i]|. (11.78)
250 Chapter 11. Selected targeted attacks

11.6 [Attack on MBS] Consider Model-Based Steganography for JPEG im-


ages as explained in Chapter 7 and mount the following targeted attack that
follows Strategy T1 from Section 10.3. The histogram of stego images follows
the model obtainable from both cover and stego images. The model is a gener-
alized Cauchy distribution and thus symmetrical. Real images, however, rarely
have perfectly symmetrical histograms. Thus, it might be possible to construct
a targeted attack by testing using the chi-square test if the image histogram
follows the model.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
12 - Blind steganalysis pp. 251-276

Chapter DOI:
Cambridge University Press
12 Blind steganalysis

The goal of steganalysis is to detect the presence of secretly embedded messages.


Depending on how much information the warden has about the steganographic
channel she is trying to attack, the detection problem can accept many different
forms. In the previous chapter, we dealt with the situation when the warden
knows the steganographic method that Alice and Bob might be using. With
this knowledge, Eve can tailor her steganalysis to the particular steganographic
channel using several strategies outlined in Section 10.3. If Eve has no informa-
tion about the steganographic method, she needs blind steganalysis capable of
detecting as wide a spectrum of steganographic methods as possible. Design and
implementation of practical blind steganalysis detectors is the subject of this
chapter.
The first and most fundamental step for Eve is to accept a model of cover im-
ages and represent each image using a vector of features. In contrast to targeted
steganalysis, where a single feature (e.g., an estimate of message length) was
often enough to construct an accurate detector, blind steganalysis by definition
requires many features. This is because the role of features in blind steganalysis
is significantly more fundamental – in theory they need to capture all possible
patterns natural images follow so that every embedding method the prisoners
can devise disturbs at least some of the features. In Section 10.4, we loosely
formulated this requirement as completeness of the feature space and outlined
possible strategies for constructing good features.
The second step in building a blind steganalyzer is selecting a classification
tool. There are many choices, including neural networks, clustering algorithms,
support vector machines, and other tools of soft computing, pattern recognition,
and data mining. It seems that the method of choice today is Support Vector
Machines (SVMs) due to their ease of implementation and tuning as well as
superior performance. In Appendix E, we provide a brief introduction to the
theory of SVMs and their implementation and training. Readers not familiar
with this approach to machine-learning should read this appendix and become
familiar with the approach at least on a conceptual level.
The third and final step is the actual training of the classifier and setting its
parameters to satisfy desired performance criteria, such as making sure that the
false-alarm rate for known steganography algorithms is below a certain bound.
An essential part of training is the database of images. Ideally, it should reflect
252 Chapter 12. Blind steganalysis

the properties of the cover source. For example, if Eve knows that Alice likes to
send to Bob images taken with her camera, the warden may purchase a camera
of the same model and use it to create the training database. If the warden has
little or no information about the cover source, which would be the case of an
automatic traffic-monitoring device, constructing a blind steganalyzer is much
harder. There are significant differences in how difficult it is to detect stegano-
graphic embedding in various cover sources. For example, as already mentioned
in Chapter 11, scans of film or analog photographs are typically very noisy and
contain characteristic microscopic structure due to the grains present in film.
Because this structure is stochastic in nature, it complicates detection of embed-
ding changes. Thus, tuning a blind steganalyzer to produce a low false-positive
rate across all cover sources, without being overly restrictive for “well-behaved”
cover sources, such as low-noise digital-camera images, may be quite a difficult
task. This is a serious problem that is not easily resolved [37]. One possible av-
enue Eve can take is to first classify the digital image into several categories,
such as scan, digital-camera image, raw, JPEG, computer graphics, etc., and
then send the image to a classifier that was trained separately for each cover
category. Creating such a system requires large computational and storage re-
sources because each database should contain images that were also processed
using commonly used image-processing operations, such as denoising, filtering,
recoloring, resizing, rotation, cropping, etc. In general, the larger the database,
the more accurate and reliable the steganalyzer will be.
Many different blind steganalysis methods have been proposed in the litera-
ture [10, 11, 12, 67, 78, 102, 167, 168, 189, 190, 210, 225, 236, 252]. Even though,
in principle, a blind detector can be used to detect steganography in any image
format, one can expect that features computed in the same domain as where
the embedding is realized would be the most sensitive to embedding because
in this domain the changes to individual cover elements are lumped and inde-
pendent. Therefore, we divide the description of blind steganalysis according
to the embedding domain. In the next section, we give a specific example of a
feature set for blind steganalysis of stegosystems that embed messages by manip-
ulating quantized DCT coefficients. Using this feature set, in Sections 12.2 and
12.3 we present the details of a specific implementation of a blind steganalyzer
using SVMs and a one-class neighbor machine. Both steganalyzers are tested
regarding how well they can detect stego images and how they generalize to pre-
viously unseen steganographic methods. An example of using blind steganalysis
for construction of targeted attacks is included in Section 12.4. Blind steganal-
ysis of stegosystems that embed messages in the spatial domain is included in
Section 12.5.
Blind steganalysis 253

12.1 Features for steganalysis of JPEG images

In this section, we give a specific example of features [190] for construction of


blind steganalyzers for steganography that embeds messages by modifying quan-
tized DCT coefficients of a JPEG file.
For simplicity, we will consider only grayscale JPEG images represented using
an array of quantized DCT coefficients D[k, l, b] and the JPEG quantization
matrix Q[k, l]. Here, D[k, l, b], k, l = 0, . . . , 7, b ∈ {1, . . . , NB }, is the (k, l)th DCT
coefficient in the bth 8 × 8 block and NB is the number of all 8 × 8 blocks in the
image. We will assume that the blocks are ordered in a row-wise manner, meaning
that b = 1 is the block that was originated by transforming the block of 8 × 8
pixels in the upper left corner, b = N/8 corresponds to the last block in the
first row of blocks, and b = M/8 × N/8 = NB points to the block in the lower
right corner. Here, we assumed that the image has M × N pixels.
Some features can be described more easily with an alternative data structure
obtained by rearranging D into an 8 M/8 × 8 N/8 matrix by replacing each
8 × 8 pixel block with the corresponding block of DCT coefficients. We denote
the rearranged coefficients again with the same letter, hoping that this will not
cause confusion because each representation can be easily recognized from the
number of indices of D. To remove any potential source of ambiguity here, we
provide a formal description:

D[8iB + k + 1, 8jB + l + 1] = D [k, l, N/8 iB + jB + 1] , (12.1)


iB = 0, . . . , M/8 − 1, jB = 0, . . . , N/8 − 1. (12.2)

The reader is referred to Chapter 2 for more details about the JPEG format.
The features should capture all relationships that exist among DCT coeffi-
cients. A good approach is to consider the coefficients as realizations of a random
variable that follows a certain statistical model and choose as features the model
parameters estimated from the data. Unfortunately, the great diversity of natural
images prevents us from finding one well-fitting model and it is thus necessary
to build the features from several models in order to obtain good steganalysis
results.
Additionally, the features are required to be sensitive to typical steganographic
embedding changes and not depend on the image content so that one can easily
separate the cluster of cover and stego image features. To satisfy this require-
ment, the features are calibrated as explained in Chapter 10 (see Figure 10.3).
In calibration, the stego JPEG image J1 is decompressed to the spatial domain,
cropped by 4 pixels in both directions, and recompressed with the same quan-
tization table as J1 to obtain J2 . The calibrated form of feature f is thus the
difference

f (J1 ) − f (J2 ). (12.3)


254 Chapter 12. Blind steganalysis

12.1.1 First-order statistics


The first set of features is derived from the assumption that DCT coefficients are
realizations of an iid random variable. This means that their complete statistical
description can be captured using their probability mass function. The features
will thus be formed by the sample pmf computed from the DCT coefficients.
The sample pmf, which we will also call the normalized histogram, of all 64 ×
NB luminance DCT coefficients is a D-dimensional vector

1 
7 
NB
H[r] = δ (r − D[k, l, b]) , (12.4)
64 × NB
k,l=0 b=1

where r = L, . . . , R, L = mink,l,b D[k, l, b], R = maxk,l,b D[k, l, b], and D = R −


L + 1. Here, δ(x) is the Kronecker delta (2.23).
Because the distribution of DCT coefficients varies for different modes, it is
possible to consider the coefficients as 64 parallel iid channels, each correspond-
ing to one DCT mode. Thus, further useful features will be provided by the
normalized histogram of individual DCT modes. For a fixed DCT mode (k, l),
let h(kl) [r], r = L, . . . , R, denote the D-dimensional vector representing the his-
togram of values D[k, l, b], b = 1, . . . , NB ,

1 
NB
h (kl)
[r] = δ (r − D[k, l, b]) . (12.5)
NB
b=1

Additionally, for a fixed integer r ∈ {L, . . . , R} we define the so-called dual


histogram as an 8 × 8 matrix

1 
N
B

g (r)
[k, l] = δ (r − D[k, l, b]) . (12.6)
NB (r)
b=1

In words, g(r) [k, l] is how many times the value r occurs as the (k, l)th DCT
  B
coefficient in all NB blocks and NB (r) = k,l N b=1 δ (r − D[k, l, b]) is the nor-
malization constant. The dual histogram captures the distribution of a given
coefficient value r among different DCT modes. Note that if a steganographic
method preserves all individual histograms, it also preserves all dual histograms
and vice versa.
If we were to take the complete vectors (12.4), (12.5), and matrices (12.6) as
features, the dimensionality of the feature space would be too large. Because
DCT coefficients typically follow a distribution with a sharp spike at r = 0 (see
Chapter 2), the sample pmf can be accurately estimated only around zero while
its values for larger values of r exhibit fluctuations that are of little value. The
same holds true of the individual DCT modes. The most populated are low-
frequency modes with small k + l. Thus, as the first set of features, we select
the first-order statistics shown in Table 12.1. We remind the reader that the
features for blind steganalysis are not used directly in this form but are calibrated
using (12.3).
Blind steganalysis 255

Table 12.1. Non-calibrated features formed by 165 first-order statistics of DCT


coefficients from a JPEG image.

Feature name Feature Index range Features

Global histogram H[r] −5 ≤ r ≤ 5 11


AC histograms h(kl) [r] −5 ≤ r ≤ 5, 0 < k + l ≤ 2 11 × 5
Dual histograms g(r) [k, l] −5 ≤ r ≤ 5, 0 < k + l ≤ 3 11 × 9

12.1.2 Inter-block features


The statistical models for DCT coefficients that were used in the previous section
do not capture the fact that natural images exhibit dependences over distances
larger than the block size. Consequently, coefficients from neighboring blocks are
not independent. The relationship among coefficients D[k, l, b] and D[k, l, b + 1]
from neighboring blocks can be captured using the joint probability distribution.
The features in this section are more easily described by rearranging the
DCT coefficients into the two-dimensional array D[i, j], i = 1, . . . , 8 M/8,
j = 1, . . . , 8 N/8 defined in (12.1) obtained simply by replacing 8 × 8 blocks
of pixels with blocks of corresponding DCT coefficients.
The first feature set is defined as the sum of the sample joint probability
matrices in the horizontal and vertical directions,

8 M
8 −8 8 8 
N
 
δ (s − D [i, j]) δ (t − D [i + 8, j])
i=1 j=1
C[s, t] =
64 (M/8 − 1) N/8
8
  8 −8
M N
8 8
δ (s − D [i, j]) δ (t − D [i, j + 8])
i=1 j=1
+ . (12.7)
64 M/8 (N/8 − 1)

This matrix is called the co-occurrence matrix as it describes the distribution


of pairs of neighboring DCT coefficients. The matrix C usually has a sharp
maximum at (s, t) = (0, 0) and then quickly falls off. This is why we select as
features only the values C[s, t] for −2 ≤ s, t ≤ 2.
The next (scalar) feature captures the fact that most steganographic tech-
niques in some sense add entropy to the array of quantized DCT coefficients and
thus increase the differences between dependent coefficients across blocks. The
256 Chapter 12. Blind steganalysis

Table 12.2. Non-calibrated features formed by 28 higher-order inter-block statistics of


DCT coefficients from a JPEG image.

Feature name Feature Index range Features

Co-occurrence matrix C[s, t] −2 ≤ s, t ≤ 2 25


Variation V 1
Blockiness B1 , B2 2

dependences are measured using a quantity known in mathematics as variation,

8 −8 8 8 
8 M 
N
   
D [i, j] − D [i + 8, j]
i=1 j=1
V =
64 (M/8 − 1) N/8
8 M 8 −8
8  8  
N
  
D [i, j] − D [i, j + 8]
i=1 j=1
+ . (12.8)
64 M/8 (N/8 − 1)
An integral measure of dependences among coefficients from neighboring
blocks is the blockiness defined as the sum of discontinuities along the 8 × 8
block boundaries in the spatial domain. Embedding changes are likely to in-
crease the blockiness rather than decrease it. We define two blockiness measures
for γ = 1 and γ = 2:
 M
8  
−1
N
|x[8i, j] − x[8i + 1, j]|γ
i=1 j=1
Bγ =
N (M − 1)/8 + M (N − 1)/8
M 
8 
N −1

|x[i, 8j] − x[i, 8j + 1]|γ
i=1 j=1
+ . (12.9)
N (M − 1)/8 + M (N − 1)/8
Here, M and N are the image height and width in pixels and x[i, j], i =
1, . . . , M, j = 1, . . . , N , are grayscale values of the decompressed JPEG image.
These two features are the only features computed from the spatial representa-
tion of the JPEG image.
The higher-order functionals measuring inter-block dependences among DCT
coefficients are summarized in Table 12.2.

12.1.3 Intra-block features


DCT coefficients within one 8 × 8 block exhibit weak (intra-block) dependences
that cannot be captured with the features introduced so far [210]. In particular,
for a fixed index b, we need to describe the relationship among coefficients that
are adjacent in the horizontal direction, D[k, l − 1, b], D[k, l, b], and D[k, l + 1, b],
Blind steganalysis 257

as well as for the vertical and both diagonal directions. To this end, it will be
useful to represent DCT coefficients again using the matrix D[i, j].
Let A[i, j] = |D[i, j]| be the matrix of absolute values of DCT coefficients in
the image. Instead of modeling directly the intra-block dependences among DCT
coefficients, which are quite weak, we will model the differences among them
because the differences will be more sensitive to embedding changes. Thus, we
form four difference arrays along four directions: horizontal, vertical, diagonal,
and minor diagonal (further denoted as Ah [i, j], Av [i, j], Ad [i, j], and Am [i, j]
respectively)

Ah [i, j] = A[i, j] − A[i, j + 1], (12.10)


Av [i, j] = A[i, j] − A[i + 1, j], (12.11)
Ad [i, j] = A[i, j] − A[i + 1, j + 1], (12.12)
Am [i, j] = A[i + 1, j] − A[i, j + 1]. (12.13)

Note that Ah is an M × (N − 1) matrix, Av is (M − 1) × N , and Ad and Am


are (M − 1) × (N − 1). Viewing the individual elements of these matrices as
realizations of Markov variables, we compute four sample transition probability
matrices Mh , Mv , Md , Mm :
M N −2
u=1 v=1 δ(Ah [u, v] − s)δ(Ah [u, v + 1] − t)
Mh [s, t] = M N −2 , (12.14)
u=1 v=1 δ(Ah [u, v] − s)
M−2 N
u=1 v=1 δ(Av [u, v] − s)δ(Av [u + 1, v] − t)
Mv [s, t] =  M−2 N
, (12.15)
u=1 v=1 δ(A v [u, v] − s)
M−2 N −2
u=1 v=1 δ(Ad [u, v] − s)δ(Ad [u + 1, v + 1] − t)
Md [s, t] = M−2 N −2 , (12.16)
u=1 v=1 δ(Ad [u, v] − s)
M−2 N −2
u=1 v=1 δ(Am [u + 1, v] − s)δ(Am [u, v + 1] − t)
Mm [s, t] = M−2 N −2 . (12.17)
u=1 v=1 δ(A m [u + 1, v] − s)

Since the range of differences between absolute values of neighboring DCT coeffi-
cients could be quite large, if the matrices Mh , Mv , Md , Mm were taken directly
as features, the dimensionality of the feature space would be impractically large.
Thus, we use only the central portion of the matrices, −4 ≤ s, t ≤ 4 with the note
that the values in the difference arrays Ah [i, j], Av [i, j], Ad [i, j], and Am [i, j]
larger than 4 are set to 4 and values smaller than −4 are set to −4 prior to
calculating Mh , Mv , Md , Mm . To further reduce the features’ dimensionality, all
four matrices are averaged,
1
M= (Mh + Mv + Md + Mm ) , (12.18)
4
which gives a total of 9 × 9 = 81 features (Table 12.3).
The final feature set summarized in Tables 12.1–12.3 contains 165 + 28 + 81 =
274 calibrated features: 165 first-order statistics, 28 inter-block statistics, and 81
258 Chapter 12. Blind steganalysis

Table 12.3. Non-calibrated features formed by 81 higher-order intra-block statistics of


DCT coefficients from a JPEG image.

Feature name Feature Index range Features

Average Markov matrix M[s, t] −4 ≤ s, t ≤ 4 81

intra-block features. The following sections demonstrate several applications that


use this feature set.

12.2 Blind steganalysis of JPEG images (cover-versus-all-stego)

In blind steganalysis, images are first mapped to some low-dimensional feature


space where they can be classified into two categories (cover and stego) using
standard machine-learning tools. In general, there are countless ways to repre-
sent an image using a low-dimensional feature. The previous section showed one
particular 274-dimensional representation suitable for JPEG images. This fea-
ture set is now used to construct a blind steganalyzer for JPEG images. The
idea is to train a binary classifier to distinguish between two classes of images –
the class of cover images and the class of stego images embedded using multiple
steganographic methods with a mixture of payloads. We call such a steganalyzer
a “cover-versus-all-stego classifier.” The author hopes that by considering details
of a specific construction, the reader will be exposed to some typical issues that
arise when designing a blind steganalyzer using this approach and will be able
to apply the acquired knowledge for other feature spaces and machine-learning
tools. We also note that the cover-versus-all-stego approach is not the only pos-
sible way to construct blind steganalyzers. In Section 12.3, the reader will learn
about an alternative approach in which a classifier is trained to recognize only
the class of cover images.
First, we describe the database of images used for training and testing as
well as the set of steganographic techniques used to produce stego images for
training. Then, we show how to implement the steganalyzer using support vector
machines. Finally, the steganalyzer is subjected to tests to give the reader a sense
of the level of performance that can be achieved in practice.

12.2.1 Image database


Because the database used to construct and test the blind steganalyzer may
substantially influence the resulting performance, it is always necessary to de-
scribe it in detail. In particular, one should include the number of images, their
size, their origin, and details of processing they have been subjected to. In our
example, the images were all created from 6004 source raw images taken by 22
Blind steganalysis 259

different digital cameras with sizes ranging from 1.4 to 6 megapixels with an
average size of 3.2 megapixels. All images were originally acquired in the raw
format and then converted to grayscale and saved as 75% quality JPEG. The
database was divided into two disjoint sets D1 and D2 . The first set with 3500
images was used only for training the SVM classifier while the second set with
2504 images was used to evaluate the steganalyzer performance. No image ap-
peared in both databases that would be taken by the same camera in order to
make the evaluation of the steganalyzer more realistic.

12.2.2 Algorithms
When training a blind steganalyzer, we need to present it with stego images
from as many (diverse) steganographic methods as possible to give it the abil-
ity to generalize to previously unseen stego images. The tacit assumption we
are making here is that the steganalyzer will be able to recognize stego images
embedded with an unknown steganographic scheme because the features will oc-
cupy a location in the feature space that is more compatible with stego images
rather than cover images. As will be seen in Section 12.2.6, this assumption is
not always satisfied and alternative approaches to blind steganalysis need to be
explored (Section 12.3).
All stego images for training were prepared from the training database D1 us-
ing six steganographic techniques – JP Hide&Seek, F5, Model-Based Steganogra-
phy without (MBS1) and with (MBS2) deblocking, Steghide, and OutGuess. The
algorithms for F5, Model-Based Steganography, and OutGuess were described
in Chapter 7. JP Hide&Seek is a more sophisticated version of Jsteg and its em-
bedding mechanism mostly modifies LSBs of DCT coefficients. The source code
is available from http://linux01.gwdg.de/~alatham/stego.html. Steghide is
another algorithm that preserves the global first-order statistics of DCT coeffi-
cients but using a different mechanism than OutGuess. The embedding is always
done by swapping coefficients rather than modifying their LSBs, which means
that no correction phase is needed. Steghide is described in [110]. These se-
lected six algorithms form the warden’s knowledge base about steganography
from which she constructs her detector.

12.2.3 Training database of stego images


In practice, the warden will rarely have any information about the distribution of
payloads. Thus, a reasonable strategy is to select the least informative distribu-
tion, such as uniform distribution of payload. In our example, with the exception
of MBS2, for the remaining five algorithms one third of the training database
D1 was embedded with random messages of relative length 1 (full embedding
260 Chapter 12. Blind steganalysis

capacity), another third with payload 0.5, and the last third with 0.25.1 Thus,
each image was embedded with five algorithms with one payload out of three.
The stego images for MBS2 were embedded with an even mixture of relative
payloads 0.3 and 0.15 of the embedding capacity of MBS1. This measure was
necessary because MBS2 often fails to embed longer messages. Thus, the train-
ing database contained a total of 6 × 3500 stego images embedded with an even
mixture of relative payloads 1, 0.5, and 0.25 (with the exception of MBS2).
For practical applications, the number of training stego images produced by
each embedding algorithm should reflect the a priori probabilities of encountering
stego images generated by each steganographic algorithm. In the absence of any
prior information, again one can use uniform distribution and assume that we
are equally likely to encounter a stego image from any of the stego algorithms.
Thus, the training database of stego images was formed by randomly selecting
3500 stego images from all 6 × 3500 stego images.

12.2.4 Training
The training consists of computing the feature vectors for each cover image from
D1 and for each stego image from the training database created in the previous
section. There are many tools one can use for classification purposes. Here, we
describe an approach based on soft-margin weighted support vector machines
(C-SVMs) with Gaussian kernel.2 The kernel width γ and the penalization pa-
rameter C are typically determined by a grid-search on a multiplicative grid,
such as

+ ,
(C, γ) ∈ (2i , 2j )|i ∈ {−3, . . . , 9}, j ∈ {−5, . . . , 3} , (12.19)

to determine the values of the parameters leading to a false-alarm rate below


the threshold required by each particular application. To obtain a more robust
estimate of these two parameters, one typically uses multiple cross-validation. For
example, in five-fold cross-validation, the training database is randomly divided
into five mutually disjoint parts, four of which are used for training and the
remaining fifth part for testing the classifier performance to see whether the
false-alarm rate is below the required threshold, such as PFA ≤ 0.01 on the fifth
validation part.

1 This means that a larger number of bits was embedded with stego algorithms with higher em-
bedding capacity, which might correspond to how users would use the algorithms in practice.
Consequently, one cannot use such results to fairly compare different stego algorithms.
2 The reader is now encouraged to browse through Appendix E to become more familiar with
SVMs.
Blind steganalysis 261

Table 12.4. Probability of detection PD when presenting the blind steganalyzer with
stego images embedded by “known” algorithms on which the steganalyzer was trained.
The false-alarm rate on the testing set of 2504 cover images was 1.04%.

Algorithm [bpnc] PD
F5 [1.0] 99.96%
F5 [0.5] 99.60%
F5 [0.25] 90.73%
JP HS [1.0] 99.84%
JP HS [0.5] 98.28%
JP HS [0.25] 73.52%
MBS1 [1.0] 99.96%
MBS1 [0.5] 99.80%
MBS1 [0.3] 98.88%
MBS1 [0.15] 71.19%
MBS2 [0.3] 99.12%
MBS2 [0.15] 77.92%
OutGuess [1.0] 99.96%
OutGuess [0.5] 99.96%
OutGuess [0.25] 98.12%
Steghide [1.0] 99.96%
Steghide [0.5] 99.84%
Steghide [0.25] 96.37%
Cover 98.96%

12.2.5 Testing on known algorithms


The blind steganalyzer is first tested on stego images embedded with the same
six algorithms on which it was trained. The stego images were prepared from
database D2 in the same manner as for training. One third was embedded with
relative payload 1, 0.5, and 0.25, again with the same exception for MBS2. From
this set of 6 × 2504 stego images, 2504 images were randomly selected for testing.
The results of the test are shown in Table 12.4. The steganalyzer can detect
fully embedded images with all algorithms with probability better than 99%. To
determine the steganalyzer’s false-alarm rate, it was also presented with 2504
cover images from D2 out of which 1.04% were detected as stego. This is in good
agreement with the design false-alarm rate of FA = 1% used to determine the
SVM parameters γ and C during training. The table also confirms our intuition
that with decreasing payload, the missed-detection rate increases. Overall, we can
state that the steganalyzer is very successful in detecting known steganographic
algorithms embedded with payloads larger than 25% of each algorithm’s capacity.
In the next section, the same blind steganalyzer is tested on stego images
produced by previously unseen steganographic methods.
262 Chapter 12. Blind steganalysis

Table 12.5. Probability of detection of four “unknown” steganographic algorithms for


various payloads expressed in bits per non-zero AC DCT coefficient (bpnc), for –F5 and
MMx, and in percentages of embedding capacity for Jsteg.

Algorithm [bpnc] PD
–F5 [1.0] 99.08%
–F5 [0.5] 99.60%
–F5 [0.25] 98.48%
MM2 [0.66] 99.64%
MM2 [0.42] 99.20%
MM2 [0.26] 53.67%
MM3 [0.66] 99.72%
MM3 [0.42] 99.32%
MM3 [0.26] 58.51%
Jsteg [1.00] 42.41%
Jsteg [0.50] 42.43%
Jsteg [0.25] 42.05%
Cover 98.96%

12.2.6 Testing on unknown algorithms


As explained in the introduction of Section 12.2, the cover-versus-all-stego ste-
ganalyzer is supposed to detect all steganographic methods. The hope is that
when the detector is presented with a new steganographic algorithm, it will
be able to generalize and correctly classify the image as containing stego. We
remind the reader that our blind detector was trained on six steganographic al-
gorithms that represent the knowledge base of the warden. In order to test how
well the detector can recognize steganographic algorithms on which it was not
trained, we intentionally did not use in the training phase the following algo-
rithms: Jsteg (Chapter 5), –F5, and two MMx algorithms (Chapter 9). The –F5
algorithm works in exactly the same manner as F5 but the embedding operation
is reversed (the absolute value of the DCT coefficient is always increased rather
than decreased). Reversing the embedding operation has the benefit of remov-
ing shrinkage from F5, which enables easier implementation and increases the
algorithm embedding efficiency.
The payloads embedded using MM2 and MM3 (Chapter 9) were 0.66, 0.42,
and 0.26 bpnc. These values correspond to relative payloads α = 23 , 37 , and 15 4

bpnc for binary Hamming codes [2p − 1, 2p − p − 1] for p = 2, 3, 4 (see Table 8.1).
Because the embedding capacity of both –F5 and MMx is equal to the number of
non-zero DCT coefficients, payloads expressed in bpnc also express the relative
payload size with respect to the maximal embedding capacity of both algorithms.
Finally, for Jsteg the images were embedded with relative payloads 1.0, 0.5, and
0.25. The quality factor for all stego images was again set to 75, the quality
factor of all images from the training set.
Blind steganalysis 263

The results shown in Table 12.5 demonstrate that the blind detector can gen-
eralize to –F5 and MMx. Even though the embedding mechanism of –F5 is very
different from those of the six algorithms on which the blind steganalyzer was
trained, images produced by –F5 are reliably detected.
Quite surprisingly, however, the images embedded by Jsteg are detected the
least reliably despite the fact that Jsteg is a relatively poor steganographic al-
gorithm that is easily detectable using a variety of targeted attacks, such as the
one explained in Section 5.1.2 or [21, 156, 157]. Jsteg introduces severe arti-
facts into the histogram of DCT coefficients and the steganalyzer has not seen
such artifacts before. Because it was tuned to a low probability of false alarm,
it conservatively assigns such images to the cover class. This analysis underlies
the need to train the classifier on as diverse set of steganographic algorithms as
possible to give it the ability to generalize.
An alternative approach to blind steganalysis that is less prone to such catas-
trophic failures but gives an overall smaller accuracy on known algorithms is the
construction based on a one-class detector. In the next section, we describe one
simple approach to such one-class steganalyzers implemented using a one-class
neighbor machine.

12.3 Blind steganalysis of JPEG images (one-class neighbor


machine)

The cover-versus-all-stego blind steganalyzer described in the previous section


failed to correctly classify stego images embedded by Jsteg (this algorithm was
kept “unknown” when training the classifier to test its ability to generalize to
previously unseen stego methods). It appears that blind steganalyzers trained
on examples of both cover and stego images will always be prone to such failures
simply because the SVM (or some other machine-learning method) will by defi-
nition learn the boundary between the classes and this boundary may be a poor
choice for stego images that occupy a completely different portion of the feature
space (images that do not look like covers or like known stego images).
A reasonable alternative to resolve this problem is to train a classifier to rec-
ognize only cover images so that anything that fails to appear as cover will be
classified as stego. This problem is recognized in the machine-learning literature
as novelty detection. In this section, we describe a very simple one-class stegana-
lyzer and demonstrate its performance on experiments. Despite its simplicity, its
accuracy is commanding, which makes this approach to blind steganalysis quite
promising. The reader is referred to [194] for a more detailed treatment as well
as detailed comparative study of various one-class steganalyzers.
For the description of a One-Class Neighbor Machine (OC-NM) [181], we need
the notion of a sparsity measure. Let us assume that we have l cover images from
which we compute l d-dimensional features. We will denote the jth component
of the ith feature as f [i, j], i = 1, . . . , l, j = 1, . . . , d. The array f is our training
264 Chapter 12. Blind steganalysis

set. The function Sf : Rd → R, which depends on f , is a sparsity measure if and


only if

∀x, y ∈ Rd , pc (x) > pc (y) ⇒ Sf (x) < Sf (y), (12.20)

where pc is the distribution of cover features. In other words, the sparsity mea-
sure characterizes the closeness of x to the training set. The OC-NM works by
identifying a threshold γ so that all features x with Sf (x) > γ are classified as
stego.
The training of an OC-NM is simple because we need only to find the threshold
γ. It begins with calculating the sparsity of all training samples m[i] = Sf (f [i, .]),
1 ≤ i ≤ l, and ordering them so that m[1] ≥ m[2] ≥ . . . ≥ m[l]. By setting γ =
m[PFA l], we ensure that a fraction of exactly PFA training features are classified
as stego. Assuming the features f [i, .] are iid samples drawn according to the
pdf pc , with l → ∞ the OC-NMs were shown to converge to a detector with the
required probability of false alarm [181].
Note that there is a key difference between utilizing the training features in
OC-NM and in classifiers based on SVMs. While SVMs use only a fraction of
them during classification (support vectors defining the hyperplane), OC-NMs
use all training features, which shows the relation to classifiers of the nearest-
neighbor type.
The original publication on OC-NMs [181] presents several types of sparsity
measures. In this book, we adopted the one based on the so-called Hilbert kernel
density estimator
!
1
Sf (x) = log l , (12.21)
i=1 1/(x − f [i, .] 2 )hd

where x2 is the Euclidean norm of x. The parameter h in (12.21) controls the
smoothness of the sparsity measure. Intuitively, when x “is surrounded” by the
training features, the sparsity measure will be small (negative). Points that are
farther away from the training features will lead to larger values of Sf .

12.3.1 Training and testing


An important advantage of OC-NMs is the simplicity of their training. In our
case, the training of an OC-NM classifier simply consists of computing the set of
features f from 3500 cover JPEG images from D1 . The ability of the detector to
recognize stego images will be estimated on 2504 JPEG images from D2 not used
during training. The description of the databases of cover images D1 and D2 , as
well as the process of creating the stego images, can be found in Section 12.2.
The detection accuracy of the OC-NM seems to vary very little with the spar-
sity parameter. All experimental results reported in Tables 12.6 and 12.7 were
obtained with the sparsity parameter h set to 0.01.
Blind steganalysis 265

Table 12.6. Detection accuracy of OC-NM on F5, MBS1, MBS2, JP HS, OutGuess, and
Steghide. For comparison, the detection accuracy of the cover-versus-all-stego classifier
described in Section 12.2.5 is repeated in this table. We emphasize that the
cover-versus-all-stego classifier was trained on a mixture of stego images embedded by
the same six algorithms.

Algorithm [bpnc] OC-NM Cover-versus-all-stego

F5 [1.0] 98.96% 99.96%


F5 [0.5] 20.10% 99.60%
F5 [0.25] 2.40% 90.73%
JP HS [1.0] 99.52% 99.84%
JP HS [0.5] 41.73% 98.28%
JP HS [0.25] 19.04% 73.52%
MBS1 [1.0] 99.92% 99.96%
MBS1 [0.5] 29.50% 99.80%
MBS1 [0.3] 4.27% 98.88%
MBS1 [0.15] 1.76% 71.19%
MBS2 [0.3] 32.47% 99.12%
MBS2 [0.15] 2.88% 77.92%
OutGuess [1.0] 100.00% 99.96%
OutGuess [0.5] 57.51% 99.96%
OutGuess [0.25] 5.19% 98.12%
Steghide [1.0] 99.44% 99.96%
Steghide [0.5] 16.61% 99.84%
Steghide [0.25] 2.84% 96.37%
Cover 98.64% 98.96%

Table 12.6 shows the percentage of correctly classified stego images embed-
ded with six different steganographic algorithms and various payloads. We also
reprint the detection accuracy of the cover-versus-all-stego classifier from Sec-
tion 12.2 for comparison. As one could expect, the binary cover-versus-all-stego
classifier has a better performance because it was trained on a mixture of stego
images embedded using the same algorithms. The advantage of the OC-NM
becomes apparent when testing on algorithms unseen by the binary classifier
(Table 12.7). Here, the difference in performance is only marginal for –F5 and
MMx. The biggest difference occurs for Jsteg images, which are reliably detected
as stego by the OC-NM and almost completely missed by the cover-versus-all-
stego classifier. The last row of the table shows the false-alarm rates for both
methods, which are fairly similar.

12.4 Blind steganalysis for targeted attacks

By training a blind steganalyzer on the set of stego images embedded using a


specific steganographic algorithm, we essentially obtain a cookie-cutter approach
for constructing targeted attacks. The targeted steganalyzer built in this way will
266 Chapter 12. Blind steganalysis

Table 12.7. Detection accuracy of OC-NM on –F5, MMx, and Jsteg. For comparison,
the detection accuracy of the cover-versus-all-stego classifier described in Section 12.2.6
is repeated here. We note that the cover-versus-all-stego classifier was not trained on
stego images produced by these three algorithms.

Algorithm [bpnc] OC-NM Cover-versus-all-stego

–F5 [1.0] 100.00% 99.08%


–F5 [0.5] 100.00% 99.60%
–F5 [0.25] 93.93% 98.48%
MM2 [0.66] 100.00% 99.64%
MM2 [0.42] 99.92% 99.20%
MM2 [0.26] 17.69% 53.67%
MM3 [0.66] 100.00% 99.72%
MM3 [0.42] 99.92% 99.32%
MM3 [0.26] 15.14% 58.51%
Jsteg [1.0] 100% 42.41%
Jsteg [0.75] 100% 42.33%
Jsteg [0.5] 100% 42.37%
Jsteg [0.2] 98.09% 42.09%
Jsteg [0.1] 65.36% 32.98%
Jsteg [0.05] 40.27% 5.99%
Cover 98.64% 98.96%

naturally have a better performance than the blind classifiers from Sections 12.2
and 12.3 because the distribution of stego images will be less spread out. This
section provides the reader with an example of how accurate the targeted attacks
are for the following steganographic algorithms: F5, –F5, nsF5, Model-Based
Steganography without deblocking (MBS1), JP Hide&Seek, Steghide, MMx, and
three versions of perturbed quantization in double-compressed JPEG images as
described in Chapter 9 (PQ, PQe, PQt).
As in the previous section, all classifiers were implemented as soft-margin
support vector machines with Gaussian kernel trained on 3500 cover images and
3500 stego images3 with an even mixture of short payloads 0.05, 0.1, 0.15, and
0.2 bpnc. The detection performance was evaluated on 2504 cover images from
database D2 and their stego versions embedded with the same payload mixture
as in the training set. For this experiment, the quality factor of all JPEG images
was set to 70. For the three perturbed quantization methods, the primary and
secondary quality factors were 85 and 70, respectively. The cover images for these
three methods were JPEG images doubly compressed with quality factors 85 and
70.
Table 12.8shows
 the detection results obtained using the performance criteria
−1 1
PE and PD 2 (see Section 10.2.4).

3 The same database D1 was used as in Section 12.2.1.


Blind steganalysis 267

Table 12.8. Performance of targeted steganalyzers obtained by training a binary classifier


for JPEG images on covers and a mixture of stego images embedded with relative
payloads 0.05, 0.1, 0.15, and 0.2 bpnc. The results are from the testing set. Two
performance criteria are shown; the minimal average error probability PE and false-alarm
probability at 50% detection, PD−1 (1/2). The abbreviations of stego algorithms are
explained in the text.

Algorithm PE PD−1 (1/2)


F5 7.5% 0.00%
–F5 4.4% 0.00%
nsF5 14.9% 1.04%
JP Hide&Seek 3.2% 0.04%
MBS1 2.8% 0.00%
MM2 18.4% 0.96%
MM3 19.5% 1.24%
Steghide 1.4% 0.00%
PQ 6.4% 0.12%
PQe 27.4% 10.08%
PQt 26.9% 10.84%

Because the payloads embedded using each algorithm were the same, this table
tells us how detectable different steganographic methods are. Note that even
though the tested payloads are relatively short, very reliable targeted attacks
are possible on OutGuess, F5, –F5, Model-Based Steganography, Steghide, and
JP Hide&Seek. The improved version of F5 without shrinkage (nsF5) is the
best tested method that does not use any side-information at the sender (the
uncompressed JPEG image). As expected, the methods that do use this side-
information (versions of PQ and MMx) are among the least detectable.
To summarize, this section demonstrates a general approach to targeted ste-
ganalysis that does not require knowledge of the embedding mechanism. The
basic idea is to train a binary classifier to recognize the class of cover images and
stego images embedded with the particular steganographic method.

12.4.1 Quantitative blind attacks


In the previous section, we learned that targeted attacks can be constructed by
training a blind steganalyzer on a set of cover and stego images embedded by
a specific stegosystem. Such detectors are binary classifiers returning a “yes” or
“no” answer when presented with an image. In this section, we explain how to use
the feature set described in Section 12.1 for construction of quantitative attacks
capable of estimating the change rate (payload). Intuitively, because the features
sensitively react to embedding changes, the difference between the feature vectors
of the cover and stego image increases with increasing change rate. Thus, one
could potentially extract more information from the position of the feature vector
and build a change-rate estimator using regression.
268 Chapter 12. Blind steganalysis

Let us assume that we have l stego images embedded with a mixture of change
rates β[i]. Denoting their feature vectors f [i, j], i = 1, . . . , l, j = 1, . . . , d, we will
seek the estimate of the change rate as a linear combination of the features

β̂[i] = f [i, .]θ, (12.22)

where θ ∈ Rd is a column vector. The unknown vector parameter θ is determined


by minimizing the total square error


l
(f [i, .]θ − β[i])2 . (12.23)
i=1

This least-square problem has a standard solution in the form (see Section D.8)

θ̂ = (f  f )−1 f  β. (12.24)

Change-rate estimators constructed in this fashion typically exhibit very good


accuracy, which generally depends on how well the features react to embedding.
To give the reader a sense of the performance, we include a sample of the results
published in the paper that pioneered this approach to quantitative steganaly-
sis [195]. The results were obtained for the same database of JPEG images as
in the previous sections enlarged by an additional 3159 images, bringing the to-
tal number of images to 9163. All images were originally acquired in raw format
and then, for the purpose of this test, compressed with 80% quality JPEG. Seven
steganographic algorithms were used to embed all images from the database with
a uniform mixture of relative payloads. To be more precise, the length of the mes-
sage was chosen randomly between zero and the maximum embedding capacity
for each algorithm and each image. Matrix embedding was turned off for F5 to
better control the change rate. The database was then divided into two halves,
each containing l ≈ 4600 images. The first half was used to construct the linear
regressor (find θ), while the other half was used for testing the performance.
The feature set f was formed by 274 calibrated DCT features described in
Section 12.1 augmented with the number of non-zero DCT coefficients as an
additional 275th feature (thus, d = 275). All 275 features were normalized to
have zero mean and unit variance on the training set.
Figure 12.1 shows the scatter plot of the estimated change rate versus true
change rate for the F5 algorithm and Model-Based Steganography. Table 12.9
displays the sample Median Absolute Error (MAE) of the estimator and its bias
(the median of the error computed from all estimates) for seven steganographic
algorithms.
This approach to quantitative steganalysis has one overwhelming advantage
over the strategies presented in Section 10.3 and methods described in Chap-
ter 11. Here, all that is needed to construct the quantitative attack is a set of
stego images embedded with known change rates. In particular, the stegana-
lyst does not need any information about the embedding algorithm. As long as
Blind steganalysis 269

0.6 F5 algorithm

0.5
Estimated change rate
0.4

0.3

0.2

0.1

0 0.1 0.2 0.3 0.4 0.5 0.6


True change rate

0.4 Model-Based Steganography

0.3
Estimated change rate

0.2

0.1

0 0.1 0.2 0.3 0.4


True change rate

Figure 12.1 Scatter plot showing the estimated change rate for F5 (Section 7.3.2) and
Model-Based Steganography (Section 7.1.2) versus the true change rate. All estimates
were made on images from the testing set.
270 Chapter 12. Blind steganalysis

Table 12.9. Median absolute error (MAE) and bias for the quantitative change-rate
estimator built from a 275-dimensional feature set implemented using a linear
least-square fit (results for the testing set).

Algorithm MAE Bias


F5 8.39 × 10−3 −5.29 × 10−4
Jsteg 8.38 × 10−3 −5.29 × 10−4
MB1 9.07 × 10−3 3.86 × 10−5
MMX 3.25 × 10−3 1.58 × 10−4
PQ 5.69 × 10−2 −2.89 × 10−3
OutGuess 2.53 × 10−3 1.51 × 10−4
Steghide 3.23 × 10−3 2.60 × 10−4

the blind steganalyzer successfully detects the steganographic method, it can be


turned into a quantitative steganalyzer using regression!
The reader is referred to [195] for more details about this approach to quan-
titative steganalysis. This reference also contains an alternative, slightly more
accurate implementation using methods of support vector regression [215].

12.5 Blind steganalysis in the spatial domain

The blind steganalyzer for JPEG images explained in the previous sections used
features that were computed directly from DCT coefficients. This brought two
important advantages – the features were sensitive to embedding and, in combi-
nation with calibration, they were less sensitive to image content. The calibration
worked because quantized DCT coefficients are robust with respect to small (em-
bedding) changes when the quantization is performed on a desynchronized 8 × 8
grid. Unfortunately, this principle is unavailable in the spatial domain. However,
it is possible to calibrate and simultaneously increase the features’ sensitivity
to embedding by calculating the features from the stego image noise residual
r = y − F (y) obtained using a denoising filter F . Working with the noise resid-
ual instead of the whole image has two important advantages:

1. The cover image content is suppressed and thus the features exhibit less vari-
ation from image to image.
2. The SNR between the stego noise and the cover image is increased, which
leads to increased sensitivity of the features to embedding.

As an example of this approach to blind steganalysis in the spatial domain,


we now describe the Wavelet Absolute Moments (WAM) feature set originally
published in [102] and then show the results when applying this feature set to
steganalysis of ±1 embedding. It is worth mentioning that many other features
proposed for blind steganalysis [11, 67, 252] were intuitively constructed using
Blind steganalysis 271

the same principle. The role of the denoising filter is replaced with local content
predictors [67, 252] or image-quality metrics [11]. All methods implicitly make
use of the difference between the stego image and its low-pass-filtered version.
The goal is, however, the same – to remove the image content and make the
features more sensitive to stego noise.

12.5.1 Noise features


The features’ definition stems from the inner mechanism of the wavelet-based de-
noising filter described in [175]. For the sake of simplicity, we will assume that the
image under investigation, y, is a grayscale 512 × 512 image. The image is first
transformed using a single-level 8-tap (decimated) Daubechies wavelet transform
W . The wavelet transform is a linear transformation that provides decomposi-
tion of the image localized both in frequency and in space. The transformed
image consists of four 256 × 256 arrays of real numbers, L, H, V, D, called low-
frequency, horizontal, vertical, and diagonal subbands (see Figure 12.2). This
transformation can be calculated, for example, using the Matlab Wavelet Tool-
box with the command [L,H,V,D]=dwt2(y,’db8’); where y is either an array
of integers in the range {0, . . . , 255}, for an 8-bit grayscale image, or one color
channel in a true-color image. Since the steganographic modifications of most
methods constitute a high-frequency signal, most of the energy of the stego noise
is in the high spatial frequencies. Thus, we will work only with the following three
subbands: H[i, j], D[i, j], V[i, j], i, j = 1, . . . , 256.4
Each subband is viewed as a sum of two random variables – the signal due
to high-pass filtering of the noise-free image (mostly the edges) and the noise.
The image is modeled as a non-stationary Gaussian signal N (0, σ2 [i, j]) and
the noise as a stationary Gaussian signal N (0, σn2 ) with a known variance σn2
(typically σn2 ≈ 2 gives good results for an 8-bit grayscale image). The variance
σ 2 [i, j] is estimated from a local neighborhood of the wavelet coefficient H[i, j],

σ̂ 2 [i, j] = min(σ 23 [i, j], σ 25 [i, j], σ27 [i, j], σ29 [i, j]), (12.25)

where σ 2w [i, j] is the sample variance estimated from a w × w square neighbor-


hood,
⎛ ⎞

2 
w

⎜ 1 ⎟
σ 2w [i, j] = max ⎝0, 2 (H[i + k, j + l])2 − σn2 ⎠ . (12.26)
w
k,l=− w
2 

4 This choice is justifiable for detection of steganography that imposes independent modifica-
tions to cover-image pixels, such as ±1 embedding. Steganalysis of stego noise with different
spectral characteristics (e.g., with its energy shifted towards medium spatial frequencies as
in [36, 76]) may benefit from analyzing subbands from higher-level wavelet decomposition.
272 Chapter 12. Blind steganalysis

H D

Figure 12.2 Example of a grayscale 512 × 512 image and its wavelet transform using an
8-tap Daubechies wavelet. The H, V, and D subbands contain the high-frequency noise
components of the image along the horizontal, vertical, and diagonal directions.

The noise residual in the horizontal subband, rH , is obtained by subtracting


from it a Wiener-filtered version of the subband,

σ̂ 2 [i, j]
rH [i, j] = H[i, j] − 2 H[i, j]. (12.27)
σ̂ [i, j] + σn2
The noise residuals rD and rV can be obtained using similar formulas.
The steganographic features are calculated for each subband separately as the
first nine central absolute moments μc [k], k = 1, . . . , 9, of the noise residual


256
μc [k] = C |rH [i, j] − r̄H |k , (12.28)
i,j=1
 1
where r̄H = C i,j rH [i, j] is the sample mean of rH and C = 256 2 is a nor-

malization constant. Since there are three subbands, there will be a total of 27
features for a grayscale image and 3 × 27 features for a color image.
Note that the amount of data samples (2562 ) in each subband does not al-
low estimating all moments accurately. In particular, only the sample moments
Blind steganalysis 273

Table 12.10. Total error probability PE and PD−1 (1/2) for detectors constructed to detect
±1 embedding in raw images (RAW), their JPEG versions (JPEG80), and scans of film
(SCAN). The whole ROCs are displayed in Figure 12.3.

PE PD−1 (1/2)
RAW 16.6% 1.4%
JPEG80 1.7% 0%
SCAN 22.8% 6.1%

μc [k] for k ≤ 4 are relatively accurate estimates of the true moments. However,
even though the higher-order sample moments are not accurate estimates of the
moments, they can still be valuable as features in steganalysis as long as they
sensitively react to embedding changes.

12.5.2 Experimental evaluation


We demonstrate the performance of the blind steganalysis method based on 27
noise moments by steganalyzing ±1 embedding in the spatial domain of grayscale
images. The results will be reported separately on three different image sources,
which will enable us to demonstrate the influence of the cover source on statistical
detectability as already discussed in Section 12.2. Two of the databases are the
RAW and SCAN sets described in Section 11.1.1. The third database, JPEG80,
contains images from RAW compressed with 80% quality JPEG (the images were
decompressed to the spatial domain before embedding).
Each database was randomly split in half, obtaining two image sets D1 and
D2 . Images from D1 were used as covers for training and were embedded with
relative payload randomly chosen from the set {0.1, 0.25, 0.5} bpp (for RAW and
JPEG80) and from {0.25, 0.5, 0.75, 1} bpp for SCAN. The set D2 was used to
build the testing set in the same manner.
The detectors were soft-margin weighted support vector machines with Gaus-
sian kernels as in Section 12.2. A separate detector was trained for each database.
The detection performance of the blind steganalyzer on ±1 embedding for each
database is shown in Table 12.10 using the total probability of error, PE , and
false-alarm rate at 50% detection, PD−1 (1/2). Figure 12.3 shows the whole ROC
curves drawn by changing the threshold b∗ as described in Section E.5.5.
This experiment demonstrates that the features computed from the noise resid-
ual in the wavelet domain enable reliable detection of ±1 embedding in the spatial
domain. Note that the detection is significantly more reliable for digital-camera
images than for scans. This is attributed to the much higher noise level of scans
as discussed in Section 11.1.1. Note the difference between detection performance
on uncompressed camera images and decompressed JPEGs. This is because the
lossy compression acts as a low-pass filter and thus it is easier to filter out the
steganographic changes.
274 Chapter 12. Blind steganalysis

0.8

0.6
PD

0.4

0.2
JPEG80
RAW
SCAN
0
0 0.2 0.4 0.6 0.8 1
PFA
Figure 12.3 ROC for detection of ±1 embedding in images from the training databases
RAW, JPEG80, and SCAN. The stego images contained an equal mixture of images
embedded with relative payloads 0.1, 0.25, and 0.50 bpp for RAW and JPEG80, and 0.25,
0.5, 0.75, and 1.0 bpp for SCAN.

Summary
r The goal of blind steganalysis is to detect an arbitrary steganographic method.
r It works by first selecting a set of numerical features that can be computed
from an image. The feature set plays the role of a low-dimensional model of
cover images. Then, a classifier is trained on many examples of cover and stego
image features to recognize cover and stego images in the feature space.
r The performance of a blind steganalyzer is typically evaluated on a separate
testing set using the ROC curve or selected numerical characteristics, such as
the total
  probability error, PE , or the false-alarm rate at 50% detection rate,
−1 1
PD 2 .
r A reliable blind steganalysis classifier for JPEG images can be constructed
from various first-order and higher-order statistics of quantized DCT coef-
ficients. Calibration further improves the classifier performance because it
makes the features more sensitive to embedding while removing the influence
of the cover content.
r A blind steganalyzer can be built either as a binary cover-versus-all-stego clas-
sifier or as a one-class classifier. In the first case, the two classes are formed by
cover images and stego images embedded by as many steganographic methods
as possible with varying payload. The alternative approach is to train a one-
Blind steganalysis 275

class steganalyzer that recognizes only covers and marks features incompatible
with the training set as stego (or suspicious).
r The equivalent of calibration in the spatial domain is the principle of com-
puting the features from the image noise residual obtained using a denoising
filter.
r An example of a blind steganalysis method in the spatial domain is shown.
It is based on computing higher-order moments of the image noise residual in
high-frequency wavelet subbands.
r Blind steganalysis can also be used to construct targeted and quantitative
attacks and to classify stego images into known steganographic programs.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
13 - Steganographic capacity pp. 277-292

Chapter DOI:
Cambridge University Press
13 Steganographic capacity

Intuition tells us that steganographic capacity should perhaps be defined as


the largest payload that Alice can embed in her cover image using a specific
embedding method without introducing artifacts detectable by Eve. After all,
knowledge of this secure payload appears to be fundamental for the prisoners to
maintain the security of communication. Unfortunately, determining the secure
payload for digital images is very difficult even for the simplest steganographic
methods, such as LSB embedding. The reason is the lack of accurate statistical
models for real images. Moreover, it is even a valid question whether capacity can
be meaningfully defined for an individual image and a specific steganographic
method. Indeed, capacity of noisy communication channels depends only on the
channel and not on any specific communication scheme.
This chapter has two sections, each devoted to a different capacity concept. In
Section 13.1, we study the steganographic capacity of perfectly secure stegosys-
tems. Here, we are interested in the maximal relative payload (or rate) that
can be securely embedded in the limit as the number of pixels in the image
approaches infinity. Capacity defined in this way is a function of only the physi-
cal communication channel and the cover source rather than the steganographic
scheme itself. It is the maximal relative payload that Alice can communicate if
she uses the best possible stegosystem. The significant advantage of this defini-
tion is that we can leverage upon powerful tools and constructions previously
developed for study of robust watermarking systems. This apparatus also en-
ables simultaneous study of stegosystems with distortion-limited embedder and
distortion-limited active warden (Section 6.3). For some simple models of covers,
it is also possible to compute the capacity and design capacity-reaching stegosys-
tems. However, the steganographic capacity derived from these models is often
overly optimistic because simple models do not describe real images well. Also,
because of the model mismatch, it is hard to relate specific numerical results to
real cover sources. Despite these objections, the theoretical results provide deeper
insight into the problem as well as the possibility of progressive improvement to
include more realistic cover models.
Section 13.2 deals with a more practical issue, which is the largest payload that
can be embedded in covers of a certain size using a specific embedding method.
To distinguish this concept from the capacity as defined in Section 13.1, we use
the term “secure payload” instead. The section starts with a bold pragmatic as-
278 Chapter 13. Steganographic capacity

sumption that it is unlikely that a perfectly secure stegosystem can ever be built
for real covers due to their overwhelming complexity. Under the assumption that
the steganographic system is not perfectly secure, the secure payload is defined
in absolute terms as the critical size of payload for which the KL divergence be-
tween covers and stego images reaches a certain fixed value. The secure payload
defined in this way is bounded by the square root of the number of pixels, n,

which means that the safe communication rate approaches zero as 1/ n. The
result is supported in Section 13.2.2 by experiments carried out for a selected
embedding algorithm and a chosen blind steganalyzer.
Before starting with the technical arguments, we note that Alice may attempt
to determine the secure payload experimentally using current best blind stegana-
lyzers on a large database of cover images from the investigated cover source. By
analyzing images embedded with different payloads, she can estimate the secure
payload as the critical size of payload at which her steganalyzer starts making
random guesses (e.g., PE ≥ 0.5 − δ for some selected δ; see (10.15) for the defi-
nition of PE ). Secure payload defined this way informs the prisoners about the
longest message that can be safely embedded given the current state of the art
in steganalysis. The obvious disadvantage is that the estimate is now tied to a
specific steganalyzer. Alice can thus obtain only an upper bound on the true size
of the secure payload, which may dramatically change with further progress in
steganalysis. The reader is referred to [102, 95, 217] for some recent results.

13.1 Steganographic capacity of perfectly secure stegosystems

In this section, we define steganographic capacity of perfectly secure stegosys-


tems with a distortion-limited embedder and an active distortion-limited warden.
The results for the passive-warden scenario and unlimited distortion of the em-
bedder can be obtained as corollaries. The capacity is defined as the largest rate
(relative payload) over all possible steganographic schemes satisfying the dis-
tortion constraints and the steganographic constraint (the distributions of cover
and stego objects should coincide).
The material in this section deviates from the rest of this book in the sense
that it requires advanced knowledge of information theory that goes well beyond
the prerequisites of this book. Proofs of the statements in this section can be
found in the original publications [49, 180, 237]. The reader is referred to [50] for
a comprehensive text on information theory.
We start by reminding the reader of the concept of a steganographic channel
as defined in Section 6.3. A steganographic channel consists of the source of cov-
ers, x ∈ X n , where X is a finite set (e.g., for grayscale images, X = {0, . . . , 255}),
secret keys, k ∈ K , and the source of messages, m ∈ M, with corresponding dis-
tributions on their sets, Pc , Pk , Pm , the embedding and extraction algorithms,
Steganographic capacity 279

Embn : X n × K × M → X n , Extn : X n × K → M,1 and the physical communi-


cation channel described by a matrix of conditional probabilities A(ỹ|y) that a
stego object y sent by Alice is received in its noisy form as ỹ by Bob. The attack
channel is considered to be memoryless, or

"
n
A(ỹ|y) = A(ỹ[i]|y[i]). (13.1)
i=1

Furthermore, we assume that the expected embedding distortion imposed by


Alice and the channel distortion per cover element are bounded,

  
E [d(x, y)] = Pc (x) Pk (k) Pm (m)d (x, Embn (x, k, m)) ≤ D1 , (13.2)
x k m

Pc (y)A(ỹ|y)d(ỹ, y) ≤ D2 , (13.3)
y,ỹ

where d(., .) : X n × X n → [0, ∞) is a measure of distortion per cover element.


The requirement of perfect security is Pc = Ps or, equivalently, DKL (Pc ||Ps ) = 0,
where Ps is the distribution of stego objects y = Embn (x, k, m).
In this section, we allow the embedding mapping to fail to embed a message
in the sense that Extn (Embn (x, k, m), k) = m in some cases, as long as the
probability of this occurring is small with growing length of covers, n. To be more
precise, we say that rate R is achievable if there exists a sequence of embedding
and extraction mappings (Embn )∞ ∞
n=1 , (Extn )n=1 , such that

sup Pr {Extn (Embn (x, k, m)) = m} → 0 (13.4)


A

as n → ∞ and |M| ≥ 2nR . The random variable Embn (c, k, m) is required to


comply with the distortion constraint (13.2). The supremum is taken over all
noisy channels A(ỹ|y) constrained by (13.3).
The steganographic capacity is defined as the supremum of all achievable rates.
It will be denoted as Csteg (D1 , D2 ) as it clearly depends on the distortion bounds.
Note that Csteg is not a concept defined for one specific embedding scheme!
Instead, it is a function of only the cover source and the distortion bounds.
We also define the concept of a covert channel as the conditional probabilities
Q(y, u|x) = Pr{y = y, u = u|x = x}, where u is an auxiliary random variable on
a finite but arbitrarily large alphabet U. This random variable is used to define
a random codebook shared between Alice and Bob to realize the steganographic
communication.

1 Here, we use the subscript n to stress the fact that Embn and Extn act on n-element covers.
280 Chapter 13. Steganographic capacity

We say that the covert channel Q is feasible if



Q(y, u|x)Pc (x)d (Q(x, y)) ≤ D1 , (13.5)
y,u,x

Q(y, u|x)Pc (x) = Pc (y). (13.6)
u,x

Thus, a feasible covert channel complies with the embedding-distortion con-


straint and perfect steganographic security.
The next theorem gives an expression for the steganographic capacity Csteg
for the most general case of an active warden.

Theorem 13.1. [Steganographic capacity]

Csteg (D1 , D2 ) = sup inf {I(u; ỹ) − I(u; x)}, (13.7)


Q A

where (u, x) → y → ỹ forms a Markov chain.

In the theorem, the supremum is taken over all feasible channels Q and the
infimum over all attack channels A satisfying the distortion constraint.
For the passive-warden scenario, A(ỹ|y) = 1 when ỹ = y, and the capacity
result simplifies to the following corollaries.

Corollary 13.2. [Passive warden] For the passive-warden case (D2 = 0),

Csteg (D1 , 0) = sup H(y|x). (13.8)


Q

Corollary 13.3. [Passive warden, unlimited embedding distortion] For


unlimited embedding distortion, D1 = ∞, and a passive warden

Csteg (∞, 0) = H(x). (13.9)

The capacity can be calculated for some simple cover sources, such as the
source of uniformly distributed
1 random bits, X = {0, 1}, when covers x are
Bernoulli sequences B 2 (see Appendix A for the definition) and the distortion
is the Hamming distance dH [180]. It has also been established for iid Gaussian
sources [49].

13.1.1 Capacity for some simple models of covers


 
The capacity for the Bernoulli B 12 cover source has been derived in [17, 197]
without the steganographic constraint Pc = Ps . For the passive-warden scenario,

H(D1 ) if 0 ≤ D1 ≤ 12
Csteg (D1 , 0) = (13.10)
1 if D1 ≥ 12
Steganographic capacity 281

and for the active warden



⎪  
if 0 ≤ D1 ≤ d
⎨(D1 /d ) (H(d ) − H(D2 ))

Csteg = H(D1 ) − H(D2 ) if d ≤ D1 ≤ 1 (13.11)


2
⎩1 − H(D ) if D1 ≥ 12 ,
2

where d = 1 − 2−H(D2 ) and H(x) is the binary entropy function (Figure 8.2).
In this special case, the steganography constraint Pc = Ps does not influence the
steganographic capacity because the capacity-reaching distribution is Ps = Pc .
It is possible to construct steganographic schemes that reach the capacity using
random linear codes [254].
More realistic cover sources formed by iid Gaussian sequences were studied
in [49]. The authors computed a lower bound on the capacity of perfectly se-
cure stegosystems for such sources and studied the increase in capacity for -
secure systems. They also described a practical QIM2 lattice-based construction
for high-capacity secure stegosystems and contrasted their work with results ob-
tained for digital watermarking that do not include the steganographic constraint
Pc = Ps . The authors of [237] describe codes for construction of high-capacity
stegosystems and show how their approach can be generalized from iid sources
to Markov sources and sources over continuous alphabets.
Finally, we note that the work [180] also investigated steganographic capacity
for an alternative definition of an almost sure distortion of the attack channel in
which the channel distortion constraint is replaced with Pr {d(x, y) ≤ D2 } = 0,3
and showed that for this form of the active warden, the steganographic capacity
is also given by Theorem 13.1.

13.2 Secure payload of imperfect stegosystems

In the previous section, we introduced a communication-theoretic definition of


steganographic capacity as the maximal achievable rate of stegosystems com-
plying with the required distortion constraints in the limit when the number
of elements in the cover goes to infinity. We also learned that for sufficiently
simple models of covers, it is possible to compute the capacity and even con-
struct capacity-reaching perfectly secure stegosystems or -secure stegosystems
with positive rate, depending on the cover model. However, the problem of de-
termining the maximal size of the secure payload for a specific steganographic
scheme and for realistic covers remains open. Due to our ignorance about the
cover model, all steganographic techniques designed for real digital multimedia
are likely to be statistically detectable in the sense that their KL divergence will

2 A version of Quantization Index Modulation (QIM) appears in Chapter 6, otherwise, the


reader is referred to [51].
3 This distortion constraint was for the first time used in [219].
282 Chapter 13. Steganographic capacity

be unbounded with increasing number of elements n in the cover. This indicates


that for such imperfect schemes the secure relative payload will converge to zero.
In this section, we establish the so-called Square-Root Law (SRL) for imperfect
stegosystems which says that for a fixed -secure stegosystem the absolute se-
cure payload grows only as the square root of n and thus the communication

rate approaches zero as O(1/ n).
The fact that the secure payload of practical steganographic schemes digital-
media covers might be sublinear in the number of cover elements was first sus-
pected by Anderson [3] in 1996:

“Thanks to the Central limit theorem, the more covertext we give the warden, the
better he will be able to estimate its statistics, and so the smaller the rate at which
[the steganographer] will be able to tweak bits safely. The rate might even tend to
zero...”

The first insight into the problem was obtained from analysis of batch steganog-
raphy and pooled steganalysis [132]. In batch steganography, Alice and Bob try
to minimize the chances of being caught by dividing the payload into chunks
(not necessarily uniformly) and embedding each chunk in a different image. The
warden is aware of this fact and thus pools her results from testing all images
sent by both prisoners. One of the main results obtained in this study is the fact
that if there exists a detector for the stegosystem, the secure payload grows only
as the square root of the number of communicated covers. This result could be
interpreted as the square-root law for a single image by dividing it into smaller
blocks. In the next section, we describe a different proof of this law for covers
modeled as sequences of iid variables and for steganographic algorithms that do
not preserve the cover model. In Section 13.2.2, the SRL is experimentally veri-
fied using a blind steganalyzer in the DCT domain for the embedding operation
of F5 (Section 7.3.2).

13.2.1 The SRL of imperfect steganography


Before proceeding with the technical result, we note that we will study only
stegosystems for which the embedding and extraction mappings satisfy

Ext (Emb(x, k, m), k) = m, ∀x ∈ C, k ∈ K, m ∈ M. (13.12)

In other words, it is assumed that every message can be embedded in every cover.
If the steganographic method has this property of homogeneity, one could define
the secure payload as log2 |M| for the largest set M for which the KL divergence
between cover and stego images is below a fixed threshold  > 0.
Let us assume that the cover source produces sequences of n iid realizations
of a discrete random variable x with probability mass function p0 [i] > 0 for all i,
where i is an index whose range is determined by the range of cover elements (for
example, i ∈ {0, . . . , 255} for an 8-bit grayscale image). Let us further assume
that after embedding using change rate β, the stego image can be described as
Steganographic capacity 283

an iid sequence with pmf

pβ [i] = p0 [i] + βr[i], (13.13)

where r[i] are constants4 independent of β, such that r[k] = 0 for at least one
k. Before discussing the plausibility of assuming that the impact of embedding
on the cover source is in the form of equation (13.13), notice that for imper-
fect steganography there must indeed exist at least one r[k] = 0, otherwise the
steganographic method would be undetectable within our cover-source model.
Assumption (13.13) holds for many practical stegosystems and digital-media
covers. In general, it is true whenever the embedding visits individual cover
elements and applies an independent embedding operation to each visited cover
element. This is true, for example, for steganographic schemes whose embedding
impact can be modeled as adding to x an independent random variable ξ with
pmf u that is linear in β. This is because the distribution of the stego image

elements y = x + ξ is the convolution u  p or pβ [i] = j u[i − j]p0 [j].
Examples of practical steganographic schemes that do satisfy (13.13) include
LSB embedding (see equation (5.10)), ±1 embedding (see Section 7.3.1 and Ex-
ercise 7.1), stochastic modulation (Section 7.2.1), and the embedding operation
of F5, –F5, nsF5, and MMx algorithms (see Exercise 7.2 for F5, Section 9.4.7 for
nsF5, and Section 9.4.3 for MMx). Even though the relationship (13.13) may not
be valid for steganography that preserves the first-order statistics (histogram) of
the cover, such as OutGuess, a similar dependence5 may exist for some proper
subsets of the cover or for local pairs of cover elements (since most stego meth-
ods disturb higher-order statistics). For example, OutGuess preserves only the
global histogram but not the histograms of individual DCT modes. Thus, as-
sumption (13.13) applies to virtually all practical steganographic schemes for
some proper model of the cover.

Theorem 13.4. [The SRL of imperfect steganography] Under assump-


tion (13.13),


1. If the absolute number of embedding changes β(n)n increases faster than n

in the sense that limn→∞ β(n)n/ n = ∞, then for sufficiently large n there
exist steganalysis detectors with arbitrarily small probability of false alarms
and missed detection.

2. If the absolute number of embedding changes increases slower than n,

limn→∞ β(n)n/ n = 0, then the steganographic algorithm can be made -
secure for any  > 0 for sufficiently large n.

4

Note that since pβ is a pmf, we must have i
r[i] = 0.
5 When pβ is the histogram of pairs (groups) of pixels, the dependence pβ on β may not
be linear, but the arguments of this section apply as long as ∂pβ [k]/∂β β=0 = 0 for some
(multi-dimensional) index k.
284 Chapter 13. Steganographic capacity


3. Finally, when limn→∞ β(n)n/ n = C, the stegosystem is
2
 2
(C /2) i r [i]/p0 [i]-secure for sufficiently large n.

Proof. Before proving the theorem, we clarify one technical issue. The theorem
is formulated with respect to the number of embedding changes rather than pay-
load. This is quite understandable because statistical detectability is influenced
by the impact of embedding and not the payload. However, for steganographic
schemes that exhibit a linear relationship between the number of changes and
payload, such as schemes that do not employ matrix embedding, the theorem
can be easily seen in terms of payload.
Proof of Part 1: Under Kerckhoffs’ principle, the warden knows the distri-
bution of cover elements p0 . Her task is to distinguish whether she is observing
realizations of a random variable with pmf pβ with β = 0 or with β > 0. To show
the first part of the SRL, the warden needs only to construct a sufficiently good
detector rather than the best possible one. In particular, it is sufficient to con-
strain ourselves to the index k for which r[k] = 0, say r[k] > 0. Let (1/n)hβ [i] be
the random variable corresponding to the value of the ith bin of the normalized
histogram of the stego image embedded with change rate β. We will investigate
the following scaled variable:
√ 1
νβ,n = n hβ [k] − p0 [k] . (13.14)
n
From the assumption of independence of stego elements, νβ,n is a binomial ran-
dom variable with expected value and variance

E [νβ,n ] = β nr[k], (13.15)
Var [νβ,n ] = pβ [k] (1 − pβ [k]) . (13.16)

With n → ∞, the distribution of νβ,n converges to a Gaussian distribution with


the same mean and variance. Thus, from now on, we will make the following
simplifying assumption:

νβ,n is Gaussian. (13.17)

This step will simplify further arguments and, it is hoped, highlight the main
idea without cluttering it with technicalities. The interested reader is referred
to Exercise 13.1 to see how the arguments can be carried out without assump-
tion (13.17).
The variance of νβ,n can be bounded
1
2
Var[νβ,n ] = σβ,n ≤ (13.18)
4
for all β ∈ [0, 1] because x(1 − x) ≤ 14 .
The difference between the mean values,

E[νβ,n ] − E[ν0,n ] = E[νβ,n ] = β nr[k], (13.19)
Steganographic capacity 285


tends to ∞ with n → ∞ when β n → ∞. Note that for β = 0, ν0,n is Gaussian
2
N (0, σ0,n 2
) with σ0,n ≤ 14 .
Consider the following detector:

νβ,n > T decide stego (β > 0), (13.20)


νβ,n ≤ T decide cover (β = 0), (13.21)

where T is a fixed threshold. We will now show that T can be chosen to make
the detector probability of false alarms and missed detections satisfy

PFA ≤ FA , (13.22)


PMD ≤ MD (13.23)

for arbitrary FA , MD > 0 for sufficiently large n. The threshold T (FA ) will be
determined from the requirement that the right tail, x ≥ T (FA), for the Gaussian
2
variable N (0, σ0,n ) is less than or equal to FA . This can be conveniently written
√ ´ ∞ t2
using the tail probability function Q(x) = (1/ 2π) x e− 2 dt for a standard
normal random variable (see Appendix A) as6

T (FA ) T (FA ) 1 −1
FA ≥ Q ≥Q ⇒ Q (FA ) ≤ T (FA ), (13.24)
1/2 σ0,n 2

or it is sufficient to take T = 12 Q−1 (FA ).


Because of the growing difference between the means (13.19), we can find n
large enough that the left tail, x ≤ T (FA ), of the Gaussian variable νβ,n is less
than or equal to MD ,
√ √
β nr[k] − T (FA ) β nr[k] − T (FA )
MD ≥ Q ≥Q (13.25)
1/2 σβ,n

or

1 −1 √
Q (MD ) ≤ β nr[k] − T (FA), (13.26)
2

which can be satisfied for sufficiently large n ≥ nMD because β n → ∞.
Proof of Part 2: We now show the second part of the SRL. In particular, we

prove that if β n → 0, the steganography is -secure for any  > 0 for sufficiently
large n.
Let x and y be n-dimensional random variables representing the elements
of cover and stego images. Due to the independence assumption about cover

6 Note that Q−1 (x) is a decreasing function.


286 Chapter 13. Steganographic capacity

elements and the embedding operation


 p0 [i]
DKL (x||y) = nDKL (p0 ||pβ ) = n p0 [i] log (13.27)
i
pβ [i]
 βr[i]
= −n p0 [i] log 1 + (13.28)
i
p0 [i]
 nβ 2  r2 [i]  θ3 [i]
= −nβ r[i] + −n p0 [i] , (13.29)
i
2 i p0 [i] i
3

where θ[i] ∈ (0, βr[i]/p0 [i]). We used Taylor expansion of log(1 + x) = x − 12 x2 +


1 3 
3 θ and the Lagrange form of the remainder. Since i r[i] = 0, we readily obtain
that the KL divergence is locally quadratic in β because it is assumed that at

least one r[k] > 0 and p0 [i] > 0 for all i. Under the assumption that β n → 0,
the quadratic term converges to zero. Because θ3 [i] < β 3 r3 [i]/p30 [i], we have for
the Lagrange remainder
   
  θ3 [i]   1  r3 [i] 
  
n p0 [i]  < nβ 2 β   → 0. (13.30)
 3  3 p0 [i] 
2
i i

This establishes the result that DKL (x||y) can be made arbitrarily small for
sufficiently large n. Putting this another way, when the number of embedding
changes is smaller than the square root of the number of cover elements n, the
steganographic scheme becomes asymptotically perfectly secure.

Proof of Part 3: In the third case when β n → C, the cubic term also goes to
zero because β → 0. The quadratic term can be bounded
nβ 2  r2 [i] C 2  r2 [i]
≤ . (13.31)
2 i p0 [i] 2 i p0 [i]

Having established the SRL of imperfect steganography, we now discuss its im-
plications. Put in simple words, the law states that the secure payload of a wide
class of practical embedding methods is proportional to the square root of the
number of cover elements that can be used for embedding. For the steganog-

rapher, this means that an r-times larger cover image can hold only r-times
larger payload at the same level of statistical detectability (the same KL diver-
gence). The SRL also means that it is easier for the warden to detect the same
relative payload in larger images than in smaller ones.
The SRL has additional important implications for steganalysis. For example,
when evaluating the statistical detectability of a steganographic method for a
fixed relative payload, we will necessarily obtain better detection on a database
of large images than on small images. To make comparisons meaningful, we
should fix the “root rate,” defined as the number of bits per square root of the
number of cover elements, rather than the rate. This would also alleviate issues
with interpreting steganalysis results on a database of images with varying size.
Steganographic capacity 287

For schemes that do use matrix embedding, due to the non-linear relationship
between the change rate and the relative payload, the critical size of the payload

does not necessarily scale with n. This is because βn changes may communicate
payload up to nH(β) bits, where H(x) is the binary entropy function. Because
for x → 0, H(x) ≈ x log(1/x), we can say that the critical size of the payload
picks up an additional logarithmic factor and scales as nβ log(1/β), which for
√ √
β ≈ 1/ n behaves as n log n.
On the more fundamental level, the SRL essentially tells us that the conven-
tional approach to steganography is highly suboptimal because we already know
from Section 6.2 and Section 13.1 that when the cover source is known, the se-
cure payload is linear in the number of cover elements (the rate is positive). The
reader should also compare this result with the capacity of noisy communication
channels [50], where the number of message bits that can be sent without any
errors increases linearly with the number of communicated symbols.
Finally, we note that the law can be generalized in several directions. It has
been established for a more general class of Markov covers and steganography
realized by applying a mutually independent embedding operation at each cover
element [71]. Also, the linear form of pβ with respect to β in assumption (13.13)
can be relaxed. All that is really necessary to carry out essentially the same steps
in the proof is the assumption that
∂pβ [k] 
 = 0 for some k. (13.32)
∂β β=0
Constraining ourselves to some right neighborhood of 0, β ∈ [0, β0 ), we could
expand pβ [i] using Taylor expansion on this interval and then repeat similar
arguments as in the proof of the SRL under assumption (13.13).

13.2.2 Experimental verification of the SRL


In the previous section, under certain assumptions on the cover source we learned
that the secure payload of imperfect stegosystems grows only as the square root
of the number of usable elements of the cover. In this section, this law is ex-
perimentally verified on the example of the F5 algorithm that embeds messages
in JPEG images (Section 7.3.2). For this experiment, the matrix embedding in
F5 was disabled to obtain a linear relationship between relative payload and the
change rate.
As in the rest of this book, for JPEG images, the relative payload α is mea-
sured with respect to all non-zero DCT coefficients in the cover JPEG file. To
verify the SRL, care needs to be taken when preparing the image sets for the ex-
periment. Basically, what is needed is a sequence of image sets, Si , i = 1, . . . , L,
such that all images from one fixed set contain approximately the same number
of non-zero DCT coefficients, ni , with ni forming a strictly increasing sequence.
Each set needs to contain sufficiently many images for training and testing a
blind steganalyzer to evaluate the statistical detectability. Moreover, the images
288 Chapter 13. Steganographic capacity

should all have the same quality factor to avoid biasing the experimental re-
sults. Ideally, the sets Si should contain images from different sources. When
attempting to comply with this requirement, however, an infeasibly high num-
ber of images would be needed. A more economical approach would be to use one
large database of raw, never-compressed images, D, and create the individual im-
age sets by cropping, scaling, and compressing the images from D. Downscaling
would, however, likely introduce an unwanted bias because downsampled images
contain more high-frequency content, which would produce JPEG images with a
different profile of non-zero DCT coefficients across the DCT modes. While crop-
ping avoids this problem, it may produce images with singular content, such as
portions of sky or grass. To avoid creating such pathological images, the cropping
was carried out so that the ratio of non-zero coefficients to the number of pixels
in the image was kept approximately the same as for the original-size image.
The images used in the experiments below were all created from a database of
6000 never-compressed raw images stored as 8-bit grayscales. From this database,
a total of L = 15 image sets that contained images with n = 20 × 103 , 40 ×
103 , . . . , 300 × 103 non-zero DCT coefficients were prepared by cropping. The
cover images from all image sets were singly compressed JPEGs with quality
factor 80.
The statistical detectability was evaluated using the targeted blind stegana-
lyzer containing 274 features described in Section 12.4 implemented using SVMs
with Gaussian kernel. Each image set, Si , was split into two disjoint subsets, one
of which was used for training and the other for testing the steganalyzer. The
training set contained 3500 images and the corresponding 3500 stego images,
while the testing set consisted of 2500 cover and the corresponding 2500 stego
images. The ability of the steganalyzer to detect the embedding on the test-
ing set was evaluated using 1 − PE , where PE = 12 (PFA + PMD ) is the minimal
average classification error under equal prior probabilities of cover and stego im-
ages defined in equation (10.15). For an undetectable steganography, PE ≈ 0.5,
while PE ≈ 1 corresponds to perfect detection. Finally, the training and testing
of the steganalyzer was repeated 100 times with different splitting of Si into the
training and testing subsets and the average PE reported as the result.
First, the same fixed payload was embedded across all image sets. Intuitively,
we expect the detectability to be the largest for the image set with the smallest
images, S1 , while the smallest detectability, close to 0.5, is expected for the set
with the largest images, S15 . Figure 13.1 confirms this expectation. Next, the
payload embedded in images from set Si was made linearly proportional to the
number of non-zero DCT coefficients, ni . As can be seen from Figure 13.2, the
statistical detectability increases with increasing ni . Finally, the detectability

stays approximately constant when the payload is made proportional to ni
(Figure 13.3). The spread around the data points in all three figures corresponds
to a 90% confidence interval.
Figure 13.4 shows the result of the fourth experiment, where for each image
set, Si , the largest absolute payload, M (ni ), for which the steganalyzer obtained
Steganographic capacity 289

1
bpnc=0.175

0.9
1 − PE

0.8 bpnc=0.075

0.7

0.6 bpnc=0.025

0 0.5 1 1.5 2 2.5 3


n ·105
Figure 13.1 Detectability 1 − PE of embedding in JPEG images with increasing number of
non-zero DCT coefficients, n. Fixed payload.

1.05
bpnc=0.125
1

0.95

0.9 bpnc=0.075

0.85
1 − PE

0.8 bpnc=0.025

0.75

0.7

0.65

0 0.5 1 1.5 2 2.5 3


n ·105
Figure 13.2 Detectability 1 − PE of embedding in JPEG images with increasing number of
non-zero DCT coefficients, n. Payload proportional to n.
290 Chapter 13. Steganographic capacity

1 bpnc=0.125

0.95
bpnc=0.075
0.9
1 − PE

0.85

0.8

0.75 bpnc=0.025

0.7

0 0.5 1 1.5 2 2.5 3


n ·105
Figure 13.3 Detectability 1 − PE of embedding in JPEG images with increasing number of

non-zero DCT coefficients, n. Payload proportional to n.

a fixed value of PE was iteratively found. A linear fit through the experimental


data displayed in a log–log plot confirms the SRL.
The SRL of imperfect steganography was verified for other forms of steganog-
raphy and other measures of security in [140].

Summary
r Informally, the steganographic capacity is the size of the maximal payload that
can be undetectably embedded in the sense that the KL divergence between
the set of cover and stego images is zero.
r The steganographic capacity can be rigorously defined in the limit when the
number of elements forming the cover object grows to infinity. It is the largest
communication rate taken across all stegosystem satisfying the requirement of
perfect security, possibly with constraints on the embedding and/or channel
distortion.
r The capacity can be computed analytically for sufficiently simple cover
sources. It is rather difficult to estimate the capacity for real digital media
due to its complexity.
r The secure payload is the largest payload that can be embedded using a given
steganographic scheme with a fixed statistical detectability expressed by the
KL divergence between cover and stego objects.
Steganographic capacity 291

9
m(n) = n0.532
PE = 0.1
8.5 m(n) = n0.535

log m(n) at fixed PE


PE = 0.25

7.5

10 10.5 11 11.5 12 12.5


log n
Figure 13.4 The largest payload m(ni ) that produces the same value of PE for the ith
image set containing images with ni non-zero DCT coefficients. The straight lines are the
corresponding linear fits. The slope of the lines is 0.53 and 0.54, which is in good
agreement with the SRL.

r The secure payload can be estimated experimentally by estimating it from a


large database of images using blind steganalyzers. The disadvantage is that
this estimate depends on the database of images and the choice of blind ste-
ganalyzers. On the other hand, it provides useful information to the steganog-
rapher, given the current state of the art of steganalysis.
r The secure payload of imperfect steganographic systems increases with the
square root of the number of cover elements. This result is known as the SRL
of imperfect steganography and holds for virtually all known steganographic
techniques for real digital media.
r The SRL has been confirmed experimentally using blind steganalyzers.

Exercises
13.1 [Berry–Esséen theorem] A version of the Berry–Esséen theorem
says that for a sequence of iid random variables x[i], i = 1, . . . , n, whose distri-
bution satisfies E[x] = 0, E[x2 ] = σ 2 > 0, and E[|x|3 ] = ρ < ∞, the cumulative
density function Fn (x) of the random variable
√ x̄
n (13.33)
σ
satisfies
ρC
|Fn (x) − Φ(x)| < √ , for all x. (13.34)
σ3 n
The symbol x̄ stands for the sample mean computed from n realizations, Φ(x)
is the cdf of a normal random variable, and C is a positive constant. The
292 Chapter 13. Steganographic capacity

theorem essentially says that the cdf of the scaled sample mean uniformly
converges to the cdf of its limiting random variable and the rate of convergence

is 1/ n.
Use this theorem to carry out the arguments in the proof of the SRL of
imperfect steganography without the Gaussian assumption (13.17). Note that
the same bound holds true for the corresponding tail probability functions
1 − Fn (x) and Q(x) = 1 − Φ(x).
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
A - Statistics pp. 293-312

Chapter DOI:
Cambridge University Press
A Statistics

A.1 Descriptive statistics

The purpose of this appendix is to provide the reader with some basic concepts
from statistics needed throughout the main text. A good introductory text on
statistics for signal-processing applications is [59]. A discrete-valued random vari-
able x reaching values from a finite alphabet A with probabilities p(x), x ∈ A,

0 ≤ p(x) ≤ 1, x p(x) = 1, is described by its probability mass function (pmf)

p(x). The probability that x reaches a value in S ⊂ A is Pr{x ∈ S} = x∈S p(x).

Example A.1: [Fair die] x ∈ {1, 2, 3, 4, 5, 6}, p(i) = 16 . The probability that x is
even is Pr{x ∈ {2, 4, 6}} = 3 × 16 = 12 .

A continuous random variable accepting values in R is characterized by a prob-


ability density function (pdf)´ f (x), which is a non-negative and Lebesgue-
integrable
´ function satisfying R f (x)dx = 1. For any Borel set B ⊂ R, Pr{x ∈
B} = B f (x)dx. Informally, a set that can be written as a countable union or
´ b (closed) intervals in R is called a Borel set. In particular,
intersection of open
Pr{x ∈ [a, b]} = a f (x)dx.
For a real-valued random variable, we define the cumulative distribution func-
tion (cdf)

ˆx
F (x) = Pr{x ≤ x} = f (t)dt. (A.1)
−∞

The cdf has the following properties.

r F  (x) = dF (x)/dx = f (x) by the fundamental theorem of calculus.


r F (x) is non-decreasing, F (x) ≤ F (y), for x ≤ y.
r F (x) is right-continuous, limx→x+ F (x) = F (x0 ).
r limx→−∞ F (x) = 0.
0

r limx→∞ F (x) = 1.
294 Appendix A. Statistics

The complementary cdf is defined as


ˆ∞
1 − F (x) = Pr{x > x} = f (t)dt. (A.2)
x

Example A.2: [Uniform distribution] A random variable uniformly dis-


tributed on interval [a, b] is denoted U (a, b). Its pdf and cdf are

1
b−a x ∈ [a, b]
f (x) = (A.3)
0 otherwise,



⎨0 x<a
F (x) = x−a
a≤x≤b (A.4)
⎪ b−a

⎩1 x > b.

A.1.1 Measures of central tendency and spread


For a real-valued random variable x, we define its mean value (or expected value)
and variance as
ˆ
x = E[x] = xf (x)dx, (A.5)
R

Var[x] = E[(x − x)2 ] (A.6)


= E[x − 2xx + x ] = E[x ] − 2x + x = E[x ] − x
2 2 2 2 2 2 2
(A.7)
´
assuming the integrals R |x|k f (x)dx exist (k = 1 for the mean and k = 2 for the
 
variance). For a discrete random variable, E[x] = x xp(x), Var[x] = x (x −
x̄)2 p(x). The mean value is commonly denoted withsymbol μ, while the standard
deviation defined as the square root of variance, Var[x], is often denoted with
the Greek letter σ.
Note that there are probability distributions for which the mean or the vari-
ance does not exist. An example of a pdf with undefined mean is the Cauchy
distribution, which has “thick tails”
1
f (x) = . (A.8)
π(1 + x2 )
This distribution arises as the ´pdf of the ratio of two Gaussian random variables

with zero mean. The integral −∞ |x|/[π(1 + x2 )]dx is divergent because
ˆ∞ ˆ∞ ˆ∞
|x| 2 x 2 1
2
dx = 2
dx ≥ dx (A.9)
π(1 + x ) π 1+x π 1+x
−∞ 0 1
Statistics 295

and the last integral is divergent.


An example of a distribution for which the variance is undefined is Student’s
t-distribution (Section A.8) with the tail index ν ≤ 2.
The issue of undefined mean or variance is not a pure academic construct but
rather a real phenomenon. Thick-tail distributions are often encountered in ste-
ganalysis and signal processing. For example, in this book the error of quantita-
tive steganalyzers (Section 10.3.2) is well modeled using Student’s t-distribution.
For thick-tail distributions, the spread of the random variable may not be well
described using variance because the sample variance may not converge to a
finite number with increasing number of samples or may converge very slowly.
Robust alternatives to the mean and variance are the median and Median
Absolute Deviation (MAD). For a finite statistical sample, the median is obtained
by ordering the samples from the lowest to the largest value and choosing the
middle value. When the number of samples is even, one usually takes the mean of
the two middle values. The median is sometimes denoted as μ1/2 (x). It satisfies

1 1
Pr{x ≤ μ1/2 (x)} ≥ and Pr{x ≥ μ1/2 (x)} ≥ . (A.10)
2 2
When more than one value μ1/2 (x) satisfies this equation, we again take the mean
of all such values. For example, for the fair die from Example A.1, any value in
the interval (3, 4) satisfies (A.10). The median is thus defined as μ1/2 = 3+42 =
3.5, which coincides with the mean. In general, the median of any probability
distribution that is symmetrical about its mean must be equal to the mean.
Moreover, for a random variable with continuous pdf,

ˆ
μ1/2 (x)
1
f (x)dx = . (A.11)
2
−∞

The MAD is defined as


 
MAD = μ1/2 (x − μ1/2 (x)). (A.12)

The sample MAD for a random variable that represents an error term is called
the Median Absolute Error (MAE).
Another common robust measure of spread is the Inter-Quartile Range (IQR).
For a random variable x, it is defined as the interval [q1 , q3 ], where qi is the ith
quartile determined from Pr{x < qi } = i/4, i = 1, . . . , 4.
The next proposition gives a formula for the expected value of a random
variable obtained as a transformation of another random variable x.

Proposition A.3. For any piecewise continuous function g


ˆ
E[g(x)] = g(x)f (x)dx. (A.13)
R
296 Appendix A. Statistics

This statement is more intuitive for a discrete variable, where the trans-
formed random variable g(x) reaches values g(x) with probabilities p(x) and

thus E[g(x)] = x g(x)p(x). To see this statement for a simpler case of a contin-
uous variable, consider g strictly monotonic and differentiable. Then, assuming
for example that g is increasing, we have for the cdf of y = g(x)
−1
gˆ (y)

Fy (y) = Pr{y ≤ y} = Pr{g(x) ≤ y} = Pr{x ≤ g −1 (y)} = f (x)dx. (A.14)


−∞

Thus, the pdf of y is obtained by differentiating its cdf,


  dg −1 (y)
fy (y) = Fy (y) = f g −1 (y) . (A.15)
dy
We can now calculate its mean value:
ˆy2 ˆ
 −1  dg −1 (y)
E[y] = yf g (y) dy = {x = g −1 (y)} = g(x)f (x)dx. (A.16)
dy
y1 R

In the derivation above, we denoted the image of R under g as (y1 , y2 ) =


g(−∞, ∞).
A special case of Proposition A.3 states that for any a, b ∈ R

E[ax + b] = aE[x] + b (A.17)

due to linearity of the integral. In this special case, the pdf of y = ax + b for
a > 0 is
y−b 1
fy (y) = fx . (A.18)
a a

A.1.2 Construction of PRNGs using compounding


Proposition A.3 can be conveniently used for constructing pseudo-random num-
ber generators (PRNGs) with arbitrary pdf from a uniform random number
generator on (0, 1). In order to obtain a PRNG producing numbers according to
cdf F (x), note that for u ∼ U (0, 1)

Pr{F −1 (u) ≤ x} = Pr{u ≤ F (x)} = F (x), (A.19)

which implies that the cdf of F −1 (u) is F (x).


As an example, let us imagine that we wish to construct a PRNG that gener-
ates numbers following the Cauchy distribution
1
f (x) = . (A.20)
π(1 + x2 )
Statistics 297

First, we obtain the cdf


ˆx
dt 1 1
F (x) = 2
= + arctan(x). (A.21)
π(1 + t ) 2 π
−∞

  
The inverse function
  to y
= F (x) is x = tan π y − 1
2 . Finally, if u ∼ U (0, 1),
the variable tan π u − 2 follows the Cauchy distribution.
1

A.2 Moment-generating function

Statistical moments are often useful for description of the shape of a probability
distribution. For a positive integer k, the kth moment of x is defined as

μ[k] = E[xk ] (A.22)

and the kth central moment is

μc [k] = E[(x − x)k ]. (A.23)


´
The moments do not exist if the integral E[|x|k ] = R |x|k f (x)dx does not con-
verge. The first moment is the mean and the second moment μ[2] = E[x2 ] =
x2 + Var[x] = μ2 + σ 2 . Of course, the first and second central moments are
μc [1] = 0 and μc [2] = Var[x].
Central moments normalized by the corresponding power of the standard de-
viation, μc [k]/σ k , are called normalized moments. The third normalized central
moment, μc [3]/σ 3 , is called skewness and it measures the asymmetry of the dis-
tribution. The fourth normalized central moment, μc [4]/σ 4 , is called kurtosis
and is often used to evaluate the Gaussianity of a random variable (the kurtosis
of a standard Gaussian variable is 3).
Moments can be conveniently computed using the moment-generating function
defined as
ˆ
Mx (t) = E[e ] = etx f (x)dx.
tx
(A.24)
R

This is a well-defined transform because, if Mx (t) = My (t), then x = y in the


sense that they have the same distribution almost everywhere with respect to
the pdf f (x).
The moment-generating function can be used to calculate moments of random
variables in the following manner:
ˆ ˆ
dMx (t) d
= e f (x)dx = xetx f (x)dx.
tx
(A.25)
dt dt
R R
298 Appendix A. Statistics

By differentiating k times
ˆ
dk Mx (t)
= xk etx f (x)dx. (A.26)
dtk
R

Thus,
 ˆ
dk Mx (t) 
= xk f (x)dx = E[xk ] = μ[k]. (A.27)
dtk t=0
R

Example A.4: [Gaussian moment-generating function] Let x be a Gaussian


random variable with mean μ and variance σ 2 , x ∼ N (μ, σ2 ) (see the definition
in Section A.4),
ˆ 2
ˆ
1 1 x2 −2μx+μ2 −2σ2 tx
− (x−μ)
Mx (t) = e √tx
e 2σ2 dx = √ e− 2σ2 dx. (A.28)
2πσ 2πσ
R R

Completing the exponent to the form (x − A)2 + B, we obtain

x2 − 2μx + μ2 − 2σ 2 tx = x2 − 2x(μ + tσ 2 ) + (μ + tσ 2 )2 + μ2 − (μ + tσ 2 )2
(A.29)
 2
= x − (μ + tσ 2 ) − 2μtσ 2 − t2 σ 4 . (A.30)

Thus, we can write


ˆ 2
1 (x−(μ+tσ2 )) t2 σ 2 t2 σ 2

Mx (t) = √ e 2σ2 eμt+ 2 dx = eμt+ 2 , (A.31)
2πσ
R

which is the moment-generating function of a Gaussian variable N (μ, σ2 ).

From the moment-generating function, we can obtain the first four moments of
a Gaussian random variable simply by differentiating Mx (t) and evaluating the
derivatives at 0. Writing higher-order derivatives as roman numerals,

Mx (t) = (μ + tσ 2 )Mx (t) = {at t = 0} = μ, (A.32)
 
MxII (t) = σ 2 + (μ + tσ 2 )2 Mx (t) = {at t = 0} = μ2 + σ 2 , (A.33)
   
MxIII (t) = 2σ 2 (μ + tσ 2 ) + σ 2 + (μ + tσ 2 )2 (μ + tσ 2 ) Mx (t), (A.34)
= {at t = 0} = μ3 + 3μσ 2 , (A.35)
MxIV (t) = {at t = 0} = 3σ 4 + 6μ2 σ 2 + μ4 . (A.36)

The first four central moments (μ = 0) are 0, σ2 , 0, 3σ 4 . In general, for an even


pdf, all odd moments are 0.
Statistics 299

A.3 Jointly distributed random variables

Let x and y be two real-valued random variables. Their joint probability density
function f (x, y) ≥ 0 defines the probability that a joint event (x, y) will fall into
a Borel subset B ⊂ R2 ,
ˆ
Pr {(x, y) ∈ B} = f (x, y)dxdy. (A.37)
B
´
As with the one-dimensional counterpart, R2 f (x, y)dxdy = 1.
The marginal probabilities for each random variable are
ˆ ˆ
fx (x) = f (x, y)dy, fy (y) = f (x, y)dx. (A.38)
R R

We say that x and y are independent if

f (x, y) = fx (x)fy (y) for all x, y. (A.39)

These concepts generalize to more than two variables in a straightforward


manner.
An analogy of Proposition A.3 for jointly distributed random variables is the
following useful statement.

Proposition A.5. Consider random variables x1 , . . . , xn with joint pdf


f (x1 , . . . , xn ). Then for any piecewise continuous function g : Rn → R
ˆ∞ ˆ∞
E[g(x1 , . . . , xn )] = ... g(x1 , . . . , xn )f (x1 , . . . , xn )dx1 . . . dxn . (A.40)
−∞ −∞

For any two random variables x and y, their covariance

Cov[x, y] = E[(x − x)(y − y)] = E[xy] − xy − yx + xy = E[xy] − xy (A.41)

measures the extent of linear dependence between them. If Cov[x, y] = 0, we say


that the variables are uncorrelated. Thus, E[xy] = E[x]E[y] for two uncorrelated
random variables. Note that Cov[x, x] = Var[x].
If x, y are independent, then they are uncorrelated because by (A.41) and
Proposition A.5
¨ ¨
E[xy] = xyf (x, y)dxdy = xyfx (x)fy (y)dxdy (A.42)
R2 R2
ˆ ˆ
= xfx (x)dx yfy (y)dy = E[x]E[y]. (A.43)
R R

Example A.7 shows that the converse is not generally true. Two uncorrelated
variables may be dependent.
300 Appendix A. Statistics

Linear dependence between two random variables is evaluated using the cor-
relation coefficient

Cov[x, y]
ρ(x, y) =  . (A.44)
Var[x]Var[y]

The correlation coefficient is always between −1 and 1,

|ρ(x, y)| ≤ 1, (A.45)

which is the Cauchy–Schwartz inequality in the vector space of zero-mean ran-


dom variables with inner product defined as the covariance (see Section D.10).
Using Proposition A.5, we can write for any random variables x1 , . . . , xn and
a constant vector a
 

n 
n
E a[i]xi = a[i]E[xi ]. (A.46)
i=1 i=1

Often, we will also need to know the variance of a sum of random variables

y = ni=1 a[i]xi . The variance of y is
⎡ !2 ⎤ !2

n 
n
Var[y] = E[y2 ] − (E[y])2 = E ⎣ a[i]xi ⎦ − a[i]E[xi ] (A.47)
i=1 i=1
⎡ ⎤

n n 
n 
n
=E⎣ a[i]a[j]xi xj ⎦ − a[i]a[j]E[xi ]E[xj ] (A.48)
i=1 j=1 i=1 j=1


n 
n
= a[i]a[j] (E[xi xj ] − E[xi ]E[xj ]) (A.49)
i=1 j=1
n  n
= a[i]a[j]Cov[xi , xj ] = aCa , (A.50)
i=1 j=1

where C[i, j] = Cov[xi , xj ] = E[xi xj ] − E[xi ]E[xj ] is the covariance between ran-
dom variables xi and xj forming the covariance matrix C.
If xi are pairwise uncorrelated, i.e., Cov[xi , xj ] = 0 for all i = j, then
 n 
 
n
Var a[i]xi = a[i]2 Var[xi ]. (A.51)
i=1 i=1

Often, it is necessary to determine the distribution of a sum of two random


variables. Let x and y be two real-valued random variables with joint pdf f (x, y).
Statistics 301

Then, the cdf of their sum is


¨
Fx+y (z) = Pr{x + y ≤ z} = f (x, y)dxdy (A.52)
x+y≤z
ˆ∞ ˆ
z−x ˆ∞ ˆz
= dx f (x, y)dy = {u = x + y} = dx f (x, u − x)du.
−∞ −∞ −∞ −∞
(A.53)

By differentiating

ˆ∞ ˆz ˆ∞
d
fx+y (z) = dx f (x, u − x)du = f (x, z − x)dx. (A.54)
dz
−∞ −∞ −∞

If x and y are independent, then f (x, y) = fx (x)fy (y) and we have

ˆ∞
fx+y (z) = fx (x)fy (z − x)dx = fx  fy (z). (A.55)
−∞

In other words, the pdf of the sum of two independent random variables is the
convolution of their pdfs.
We now show that the sum of two independent Gaussian variables is again
Gaussian, with mean equal to the sum of means and variance equal to the sum
of variances. To this end, we use the following lemma.

Lemma A.6. For two independent random variables x, y, the moment-


generating function of their sum is the product of moment-generating functions
for x and y
# $ ¨
t(x+y)
Mx+y (t) = E e = et(x+y) f (x, y)dxdy (A.56)
ˆ ˆ
= e fx (x)dx ety fy (y)dy = Mx (t)My (t).
tx
(A.57)

For x ∼ N (μ1 , σ12 ) and y ∼ N (μ2 , σ22 ), we have


1 2 2 1 2 2 1 2 2 2
Mx+y (t) = eμ1 + 2 σ1 t eμ2 + 2 σ2 t = e(μ1 +μ2 )+ 2 (σ1 +σ2 )t , (A.58)

which is the pdf of the Gaussian random variable N (μ1 + μ2 , σ12 + σ22 ). In gen-
eral, a linear combination of independent Gaussian variables is again a Gaussian
random variable, with mean equal to the same linear combination of means and
variance equal to a linear combination of variances, where the coefficients in the
linear combination are squared.
302 Appendix A. Statistics

A.4 Gaussian random variable

The Gaussian distribution is, without any doubts, the most famous and most
important distribution for a signal-processing engineer. The pdf for a Gaussian
(also called normal) random variable N (μ, σ2 ) with mean μ and variance σ 2 is

1 (x−μ)2
f (x) = √ e− 2σ2 . (A.59)
2πσ

By numerical evaluation, we obtain the following probabilities of outliers:

.
Pr{|x − μ| ≤ σ} = 68.27%, (A.60)
.
Pr{|x − μ| ≤ 2σ} = 95.45%, (A.61)
.
Pr{|x − μ| ≤ 3σ} = 99.73%. (A.62)

The cdf for a standard Gaussian random variable x ∼ N (0, 1) will be denoted
´x √ t2
Φ(x) = −∞ (1/ 2π)e− 2 dt. We also define the complementary cumulative dis-
tribution function (the tail probability) as the probability that x reaches a value
larger than x,

ˆ∞
1 t2
Q(x) = Pr{x > x} = 1 − Φ(x) = √ e− 2 dt. (A.63)

x

This tail probability can be expressed using the error function


ˆ x
2 2
Erf(x) = √ e−t dt, (A.64)
π 0

which is available in Matlab as erf.m. Here is the connection between Erf(x)


and Q(x):

ˆx √ ˆ 2
x/
1 t2 1 2
1 − Q(x) = √ e− 2 dt = {substitution u = t/ 2} = √ e−u du
2π π
−∞ −∞
(A.65)
⎛ √ ⎞
ˆ0 ˆ 2
x/
1 ⎜ 2 2 ⎟ 1 1 x
= √ ⎝ e−u du + e−u du⎠ = + Erf √ . (A.66)
π 2 2 2
−∞ 0

Thus,

1 x
Q(x) = 1 − Erf √ . (A.67)
2 2
Statistics 303

Note that for x ∼ N (μ, σ2 ),


ˆ∞
1 (t−μ)2
Pr{x ≥ x} = √ e− 2σ2 dt = {u = (t − μ)/σ} (A.68)
2πσ
x
ˆ∞
1 u2 x−μ
= √ e− 2 du = Q . (A.69)
2π σ
x−μ
σ

The tail probability Q(x) cannot be expressed in a closed form. However, for
large x there exists an approximate asymptotic form
2
e−x /2
Q(x) ≈ √ , (A.70)
2πx
which should be understood in the sense that f (x) ≈ g(x) for x → ∞ if
limx→∞ f (x)/g(x) = 1.´ This asymptotic
´  expression can be easily obtained by

integrating by parts ( uv = uv − u v),
ˆ∞ √
t t2 2
Q(x) = √ e− 2 dt = {u = 1/( 2πt), v  = te−t /2 } (A.71)
2πt
x
 ∞ ˆ∞
−t2 /2 2 2
e e−t /2 e−x /2
= − √ + √ dt = √ − R(x), (A.72)
2πt x 2πt2 2πx
x

where
ˆ ∞ 2 ˆ ∞ 2
e−t /2 1 e−t /2 Q(x)
R(x) = √ dt ≤ 2 √ dt = . (A.73)
x 2πt 2 x x 2π x2
Thus,
2 √
e−x /2
/( 2πx) R(x) 1
=1+ =1+O , (A.74)
Q(x) Q(x) x2
which proves (A.70).

A.5 Multivariate Gaussian distribution

A multivariate Gaussian distribution, also called jointly Gaussian, is obtained as


a linear transformation of independent standard normal variables. Let y1 , . . . , yn
be all mutually independent and N (0, 1) and let x1 , . . . , xn be random variables
obtained as linear transformations of y1 , . . . , yn ,
xi = T[i, 1]y1 + T[i, 2]y2 + · · · + T[i, n]yn + μ[i] for each i = 1, . . . , n, (A.75)
or in matrix notation
x = Ty + μ, (A.76)
304 Appendix A. Statistics

where y = (y1 , . . . , yn ) and μ = (μ[1], . . . , μ[n]) is a vector of constants. We also


assume that the n × n matrix T is regular, so that T−1 exists (we denote T =
D−1 ). The vector x = (x1 , . . . , xn ) follows the multivariate Gaussian distribution.
We now derive the pdf of this vector random variable.
Because yi are independent, the joint pdf of y is the product of pdfs of each yi
"
n
1 1 2 1 1 2 2 1 1 
f (y) = √ e− 2 y[i] =  e− 2 (y[1] +···+y[n] ) =  e− 2 y y , (A.77)
i=1
2π (2π)n (2π)n
because y is a column vector.
We now derive the pdf for the transformed variables xi by substituting
y = T−1 (x − μ) = D(x − μ) (A.78)
into (A.77). First, note that y y = (D(x − μ)) D(x − μ) = (x − μ) D D(x −
μ) = (x − μ) C−1 (x − μ), where we denoted
C−1 = D D. (A.79)
We remark that C−1 1 is1a2 positive-definite matrix because, for any z ∈ R ,
n

z C z = z D Dz = 1Dz1 ≥ 0 and equality occurs if and only if z = 0 because


 −1  

D is regular. To obtain the full pdf expressed in variables x[i], we need to multiply
the pdf by the Jacobian of the linear transform. This is because
ˆ ˆ  
 ∂y 
Pr{y ∈ B ⊂ R } = f (y)dy = {substitution y = h(x)} = f (h(x))   dx.
n
∂x
B B
(A.80)
Thus, the pdf expressed in terms of x is f (h(x)) |∂y/∂x|. From (A.78),
∂y[i]/∂x[j] = D[i, j] and we have for the Jacobian
 
 ∂y   
  = |D| = |D| · |D | = |C−1 | = 1 . (A.81)
 ∂x  |C|
The pdf of x is thus
1 1  −1
f (x) =  e− 2 (x−μ) C (x−μ) . (A.82)
(2π)n |C|
Notice that from (A.75) the mean values E[xi ] = μ[i] (or in vector form, E[x] =
μ) and the covariances
 n 
 n
E[(xi − μ[i])(xj − μ[j])] = E D−1 [i, k]yk × D−1 [j, k]yk (A.83)
k=1 k=1
 
n
=E D−1 [i, k]D−1 [j, k]yk2 (A.84)
k=1

n
= D−1 [i, k]D−1 [j, k] (A.85)
k=1

= D (D−1 ) [i, j] = (D D)−1 [i, j] = C[i, j].


−1
(A.86)
Statistics 305

In the derivations above, we used the fact that E[yi yj ] = δ(i − j), because yi are
iid, and (D )−1 = (D−1 ) for any regular matrix D. Thus, the matrix C is the
covariance matrix of variables x1 , . . . , xn .
Conversely, if a vector of random variables x follows the distribution (A.82) for
a symmetric positive-definite matrix C, it is jointly Gaussian because any sym-
metric positive-definite matrix C can be written as C−1 = D D for some regular
matrix D (this is called Choleski decomposition). By transforming x using the
linear transform y = D(x − μ), we can make all yi ∼ N (0, 1) and independent,
simply by reversing the steps above. This will also prove that E[x] = μ and
E[(xi − μ[i])(xj − μ[j])] = C[i, j], the covariance matrix.
Note that jointly Gaussian random variables that are uncorrelated must be also
independent. This is a rare case when uncorrelatedness does imply independence.
This is because, if xi and xj are uncorrelated for every i = j, their covariance
matrix C = diag(σ[1]2 , . . . , σ[n]2 ) is diagonal, |C| = σ[1] × · · · × σ[n], and the
pdf (A.82) can be factorized,

1 1  −1
" n
1 − (x[i]−μ[i])
2

f (x) =  e− 2 (x−μ) C (x−μ) =  e 2σ[i]2 , (A.87)


(2π)n |C| i=1
2πσ[i]2

which means that the pdf is the product of its marginals, establishing thus their
independence.
Note that uncorrelated Gaussians that are not jointly distributed do not have
to be independent. Consider this simple example.

Example A.7: [Uncorrelated but dependent] Let ξ ∼ N (0, 1) and s be uni-


formly distributed on {−1, 1}, independent of ξ (s is a random variable that
equally likely attains values −1 and 1). Then, E[ξ] = E[sξ] = 0 and ξ and sξ are
two uncorrelated variables because E[ξ · sξ] = 0, and thus E[ξ · sξ] − E[ξ]E[sξ] =
0, because the pdf of sξ 2 is symmetrical about 0 (the symmetry of s is responsi-
ble for this). The variables ξ and sξ are, however, not independent because the
knowledge of ξ determines sξ up to its sign. In fact, the marginal distributions of
each variable are standard Gaussian N (0, 1) (a Gaussian N (0, 1) with random-
ized sign is again N (0, 1)). Thus, if ξ and sξ were independent, their joint pdf
would have to be a product of two Gaussian pdfs, which is clearly not the case
because the joint pdf f (x, y) = 0 whenever |x| = |y| (because we always have
|ξ| = |sξ|).

A.6 Asymptotic laws

There exist many important asymptotic results in statistics when a sequence of


random variables converges in some sense to another random variable giving us
306 Appendix A. Statistics

the option in practice to replace a variable with a rather complicated distribution


with a much simpler distribution.
The weak law of large numbers states that the sample mean of a sequence of
iid random variables x1 , . . . , xn with finite mean μ converges in probability to
the mean
1 P
xn = (x1 + · · · + xn ) → μ (A.88)
n
in the sense that for any  > 0,

Pr{|xn − μ| > } → 0, (A.89)

as n → ∞.
The Central Limit Theorem (CLT) is one of the most fundamental theorems in
probability. It states that given a sequence of iid random variables x1 , . . . , xn with
finite mean value μ and variance σ 2 , the distribution of the average (x1 + · · · +
xn )/n approaches a Gaussian with mean μ and variance σ 2 /n independently of
the distribution of xi .
More precisely, denoting the partial sum sn = x1 + · · · + xn , the following vari-
able converges in distribution to N (0, 1):

sn − nμ D
√ → N (0, 1). (A.90)
σ n

We say that sequence zn converges in distribution to y if limn→∞ Fzn (x) = Fy (x)


for all x where the cdf Fy is continuous.

A.7 Bernoulli and binomial distributions

In this section, we introduce two discrete distributions that appear in the book.
A Bernoulli random variable B(p) is a random variable x with range {0, 1}
with

Pr{x = 1} = p, (A.91)
Pr{x = 0} = 1 − p. (A.92)

The mean and variance of a Bernoulli random variable are

E[x] = (1 − p) × 0 + p × 1 = p, (A.93)
Var[x] = (1 − p) × (0 − p) + p × (1 − p) = p(1 − p).
2 2
(A.94)

Assume we have an experiment with two possible outcomes “yes” or “no,” where
“yes” occurs with probability p and “no” with probability 1 − p. If the experiment
is repeated n times, the number of occurrences of “yes” is a random variable y
Statistics 307

with range y ∈ {0, 1, . . . , n} that follows the binomial distribution Bi(n, p)

n k
Pr{y = k} = p (1 − p)n−k , k = 0, 1, . . . , n, (A.95)
k
E[y] = np, (A.96)
Var[y] = np(1 − p). (A.97)
n
Given n independent Bernoulli random variables B(p), x1 , . . . , xn , i=1 xi ∼
Bi(n, p). Thus, by the CLT, Bi(n, p) converges in distribution to a Gaussian in
the sense defined in Section A.6.

A.8 Generalized Gaussian, generalized Cauchy, Student’s


t-distributions

The generalized Gaussian distribution


β
 e−| |
x−μ β
f (x; α, β, μ) = α (A.98)
1
2αΓ β

is a reasonably good model of DCT coefficients in a JPEG file and in general a


good fit for any high-pass-filtered natural image. The model depends on three
parameters – the mean μ, the shape parameter β > 0, and the width parameter
α > 0.
The influence of the parameter β on the distribution shape is shown in Fig-
ure A.1. For β > 1, the distribution is continuously differentiable at zero, while for
0 < β ≤ 1, the distribution has a “spike” at zero (it is not differentiable there).
The smaller β, the spikier the distribution looks. For β = 1, the distribution
is called Laplacian and for β = 2, we obtain the Gaussian distribution. Values
of β > 2 lead to distributions with increasingly flatter maximum at zero. For
β → ∞, the distribution converges to a uniform distribution on (−α, α).
A simple method for fitting the generalized Gaussian model through sample
data x[i] is the method of moments [172]. It works by first calculating the sample
mean and the first two absolute central sample moments,

1
n
μ̂c [1] = |x[i] − μ̂|, (A.99)
n i=1
1
n
μ̂c [2] = |x[i] − μ̂|2 , (A.100)
n i=1

where

1
n
μ̂ = x[i].
n i=1
308 Appendix A. Statistics

0.6
β =4
β =2
β =1
β = 0.7
0.4

0.2

0
−3 −2 −1 0 1 2 3
Figure A.1 Examples of the generalized Gaussian distribution for different values of the
shape parameter β and α = 1, μ = 0.

Estimates of the generalized Gaussian parameters are then obtained as

μ̂c [1]2
β̂ = G−1 , (A.101)
μ̂c [2]

Γ β̂1
α̂ = μ̂c [1] , (A.102)
Γ β̂2

where
 
Γ2 x2
G(x) =  1   3  , (A.103)
Γ x Γ x

and Γ(x) is the Gamma function

ˆ∞
Γ(x) = tx−1 e−t dt. (A.104)
0

Note that the method fails to estimate the parameters if μ̂c [1]2 /μ̂c [2] > 3/4
because the range of G is 0, 34 . The Gamma function is implemented in Matlab
as gamma.m. The inverse function G−1 (x) must be implemented numerically (e.g.,
through a binary search).
Another common model for the distribution of DCT coefficients is the Cauchy
distribution, which arises from the ratio of two normally distributed random
variables. Its generalized version has the following pdf:
−p
p−1 |x − μ|
f (x; p, s, μ) = 1+ (A.105)
2s s
Statistics 309

0.5 p = 2.0
p = 1.5
p = 1.2
0.4

0.3

0.2

0.1

0
−3 −2 −1 0 1 2 3
Figure A.2 Generalized Cauchy distribution for different values of the shape parameter p
and s = 1, μ = 0.

with three parameters – the mean μ, the width parameter s > 0, and the shape
parameter p > 1. The generalized Cauchy distribution has thick tails. It becomes
spikier with larger p and flatter with smaller p (see Figure A.2).
Student’s t-distribution finds applications in modeling the output from quan-
titative steganalyzers (see Chapter 10). It arises in the following problem. Let
x1 , . . . , xn be iid Gaussian variables N (μ, σ2 ) and let xn = (x1 + · · · + xn )/n,

σ̂n2 = [1/(n − 1)] ni=1 (xi − x)2 be the sample mean and sample variance. While
the normalized mean is Gaussian distributed if μ and σ 2 are known,
xn − μ
√ ∼ N (0, 1), (A.106)
σ/ n
replacing the standard deviation σ with its sample form leads to a more compli-
cated random variable,
x−μ
√ , (A.107)
σ̂n / n
which follows Student’s t-distribution with ν = n − 1 degrees of freedom:
  − ν+1
Γ ν+12  (x − μ)2 2
f (x) = √ 1 + . (A.108)
νπ Γ ν2 ν
The distribution is symmetric about its mean, which is μ for ν > 1 and is un-
defined otherwise. The parameter ν is called the tail index. When ν = 1, the
distribution is Cauchy. As ν → ∞, the distribution approaches the Gaussian
distribution. The variance is ν/(ν − 2) for ν > 2 and undefined otherwise. In
general, only moments strictly smaller than ν are defined. The tail probability
of Student’s distribution satisfies 1 − F (x) ≈ x−ν , where F (x) is the cumulative
density. Maximum likelihood estimators (see Section D.7 and [229]) can be used
in practice to fit Student’s distribution to data.
310 Appendix A. Statistics

0.5 ν =2
ν =3
ν =4
0.4 ν =5
ν = 10
0.3

0.2

0.1

0
0 5 10 15 20
Figure A.3 Chi-square distribution for ν = 2, 3, 4, 5, 10 degrees of freedom.

A.9 Chi-square distribution

The chi-square distribution often arises in many engineering applications. In this


book, we will encounter it in Section 5.1.1 dealing with the histogram attack.
There, it is used in Pearson’s chi-square test to determine whether a discrete
random variable follows a known distribution. A chi-square-distributed random
variable with ν degrees of freedom arises as a sum of squares of ν standard
normal Gaussian variables

x= ξi2 , ξi ∼ N (0, 1). (A.109)
i=1

The probability distribution of x is denoted with χ2ν and its pdf is


e− 2 x 2 −1
x ν

f (x) = ν  ν  for x ≥ 0 (A.110)


22Γ 2
and f (x) = 0 otherwise. The mean and variance of x are
E[x] = ν, (A.111)
Var[x] = 2ν. (A.112)
Examples of chi-square probability distributions for several values of ν are
shown in Figure A.3.

A.10 Log–log empirical cdf plot

For a given scalar random variable, x, the log–log empirical cdf plot shows the
tail probability Pr{x > x} (or complementary cdf) as a function of x in a log–
log plot. A separate plot is usually made for the right and left tails when the
Statistics 311

Pr{x > x} 10−1 10−1

Pr{s > x}
10−2 10−2

10−3 10−3

10−3 10−2 10−1 100 101 10−3 10−2 10−1


x x
Figure A.4 The log–log empirical cdf plot for a Gaussian, x, and a t-distributed random
variable, s, and the corresponding Gaussian fits (thin line).

distribution is non-symmetrical. The purpose of the plot is to analyze the tails


of the distribution, which are important for error estimation in signal detection
(Appendix D) and quantitative steganalysis (Section 10.3.2).
Figure A.4 shows the log–log empirical cdf plot obtained from 10,000 samples
of a Gaussian random variable N (0, 1) and for the same number of samples of a
t-distributed random variable with 3 degrees of freedom. The thin line is the plot
of the corresponding Gaussian fit. Note that the t-distribution has thicker tails
(the curve lies above the Gaussian fit in the log–log plot). Also, observe that the
plot for the t-distributed variable approaches a straight line for large x because
the tail probability Pr{x > x} = 1 − F (x) ≈ x−ν , which produces a straight line
in the log–log plot.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
B - Information theory pp. 313-324

Chapter DOI:
Cambridge University Press
B Information theory

The purpose of this text is to introduce selected basic concepts of information


theory necessary to understand the material in this book. An excellent textbook
on information theory is [50].

B.1 Entropy, conditional entropy, mutual information

Let x be a discrete random variable attaining values in alphabet A with probabil-


ity mass function p(x) = Pr{x = x}. The entropy of x is defined as the expected
value of − log p(x)

H(x) = − p(x) log p(x). (B.1)
x∈A

If the base of the log is 2, we measure H in bits; if the base is the Euler number,
e = 2.71828..., we speak of “nats.” Since 0 ≤ p(x) ≤ 1, − log p(x) ≥ 0, and thus
H(x) ≥ 0. Note that entropy does not depend on the values x, only on their
probabilities p(x).
Entropy measures the uncertainty of the outcome of x. It is the average number
of bits communicated by one realization of x. Also, H(x) is the minimum average
number of bits needed to describe x.
In Chapter 6, we will encounter a less common notion of minimal entropy
defined as
Hmin (x) = min − log p(x). (B.2)
x∈A

The equivalent of entropy for real-valued random variables with probability


density function f (x) is the differential entropy
ˆ
h(x) = − f (x) log f (x)dx. (B.3)
R

The lower-case letter is used to stress the fact that, although entropy and dif-
ferential entropy share many properties, they also behave very differently. For
example, h(x) can be negative because we can certainly have f (x) > 1 on its
support (h(u) = − log 2 for u ∼ U (0, 1/2)). Moreover, h may not be preserved
under one-to-one mapping (c.f., Proposition B.3).
314 Appendix B. Information theory

Example B.1: [Entropy of Bernoulli random variable B(p)]

H(p) = −p log p − (1 − p) log(1 − p).


This function is called the binary entropy function and is displayed in Figure 8.2.

Example B.2: [Differential entropy of Gaussian variable] ξ ∼ N (μ, σ2 )


ˆ
1 (x−μ)2 1 (x − μ)2
h(ξ) = √ e− 2σ2 log 2πσ 2 + dx (B.4)
2πσ 2 2 2σ 2
R
ˆ ˆ
1 1 − y2 1 y2 y2
= log 2πσ 2
√ e 2 dy + √ e− 2 dy (B.5)
2 2π 2 2π
R R
1 
= 1 + log 2πσ 2 . (B.6)
2

Proposition B.3. [Processing cannot increase entropy] Let f be a map


from A to B, f : A → B. If f is injective,1 then the entropy of the random vari-
able f (x) on B is H(f (x)) = H(x). In general, H(f (x)) ≤ H(x).

Proof. If f is injective, then f is one-to-one from A to f (A) ⊂ B and the equality


is clear because we are just “renaming” the elements. To prove the inequality,
take any b ∈ f (A) and denote f −1 (b) = {x ∈ A|f (x) = b}, the set of all elements
from A that are mapped to b under f . Note that when f is injective, f −1 (b)
contains only one element.
Let us denote the random variable f (x) as x and its pmf on B as p . The
probability of observing b for x is the probability of observing any x ∈ f −1 (b),

or p (b) = x∈f −1 (b) p(x). We will show that

−p (b) log p (b) ≤ − p(x) log p(x). (B.7)
x∈f −1 (b)

The inequality H(f (x)) ≤ H(x) will then be proved by summing (B.7) over all
b ∈ f (A) because
  
H(f (x)) = − p (b) log p (b) ≤ −p(x) log p(x) (B.8)
b∈f (A) b∈f (A) x∈f −1 (b)

=− p(x) log p(x) = H(x). (B.9)
x∈A

1 f is injective if x = y ⇒ f (x) = f (y).


Information theory 315

Because p(x)/p (b) is itself a pmf on f −1 (b) (it sums to 1), from the non-negativity
of entropy,
 p(x) p(x) 1  p(x)
0≤− 
log  =  −p(x) log (B.10)
p (b) p (b) p (b) p (b)
x∈f −1 (b) x∈f −1 (b)
⎧ ⎫
1 ⎨   ⎬
=  −p(x) log p(x) + p(x) log p (b) (B.11)
p (b) ⎩ −1 ⎭
x∈f (b) x∈f −1 (b)
⎧ ⎫
1 ⎨  ⎬
=  −p(x) log p(x) + p (b) log p (b) , (B.12)
p (b) ⎩ −1 ⎭
x∈f (b)

which proves (B.7) and the whole proposition.

The proposition simply states that by transforming a random variable, one can-
not make its output more uncertain. This is intuitively clear because either f
maps elements of A to different elements, which does not change entropy, or
it maps several elements to one (not a one-to-one map), which should decrease
uncertainty.
Entropy is the measure of uncertainty. The conditional entropy of x is the
uncertainty of x given the outcome of another random variable y and it can be
expressed using the conditional probability p(x|y = y):

H(x|y = y) = − p(x|y) log p(x|y). (B.13)
x∈Ax

Since, potentially, the alphabets for x and y may be different, we use the subscript
to denote this fact, x ∈ Ax , y ∈ Ay . Using (B.13), we define the conditional
entropy
 
H(x|y) = − p(x|y) log p(x|y)p(y) (B.14)
x∈Ax y∈Ay
 
=− p(x, y) log p(x|y). (B.15)
x∈Ax y∈Ay

Here, p(x, y) = Pr{x = x, y = y} is the joint probability.


The joint entropy of two random variables is defined as the entropy of the
vector random variable (x, y):
 
H(x, y) = − p(x, y) log p(x, y) (B.16)
x∈Ax y∈Ay
   
=− p(x, y) log p(x|y) − p(x, y) log p(y) (B.17)
x∈Ax y∈Ay x∈Ax y∈Ay

= H(x|y) − p(y) log p(y) = H(x|y) + H(y). (B.18)
y∈Ay
316 Appendix B. Information theory

Thus, the joint entropy is the sum of the entropy of y and the conditional
entropy H(x|y),
H(x, y) = H(x|y) + H(y). (B.19)
The mutual information I(x; y) measures how much information about x is
conveyed by y,
I(x; y) = H(x) − H(x|y). (B.20)
If H(x|y) = H(x), or I(x; y) = 0, the uncertainty of x is unaffected by y and thus
knowledge of y does not give us any information about x. On the other hand, if
H(x|y) = 0, x is completely determined by y and thus it delivers all information
about x. In other words, the mutual information I(x; y) = H(x).
Mutual information is symmetrical,
I(x; y) = I(y; x), (B.21)
because, expressing H(x|y) using (B.19),
I(x; y) = H(x) − H(x|y) = H(x) + H(y) − H(x, y), (B.22)
which is obviously symmetrical in both random variables.
There exists a fundamental relationship between mutual information and KL
divergence (introduced in the next section):
  
I(x; y) = − p(x) log p(x) + p(x, y) log p(x|y) (B.23)
x∈Ax x∈Ax y∈Ay
   
=− p(x, y) log p(x) + p(x, y) log p(x|y) (B.24)
x∈Ax y∈Ay x∈Ax y∈Ay
  p(x|y)   p(x, y)
= p(x, y) log = p(x, y) log (B.25)
p(x) p(x)p(y)
x∈Ax y∈Ay x∈Ax y∈Ay

= DKL (p(x, y)||p(x)p(y)) . (B.26)


From the non-negativity of KL divergence (Proposition B.5), we have I(x; y) ≥ 0
and I(x; y) = 0 if and only if p(x, y) = p(x)p(y) for all x, y, which means that x
and y are independent.
Moreover, the non-negativity of mutual information in connection with (B.20)
implies
H(x) ≥ H(x|y), (B.27)
which is known as the fact that conditioning reduces entropy.

B.2 Kullback–Leibler divergence

The Kullback–Leibler divergence is also called the KL distance or relative en-


tropy. This fundamental concept is defined for two probability mass functions p
Information theory 317

and q on A as
 p(x)
DKL (p||q) = p(x) log . (B.28)
q(x)
x∈A

For real-valued random variables with distributions p(x) and q(x)


ˆ
p(x)
DKL (p||q) = p(x) log dx. (B.29)
q(x)
The convention is that 0 log 0 = 0 because limx→0 x log x = 0.
One can view the KL divergence as a measure of difference between two prob-
ability mass functions p, q. Only when p = q is the KL divergence equal to zero.
The more different p and q are, the larger their KL divergence is.

Example B.4: [KL divergence on a set of two elements] For A = {0, 1} the
KL divergence (B.28) becomes

p(0) 1 − p(0)
DKL (p||q) = p(0) log + (1 − p(0)) log . (B.30)
q(0) 1 − q(0)

Proposition B.5. [Non-negativity of KL divergence] For any two distri-


butions, p and q on A,

DKL (p||q) ≥ 0 (B.31)

and equality holds if and only if p(x) = q(x) for all x ∈ A.

Proof. For the proof, it will be convenient to use the natural logarithm in the
definition of the KL divergence. We will use the log inequality

log t ≤ t − 1 (B.32)

which holds for all t > 0 with equality if and only if t = 1. This can be seen by
noting that log t is a concave function (because (log t) = −t−2 < 0 for t > 0)
and y = t − 1 is its tangent at t = 1.
Substituting t = q(x)/p(x) into (B.32), we obtain
q(x) q(x)
log ≤ − 1. (B.33)
p(x) p(x)
After multiplying by −p(x) (do not forget to flip the inequality sign because
−p(x) ≤ 0)
p(x)
p(x) log ≥ p(x) − q(x). (B.34)
q(x)
318 Appendix B. Information theory

Summing over all x ∈ A, we obtain the required inequality


 p(x)  
DKL (p||q) = p(x) log ≥ p(x) − q(x) = 1 − 1 = 0. (B.35)
q(x)
x∈A x∈A x∈A

The equality can hold if and only if we had equality for each x, or q(x)/p(x) = 1
for all x, which means that the distributions are identical.

The KL divergence is not a metric because it is in general not symmetrical,


DKL (p||q) = DKL (q||p). It does not satisfy the triangle inequality DKL (p||q) +
DKL (q||r) ≥ DKL (p||r) either. It is, nevertheless, useful to think of it as a dis-
tance. To get a better feeling for it, consider the following special case.
Choose q = u(x) = 1/|A|, the uniform distribution on A. Then,
 p(x) 
DKL (p||u) = p(x) log = −H(x) + p(x) log |A| (B.36)
u(x)
x∈A x∈A

or

DKL (p||u) = log |A| − H(x). (B.37)

From here, we can draw some interesting conclusions. First, because the KL
divergence is always non-negative, we see that we must always have H(x) ≤
log |A|, the entropy is bounded by the logarithm of the number of elements in A.
The difference between this maximal value and the entropy H(x) is just the KL
divergence between the pmf of x and the uniformly distributed random variable
(which always has the maximal entropy, as can be easily verified).
The KL divergence comes up very frequently in information theory and it can
be argued that it is a more fundamental concept than the entropy itself. For
example, let us assume that we wish to compress a random variable x with pmf
p using Huffman code but we know only an estimate of p that we denote p̂. If
we construct a Huffman code based on our imprecise pmf p̂, we will on average
need DKL (p||p̂) more bits for compression of x due to the fact that we know its
pmf only approximately. The KL divergence has also a fundamental relationship
to hypothesis testing (see Appendix D). Moreover, in Section B.1 we showed
that the mutual information I(x; y) between two random variables x and y can
be written as the KL divergence between their joint probability mass function
p(x, y) and the product distribution of their marginals p(x)q(y) (this would be
the joint pmf if x and y were independent): I(x; y) = DKL (p(x, y)||p(x)q(y)).
In order to understand the information-theoretic definition of steganographic
security, we will need the following proposition, which is an equivalent of Propo-
sition B.3 for KL divergence.

Proposition B.6. [Processing cannot increase KL divergence] Let x and


y be two random variables on A with pmfs p and q. Let f : A → B be a map from
A to some set B. Denoting the pmfs of the transformed random variables f (x)
Information theory 319

and f (y) with p and q  , their KL divergence cannot increase,


DKL (p||q) ≥ DKL (p ||q  ). (B.38)

Proof. We will need the following log-sum inequality, which holds for any non-
negative r1 , . . . , rk and positive s1 , . . . , sk :
k
k
ri  k
j=1 rj
ri log ≥ ri log k , (B.39)
i=1
si i=1 j=1 sj

which can be proved again using the log inequality (B.32), in which we substitute

si rj
t=  : (B.40)
ri sj
 
si rj si rj
log + log  ≤  − 1. (B.41)
ri sj ri sj
We now multiply (B.41) by ri and sum over i,
  
si   rj 
k k k k
rj
ri log + ri log  ≤ si  − ri = 0, (B.42)
i=1
ri i=1 sj i=1
sj i=1

which is the log-sum inequality.


To prove the proposition, we again take an arbitrary b ∈ f (A) and note that
 
p (b) = x∈f −1 (b) p(x) and q  (b) = x∈f −1 (b) q(x), which correspond to the prob-


abilities of observing b for f (x) and f (y), respectively. Let x1 , . . . , xk be all the ele-
ments of f −1 (b). We now use the log-sum inequality for ri = p(xi ) and si = q(xi ),
 
noting that ri = p (b) and si = q  (b) to obtain
k
 p(x) 
k
p(xi ) 
k
j=1 p(xj )
p(x) log = p(xi ) log ≥ p(xi ) log k (B.43)
−1
q(x) i=1
q(xi ) i=1 j=1 q(xj )
x∈f (b)

p (b)
= p (b) log . (B.44)
q  (b)
Taking similar steps as in the proof of Proposition B.3, we obtain
 p(x)   p(x)
DKL (p||q) = p(x) log = p(x) log (B.45)
q(x) −1
q(x)
x∈A b∈f (A) x∈f (b)
 
p (b)
≥ p (b) log = DKL (p ||q  ). (B.46)
q  (b)
b∈f (A)

Proposition B.7. [KL divergence is locally quadratic] Let p(x; β) be a


family of probability mass functions parametrized by a scalar parameter β. Then
β2
DKL (p(x; 0)||p(x; β)) = I(0) + O(β 3 ), (B.47)
2
320 Appendix B. Information theory

where I(β) is the Fisher information


 1 ∂p
2  ∂ log p
2
I(β) = (x; β) = p(x; β) (x; β) . (B.48)
x
p(x; β) ∂β x
∂β

Proof. We expand log p(x; β) using Taylor series in the parameter:


∂p β2 ∂2p
log p(x; β) = log p(x; 0) + β (x; 0) + (x; 0) + O(β 3 ) (B.49)
∂β 2 ∂β 2
 
β ∂p β2 ∂2p 3
= log p(x; 0) 1 + (x; 0) + (x; 0) + O(β )
p(x; 0) ∂β 2p(x; 0) ∂β 2
(B.50)
β ∂p β2 ∂2p
= log p(x; 0) + (x; 0) + (x; 0)
p(x; 0) ∂β 2p(x; 0) ∂β 2
2
β2 1 ∂p
− (x; 0) + O(β 3 ), (B.51)
2 p(x; 0) ∂β
which will hold if p(x; β) is sufficiently smooth in β for all x. Thus,
∂p β2 ∂2p
p(x; 0) (log p(x; 0) − log p(x; β)) = −β (x; 0) − (x; 0)
∂β 2 ∂β 2
2
β2 ∂p
+ (x; 0) + O(β 3 ). (B.52)
2p(x; 0) ∂β
We can now write for the KL divergence

DKL (p(x; 0)||p(x; β)) = p(x; 0) (log p(x; 0) − log p(x; β)) (B.53)
x

β2  1
2
∂p
= (x; 0) + O(β 3 ) (B.54)
2 x p(x; 0) ∂β
because
 ∂p ∂  ∂
(x; 0) = p(x; 0) = 1 = 0, (B.55)
x
∂β ∂β x
∂β
 ∂2p ∂ 2 
∂2
(x; 0) = p(x; 0) = 1 = 0. (B.56)
x
∂β 2 ∂β 2 x ∂β 2

Proposition B.8. [Additivity of KL divergence for independent vari-


ables] Let x1 , . . . , xn and y1 , . . . , yn be independent random variables with distri-
butions p1 (x), . . . , pn (x) and q1 (x), . . . , qn (x), respectively. Considering the vec-
tor random variables x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) with distributions px
and qy ,

n
DKL (px ||py ) = DKL (pi ||qi ). (B.57)
i=1
Information theory 321

Proof. From the definition of the KL divergence

 "
n
pi (xi )
DKL (px ||py ) = p1 (x1 ), . . . , pn (xn ) log (B.58)
x1 ,...,xn i=1
qi (xi )
 
n
pi (xi )
= p1 (x1 ), . . . , pn (xn ) log (B.59)
i=1 x1 ,...,xn
qi (xi )
n 
pi (xi ) 
n
= pi (xi ) log = DKL (pi ||qi ), (B.60)
i=1 xi
qi (xi ) i=1

because on the second line we can sum over all xj , j = i, and use the fact that

xj p(xj ) = 1.

B.3 Lossless compression

A lossless compression scheme, C, is a mapping that assigns a bit string ci con-


sisting of l(ai ) bits (a codeword) to each symbol ai ∈ A. Given a sequence of
independent realizations of x, ai1 , ai2 , . . ., the compression maps it into a con-
catenation of bit strings ci1 &ci2 & . . .. Thus, on average each symbol is encoded
|A|
using l(C) = j=1 p(aj )l(aj ) bits. The goal of lossless compression is to com-
press a sequence of n independent realizations of x into a bit string that is as
short as possible. The best possible (perfect) lossless compression scheme will
do this task with nH(x) bits. A practical way to achieve this is to divide the
sequence of symbols into groups of k symbols and work with these blocks rather
than individual symbols. The vector random variable y consisting of k indepen-
dent realizations of x attains values from A × · · · × A, which is an alphabet of
5
|A|k k-ary symbols, with probabilities Pr{y = (ai1 , . . . , aik )} = kj=1 p(aij ). It is
possible to design lossless compression schemes Ck that with increasing k asymp-
totically reach the best possible performance and compress n symbols from A
using approximately nH(x) bits. For example, Huffman codes can be shown to
be asymptotically optimal in this sense.
The codeword lengths in an asymptotically perfect lossless compression scheme
must satisfy

l(aj ) ≈ − log p(aj ). (B.61)

This is because the average length of a codeword is


 
l(C) = p(aj )l(aj ) = − p(aj ) log p(aj ) = H(x) (B.62)
j j

and this is the minimal average codeword length possible.


322 Appendix B. Information theory

B.3.1 Prefix-free compression scheme


We say that a compression scheme is prefix-free if no codeword is a prefix of any
other codeword. A compression scheme with codewords {0, 10, 110, 111} is prefix-
free, while {0, 1, 10, 11} is not because  1 is a prefix of both  10 and  11 . There
exist asymptotically perfect prefix-free compression schemes (e.g., the Huffman
codes).
Every perfect prefix-free scheme C has the following interesting property.
Given any sequence of bits b[i], i = 1, 2, . . ., there must exist i1 ≥ 1 such that
the bit string (b[1], . . . , b[i1 ]) is a codeword. To see this, let k be the length of
the longest codeword. If (b[1], . . . , b[i1 ]) is a codeword for some i1 ≤ k, we are
done. In the other case, we obtain a contradiction in the following manner. We
know that (b[1], . . . , b[k − 1]) is not a codeword and we show that it is not a pre-
fix of any codeword either. The only codeword with this prefix would have to be
(b[1], . . . , b[k − 1], 1 − b[k]), in which case we could remove the last bit from the
codeword and shorten the average code length. Therefore, (b[1], . . . , b[k − 1]) is
not a prefix of any codeword and we can shorten the code by replacing one of
the codewords with length k with (b[1], . . . , b[k − 1]).
Thus, any bit sequence b[i] can be divided into disjoint bit strings of codewords
b = b[1], . . . , b[i1 ], b[i1 + 1], . . . , b[i2 ], . . .. If the bit sequence is random, each
codeword, cj , will appear with probability p(aj ). This is because the probability
that the first codeword occurring in the bit sequence is cj is the same as the
probability of generating cj randomly (because the compression scheme is prefix-
free). Because the length of cj is l(aj ), for a perfect compression scheme, this
probability is 2−l(aj ) = 2log p(aj ) = p(aj ).

Example B.9: [Biased tetrahedron] The tetrahedron t has four sides, A =


{1, 2, 3, 4}. The probabilities that the tetrahedron falls on side 1, 2, 3, or 4
when thrown are p(1) = 12 , p(2) = 14 , p(3) = 18 , p(4) = 18 . The entropy H(t) =
− 21 log2 12 − 14 log2 14 − 2 × 18 log2 18 = 1 + 34 . In this special case, there exists a
simple perfect lossless encoding scheme that encodes each toss with H(t) bits.
Assign the codewords in the following manner

1 →  0 (B.63)
 
2 → 10 (B.64)
 
3 → 110 (B.65)
4 →  111 . (B.66)

It is clear that this compression scheme is prefix-free. Assume we keep tossing the
tetrahedron while registering the results. We know that we will need at least n ×
H(t) bits to describe the tosses. The average number of bits needed to describe
the realization of one toss with our encoding is 12 × 1 + 14 × 2 + 18 × 3 + 18 × 3 =
1.75. This is also the best encoding that we can have because H(t) = 1.75. Note
Information theory 323

that in this special case the codeword lengths do satisfy l(aj ) = − log2 p(aj ) for
each j.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
C - Linear codes pp. 325-334

Chapter DOI:
Cambridge University Press
C Linear codes

This appendix contains the basics of linear covering codes needed to explain the
material in Chapter 8 on matrix embedding and Chapter 9 on non-shared selec-
tion channels. Coding theory is the appropriate mathematical discipline to formu-
late and solve problems associated with minimal-embedding-impact steganogra-
phy introduced in Chapter 5. An excellent text on finite fields and coding theory
is [248].
We first start with the concept of a finite field and then introduce linear codes
while focusing on selected material relevant to the topics covered in this book.

C.1 Finite fields

Many steganographic schemes work by first mapping the numerical values of


pixels/coefficients onto a finite field where embedding tasks can be formulated
and solved in terms of linear codes. This enables us to import powerful tools from
a well-developed discipline and in the end obtain more secure steganographic
schemes.
A field is an alphabet A with two operations of addition,  + , and multipli-
cation,  · , that are both associative, commutative, and distributive. Also, in
a field there must exist a zero element 0, a + 0 = a, ∀a ∈ A, and a unit ele-
ment 1, a · 1 = a, ∀a ∈ A. Moreover, all elements must have an additive inverse:
∀a ∈ A, ∃b ∈ A, a + b = 0; and all non-zero elements must have a multiplicative
inverse: ∀a ∈ A, a = 0, ∃b ∈ A, a · b = 1.

Example C.1: The alphabet A2 = {0, 1} equipped with modulo 2 arithmetic is a


finite field. The operations of addition and multiplication satisfy 0 + 1 = 1 + 0 =
1, 0 + 0 = 1 + 1 = 0, and 1 · 0 = 0 · 1 = 0 · 0 = 0, 1 · 1 = 1. The zero element is 0;
the additive and multiplicative inverse of 1 is 1.

Example C.2: For q a prime number, the alphabet Aq = {0, 1, 2, . . . , q − 1} with


arithmetic modulo q is a finite field. The required properties of addition and
326 Appendix C. Linear codes

multiplication are obviously satisfied. The existence of a multiplicative inverse


follows from the little Fermat’s theorem (aq−1 = 1 mod q for all positive integers
a).

When q is not prime, Aq with modulo q arithmetic is not a field because the
factors of q would not have a multiplicative inverse. For example, for q = 4, there
is no multiplicative inverse for 2.
A finite field with q elements exists if and only if q = pm for some positive
integer m and p prime. The field is formed by polynomials of degree m − 1
modulo an irreducible polynomial.
All finite fields with q elements are isomorphic and are called Galois fields Fq .
By isomorphism, we mean a one-to-one mapping, Ψ : Fq ↔ Gq , that preserves
all operations, e.g., ∀a, b ∈ Fq , Ψ(ab) = Ψ(a)Ψ(b) and Ψ(a + b) = Ψ(a) + Ψ(b).

C.2 Linear codes

A q-ary code C of length n is any subset of the Cartesian product Fnq = Fq ×


· · · × Fq (n times) and its elements are called codewords. The code C is linear if
Fq is a finite field and C is a vector subspace of Fnq . A vector subspace is a subset
closed with respect to addition and multiplication by an element from the finite
field: C is a vector subspace if ∀x, y ∈ C and ∀a ∈ Fq , x + y ∈ C and ax ∈ C. We
say that C is closed under linear combination of its elements.
The Hamming distance dH (x, y) between x, y ∈ C is defined as the number of
elements in which x, y differ. The Hamming weight w(x) of x is the number of
non-zero elements in x or w(x) = dH (x, 0).
Because linear code is a vector subspace of Fnq , it has a basis consisting of k ≤ n
linearly independent vectors. Let us write the basis vectors as rows of a k × n
matrix G. All codewords can thus be obtained as linear combinations of rows
of G. We call G the generator matrix of the code and k is the code dimension.
We also say that C is an [n, k] code. Note that there are exactly q k codewords in
such a code.
The ball of radius r with center at x is the set of all y ∈ Fnq whose distances
from x are less than or equal to r,

B(x, r) = {y ∈ Fnq |dH (x, y) ≤ r}. (C.1)

The ball volume is the cardinality of B(x, r),


n n n
Vq (r, n)  |B(x, r)| = 1 + (q − 1) + (q − 1)2 + · · · + (q − 1)r ,
1 2 r
n (C.2)
because there are i possible places for i changes in n symbols and each change
can attain q − 1 different values.
Linear codes 327

The minimal distance of a code is determined by the two closest codewords.


For linear codes, it is also the smallest Hamming weight among all non-zero
codewords,

d= min dH (x, y) = min w(x − y) = min w(c). (C.3)


x,y∈C,x=y x,y∈C,x=y c∈C, c=0

The distance to code is the distance to the closest codeword,

dH (x, C) = min dH (x, c). (C.4)


c∈C

The average distance to code is the expected value of dH (x, C) over randomly
uniformly distributed x ∈ Fnq ,

1 
Ra = dH (x, C). (C.5)
qn
x∈Fq
n

The covering radius of a code is determined by the most distant word from
the code,

R = maxn dH (x, C). (C.6)


x∈Fq

Note that

R ≥ Ra , (C.7)

B(x, R) = Fnq . (C.8)
x∈C

Example C.3: Consider a binary code given by the generator matrix


⎛ ⎞
10111
G = ⎝0 1 1 0 1⎠. (C.9)
11000

The code length is n = 5 and its dimension is k = 3 (the number of rows in G).
The codewords are elements of F52 (they are five-tuples of bits). We know that
there must be 23 = 8 codewords. Three of them already appear as rows of G.
The fourth one is the all-zero codeword, (0, 0, 0, 0, 0), which is always an element
of every linear code. The remaining four codewords are obtained by adding the
rows of G: (1, 1, 0, 1, 0) is the sum of the first two rows, (1, 0, 1, 0, 1) is the sum
of the last two rows, and (0, 1, 1, 1, 1) is the sum of the first and third row. The
last codeword is the sum of all three rows, (0, 0, 0, 1, 0). Thus, a complete list of
328 Appendix C. Linear codes

all codewords is
⎧ ⎫

⎪ 0 000 0⎪⎪

⎪ ⎪

⎪ 1 011 1⎪⎪


⎪ ⎪


⎪ 0 110 1⎪⎪


⎨ ⎬
1 100 0
C= . (C.10)

⎪ 1 101 0⎪⎪

⎪ ⎪


⎪ 1 010 1⎪⎪



⎪ ⎪

⎪ 0 111 1⎪⎪

⎩ ⎭
0 001 0
This code has many other generator matrices formed by any triple of linearly
independent codewords. The minimal distance of C is 1 because one of the code-
words has Hamming weight 1. The covering radius of C is R = 1 because no other
word in F52 is farther away from C than 1 (the reader is encouraged to verify this
statement by listing the distance to C for all 32 words in F52 ). In Section C.2.4, we
will learn a better and faster method for how to determine the covering radius
for such small codes. Because the distance to C is 0 for all 8 codewords and 1 for
the remaining 32 − 8 = 24 words, the average distance to code Ra = 24 2
32 = 3 .
The ball of radius 1 centered at codeword c = (1, 0, 1, 1, 1) is the set of six
words
⎧ ⎫

⎪ 1 0 1 1 1⎪ ⎪

⎪ ⎪


⎪ 0 0 1 1 1 ⎪


⎨ ⎪

11111
B(c, 1) = . (C.11)

⎪ 1 0 0 1 1⎪ ⎪

⎪ ⎪


⎪ 1 0 1 0 1 ⎪


⎩ ⎪

10110

C.2.1 Isomorphism of codes


We say that a code C is isomorphic to D if there exists a one-to-one map Ψ : C ↔
D that preserves the distance between codewords, dH (c1 , c2 ) = dH (Ψ(c1 ), Ψ(c2 ))
for all c1 , c2 ∈ C. In other words, two isomorphic codes are geometrically identical
in the sense that they are rotations or mirror images of each other.
Two generator matrices correspond to two isomorphic codes if we can trans-
form the generator matrix of one code to the other using the following operations:
r Swap two rows (linear combinations of rows stay the same)
r Add a multiple of a row to another row (linear combinations of rows stay the
same)
r Multiply a row by a non-zero scalar (linear combinations of rows stay the
same)
r Multiply a column by a non-zero scalar (the distance between every pair of
codewords does not change)
Linear codes 329

r Swap two columns (symmetry)

Using these operations, every linear code can be mapped to an isomorphic code
with generator matrix G = [Ik ; A], where Ik is the k × k unity matrix and A is
a k × (n − k) matrix. This form of the generator matrix is called the systematic
form.
The reader is encouraged to think why adding two columns may not lead to
an isomorphic code (come up with an example where adding two rows changes
the minimal distance).

C.2.2 Orthogonality and dual codes


We define the dot product between two words

x · y = x1 y1 + · · · + xn yn , (C.12)

all operations in the corresponding finite field. Note that when x ∈ Fn2 contains
an even number of ones, x · x = 0, which is true for the standard dot product in
Euclidean space only when x = 0.
The orthogonal complement to C is the set of all vectors orthogonal to every
codeword:

C ⊥ = {x|x · c = 0, ∀c ∈ C}. (C.13)

We can find C ⊥ by solving the system of linear equations Gx = 0. From linear


algebra, the set of all solutions of this equation is a vector subspace of dimension
n − k. Thus, C ⊥ is an [n, n − k] code called the dual code of C. The dimension
of C ⊥ is the codimension of C. Note that the dual of the dual is again C.
To find the generator matrix of C ⊥ , we assume that G = [Ik , A] is in the sys-
tematic form. The basis of the dual code is formed by solutions to the equations
⎛ ⎛ ⎞⎞
x[k + 1]
⎜ x[1] + A[1, .] ⎝ · · · ⎠ ⎟
⎜ ⎟
⎜ ⎟
⎜ x[n] ⎟
⎜ ⎟
0 = Gx = [Ik , A]x = ⎜ ··· ⎟, (C.14)
⎜ ⎛ ⎞⎟
⎜ x[k + 1] ⎟
⎜ ⎟
⎝ x[k] + A[k, .] ⎝ ··· ⎠⎠
x[n]

where we wrote A[j, .] for the jth row of A. The equation above implies
⎛ ⎞ ⎛ ⎞
x[k + 1] x[k + 1]
x[1] = −A[1, .] ⎝ · · · ⎠ , . . . , x[k] = −A[k, .] ⎝ · · · ⎠ . (C.15)
x[n] x[n]
330 Appendix C. Linear codes

We can choose x[k + 1], . . . , x[n] arbitrarily and always find x[1], . . . , x[k] so that
all k equations hold. Choose
⎧⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎫
⎛ ⎞ ⎪ 1 0 0 ⎪

⎨⎜ ⎟ ⎜ ⎟ ⎪
x[k + 1]
0 1 ⎜ 0 ⎟⎬
⎝ ··· ⎠ ∈ ⎜ ⎟,⎜ ⎟,...,⎜ ⎟ . (C.16)
⎪ ⎝···⎠ ⎝···⎠ ⎝ · · · ⎠⎪
x[n] ⎪
⎩ ⎪

0 0 1
By writing the solutions as rows of a matrix, we obtain the generator matrix, H,
of the dual code
⎛ ⎞
A[1, 1] A[2, 1] · · · A[k, 1] 1 0 · · · 0
⎜ A[1, 2] A[2, 2] · · · A[k, 2] 0 1 · · · 0 ⎟
H=⎜ ⎝ ···
⎟ = [A , In−k ], (C.17)
··· ··· ··· 0 0 ··· 0⎠
A[1, k] A[2, k] · · · A[k, k] 0 0 · · · 1
which is called the parity-check matrix of C. The parity-check matrix is an equiv-
alent description of the code; G is k × n, H is (n − k) × n. The codewords are
defined implicitly through H because the rows of H are orthogonal to rows of
G, Hc = 0 for all codewords c ∈ C.
The parity-check matrix can be used to find the minimal distance of a code
(at least for small codes). The minimal distance d is the smallest number of
columns in H whose sum is 0 because Hc = 0 can be written as a linear com-
bination of columns of H: Hc = c[1]H[., 1] + · · · + c[n]H[., n] = 0. (Recall that
d = minc∈C, x=0 w(c).)

Example C.4: Consider the code from Example C.3. We will first find the sys-
tematic form of the generator matrix and then the parity-check matrix. Using
the operations that lead to isomorphic codes, we can write
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
10111 10111 10111
G = ⎝0 1 1 0 1⎠ ∼ ⎝0 1 1 0 1⎠ ∼ ⎝0 1 1 0 1⎠ (C.18)
11000 01111 00010
⎛ ⎞ ⎛ ⎞
10111 10011
∼ 0 1 0 1 1 ∼ 0 1 0 1 1 ⎠ = [I3 ; A].
⎝ ⎠ ⎝ (C.19)
00100 00100
The first operation involved adding the first row to the third row to manufacture
zeros in the first column (besides the first one). In the second operation, we
added the second row to the third one to obtain zeros in the second column.
Then, we swapped the third and fourth columns to obtain an upper-diagonal
matrix. The last operation turned the matrix to the desired systematic form by
adding the third row to the first row.
Thus, the parity-check matrix is obtained as
11010
H = [A ; I5−3 ] = . (C.20)
11001
Linear codes 331

The reader is encouraged to verify that the rows of H are orthogonal to all rows
of G in systematic form.

C.2.3 Perfect codes


Proposition C.5. [The sphere-covering bound] For any [n, k] code

Vq (R, n) ≥ q n−k . (C.21)

Proof. Because the union of balls with radius equal to R, the covering radius, is
6
the whole space, we must have x∈C
6  nB(x,
 R) = Fnq . Now, if all balls were disjoint,
we would have  x∈C B(x, R) = Fq  = q n = Vq (R, n) × q k , which is equality in
the sphere-covering bound, because all balls have the same volume Vq (R, n) and
there are q k of them. If the balls are not disjoint,
6 we will count
 some words more
 
than once and thus obtain inequality, q = x∈C B(x, R) ≤ Vq (R, n) × q k .
n

An [n, k] code is called perfect if the minimal distance d is odd and the following
equality holds:
d−1
, n = q n−k ,
Vq (C.22)
2
7 8 d−1
In other words, the balls with radius d−12 = 2 centered at codewords cover
the whole space without overlap.

Proposition C.6. [Perfect codes] The only perfect codes are

r The repetition code [n, 1], d = n−1 for n odd,


r Hamming codes,
2

r Binary and ternary Golay codes.


# $
qp −1 qp −1 qp −1
The q-ary Hamming code q−1 , q−1 − p is a linear code of length n = q−1
p
−1
and codimension p. The code parity-check matrix has dimensions p × qq−1 and
contains as its columns all different non-zero p-tuples of symbols from Fq up
p
−1
to multiplication by a scalar. The length is n = qq−1 because there are q p − 1
non-zero p-tuples of q-ary symbols and each such tuple appears in q − 1 versions
(there are q − 1 non-zero multiples).
Because we need to add at least three columns of H to obtain a zero vector, e.g.,
(1, 0, 0, . . . 0) , (0, 1, 0, . . . , 0) , and (1, 1, 0, . . . , 0) , the minimal distance d = 3.
The Hamming code is perfect because

Vq (1, n) = 1 + n(q − 1) = q p . (C.23)

This means that balls of radius 1 centered at codewords cover the whole space
without overlap. This also implies that the covering radius R = 1. Because there
332 Appendix C. Linear codes

are q p words in every ball out of which there is only one codeword, the average
distance to code is Ra = q q−1
p
p = 1 − q −p .
Description of the Golay codes and the proof of Proposition C.6 can be found,
e.g., in [248].

C.2.4 Cosets of linear codes


For any x ∈ Fnq , we define the concept of a syndrome Hx = s ∈ Fn−k
q . The set of
all x ∈ Fnq with the same syndrome, s ∈ Fn−k
q , is a coset C(s):

C(s) = {x ∈ Fnq |Hx = s}. (C.24)

Note that C(0) = C, C(s1 ) ∩ C(s2 ) = ∅ for s1 = s2 . The whole space Fnq can be
thus decomposed into q n−k cosets, each coset containing q k elements,

C(s) = Fnq . (C.25)
s∈Fn−k
q

Also, from linear algebra C(s) = x̃ + C, where x̃ is an arbitrary coset member


(arbitrary solution Hx̃ = s).
Coset leader e(s) is a member of the coset C(s) with the smallest Hamming
weight. The Hamming weight of every coset leader satisfies

w(e(s)) ≤ R (C.26)

and this bound is tight. This is because for any x ∈ C(s)

R = max dH (z, C) ≥ dH (x, C) = min w(x − c) = w(e(s)) (C.27)


z∈Fq
n c∈C

because as c goes through C, x − c goes through all members of the coset C(s).
This result also implies that any syndrome, s ∈ Fn−k q , can be obtained by
adding at most R columns of H. This is because C(s) = {x|Hx = s} and the
weight of a coset leader is the smallest number of columns that need to be
summed to obtain the coset syndrome s. Thus, one method to determine the
covering radius of a linear code is to first form its parity-check matrix and then
find the smallest number of columns of H that can generate any syndrome.

Example C.7: Using the code from Example C.3 again, this last result can im-
mediately give us the covering radius of the code from the parity-check matrix
11010
H= . (C.28)
11001
Because the columns of H cover all four binary vectors, each syndrome can be
written as a linear combination of one column and thus R = 1. Because k = 3
and n = 5, there are 2n−k = 4 cosets, each containing 2k = 8 words. The coset
corresponding to syndrome s00 = (0, 0) is the code C and the coset leader is the
Linear codes 333

all-zero codeword. The coset leader of the coset C(s01 ), s01 = (0, 1) , is e(s01 ) =
(0, 0, 0, 0, 1) because this is the word with the smallest Hamming weight for
which He = (0, 1) . The coset leader of the coset C(s10 ), s10 = (1, 0) , is e(s10 ) =
(0, 0, 0, 1, 0) because this is the word with the smallest Hamming weight for
which He = (1, 0) . Finally, the coset leader of the coset C(s11 ), s11 = (1, 1) , is
e(s11 ) = (0, 1, 0, 0, 0) because this is a word with the smallest Hamming weight
for which He = (1, 1) . Note that the coset leader for this coset is not unique
because (1, 0, 0, 0, 0) is also a coset leader.
The space F52 = C ∪ {s01 + C} ∪ {s10 + C} ∪ {s11 + C} can be written as a dis-
joint union of four cosets.

Example C.8: Consider a linear code with parity-check matrix


⎛ ⎞
10011
H = ⎝0 1 0 1 0⎠ (C.29)
00101
already in the systematic form. The covering radius of this code is R = 2 be-
cause, for some syndromes, we need to add two columns to obtain the syn-
drome. For example, the syndrome s = (1, 1, 1) is the sum of the second and
fifth columns, s = H[., 2] + H[., 5]. This tells us that a coset leader of C(s) is
e(s) = (0, 1, 0, 0, 1). Note that because also s = H[., 3] + H[., 4], another coset
leader is e(s) = (0, 0, 1, 1, 0). The coset C(1, 0, 1), on the other hand, has a unique
coset leader e(1, 0, 1) = (0, 0, 0, 0, 1).

Example C.9: Consider a linear code with parity-check matrix


⎛ ⎞
110000
H = ⎝0 0 1 1 0 0⎠. (C.30)
000011
The code covering radius is R = 3 because for example the syndrome (1, 1, 1) is
obtained by adding no fewer than three columns of H. And three is the number
of columns that can generate every syndrome because the first, third, and fifth
columns of H form a basis of the vector space of all triples of bits.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
D - Signal detection and estimation pp. 335-362

Chapter DOI:
Cambridge University Press
D Signal detection and estimation

In this appendix, we explain some elementary facts from statistical signal detec-
tion and estimation. The material is especially relevant to Chapter 6 on definition
of steganographic security and Chapters 10–12 that deal with steganalysis be-
cause the problem of detection of secret messages can be formulated within the
framework of statistical hypothesis testing. The reader is referred to [126] and
[125] for an in-depth treatment of signal detection and estimation for engineers.
We note that the material in this appendix applies to discrete random variables
represented with probability mass functions on simply replacing the integrals
with summations.

D.1 Simple hypothesis testing

Assume that we carried out a measurement represented as a vector of scalar


values x[i], i = 1, . . . , n, and we desire to know whether repetitive measurements
follow distribution p0 or p1 defined on Rn ,

H0 : x ∼ p0 , (D.1)
H1 : x ∼ p1 . (D.2)

In (D.1)–(D.2), we resorted to a common notational abuse of denoting the mea-


surements and the random variable using the same letter. In signal detection,
we speak of the null hypothesis H0 as the “signal-absent” or noise-only hypoth-
esis, while H1 is the signal-present hypothesis. In steganalysis, usually (but not
always), H0 is the hypothesis that x is a cover image, while H1 means that a
secret message is present in x.
In hypothesis testing, we want to make the best possible decision that opti-
mizes some fundamental criterion selected by the user. Every decision-making
process is an algorithm that assigns an index of the hypothesis (e.g., 0 or 1)
to every possible vector of measurements x ∈ Rn . We call such an algorithm a
detector. Mathematically, it is a map F : Rn → {0, 1}. The sets

R0 = {x ∈ Rn |F (x) = 0}, (D.3)


R1 = {x ∈ R |F (x) = 1}
n
(D.4)
336 Appendix D. Signal detection and estimation

form a disjoint partition Rn = R0 ∪ R1 and they fully describe the detector F .


The set R1 is called the critical region because the detector decides H1 if and
only if x ∈ R1 .
The detector will make two types of error: false alarms and missed detec-
tions.1 The probability of a false alarm, PFA , is the probability that a random
variable distributed according to p0 is detected as stego (the detector decides
“1”), while the probability of missed detection, PMD , is the probability that a
random variable distributed according to p1 is incorrectly detected as cover:
ˆ
PFA = Pr{F (x) = 1|x ∼ p0 } = p0 (x)dx, (D.5)
R1
ˆ ˆ
PMD = Pr{F (x) = 0|x ∼ p1 } = p1 (x)dx = 1 − p1 (x)dx. (D.6)
R0 R1

The two most frequently used criteria for constructing optimal detectors are:
r [Neyman–Pearson] Impose a bound on the probability of false alarms,
PFA ≤ FA , and maximize the probability of detection, PD (FA ) = 1 −
PMD (FA ), or, equivalently, minimize the probability of missed detection. The
optimization process here needs to find among all possible subsets of Rn the
critical region R1 that maximizes the detection probability
ˆ
PD = Pr{x ∈ R1 |x ∼ p1 } = p1 (x)dx (D.7)
R1

subject to the condition


ˆ
p0 (x)dx ≤ FA . (D.8)
R1

r [Bayesian] Assign positive costs to each error, C10 > 0 (cost of false alarm),
C01 > 0 (cost of missed detection), and non-positive “costs” or gains to each
correct decision C00 ≤ 0 (H0 correctly detected as H0 ), C11 ≤ 0 (H1 correctly
detected as H1 ), and prior probabilities P (H0 ) and P (H1 ) that x ∼ p0 and
x ∼ p1 , and minimize the total cost

1
Cij Pr{x ∈ Ri |x ∼ pj }P (Hj ). (D.9)
i,j=0

Theorem D.1. [Likelihood-ratio test] The optimal detector for both scenar-
ios above is the Likelihood-Ratio Test (LRT):
p1 (x)
Decide H1 if and only if L(x) = > γ, (D.10)
p0 (x)

1 In statistics, these are called Type I and Type II errors.


Signal detection and estimation 337

where L(x) is called the likelihood ratio, and the threshold γ is a solution to the
following equation, for the Neyman–Pearson scenario,
ˆ
p0 (x)dx = FA , (D.11)
L(x)>γ

and γ can be computed from costs and prior probabilities,

C10 − C00 P (H0 )


γ= , (D.12)
C01 − C11 P (H1 )

for the Bayesian scenario.

Proof. In the Neyman–Pearson


´ scenario, we need to determine the critical re-
gion
´ R1 so that PD = R1 p1 (x)dx is maximal subject to the constraint PFA =
R1 p0 (x)dx ≤ FA . We can assume equality in the constraint because one can
always enlarge R1 to obtain equality PFA = FA and thus further increase PD (or,
at worst, make it the same). Thus, using the method of Lagrangian multipliers,
we maximize the functional

G(R1 ) = PD + λ(PFA − FA ) (D.13)


ˆ ˆ
= p1 (x)dx + λ p0 (x)dx − FA (D.14)
R1 R1
ˆ
= (p1 (x) + λp0 (x))dx − λFA . (D.15)
R1

The last expression will be maximized if x ∈ R1 whenever the integrand is pos-


itive, or, equivalently,

p1 (x)
> −λ. (D.16)
p0 (x)

The Lagrange multiplier λ must be non-positive because otherwise all x ∈ R1 ,


which would lead to PFA = 1.
In order to control the false-positive probability, we determine the constant
γ = −λ so that
ˆ
p0 (x)dx = FA , (D.17)
p1 (x)
p0 (x)

which gives one equation for one unknown scalar γ.


The proof for the Bayesian case is even simpler. Here, we wish to find the crit-

ical region R1 to minimize the cost 1i,j=0 Cij Pr {x ∈ Ri |x ∼ pj } P (Hj ), which
338 Appendix D. Signal detection and estimation

we express as
⎛ ⎞
ˆ ˆ
C00 ⎝1 − p0 (x)dx⎠ P (H0 ) + C10 p0 (x)dxP (H0 )
R1 R1
⎛ ⎞
ˆ ˆ
+ C11 p1 (x)dxP (H1 ) + C01 ⎝1 − p1 (x)dx⎠ P (H1 )
R1 R1
ˆ
= P (H0 )(C10 − C00 )p0 (x) + P (H1 )(C11 − C01 )p1 (x)dx
R1

+ C00 + C01 . (D.18)

The cost will be minimized if x ∈ R1 whenever the integrand is negative, or

P (H0 )(C10 − C00 )p0 (x) + P (H1 )(C11 − C01 )p1 (x) < 0, (D.19)

which can be rewritten as


p1 (x) C10 − C00 P (H0 )
> , (D.20)
p0 (x) C01 − C11 P (H1 )
which concludes the proof of the LRT.

In steganography, useful detectors must have low probability of a false alarm.


This is because images detected as potentially containing secret messages are
likely to be subjected to further forensic analysis (see Section 10.8) to deter-
mine the steganographic program, the stego key, and eventually to extract the
secret message. This may require running expensive and time-consuming dic-
tionary attacks for the stego key and further cryptanalysis if the message is
encrypted. Thus, it is more valuable to have a detector with very low PFA even
though its probability of missed detection may be quite high (e.g., PMD > 0.5 or
higher). Because typical steganographic communication is repetitive, a detector
with PMD = 0.5 is still quite useful as long as its false-alarm probability is small.
Thus, the hypothesis-testing problem in steganalysis is almost exclusively for-
mulated using the Neyman–Pearson setting. We repeat that the goal here is
to construct a detector with the highest detection probability PD (FA ) = 1 −
PMD (FA ) while imposing a bound on the probability of false alarms, PFA ≤ FA .
Even though it is possible to associate cost with both types of error, Bayesian
detectors are typically not used in steganalysis, because the prior probabilities
of encountering a cover or stego image, P (H0 ) and P (H1 ), are rarely available.
When the measurements x[i], i = 1, . . . , n, are independent and identically dis-
tributed, it is intuitively clear that with increasing n, we should be able to build
increasingly more accurate detectors. The performance of an optimal Neyman–
Pearson detector in the limit n → ∞ is captured by the Chernoff–Stein lemma,
which gives the error exponent for the probability of missed detection given a
bound on the probability of false alarms.
Signal detection and estimation 339

Lemma D.2. [The Chernoff–Stein lemma] Let x[i], i = 1, . . . , n, be a


sequence of iid realizations of random variable x. Given the simple binary
hypothesis-testing problem H0 : x[i] ∼ p0 , H1 : x[i] ∼ p1 with DKL (p0 ||p1 ) < ∞
and an upper bound on the probability of false alarms, PFA ≤ FA , the probabil-
ity of missed detection of the optimal Neyman–Pearson detector approaches 0
exponentially fast,

1
lim log PMD (FA ) = −DKL (p0 ||p1 ). (D.21)
n→∞ n

Proof. See Section 12.8 in [50].

D.1.1 Receiver operating characteristic


The performance of every detector is described using its Receiver-Operating-
Characteristic (ROC) curve, which expresses the trade-off between the proba-
bility of false alarms and missed detections. In this book, it is defined as the
probability of detection as a function of PFA , PD (PFA ). The ROC of the opti-
mal detector (D.10) must be a concave curve. To see this, let (u, PD (u)) and
(v, PD (v)) be two points on the ROC curve, u < v, and Fu and Fv the corre-
sponding optimal detectors for PFA = u and PFA = v. For each x ∈ (u, v), we
can construct a detector

Fu (x) with probability (v − x)/(v − u)
Fx (x) = (D.22)
Fv (x) with probability (x − u)/(v − u).

The probability of detection, PD , for this detector is


v−x x−u
PD∗ (x) = PD (u) + PD (v) , (D.23)
v−u v−u

which defines a line segment connecting the points (u, PD (u)) and (v, PD (v)).
Because the best detector must have PD (x) ≥ PD∗ (x), it means that the ROC
curve lies above this line segment and is thus concave.
A perfect detector with PD (x) = 1 for all x can be obtained only when the
supports of p0 and p1 are disjoint. The opposite case is when p0 = p1 , in which
case, no detector can be built and we have PD (x) = x, which corresponds to a
randomly-guessing detector.

D.1.2 Detection of signals corrupted by white Gaussian noise


In this section, we give a simple example to illustrate the apparatus introduced
above. Consider the situation when one needs to decide about the presence or
absence of a known signal w[i], i = 1, . . . , n, in additive white Gaussian noise with
known variance σ 2 . Formally, given an iid sequence of variables ξ[i] ∼ N (0, σ 2 ),
340 Appendix D. Signal detection and estimation

the hypothesis-testing problem is

H0 : x[i] = ξ[i], (D.24)


H1 : x[i] = w[i] + ξ[i], (D.25)

where H0 is the noise-only hypothesis, while the alternative hypothesis states


that the signal of interest is present.
The LRT for this problem is

5n (x[i]−w[i])2
√ 1 e− 2σ2
p1 (x) i=1 2πσ
L(x) = = 5n x[i]2
(D.26)
p0 (x) √1 − 2σ2
i=1 2πσ e
1 n 
=e σ2
( i=1 x[i]w[i]− 12 ni=1 w[i]2 ) > γ. (D.27)

An equivalent LRT is obtained by taking the natural logarithm of this inequality,


!
1 
n
1
n
log L(x) = x[i]w[i] − w[i]2 > log γ, (D.28)
σ2 i=1
2 i=1

or


n
1
n
T (x) = 2
x[i]w[i] > σ log γ + w[i]2 = γ  . (D.29)
i=1
2 i=1

The result we have just obtained is known as the thesis that the optimal de-
tector for a signal corrupted by additive white Gaussian noise is the correlation.
We now determine the threshold γ  for the Neyman–Pearson test and charac-
terize the detector performance. The test statistic T (x) is Gaussian under both
hypotheses because it is a linear combination of iid Gaussians (see Lemma A.6).
The reader can easily verify that

N (μ0 , ν 2 ) under H0
T (x) ∼ (D.30)
N (μ1 , ν 2 ) under H1 ,

where μ0 = 0, μ1 = E, and ν 2 = σ 2 E with E = ni=1 w[i]2 , the energy of the
known signal. We now face a situation when the test statistic is a mean-shifted
Gauss–Gauss (Gaussians with the same variances and different means). When-
ever this situation occurs, the detector’s performance is completely described
using the so-called deflection coefficient

(μ1 − μ0 )2
d2 = . (D.31)
ν2
Signal detection and estimation 341

To see this, let us compute PFA and PD ,


+ , γ  − μ0
PFA = Pr T (x) > γ  |T (x) ∼ N (μ0 , ν 2 ) = Q , (D.32)
ν
+ , γ  − μ1
PD = Pr T (x) > γ  |T (x) ∼ N (μ1 , ν 2 ) = Q , (D.33)
ν
´∞ √ t2
where Q(x) = x (1/ 2π)e− 2 dt is the right-tail probability of a Gaussian vari-
able N (0, 1) introduced in Section A.4.
From the first equation, we can determine the decision threshold for the
Neyman–Pearson test, γ  = μ0 + νQ−1 (PFA ) and substitute it into the equation
for PD ,

μ0 − μ1 + νQ−1 (PFA )
PD (PFA ) = Q (D.34)
ν
μ1 − μ0
= Q Q−1 (PFA ) − (D.35)
ν
√ 
= Q Q−1 (PFA ) − d2 , (D.36)

where d2 is the deflection coefficient defined in (D.31). The equation (D.36) is


the mathematical form of the ROC. The reader is encouraged to use the form
of Q(x) to see that the ROC curve goes through the origin (0, 0) and the point

(1, 1) and, also, PD (0) = ∞, PD (1) = 0, and PD (PFA ) < 0 for 0 ≤ PFA ≤ 1 (the
ROC is concave).

D.2 Hypothesis testing and Fisher information

The basic goal of steganalysis is construction of a detector distinguishing between


two distributions, p0 and pβ , where β > 0 is a parameter (change rate). In the
limit of β → 0 (the so-called small-payload limit), the performance of the optimal
likelihood-ratio detector is completely described by the leading term in Taylor
expansion of the KL divergence DKL (p0 ||pβ ). The coefficient in the leading term
is proportional to the Fisher information, which makes it useful, for example for
comparing security of steganographic schemes. We now provide detailed technical
explanation of these claims.
Let x[i], i = 1, . . . , n, be n iid realizations of some scalar random variable x.
We desire to determine whether the individual samples follow p0 or pβ :

H0 : x[i] ∼ p0 , (D.37)
H1 : x[i] ∼ pβ , β > 0, (D.38)

where β is a known parameter and pβ (x) is three times continuously differen-


tiable on some right neighborhood of β = 0 for all x. The likelihood-ratio test in
342 Appendix D. Signal detection and estimation

logarithmic form is (D.10)



n
pβ (x[i])
Lβ (x) = log . (D.39)
i=1
p0 (x[i])

Assuming the random variable log(pβ (x)/p0 (x)) has finite mean and variance,
by the central limit theorem, (1/n)Lβ (x) converges to a Gaussian distribution
whose mean and variance we now determine.
The expected value of (1/n)Lβ (x) under hypothesis H0 is −DKL (p0 ||pβ ) be-
cause
   
1 pβ (x)
Ep0 Lβ (x) = Ep0 log = −DKL (p0 ||pβ ). (D.40)
n p0 (x)
It can be expanded using Proposition B.7 as
β2
−DKL (p0 ||pβ ) = − I(0) + O(β 3 ), (D.41)
2
where I(β) is the Fisher information of one observation,
 
∂ log pβ (x)
2  1 ∂pβ (x)
2
I(β) = Epβ = , (D.42)
∂β x
pβ (x) ∂β

where the second equality holds for the discrete case (in the continuous case, the
sum is replaced with an integral). See Section D.6 for more details about Fisher
information.
The expected value of (1/n)Lβ (x) under H1 is
 
1
Epβ Lβ (x) = DKL (pβ ||p0 ). (D.43)
n
Because the leading term of pβ (x) in β is p0 (x), the leading terms of DKL (pβ ||p0 )
and DKL (p0 ||pβ ) are the same up to the sign,
 
1 β2
Epβ Lβ (x) = I(0) + O(β 3 ). (D.44)
n 2
To compute the variance of (1/n)Lβ under either hypothesis, notice that
     
pβ (x) pβ (x) 2 pβ (x) 2
Var log =E log − E log . (D.45)
p0 (x) p0 (x) p0 (x)

We already know that the second term is O(β 4 ) under both hypotheses. For the
first expectation, we expand
pβ (x) β ∂pβ (x) 
log = log 1 +  + O(β 2 ) (D.46)
p0 (x) p0 ∂β β=0
β ∂pβ (x) 
=  + O(β 2 ), (D.47)
p0 ∂β β=0
and thus both p0 (x) (log(pβ (x)/p0 (x)))2 and pβ (x) (log(pβ (x)/p0 (x)))2 have the
same leading term,
Signal detection and estimation 343

∂pβ (x) 
2
β2
 . (D.48)
p0 (x) ∂β β=0
Therefore, under both hypotheses, the leading term of the variance is
 n   
1
2
pβ (x[i]) 1 pβ (x)
Var log = E log (D.49)
n i=1 p0 (x[i]) n p0 (x)

. 1  β2 ∂pβ (x)  2
=  (D.50)
n x p0 (x) ∂β β=0
1 2
= β I(0). (D.51)
n
Finally, we can conclude that for small β the likelihood-ratio test is the mean
shifted Gauss–Gauss problem
  
1 N −β 2 I(0)/2, β 2 I(0)/n under H0
Lβ (x) ∼  2  (D.52)
n N β I(0)/2, β 2 I(0)/n under H1

and its performance is thus completely described using the deflection coefficient
(accurate up to the order of β 2 ), which is in turn proportional to the Fisher
information
 2 2
2 β I(0)/2 + β 2 I(0)/2
d = 2
= nβ 2 I(0). (D.53)
β I(0)/n
Using the so-called J-divergence (sometimes called symmetric KL divergence)

J(p0 ||pβ ) = DKL (p0 ||pβ ) + DKL (pβ ||p0 ) (D.54)

the deflection coefficient can also be written as

2 (J(p0 ||pβ ))2 4n (DKL (p0 ||pβ ))2


d =n = . (D.55)
β 2 I(0) β 2 I(0)

D.3 Composite hypothesis testing

When one of the probability distributions in the hypothesis test is not known or
depends on unknown parameters, we obtain the so-called composite hypothesis-
testing problem, which is substantially more complicated than the simple test.
In general, no optimal detectors exist in this case and one has to resort to sub-
optimal detectors by accepting simplifying or additional assumptions.
In this book, we are primarily interested in the case when the distribution
under the alternative hypothesis depends on an unknown scalar parameter β,
such as the change rate (relative number of embedding changes due to message
hiding). In this case, we know that β ≥ 0, obtaining thus the one-sided hypothesis
344 Appendix D. Signal detection and estimation

test
H0 : x ∼ p0 , (D.56)
H1 : x ∼ pβ , β > 0, (D.57)
where x is again a vector of measurements (e.g., the histogram). Examples of
specific instances of this problem appear in Chapter 10.
If the threshold in the LRT can be set so that the detector has the highest
probability of detection PD for any value of the unknown parameter β, we speak
of a Universally Most Powerful (UMP) detector. Because UMP detectors rarely
exist, other approaches are often explored. One possibility is to constrain our-
selves to small values of the parameter β. In steganalysis, this is the case of
main interest because small payloads are harder to detect than large ones. If β is
the only unknown parameter, one can derive the Locally Most Powerful (LMP)
detector that will have a constant false-alarm rate for small β and thus one will
be able to set a decision threshold in Neyman–Pearson setting. To see this, we
expand log pβ using Taylor expansion around β = 0 and write the log-likelihood
ratio test as

∂ log pβ (x) 
Lβ (x) = log pβ (x) − log p0 (x) = β  + O(β 2 ) > log γ (D.58)
∂β 
β=0

or, for the leading term,



∂ log pβ (x)  log γ
 > . (D.59)
∂β  β
β=0

Thus, for small β, the test statistic is



∂ log pβ (x) 
T (x) =  . (D.60)
∂β 
β=0

Under H0 , its expected value


⎡  ⎤

∂ log p (x) 
Ep0 ⎣ ⎦=0
β
 (D.61)
∂β 
β=0

by the regularity condition (D.89) and the variance


⎡  ⎤ ⎡⎛  ⎞2 ⎤

∂ log pβ (x)  ∂ log pβ (x) 
Var ⎣  ⎦ = Ep0 ⎣⎝  ⎠ ⎦, (D.62)
∂β  ∂β 
β=0 β=0

which is the Fisher information for the vector of observations x.


Additionally, if x[i] are iid realizations of some scalar variable, we can invoke
the central limit theorem and state that T (x) is Gaussian,

 ∂ log pβ (x[i]) 
n
T (x) =  ∼ N (0, nI(0)) , (D.63)
∂β 
i=1 β=0
Signal detection and estimation 345

where I(0) is the Fisher information of one observation x[i] (D.42). We also
denoted the marginal distribution of pβ (x) constrained to the ith component
with the same symbol. Notice that the distribution of the test statistic under H0
does not depend on the unknown parameter β. This allows us to compute the
threshold γ for T (x) for a given bound on PFA < FA from
ˆ
Pr{T (x) > γ|H0 } = p0 (x)dx = FA (D.64)
T (x)>γ

as in simple hypothesis testing. On the other hand, the probability of missed


detection PD will depend on β and thus will generally be unknown.
We now briefly mention two more alternative approaches to the composite
hypothesis test. A popular choice is to use the Generalized Likelihood-Ratio
Test (GLRT) or impose an a priori probability distribution on the unknown
parameter and use a Bayesian approach.
The GLRT detector has the form
p1 (x; β̂1 )
Decide H1 if and only if L(x) = > γ, (D.65)
p0 (x; β̂0 )

where β̂0 and β̂1 are maximum-likelihood estimates of β under the corresponding
hypotheses

β̂0 = arg max p0 (x; β), (D.66)


β

β̂1 = arg max p1 (x; β). (D.67)


β

In a Bayesian detector, a pdf is imposed on the unknown parameter, p(β), at


which point the composite hypothesis-testing problem is converted to a simple
one and the detector takes the form of (D.10),
´
p1 (x|β)p(β)dβ
Decide H1 when ´ > γ. (D.68)
p0 (x|β)p(β)dβ

D.4 Chi-square test

Pearson’s chi-square test is a popular test for the composite hypothesis-testing


problem of whether or not a given iid signal x[i], i = 1, . . . , n, follows a known
distribution, p,

H0 : x[i] ∼ p, (D.69)
H1 : x[i] ∼ p. (D.70)

The test is directly applicable only to discrete random variables but can be used
for continuous variables after binning. The binning has typically little influence
on the result if the pdf is “reasonable.” For example, for a unimodal pdf, the bin
346 Appendix D. Signal detection and estimation

width can be chosen as σ̂/2, where σ̂ is the sample standard deviation of the
data.
Suppose we have d bins (also called categories). Let o[k] be the observed num-
ber of data samples in the kth bin, k = 1, . . . , d, and e[i] the expected occupancy
of the ith bin if the data followed the known distribution p. Under the null
hypothesis, the test statistic


d
(o[k] − e[k])2
S= (D.71)
e[k]
k=1

is a random variable with distribution approaching the chi-square distribution


χ2d−1 with d − 1 degrees of freedom (Section A.9). This approximation is valid
only if the expected occupancy of each bin e[k] > 4. If this is not the case, the
bins should be merged to satisfy the minimum-occupancy condition.
The detector is

Reject H0 (decide H1 ) when S > γ, (D.72)

where γ is determined from the bound on the probability of the Type I error (de-
ciding H1 when H0 is correct), FA = Pr{χ2d−1 > γ}, given by the complementary
cumulative distribution function for the chi-square variable with d − 1 degrees
of freedom,

ˆ γ
1
e− 2 t 2 −1
t d−1
FA = Pr{χ2d−1 > γ} = 1 − d−1  d−1  dt. (D.73)
2 2 Γ 2 0

In Matlab, this function can be evaluated as 1-chi2cdf(γ,d − 1).

Example D.3: [Fair-die test] Throw a die n = 1000 times and calculate o[k] =
number of throws with outcome k, k = 1, . . . , 6. For a fair die, e[k] = n/6 for all
k = 1, . . . , 6 (we have k = 6 bins). Because e[k] = 1000
6 > 4, the statistic


6
(o[k] − n/6)2
S= (D.74)
n/6
k=1

is approximately chi-square distributed with d − 1 = 5 degrees of freedom. Let


us assume that we threw the die 1000 times and obtained the following observed
frequencies o[k] = (183, 162, 162, 155, 168, 170). Then, S = 2.765. For FA =
0.05 and 5 degrees of freedom, we have γ = 11.07. In Matlab, γ = chi2inv(1 −
FA , d − 1). Because S = 2.765 < 11.07, we accept the null hypothesis (the die is
fair).
Signal detection and estimation 347

D.5 Estimation theory

Assume we have a data set x[i], i = 1, . . . , n, that depends on some unknown


scalar parameter θ. Our task is to estimate the parameter from the data using
some function θ̂(x), which we will call an estimator of θ. We encounter this
situation in quantitative steganalysis when the unknown parameter is the change
rate due to steganographic embedding and x is the vector of elements from the
stego object.
An estimator is thus a mapping from Rn → R that assigns an estimate, θ̂(x),
of the parameter to each data set x. If we were to repeatedly collect more data
sets, each time the value of the estimate would be different, following some
distribution p(x; θ). Thus, the estimator itself is a random variable and we can
speak of its mean value and variance.
We say that the estimator θ̂ is unbiased if

E[θ̂] = θ for all θ (D.75)

or, using the pdf of our measurements,


ˆ
θ̂(x)p(x; θ)dx = θ. (D.76)
Rn

In other words, an unbiased estimator on average yields the correct answer.

Example D.4: [Unknown DC level in AWGN] Let

x[i] = A + ξ[i], i = 1, . . . , n (D.77)

be a DC signal (constant signal) corrupted by Additive White Gaussian Noise


(AWGN), meaning that ξ[i] ∼ N (0, σ 2 ) is iid. The variance of the noise, σ 2 , is
not necessarily known. Our task is to estimate the parameter A. One possibility
would be to set the estimate equal to the first sample

 = x[1]. (D.78)

This is an unbiased estimator because E[x[1]] = A. Its variance is Var[Â] =


Var[x[1]] = σ 2 . But we are not using the observations very efficiently because
we are ignoring all the other measurements. Thus, perhaps a better estimator
would be the arithmetic average

1
n
 = x[i]. (D.79)
n i=1

Indeed, its variance is

1 1 
n n
1 σ2
Var[ x[i]] = 2 Var[x[i]] = 2 nσ 2 = , (D.80)
n i=1 n i=1 n n
348 Appendix D. Signal detection and estimation

which is n times smaller than for the previous estimator. In fact, this is the best
unbiased estimator of A that we can hope for, in some well-defined sense.

We just posed an interesting question of comparing different estimators. Can we


somehow decide which estimator is better and then select the best one among
them? One seemingly natural criterion would be to rank estimators by their
Mean-Square Error (MSE)

MSE(θ̂) = E[(θ̂ − θ)2 ]. (D.81)

However, best estimators according to this criterion may not be constructable


because the best estimator may depend on the unknown parameter that we
are estimating. To see this, imagine a slightly different estimator of A from the
previous example:

a
n
 = x[i], (D.82)
n i=1

where a > 0 is a constant. The MSE of any estimator can be written as the sum
of the estimator variance and the square of its bias, b(θ̂) = E[θ̂] − θ, because

MSE(θ̂) = E[(θ̂ − E[θ̂] + E[θ̂] − θ)2 ] (D.83)


= E[(θ̂ − E[θ̂])2 ] + E[(E[θ̂] − θ)2 ] − 2E[(θ̂ − E[θ̂])(E[θ̂] − θ)] (D.84)
= Var[θ̂] + b2 (θ̂) − 2 × 0 × b(θ̂) = Var[θ̂] + b2 (θ̂). (D.85)

The last equality follows from the fact that E[θ̂ − E[θ̂]] = 0 and that E[θ̂] − θ is
a number and not a random variable. Thus, the MSE of estimator (D.82) is

a2 σ 2
MSE(Â) = Var[Â] + b2 (Â) = + (a − 1)2 A2 . (D.86)
n

By differentiating MSE(Â) with respect to a, setting to zero, and solving for


a, we obtain the value of a that minimizes the MSE:

d 2aσ 2
MSE(Â) = + 2(a − 1)A2 = 0, (D.87)
da n
A2
a= 2 . (D.88)
A + σ 2 /n

Because the second derivative is positive, we have indeed found a minimum of


the MSE. We are witnessing here an interesting trade-off. It is better to allow
the estimator to be biased if its variance becomes lower. The optimal value of a,
however, depends on the unknown parameter A. Thus, this best estimator (in
the sense of the smallest MSE) cannot be constructed.
Signal detection and estimation 349

D.6 Cramer–Rao lower bound

Because of the problem with realizability, we will next only consider unbiased
estimators. Among them, we select as the best the one with the smallest variance.
Such estimators are called Minimum-Variance Unbiased (MVU).
Even MVU estimators may not always exist because the minimum variance for
one fixed estimator may not be minimum for all θ. Again, we do not know when
to switch between estimators if there is an MVU for θ ∈ (−∞, θ0 ) and another
MVU for θ ∈ (θ0 , ∞). There are three possible approaches that we can choose
in practice [125]. We can first obtain the Cramer–Rao Lower Bound (CRLB)
on the estimator variance and show that our estimator’s variance is close to the
theoretical bound, or we can apply the Rao–Blackwell–Lehmann–Sheffe theorem,
or we can restrict our estimator to the class of linear estimators. In this appendix,
we explain the CRLB.
Before formulating the CRLB, we derive a few useful facts and introduce some
terminology.
We say that a pdf p(x; θ) satisfies the regularity condition if

 
∂ log p(x; θ)
E = 0 for all θ, (D.89)
∂θ

where the expected value is taken with respect to the pdf p(x; θ). As in Sec-
tion D.1, we adopt the same convention and use x to denote a random variable.
The regularity condition means that

ˆ ˆ ˆ
∂ log p(x; θ) ∂p(x; θ) ∂ ∂1
p(x; θ) dx = dx = p(x; θ)dx = = 0 (D.90)
∂θ ∂θ ∂θ ∂θ

if we can exchange the partial derivative and the integral. It turns out that for
most real-life cases this exchange will be justified (there will be an integrable
majorant for the pdf for some range of the parameter). Thus, the regularity
condition is, in fact, not a strong condition and is practically always satisfied.
By differentiating the regularity condition partially with respect to θ, we es-
tablish one more useful fact,

   
2
∂ 2 log p(x; θ) ∂ log p(x; θ)
E 2
= −E , (D.91)
∂θ ∂θ
350 Appendix D. Signal detection and estimation

which follows from


ˆ
∂ ∂ log p(x; θ)
p(x; θ) dx = 0, (D.92)
∂θ ∂θ
ˆ ˆ
∂ 2 log p(x; θ) ∂p(x; θ) ∂ log p(x; θ)
p(x; θ) 2
dx = − dx, (D.93)
∂θ ∂θ ∂θ
ˆ ˆ
∂ 2 log p(x; θ) ∂ log p(x; θ) ∂ log p(x; θ)
p(x; θ) 2
dx = − p(x; θ) dx, (D.94)
∂θ ∂θ ∂θ
 2   
2
∂ log p(x; θ) ∂ log p(x; θ)
E = −E . (D.95)
∂θ2 ∂θ

Theorem D.5. [Cramer–Rao lower bound] Assume the pdf of measure-


ments p(x; θ) satisfies the regularity condition (D.89). Then, the variance of any
unbiased estimator θ̂ must satisfy
1
Var[θ̂] ≥ # $. (D.96)
∂2 log p(x;θ)
−E ∂θ 2

Moreover, an unbiased estimator that attains the bound exists for all θ if and
only if
∂ log p(x; θ)
= I(θ) (g(x) − θ) (D.97)
∂θ
for some functions g and I. The MVU estimator is

θ̂ = g(x) (D.98)

and its variance is


1
Var[θ̂] = . (D.99)
I(θ)

Proof. The proof of this theorem is a direct consequence of the Cauchy–Schwartz


inequality (see Section D.10.1). Let g(x) be the function that realizes an unbiased
estimator. Because it is unbiased, we have
ˆ
g(x)p(x; θ)dx = θ. (D.100)

Differentiating this with respect to θ, we obtain


ˆ
∂p(x; θ)
g(x)dx = 1, (D.101)
∂θ
ˆ
∂ log p(x; θ)
p(x; θ) g(x)dx = 1, (D.102)
∂θ
ˆ
∂ log p(x; θ)
p(x; θ) (g(x) − θ)dx = 1. (D.103)
∂θ
Signal detection and estimation 351

The last equality follows from the fact that


ˆ  
∂ log p(x; θ) ∂ log p(x; θ)
p(x; θ) θdx = θE =0 (D.104)
∂θ ∂θ

due to the regularity condition. We now apply the Cauchy–Schwartz inequality


in the space of quadratically integrable ´functions on the support of the pdf
p(x; θ) with the inner product !f, h" = w(x)f (x)h(x)dx, where the weight
function
´ w is the pdf itself.
´ The inequality´ says that !f, h"2 ≤ f 2 h2
2
or w(x)f (x)h(x)dx ≤ w(x)(f (x))2 dx × w(x)(h(x))2 dx. In our case,
w(x) = p(x; θ), f (x) = ∂ log p(x; θ)/∂θ, h(x) = g(x) − θ. Thus,
ˆ 2 ˆ
∂ log p(x; θ)
1≤ p(x; θ) dx × p(x; θ)(g(x) − θ)2 dx, (D.105)
∂θ
 
2
∂ log p(x; θ)
1≤E × Var[θ̂], (D.106)
∂θ

which proves the bound (together with equality (D.91)).


The equality in the Cauchy–Schwartz inequality occurs if and only if the func-
tions f and h are collinear or one is a multiple of the other. Since the independent
variable for the functions is x, the multiplicative constant could depend on θ.
The equality thus occurs if and only if

∂ log p(x; θ)
= I(θ)(g(x) − θ), (D.107)
∂θ

which proves half of the statement in the CRLB.


We now need to prove that if (D.107) holds, then g(x) is an MVU estimator
and its variance is 1/I(θ). Taking the expected value of (D.107) with respect to
p(x; θ), the left-hand side is the regularity condition and thus equal to 0. The
right-hand side is I(θ) times the estimator bias, I(θ)E[g(x) − θ]. Thus, θ̂ = g(x)
is an unbiased estimator. To show the variance, we first calculate the analytic
form of I(θ). We differentiate (D.107) with respect to θ and take the expected
value again:

∂ 2 log p(x; θ)
2
= I  (θ)(g(x) − θ) − I(θ), (D.108)
 2 ∂θ 
∂ log p(x; θ)
E 2
= E[I  (θ)(g(x) − θ)] − I(θ) = −I(θ). (D.109)
∂θ

Thus,
   
2
∂ 2 log p(x; θ) ∂ log p(x; θ)
I(θ) = −E 2
=E . (D.110)
∂θ ∂θ
352 Appendix D. Signal detection and estimation

To calculate the variance of θ̂ = g(x), we rewrite (D.107), square, and take the
expected value of both sides:

1 ∂ log p(x; θ)
= g(x) − θ, (D.111)
I(θ) ∂θ
 
2
1 ∂ log p(x; θ)
E = E[(g(x) − θ)2 ] = Var[θ̂], (D.112)
I(θ) ∂θ
 
1 ∂ log p(x; θ) 2
E = Var[θ̂], (D.113)
I(θ)2 ∂θ
1
= Var[θ̂]. (D.114)
I(θ)

The last equality follows from the analytic expression for I(θ).

The quantity I(θ) = E[(∂ log p(x; θ)/∂θ)2 ] is called Fisher information. It is a
measure of how fast the pdf changes with θ at x. Intuitively, the faster the pdf
changes, the more accurately we can estimate θ from the measurements. The
larger it is, the smaller the bound on the unbiased estimator variance, Var[θ̂] ≥
1/I(θ).
The Fisher information is always non-negative and it is additive when the
observations are independent. This is because for independent observations

"
n
p(x; θ) = p(x[i]; θ) (D.115)
i=1

and we have for the Fisher information


 n  n  2   n
∂2 log p(x[i]; θ) ∂ log p(x[i]; θ)
I(θ) = −E i=1
=− E = Ii (θ),
∂θ2 i=1
∂θ2 i=1
(D.116)
where Ii (θ) is the Fisher information of each observation.

Example D.6: We can use the CRLB to prove that the sample mean is the MVU
estimator for DC level in AWGN (Example D.4). The pdf of each observation is
a Gaussian N (A, σ 2 ). Since all n observations are independent random variables,
their joint pdf is multiplication of their marginal pdfs and we have

"
n
1 (x[i]−A)2 1
n 2
e− 2σ2 = (2πσ 2 )− 2 e− 2σ2 i=1 (x[i]−A) .
n
p(x; A) = √ (D.117)
i=1
2πσ
Signal detection and estimation 353

Thus,

1 
n
∂ log p(x; A)
=− 2 −2(x[i] − A)
∂A 2σ i=1
!
1 
n
n
= 2 x[i] − nA = 2 (x − A).
σ i=1
σ

From here, the CRLB tells us that the sample mean g(x) = x is the MVU esti-
mator with variance σ 2 /n.

Example D.7: Consider the situation in Example D.4,

x[i] = A + ξ[i], i = 1, . . . , n, ξ[i] ∼ N (0, σ 2 ), (D.118)

but now let us say we know the DC level A and we wish to estimate the noise
variance σ 2 instead. We start by writing down the pdf of the n observations,
with σ 2 being the unknown parameter,
"
n
1 (x[i]−A)2 1
n 2
e− 2σ2 = (2πσ 2 )− 2 e− 2σ2 i=1 (x[i]−A) .
n
p(x; σ 2 ) = √ (D.119)
i=1
2πσ

We have
!
1 
n
∂ log p(x; σ 2 ) ∂ n
2
= − log(2πσ ) − 2
2
(x[i] − A)2 (D.120)
∂σ ∂σ 2 2 2σ i=1
1 
n
n
=− 2 + 4 (x[i] − A)2 (D.121)
2σ 2σ i=1
!
1
n
n
= (x[i] − A)2 − σ 2 . (D.122)
2σ 4 n i=1

Thus, the sample variance σ̂ 2 = (1/n) ni=1 (x[i] − A)2 is an MVU estimator and
its variance is 2σ 4 /n. Note that while the MVU estimator of A does not need
knowledge of the noise variance σ 2 , the MVU estimator of variance needs knowl-
edge of A.
If both A and σ 2 are unknown, it is tempting to write an estimator for the
variance as

1 1
n n
2
σ̂ = (x[i] − Â)2 = (x[i] − x)2 . (D.123)
n i=1 n i=1

However, this plug-in estimator is biased. In fact, plug-in estimators are generally
biased unless they are linear in the parameter. The reader is challenged to verify
354 Appendix D. Signal detection and estimation

that
1 
n
σ̂ 2 = (x[i] − x)2 (D.124)
n − 1 i=1

is an MVU estimate of variance.

D.7 Maximum-likelihood and maximum a posteriori estimation

Even though the CRLB can be used for derivation of MVU estimators, it is rarely
used in this way except for some simple cases. In practice, alternative approaches
that typically give good estimators are used. The two most common principles
are maximum-likelihood and maximum a posteriori estimation.
In Maximum-Likelihood Estimation (MLE), the parameters are estimated
from measurements x using the following optimization problem:

θ̂ = arg max p(x|θ). (D.125)


θ

When the measurements are independent,


"
n
p(x|θ) = p(x[i]|θ). (D.126)
i=1

Because maximizing the likelihood in (D.125) is the same as maximizing its


logarithm, we obtain

n
θ̂ = arg max log p(x[i]|θ) (D.127)
θ
i=1

and the parameter can be obtained by solving the following algebraic equation:

n
∂ log p(x[i]|θ)
= 0. (D.128)
i=1
∂θ

When θ is a vector parameter, (D.128) generalizes to a system of k algebraic


equations for k unknowns θ[k],

n
∂ log p(x[i]|θ)
= 0 for all k. (D.129)
i=1
∂θ[k]

Maximum-likelihood estimation is often used to fit a parametric distribution


to experimental data. We illustrate this particular application on the example of
estimating the parameters of the generalized Cauchy distribution.2

2 This estimation procedure is used in the implementation of Model-Based Steganography


described in Section 7.1.2.
Signal detection and estimation 355

Example D.8: [MLE for generalized Cauchy distribution] The generalized


Cauchy distribution with zero mean has the form (Section A.8)
−p
p−1 |x|
f (x; p, s, 0) = 1+ (D.130)
2s s
with parameters θ = (s, p). The partial derivatives are
∂ log f (x[i]|θ) 1 p|x|
=− + 2 , (D.131)
∂s s s + s |x|
∂ log f (x[i]|θ) 1 |x|
= − log 1 + . (D.132)
∂p p−1 s
Substituting them into (D.129), we obtain two equations for two unknowns,

n
1 n
s = , (D.133)
i=1
1+ |x[i]| p

n
|x[i]| n
log 1 + = , (D.134)
i=1
s p−1

which can be solved for p and s. In this particular case, we can substitute for n/p
from the first equation into the second one and thus deal with only one algebraic
equation for s.

When we have prior information about the distribution of the parameters, we can
utilize this knowledge to obtain a better estimate. This approach to parameter
estimation is called Maximum A Posteriori (MAP) estimation,

θ̂ = arg max p(θ|x) = arg max p(x|θ)p(θ), (D.135)


θ θ

where p(θ) is the a priori distribution of the parameter. Note that when p(θ)
is uniform, ML and MAP estimates coincide. The reader is referred to [125] for
more information about these two important estimation methods.

D.8 Least-square estimation

In practice, we often need to quantitatively describe a relationship between a


scalar quantity y and a vector x ∈ Rd . In steganalysis, y may stand for the
change rate and x for a stego image histogram or some other (vector) feature
derived from the stego image. In Least-Square Estimation (LSE), one selects a
parametric model for this relationship in the form

y = f (x; θ) + η, (D.136)
356 Appendix D. Signal detection and estimation

where θ is an unknown vector parameter, η is the modeling noise, and f maps


to R. The unknown parameter will be determined from l tuples (y[i], xi ), i =
1, . . . , l, usually obtained experimentally, by minimizing the scalar functional


l
J(θ) = (f (xi ; θ) − y[i])2 (D.137)
i=1

with respect to θ. The minimization leads to a set of k algebraic equations for k


unknowns θ[1], . . . , θ[k]:

∂J(θ) l
∂f (xi ; θ)
=2 (f (xi ; θ) − y[i]) = 0. (D.138)
∂θ[k] i=1
∂θ[k]

In general, this problem needs to be solved numerically, e.g., using the Newton–
Raphson method.
When the model is linear, a closed-form solution to the optimization can be
obtained. Considering the vectors as column vectors, f (x; θ) = x θ, θ ∈ Rd , and
(D.136) can be written in a compact matrix form as

y = Hθ + η, (D.139)

where the ith row of the l × d matrix H is xi . We will further assume that H is
of full rank.
Writing the functional J(θ) as

J(θ) = (y − Hθ) (y − Hθ) = y y − y Hθ − (Hθ) y + (Hθ) Hθ (D.140)


= y y − 2y Hθ + θ  H Hθ, (D.141)

we can easily obtain the result of differentiating as

∂J(θ)
= −2H y + 2H Hθ. (D.142)
∂θ

The least-square estimate of the parameter, θ̂, is obtained by setting the gradient
to zero and solving for θ,

θ̂ = (H H)−1 H y. (D.143)

The LSE is usually applied when the properties of the modeling noise are not
known. We note that LSE becomes MLE when the modeling noise η is Gaussian
N (0, σ2 ). This is because MLE maximizes

1 
l
n
log p(y|θ) = log(2πσ ) − 2
2
(y[i] − H[i, .]θ)2 , (D.144)
2 2σ i=1

which is equivalent to minimizing J(θ) = (y − Hθ) (y − Hθ), the task of LSE.


Signal detection and estimation 357

D.9 Wiener filter

An important tool in steganalysis is a denoising filter. The task of denoising


belongs to signal estimation (we are estimating the noise-free image). Here, we
describe in detail one of the simplest filters, the Wiener filter. It is an adaptive
linear filter that removes additive white Gaussian noise of a given (known) vari-
ance from a signal so that the filtered signal is closest to the original (non-noisy)
signal in the least-square sense. We assume we have n samples of a noisy signal
x[i]

x[i] = S[i] + ξ[i], i = 1, . . . , n, (D.145)

which is a superposition of the original noise-free signal S[i], the true scene,
and Gaussian noise ξ[i]. The following assumptions are made about the original
signal and the noise:

1. S[i] are zero-mean jointly Gaussian random variables with a known covariance
matrix C[i, j] = Cov [S[i], S[j]] = E [S[i]S[j]].
2. The noise is a sequence of Gaussian random variables with a known covariance
matrix. Here, we will assume that the covariance matrix is diagonal with
variance σ 2 on its diagonal (in other words, the sequence of random variables
ξ[i] ∼ N (0, σ 2 ) is iid). We also assume that S and ξ are independent of each
other.

Our task is to estimate S from x in the least-square sense so that the expected
value of the error
 2 
E S[i] − Ŝ[i] (D.146)

is minimum for each i. We seek the estimate as a linear function of the (noisy)
observations

n
Ŝ[i] = W[i, j]x[j]. (D.147)
j=1

The requirement of minimal mean-square error means that we need to find the
matrix W so that the expected value of the error for the ith sample, e[i], is
minimal,
⎡⎛ ⎞2 ⎤

n
e[i](W[i, 1], . . . , W[i, n]) = E ⎣⎝S[i] − W[i, j]x[j]⎠ ⎦ . (D.148)
j=1

We do so by differentiating e[i] with respect to its arguments and solving for the
local minimum,
 ! 
∂e[i] n
= E −2 S[i] − W[i, k]x[k] x[j] = 0 for all i, j. (D.149)
∂W[i, j]
k=1
358 Appendix D. Signal detection and estimation

Rewriting further and substituting for x[j] = S[j] + ξ[j], we obtain for all i and
j
 n 

E [S[i] (S[j] + ξ[j])] = E W[i, k] (S[k] + ξ[k]) (S[j] + ξ[j]) . (D.150)
k=1

The right-hand side can be written as



n
W[i, k] (E [S[k]S[j]] + E [S[k]ξ[j]] + E [S[j]ξ[k]] + E [ξ[k]ξ[j]]) . (D.151)
k=1

Using the fact that ξ[j] and ξ[k] are independent for j = k and the fact that S
and ξ are also independent signals, we have E [S[k]ξ[j]] = 0 and E [ξ[k]ξ[j]] =
δ(k − j)σ 2 , where δ is the Kronecker delta. Thus, we can rewrite (D.150) in
matrix notation as
 
C = W C + σ2 I (D.152)

and obtain the general form of the Wiener filter W as


 −1
W = C C + σ2 I . (D.153)

It should be clear that we have indeed found a minimum because  the second
partial derivative with respect to W[i, j] is ∂ 2 e[i]/∂W2 [i, j] = E 2x2 [j] > 0.
In the special case when the covariance matrix C is diagonal,3 we have C =
diag(σ 2 [1], . . . , σ 2 [n]) and the inverse (C + σ 2 I)−1 as well as the matrix W are
also diagonal, W = diag(w[1], . . . , w[n]),

σ 2 [i]
w[i] = . (D.154)
σ 2 [i] + σ 2
The denoised signal is
σ 2 [i]
Ŝ[i] = x[i]. (D.155)
σ 2 [i] + σ 2

D.9.1 Practical implementation for images


Let x[i, j], i = 1, . . . , M , j = 1, . . . , N , be the pixel values of a grayscale M × N
image. The values x[i, j] are a superposition of the noise-free image S[i, j] and
an iid Gaussian component ξ[i, j] ∼ N (0, σ 2 ),

x[i, j] = S[i, j] + ξ[i, j]. (D.156)

We wish to filter out the noise and obtain an approximation Ŝ[i, j] to S. It


would not be reasonable to assume that S[i, j] is a zero-mean Gaussian. We can,

3 S[i] is a non-stationary sequence of independent Gaussian variables.


Signal detection and estimation 359

however, make that assumption about S[i, j] − μ̂[i, j], where


1 
μ̂[i, j] = x[k, l] (D.157)
|N [i, j]|
(k,l)∈N [i,j]

is the average grayscale value in the neighborhood N [i, j] of pixel (i, j) containing
|N [i, j]| pixels (e.g., we can take the 3 × 3 neighborhood). Note that

Var [x[i, j]] = Var [S[i, j]] + σ 2 , (D.158)

where Var [x[i, j]] can be approximated4 by the sample variance in the neighbor-
hood N [i, j],
1 
Var [x[i, j]] ≈ σ̂ 2 [i, j] = x2 [k, l] − μ̂2 [i, j]. (D.159)
|N [i, j]|
(k,l)∈N [i,j]

Now, consider the signal x[i, j] − μ̂[i, j] = S[i, j] − μ̂[i, j] + ξ[i, j] as the equiva-
lent of the noisy signal x[i] in the derivation of the Wiener filter, and S[i, j] −
μ̂[i, j] the equivalent of the noise-free signal S[i]. From (D.158), the variance
of the noise-free signal Var [S[i, j] − μ̂[i, j]] = Var [S[i, j]] = Var [x[i, j]] − σ 2 ≈
σ̂ 2 [i, j] − σ 2 . Thus, we obtain the Wiener filter for denoising images in the form
(D.155)

σ̂ 2 [i, j] − σ 2
Ŝ[i, j] = μ̂[i, j] + (x[i, j] − μ̂[i, j]) . (D.160)
σ̂ 2 [i, j]
We summarize that the Wiener filter is an adaptive linear filter that is opti-
mal in the sense that it minimizes the MSE between the denoised image and the
noise-free image under the assumption that the zero-meaned pixel values form
a sequence of independent zero-mean Gaussian variables (not necessarily with
equal variances – we allow non-stationary signals) and the image is corrupted
with white Gaussian noise of known variance σ 2 . This is exactly the implementa-
tion of wiener2.m in Matlab. The command wiener2(X,[N N ],σ 2 ) returns a
denoised version of X, where N is the size of the square neighborhood N . If the
parameters are not specified by the user, Matlab determines the default value of
σ 2 as the average variance in 3 × 3 neighborhoods over the whole image.

D.10 Vector spaces with inner product

Some results in this appendix and in Appendix A use the Cauchy–Schwartz


inequality in abstract spaces. To help the reader understand these results on a
deeper level, we introduce in this section the concept of an abstract vector space
and formulate and prove the Cauchy–Schwartz inequality.

4 We are really making a tacit assumption that the pixel values are locally stationary.
360 Appendix D. Signal detection and estimation

A vector space V is a set of objects that can be added and multiplied by a scalar
from some field T . The addition of elements from V is associative and commuta-
tive and there exists a zero vector (so that x + 0 = x for all x ∈ V). Every element
x ∈ V also has an inverse element, −x ∈ V, so that x + (−x) = 0. The multiplica-
tion by a scalar, α ∈ T , is distributive in the sense that for all α, β ∈ T , x, y ∈ V,
α(x + y) = αx + αy and (α + β)x = αx + βx, and associative, α(βx) = (αβ)x.
Also, for the identity element in T , 1 · x = x for all x ∈ V.
An inner product in vector space is a mapping !x, y" : V × V → R with the
following properties.

1. !x, y" = !y, x" for all x, y ∈ V (symmetry).5


2. !αx + y, z" = α !x, z" + !y, z" for all x, y, z ∈ V, α ∈ T (linearity).
3. !x, x" ≥ 0 for all x ∈ V. !x, x" = x is called the norm of x.
4. !x, x" = 0 if and only if x = 0 (non-degeneracy).

We now provide several examples of important vector spaces with inner product.

Example D.9: [Euclidean space] V = Rn , T = R.


n
!x, y" = !(x[1], . . . , x[n])(y[1], . . . , y[n])" = x[i]y[i] (D.161)
i=1

is the usual dot product in Euclidean space. In Euclidean spaces, it is customary


to denote the dot product with a dot, x · y.

Example D.10: [L2 space] V = L2 (a, b) is the space of all quadratically inte-
grable functions on [a, b] with T = R. To be absolutely precise here, we consider
two functions f and g from this space as identical if they differ on a set of
Lebesgue measure 0. Thus, formally speaking, the elements of this vector space
are classes of equivalence of functions. We endow this space with an inner product

ˆb
!f, g" = f (t)g(t)dt. (D.162)
a

It is easy to verify that the four defining properties of an inner product are
´
satisfied. The fourth property is satisfied because if f 2 (t)dt = 0, then f (t) = 0
almost everywhere (in the sense of being zero up to a set of Lebesgue measure
0).

5 For complex inner products that map to C, the set of all complex numbers, the symmetry is
replaced with conjugate symmetry !x, y" = !y, x".
Signal detection and estimation 361

In this book, we will also encounter spaces with a slightly different inner prod-
uct defined as
ˆb
!f, g" = w(t)f (t)g(t)dt, (D.163)
a

where w(t) > 0 is a weight function on [a, b].

Example D.11: [Vector space of random variables] Let V be the space of


all real-valued random variables with zero mean and the inner product defined
as
!x, y" = Cov[x, y]. (D.164)
It is easy to verify that this is, indeed, a correctly defined inner product. The
fourth property is satisfied because !x, x" = Var[x] ≥ 0. If the variance of a real-
valued random variable is zero and its mean is zero, it means that its pdf is zero
almost everywhere and thus x ≡ 0.

D.10.1 Cauchy–Schwartz inequality


The Cauchy–Schwartz inequality holds in any vector space V with inner product.
For any x, y ∈ V
| !x, y" | ≤ x y , (D.165)
where the equality occurs if and only if the vectors x and y are collinear (one is
a scalar multiple of the other).
First, note that if y = 0, the inequality is satisfied. Let λ be an arbitrary
number. Then,
0 ≤ !x − λy, x − λy" = x2 + λ2 y2 − 2λ !x, y" . (D.166)
By selecting a specific value of λ = !x, y" / y2 , we obtain the Cauchy–Schwartz
inequality

2 !x, y"2 2 !x, y" 2 | !x, y" |2


0 ≤ x + y − 2 !x, y" = x − . (D.167)
y4 y2 y2
Note that equality in the Cauchy–Schwartz inequality occurs if and only if there
exists λ such that x − λy = 0, or in other words when x and y are collinear (one
is a multiple of the other).
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
E - Support vector machines pp. 363-376

Chapter DOI:
Cambridge University Press
E Support vector machines

In this appendix, we describe the basic principles of classification using Support


Vector Machines (SVM). SVMs recently gained popularity because they offer
performance comparable to or better than other machine-learning tools and are
relatively easy to use. This material is especially relevant to Chapter 12 on blind
steganalysis, where SVMs are used to construct blind steganalyzers. For a more
detailed tutorial on SVMs the reader is referred to [33, 54].
In the first, rather theoretic section, we explain the main principles of SVMs
on linearly separable problems. The methodology is then extended to problems
where linear separation is not possible. Finally, we describe the most general
kernelized SVMs. The second part of the appendix deals with practical imple-
mentation issues when applying SVMs to real problems.
We start by defining the binary classification problem.

E.1 Binary classification

Let X be an arbitrary non-empty input space and Y be the label set Y =


{−1, +1}. For example, in blind steganalysis the points xi ∈ X are feature vec-
tors extracted from images and the binary label stands for either cover or stego
image. In this appendix, the input space is X = Rn . Let us suppose that a set
of l training examples xi is available together with their associated labels y[i]:
(x1 , y[1]), . . . , (xl , y[l]) ∈ X × Y. We will also assume that the pairs (xi , y[i]) are
realizations of a random variable described by a joint probability measure P (x, y)
on X × Y. Our goal is to use the training examples to build a decision function
f : X → Y that assigns a label to each x ∈ X while making as few errors as pos-
sible. The function f is a binary classifier because it classifies every input x into
two classes.
The ability of f to classify will be evaluated using the risk functional

ˆ
R(f ) = u(−yf (x))dP (x, y), (E.1)
X ×Y
364 Appendix E. Support vector machines

where u is the step function



1, z>0
u(z) = (E.2)
0, z ≤ 0.

Clearly, 0 ≤ R(f ) ≤ 1, and R(f ) = 0 when f (x) correctly assigns the labels to
all x ∈ X up to a set of measure zero (with respect to P ). We also stress at
this point that unless we know the measure P (x, y), we cannot guarantee that
any estimated decision function is optimal (that it minimizes the risk functional
R). Unfortunately, in most practical applications the measure will not be known.
Note that if P (x, y) were known, we could apply the apparatus of classical detec-
tion theory and derive optimal detectors (classifiers) using the Neyman–Pearson
or Bayesian approach as explained in Appendix D.

E.2 Linear support vector machines

E.2.1 Linearly separable training set


We say that the training set is linearly separable if there exists a hyperplane
with normal vector w ∈ Rn and b ∈ R,

f (x) = sign(w · x − b), (E.3)

such that the empirical risk (error) on the training set

1
l
Remp (f ) = u (−y[i]f (xi )) = 0. (E.4)
l i=1

The function classifies the point x ∈ Rn according to on which side of the hyper-
plane w · x − b the point x lies. Because the decision function is fully described by
the separating hyperplane, we will use the terms decision function, hyperplane,
or classifier interchangeably depending on the context.
If the training set is linearly separable, there may exist infinitely many deci-
sion functions f (x) = sign(w · x − b) perfectly classifying the training set with
Remp (f ) = 0. To lower the chance of making an incorrect decision on x not con-
tained in the training set, we select the separating hyperplane with maximum
distance from positive and negative training examples. This hyperplane, which
we denote f ∗ , is uniquely defined. It can be found by solving the following opti-
mization problem:

[w∗ , b∗ ] = arg max min {x − xi |x ∈ Rn , w · x − b = 0, i ∈ {1, . . . , l}}


w∈Rn ,b∈R x,i
(E.5)
subject to

y[i] (w · xi − b) > 0, for all i ∈ {1, . . . , l}. (E.6)


Support vector machines 365

Separating hyperplane Support vectors


w • b∗ 
− w ∗ · w + w∗ 
= w∗ 

w∗ b∗
w∗ 
· w◦ − w∗ 
= 
w∗ 

w∗ 

Figure E.1 Example of a linearly separable training set in X = R2 . The separating


hyperplane is defined by the support vectors. Notice that other examples do not affect the
solution of the optimization problem. Using the notation from the text, w∗ · x − b∗ = 0
for all examples lying on the separating hyperplane, while for the support vectors
(w∗ /) · x◦ − b∗ / = +1 and (w∗ /) · x• − b∗ / = −1, depending on which side of the
hyperplane the support vector lies. The distance between the support vectors and the
separating hyperplane is |w∗ · x• /w∗  − b∗ /w∗ | = /w∗ .

This optimization problem is quite difficult to solve in practice. It can, however,


be reformulated as convex quadratic programming,
1
[w∗ , b∗ ] = arg min w2 (E.7)
w∈Rn ,b∈R 2

subject to

y[i] (w · xi − b) ≥ 1, for all i ∈ {1, . . . , l}. (E.8)

And this problem can be solved using standard quadratic-programming


tools [27].
We now show that problems (E.5) and (E.7) are indeed equivalent. Let us
assume that [w∗ , b∗ ] is the solution to (E.5). Denoting by (x◦ , y ◦ ) and (x• , y • )
the examples from the positive and negative classes that are closest to the hy-
perplane, we must have

min {x − x◦ |w∗ · x − b∗ = 0} = minn {x − x• |w∗ · x − b∗ = 0} . (E.9)


x∈Rn x∈R

In other words the distances from the separating hyperplane to the closest points
from each class must be equal (see Figure E.1). If they were not, we could
move the hyperplane away from the closer class and thus decrease the minimum
in (E.5), which would contradict the optimality of the solution [w∗ , b∗ ]. The
closest examples x◦ , x• cannot lie on the separating hyperplane because we have
strict inequality in (E.6). Keeping in mind that x◦ is from the positive class,
there has to exist  > 0 so that

w∗ · x◦ − b∗ = +, (E.10)
∗ • ∗
w · x − b = −. (E.11)
366 Appendix E. Support vector machines

Separating hyperplane Support vectors

ξ•

ξ◦

Figure E.2 Example of a training set on X = R2 that cannot be linearly separated. The
separating hyperplane is again defined by support vectors. Incorrectly classified examples
are displayed as large circles with their slack variables ξ • , ξ ◦ > 0.

By normalizing the last pair of equations by , they can be rewritten as


w∗ ◦ b∗
·x − = +1, (E.12)
 
w∗ • b∗
·x − = −1. (E.13)
 
Because x◦ , x• are the closest examples, it is obvious that the hyperplane
[w∗ /, b∗ /] satisfies the conditions (E.8). We now calculate the margin be-
tween the classes, which is defined as the sum of distances of the points
x◦ , x• from the separating hyperplane [w∗ /, b∗ /]. From (E.12)–(E.13), we
obtain (w∗ /) · (x◦ − x• ) = 2, and, after normalizing by w∗ /, (w∗ /w∗ ) ·
(x◦ − x• ) = 2/w∗ . Realizing that the distance from any x to the hyperplane
[w, b] is equal to (w/w) · x − b, we see that the margin is equal to 2/w∗ .
Thus, maximizing the margin 2/w in (E.5) is the same as minimizing w2 /2
in (E.7), which finishes the proof of the equivalence of problems (E.5) and (E.7).
We note that our choice of the maximum margin hyperplane f ∗ does not
guarantee that f ∗ is optimal in terms of minimizing the overall risk R(f ). There
might exist f such that Remp (f ) > Remp (f ∗ ) and yet R(f ) < R(f ∗ ).

E.2.2 Non-separable training set


Now, we consider the case when the training data (x1 , y[1]), . . . , (xl , y[l]) cannot
be linearly separated by a hyperplane without error. In this case, we would like to
  
find a linear classifier f that minimizes the number of errors li=1 u(−y[i]f (xi ))

on the training set. If we exclude all examples incorrectly classified by f from
the training set, (x1 , y[1]), . . . , (xl , y[l]), the training set will become linearly
separable and we can find the maximum margin classifier f ∗ , as we did in the
previous section. The classifier f ∗ has the following important properties. First,
its empirical risk on the training set is minimal. Second, it has maximum distance
from correctly classified training examples.
Support vector machines 367

6 u(z)
h(z)

−3 −2 −1 0 1 2 3 4 5
z
Figure E.3 Comparison of the step function u(z) and its convex majorant hinge loss
h(z) = max{0, 1 + z}.

The classifier f ∗ with maximum margin and minimal loss R(f ∗ ) is found by
solving the following optimization problem:

1 l
∗ ∗
[w , b ] = arg min w + C ·
2
u(ξ[i]) (E.14)
w,b,ξ 2
i=1

subject to constraints
y[i] ((w · xi ) − b) ≥ 1 − ξ[i], for all i ∈ {1, . . . , l}, (E.15)
ξ[i] ≥ 0, for all i ∈ {1, . . . , l}, (E.16)
for some suitably chosen value of the penalization constant C. The “slack” vari-
ables ξ[i] in (E.14) measure the distance of incorrectly classified examples xi
from the separating hyperplane. Of course, if xi is classified correctly, ξ[i] is zero
and thus u(ξ[i]) = 0.
Unfortunately, the optimization problem (E.14) is NP-complete. The complex-
ity can be significantly reduced by replacing the step function u(z) with the hinge
loss function h(z) = max{0, 1 + z}. Because h(z) is convex and h(z) ≥ u(z) for
z ≥ 0, it transforms (E.14) to a convex quadratic-programming problem

1 l
∗ ∗
[w , b ] = arg min w + C ·
2
ξ[i] (E.17)
w,b,ξ 2
i=1

subject to constraints (E.15)–(E.16). Notice that the optimization prob-


lem (E.17) minimizes the overall distance of incorrectly classified training ex-
amples from the hyperplane instead of their number. Support vector machines
that classify by solving (E.17) are called soft-margin SVMs (C-SVMs).
Even though the optimization problem (E.17) can be easily solved, in the
context of SVMs, it is usually solved in its dual form. The reason for this will
368 Appendix E. Support vector machines

become clear once we move to kernelized SVMs in Section E.3. The constrained
optimization problem (E.17) can be approached in a standard manner using
Lagrange multipliers. Since the constraints are in the form of inequalities, the
multipliers must be non-negative (for equality constraints, there is no limit on
the multipliers’ values). The Lagrangian is

1   l
L(w, b, ξ, α, r) = ww+C ξ[i] (E.18)
2 i=1

l 
l
− α[i] {y[i] ((w · xi ) − b) − 1 + ξ[i]} − r[i]ξ[i] (E.19)
i=1 i=1

with Lagrange multipliers α = (α[1], . . . , α[l]) and r = (r[1], . . . , r[l]), α ≥ 0,


r ≥ 0. A standard result from constraint optimization says that the solution
of the problem (E.17) is in the saddle point of the Lagrangian L(w, b, ξ, α, r) –
minimum with respect to variables (w, b, ξ) and maximum with respect to the
Lagrange multipliers (α, r).
The conditions for the minimum of L(w, b, ξ, α, r) at the extrema point (la-
beled again with superscript ∗) are

∂L   l

=w − α[i]y[i]xi = 0, (E.20)
∂w w=w∗ i=1

∂L  l
= α[i]y[i] = 0, (E.21)
∂b b=b∗ i


∂L 
= C − α[i] − r[i] = 0, for all i ∈ {1, . . . , l}. (E.22)
∂ξ ξ[i]=ξ[i]∗

After substituting (E.20)–(E.22) into the Lagrangian (E.19), we obtain the


formulation of the dual problem

l
1 
l
max L(α, r) = α[i] − α[i]α[j]y[i]y[j](xi · xj ) (E.23)
α,r 2 i,j=1
i=1

subject to constraints

l
α[i]y[i] = 0, (E.24)
i=1

C ≥ α[i] ≥ 0, for all i ∈ {1, . . . , l}. (E.25)

Note that the formulation of the dual problem does not contain the Lagrange
multipliers r[i].
The main advantage of solving the dual problem (E.23) over the primal prob-
lem (E.17) is that the complexity (measured by the number of free variables) of
the dual problem depends on the number of training examples, while the com-
plexity of the primal problem depends on the dimension of the input space X .
Support vector machines 369

After we introduce kernelized SVMs in Section E.3, we will see that the dimen-
sion of the primal problem can be much larger (even infinite) than the number
of training examples.
Denoting again the solutions of the dual problem (E.23) with superscript ∗,
we need to recover the solution of the primal problem (E.17), which is the pair
[w∗ , b∗ ], from α∗ = (α∗ [1], . . . , α∗ [l]). From (E.20), we can easily obtain the hy-
perplane normal w∗ as


l

w = α∗ [i]y[i]xi . (E.26)
i=1

The computation of the threshold b∗ is more involved and here we include only
its most frequently used form without proof,
1  ∗
b∗ = (w · xj ) − y[j], (E.27)
|J |
j∈J

where J = {i ∈ {1, . . . , l}|0 < α∗ [i] < C}. Equation (E.27) can be obtained from
the so-called Karush–Kuhn–Tucker conditions for the primal problem [33, 240].
Solving either the primal or dual optimization problem is commonly called
training of SVMs. Technically, any optimization library that includes a routine
for quadratic programming can be used. Most general-purpose libraries, how-
ever, are usually able to solve only small-scale problems. Therefore, we highly
recommend using algorithms developed specifically for SVMs, such as LibSVM,
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

E.3 Kernelized support vector machines

The linear SVMs described in the previous section can implement only linear
decision functions f, which is rarely sufficient for real-world problems. In this
section, we extend SVMs to non-linear decision functions. The extension is sur-
prisingly simple.
The main idea is to map the input space X , which is the space where the
observed data lives, to a different space H, using a non-linear data-driven map-
ping φ : X → H, and then find the separating hyperplane in H (for linear SVMs
described above, X = H). The non-linearity introduced through the mapping φ
allows implementation of a non-linear decision boundary in the input space X
as a linear decision boundary in H.
While the input space X is usually given by the nature of the application (it
is the space of features extracted from images), the space H can be freely chosen
as long as it satisfies the following two conditions.

1. H must be a Hilbert space (a complete space endowed with an inner product


!·, ·"H ).
370 Appendix E. Support vector machines

2. There must exist a positive-definite91 function :k : X × X → R called a ker-


  
nel, so that ∀x, x ∈ X , k(x, x ) = φ(x), φ(x ) H . The kernel function k :
X × X → R can be understood as a similarity measure on the input space
X.

These conditions ensure that the maximum margin hyperplane exists in H and
that it can be found by solving the dual problem (E.23).
The space H is a function space obtained by completing the set of all functions
on X that are linear combinations

f (x) = a[i]k(xi , x) (E.28)
i

whose inner product is defined as



!f (x), g(x)"H = a[i]b[j]k(xi , xj ), (E.29)
i,j


where g(x) = j b[j]k(xj , x). Note that the Hilbert space is driven by the data
xi . The positive definiteness of the kernel guarantees that (E.29) is an inner
product.
One of the most popular kernel functions is the Gaussian kernel

 
k(x, x ) = exp −γx − x 2 , (E.30)

where γ > 0 is a parameter controlling the width of the kernel and x is the
Euclidean norm of x. The Hilbert space H induced by the Gaussian kernel has
infinite dimension. Other popular kernels are the linear kernel k(x, x ) = x · x ,
the polynomial kernel of degree d, defined as k(x, x ) = (r + γx · x )d , and the
sigmoid kernel k(x, x ) = tanh(r + γx · x ), both with two parameters γ, r.
Non-linear SVMs are implemented in the same way as in Section E.2.2, with
the exception that now all operations should be carried out in the Hilbert space
H. Because the dimensionality of the feature space H can be infinite, we now
must use the dual optimization problem (E.23) because its dimensionality is
determined by the cardinality of the training set. Fortunately, because xi always
appears in the inner product, we can simply replace the inner product with
its kernel-based expression k(x, x ) = !φ(x), φ(x )"H . This substitution, called
the “kernel trick,” is possible in all algorithms where the calculation with data
appears exclusively in the form of inner products.

1 k
: nX × X → R is positive definite if and only if ∀n ≥ 1, ∀x1 , . . . , xn ∈ X , ∀c[1], . . . , c[n] ∈ R,
i,j=1
c[i]c[j]k(xi, xj ) ≥ 0.
Support vector machines 371

The dual optimization problem (E.23) can thus be rewritten as follows:


l
1 
l
max L(α, r) = α[i] − α[i]α[j]y[i]y[j] !φ(xi ), φ(xj )"H (E.31)
α,r 2 i,j=1
i=1


l
1 
l
= α[i] − α[i]α[j]y[i]y[j]k (xi , xj ) , (E.32)
i=1
2 i,j=1

with constraints


l
α[i]y[i] = 0, (E.33)
i=1

C ≥ α[i] ≥ 0, ∀i ∈ {1, . . . , l}. (E.34)

As the constraints do not contain any vector manipulation, they stay the same.

In the equation w∗ = li=1 α∗ [i]y[i]φ(xi ) of the optimal hyperplane w∗ , all ma-
nipulations are carried out in H and we cannot convert it to the input space X .
Fortunately, we do not need to know w∗ explicitly because the decision func-
tion (E.3) can be rewritten as

f (x) = sign (!w∗ , φ(x)"H − b∗ ) (E.35)


⎛ ⎞
l
= sign ⎝ α∗ [j]y[j] !φ(xj ), φ(x)"H − b∗ ⎠ (E.36)
j=1


l
= α∗ [j]y[j]k (xj , x) − b∗ . (E.37)
j=1

By the same mechanism, equation (E.27) for the threshold b∗ becomes

1  ∗ 1  ∗
l
b∗ = (w · φ(xj )) − y[j] = α [i]y[i]k (xi , xj ) − y[j], (E.38)
|J | |J | i=1
j∈J j∈J

with J = {i ∈ {1, . . . , l}|0 < α∗ [i] < C} as before.


Soft-margin SVMs were proved to converge to an optimal classifier minimizing
the risk functional R(f ) as the number of training examples increases [223].

E.4 Weighted support vector machines

In steganalysis, it is important to control the false-positive rate of the stegan-


alyzer. To allow SVMs to adjust the false alarms, weighted SVMs use different
penalization coefficients for false positives and missed detections.
372 Appendix E. Support vector machines

Denoting I − , I + the indices of negative and positive examples with y[i] = −1


and y[i] = 1, the primal problem of weighted SVMs accepts the form
1  
min w2H + C − ξ[i] + C + ξ[i] (E.39)
w,b,ξ 2 − +
i∈I i∈I

subject to constraints

y[i] (!w, xi "H − b) ≥ 1 − ξ[i], for all i ∈ {1, . . . , l}, (E.40)


ξ[i] ≥ 0, for all i ∈ {1, . . . , l}. (E.41)

We denoted by  · H the norm in the space H. If we compare the original primal


problem with equal costs of both detection errors (E.17) with the new formula-
tion (E.39), we can see how differently the costs are expressed. By adjusting the
penalization costs C + and C − , we can now put more importance on one or the
other error type.
Following the same steps as in Section E.2.2, the dual form of (E.39) can be
derived:

l
1 
l
max L(α, r) = α[i] − α[i]α[j]y[i]y[j]k(xi , xj ), (E.42)
α,r 2
i=1 i,j=1

subject to constraints


l
α[i]y[i] = 0, (E.43)
i=1
+
C ≥ α[i] ≥ 0, for all i ∈ I + , (E.44)
− −
C ≥ α[i] ≥ 0, for all i ∈ I . (E.45)

The dual problem of weighted SVMs (E.42) is again a convex quadratic-


programming problem, which is almost identical to (E.31), with the only ex-
ception that now the Lagrange multipliers α are bounded by different constants
depending on what training example they correspond to (e.g., whether the train-
ing example is a cover or a stego image).
The equation for b∗ becomes

1  
l

b = −
α∗ [i]y[i]k (xi , xj ) − y[j], (E.46)
|J | + |J + |
j∈J − ∪J + i=1

where J − = {i ∈ I − |0 < α∗ [i] < C − } and J + = {i ∈ I + |0 < α∗ [i] < C + }.


The decision function of a weighted SVM remains unchanged,
⎛ ⎞
l
f (x) = sign ⎝ α∗ [j]y[j]k (xj , x) − b∗ ⎠ . (E.47)
j=1
Support vector machines 373

E.5 Implementation of support vector machines

The kernel type and its parameters as well as the penalization parameter(s)
have a great impact on the accuracy of the classifier. Unfortunately, there is
no universal methodology regarding how to best select them. We provide some
guidelines that often give good results.

E.5.1 Scaling
First, the input data needs to be preprocessed. Assuming X = Rn , which is the
case for steganalysis, the input data is scaled so that all elements of vectors xi
from the training set are in the range [−1, +1]. This scaling is very important as
it ensures that features with large numeric values do not dominate features with
small values. It also increases the numerical stability of the learning algorithm.

E.5.2 Kernel selection


The next step is the selection of a proper kernel. Unless we have some side-
information about the problem we are facing, the Gaussian kernel k(x, x ) =
exp(−γx − x 2 ) is typically a good first choice. This kernel is flexible enough
to solve many problems, yet it has only one free parameter, in comparison with
polynomial or sigmoid kernels, which depend on more free parameters. Moreover,
for all values of γ > 0, the Gaussian kernel is positive definite.

E.5.3 Determining parameters


Before training, we need to determine the kernel parameters (the width γ if we
use the Gaussian kernel) and the penalization parameter(s) (C + , C − ). A common
way to find (C + , C − , γ) is to carry out an exhaustive search on predefined points
from a grid G. At each point (C + , C − , γ) ∈ G, we train the SVM and estimate
its performance on unknown data. The estimated probabilities of false positives
and missed detection of an SVM trained with parameters (C + , C − , γ) will be
denoted as P̂FA (C + , C − , γ) and P̂MD (C + , C − , γ), respectively. The parameters
(C + , C − , γ) are then selected as a point from the grid G using a Bayesian or
Neyman–Pearson approach (see below).
The requirement of estimating the performance on unknown data is very im-
portant. It is easy to find (C + , C − , γ) so that the error on the training set is
zero, but this classifier will most likely exhibit a high error rate on the unknown
data (this problem is called overtraining or overfitting).
A popular way of estimating P̂FA (C + , C − , γ) and P̂MD (C + , C − , γ) on unknown
data is k-fold cross-validation. The available training examples are divided into k
subsets of approximately equal size. Then, the union of k − 1 subsets is used as a
training set to train the SVM, while the remaining subset is used to estimate the
374 Appendix E. Support vector machines

error on “unknown” examples. This is repeated k times, each time with different
subsets. k-fold cross-validation usually gives estimates of error close to the error
we can expect on truly unknown data.
When the costs of both errors are known and equal to w+ and w− for an error
on positive and negative classes, we can use weighted SVMs with the Bayesian
approach and minimize the total cost

w− p− PFA + w+ p+ PMD , (E.48)

where p− and p+ are a priori probabilities of the negative and positive classes.
In this case, the search for the parameters can be described as

(C + , C − , γ) = arg min w− p− P̂FA (C + , C − , γ) + w+ p+ P̂MD (C + , C − , γ), (E.49)

where the arg min is taken over all (C + , C − , γ) ∈ G,


+ ,
G = (w− p− 2i , w+ p+ 2i , 2k )|i ∈ {−3, . . . , 9}, k ∈ {−5, . . . , 3} . (E.50)

Note that even though the grid G has 3 dimensions, its effective dimension is 2
because the ratio C + /C − = w+ p+ /(w− p− ) must stay constant.
In a Neyman–Pearson setting, we impose an upper bound on the probability
of false alarms, PFA ≤ FA < 1, and minimize the probability of missed detection
PMD . The search for (C + , C − , γ) becomes

(C + , C − , γ) = arg min P̂MD (C + , C − , γ), (E.51)

where the arg min is carried over all (C + , C − , γ) ∈ G satisfying P̂FA (C + , C − , γ) ≤


FA ,
+ ,
G = (2i , 2j , 2k )|i, j ∈ {−3, . . . , 9}, k ∈ {−5, . . . , 3} . (E.52)

In this case, the grid G has the effective dimension 3, which makes the search
computationally expensive. Frequently, suboptimal search [44] is used to alleviate
the computational complexity.

E.5.4 Final training


After selecting the kernel and determining its parameters including the penal-
ization parameter C, we use the whole training set to train the final SVM. This
process, which involves solving a quadratic-programming problem, will deter-
mine the vector (α∗ [1], . . . , α∗ [l]) and b∗ . The decision function of the final SVM
is
⎛ ⎞
l
f (x) = sign ⎝ α∗ [j]y[j]k (xj , x) − b∗ ⎠ . (E.53)
j=1
Support vector machines 375

E.5.5 Evaluating classification performance


The probability of detection as a function of the probability of false alarms,
PD (PFA ), is called the receiver operating characteristic (see Section D.1.1) and
it is often used to visualize the performance of detectors, including those imple-
mented as SVM classifiers. Since the SVM depends on several parameters, we
essentially have a family of parametrized classifiers. The usual method to draw
an ROC curve for a specific SVM is to train with fixed values of the penalization
parameters, C − , C + , and the kernel width γ, and change the threshold b∗ . The
ROC curve is then obtained in a parametric form
PFA = PFA (b∗ ), (E.54)
PD = PD (b∗ ), b∗ ∈ R. (E.55)
It is clear that by changing the threshold b∗ , we can achieve all values of
PFA ∈ [0, 1]. While this method to draw the ROC curve is simple and fast, it
does not really reflect the performance of the SVM. For example, if the penaliza-
tion parameters and the kernel width were determined under a Neyman–Pearson
setting for a given bound on false alarms, FA , the ROC obtained this way is
suboptimal in the sense that the probability of detection PD (FA ), for FA = FA ,
may be lower than what we would obtain if we trained for FA . We could alterna-
tively draw the ROC by varying the penalization parameters as well as the kernel
parameter. However, such an approach is often computationally very expensive
and thus almost never used in practice. An interesting approach with reduced
computational cost for fixed kernel while varying the penalization parameters
C + and C − was proposed in [13].
Notation and symbols

The mathematical notation in this book was chosen to be compact and visually
distinct to make it easier for the reader to interpret formulas. For this reason, we
adhere to short names of variables, sets, and functions rather than long descrip-
tive names similar to variable naming in programming as the latter would lead
to long and rather unsightly mathematical formulas. The price paid for the short
variable names is an occasional reuse of some symbols across the text. However,
the author strongly believes that the meaning of the symbols should always be
clear and unambiguous from the context. The most frequently occurring key
symbols, such as the relative message length α, change rate β, cover object x, or
stego object y, are used exclusively and never reused. Whenever possible, vari-
able names are chosen mnemotechnically by the first letter of the concept they
stand for, such as h for histogram, A for alphabet, etc. In some cases, however,
if there exists a widely accepted notation, rather than coining a new symbol,
we accept the established notation. A good example is the parity-check matrix,
which is almost exclusively denoted as H in the coding literature.
Everywhere in this book, vectors and matrices of predefined dimensions are
printed in boldface with indices following the symbol in square brackets. Thus,
the ijth element of the parity-check matrix H is H[i, j] and the jth element
of the histogram is h[j]. This notation generalizes to higher-dimensional arrays.
The transpose of a matrix is denoted with a prime: H is the transpose of H.
Sequences of vectors or matrices will be indexed with a subscript. Thus, fk will be
a sequence of vectors f . If it is meaningful and useful, we may sometimes arrange
a sequence of vectors into a matrix and address individual vector elements using
two indices, e.g., f [i, k]. Often, quantities may depend on parameters, such as
the relative payload α. The histogram hα [j] stands for the histogram of the
stego image embedded with relative payload α. If the vector or matrix quantities
depend on other parameters or indices, we may put them as superscripts. An
example would be the histogram of the DCT mode (k, l) from a JPEG stego
(k,l)
image, hα [j]. Strings of symbols will also be typeset as vectors. The length of
string s is |s|.
Random variables are typeset in sans serif font x, y. A vector/matrix random
variable will then be boldface sans serif, x, y. When a random variable x follows
probability distribution f , we will write x ∼ f . By a common abuse of notation,
we will sometimes write x ∼ f , meaning that x, whose realization is vector x,
378 Notation and symbols

follows the distribution f . When the probability distribution is clear from the
context or unspecified, we will write Pr{·}, where the dot stands for an expression
involving a random variable. For example, Pr{x ≤ x} stands for the probability
that random variable x attains the value x or smaller. The expected value, vari-
ance, and covariance will be denoted as E[x], Var[x], and Cov[x, y] when it is clear
from the context with respect to what probability distribution we are taking the
expected value or variance. Sometimes, we may stress the probability distribu-
tion by writing Ex∼f [y(x)] or simply Ex [y(x)], which is the expected value of y
when x follows the distribution f of variable x. The sample mean will be denoted
with a bar, x̄.
Calligraphic font will be used for sets or regions, X , Y, with |X | denoting the
cardinality of X .
Estimated quantities will be denoted with a hat, meaning that θ̂ is an estimator
of θ. A tilde is often used for quantities subjected to distortion, e.g., ỹ is a noisy
version of y.
In this book, we use Landau’s big O symbol f = O(g) meaning that |f (x)| <
Ag(x) for all x and a constant A > 0. The asymptotic equality f (x) ≈ g(x) holds
for two functions if and only if limx→∞ f (x)/g(x) = 1. We will also use the symbol
≈ with scalars, in which case it stands for approximate equality (e.g., α ≈ 1 or
k ≈ m).
Among other, less common symbols, we point out the eXclusive OR operation
on bits, which we denote with ⊕. Concatenation of strings is denoted as &. We
reserve  for convolution, and x, x for rounding down and up, respectively.
The function round(x) is rounding to the closest integer, while trunc(x) is the
operation of truncation to a finite dynamic range (e.g., to bring the real values
of pixels back to the 0, . . . , 255 range of integers).
Below, we provide definitions for the most common symbols used throughout
the text.

α relative payload

A(ỹ|y) probability of receiving ỹ when sending y through a


noisy channel
A[., i] ith column of matrix A

A[j, .] jth row of matrix A

b(θ̂) bias of estimator θ̂

B bin

B(x, R) ball of radius R centered at x


Notation and symbols 379

β change rate

B(p) Bernoulli random variable when probability of 0 is p

Bi(n, p) binomial random variable with n trials of probability p

Bγ measure of blockiness with γ = 1, 2

χ2d chi-square distribution with d degrees of freedom

c[j] = (r[j], g[j], b[j]) an RGB color

C code

C(s) coset corresponding to syndrome s

C(xinv ) set of all covers with invariant component xinv

Ci , Ei , Oi trace sets used in structural steganalysis

Csteg steganographic capacity

C co-occurrence matrix, covariance matrix

δ(x) Kronecker delta,



1 when x = 0
δ(x) =
0 when x = 0.
D matrix of quantized DCT coefficients, also a matrix used
in wet paper codes
d matrix of unquantized DCT coefficients

DKL (P ||Q) Kullback–Leibler (KL) divergence between probability


distributions P and Q
dH (x, y) Hamming distance between two words

dγ (x, y) distortion measure



n
dγ (x, y) = (x[i] − y[i])γ
i=1
380 Notation and symbols

dρ (x, y) measure of embedding impact with profile ρ[i]



n
dρ (x, y) = ρ[i] (1 − δ(x[i] − y[i]))
i=1

dRGB (c, c ) distance between two RGB colors

d2 deflection coefficient

DCT discrete cosine transform

diag(x) diagonal matrix with diagonal x

e, e embedding efficiency, lower embedding efficiency

e(s) coset leader of coset corresponding to syndrome s

e[i] quantization error at element i, noise residual

Erf(x) error function

f feature vector

F2 finite field containing two elements

Fq finite field containing q elements

Φ(x) cumulative distribution function of standard normal


variable, also a mapping used in targeted steganalysis
G generator matrix

g(r) dual histogram for value r

Γ(x) gamma function

γ gamma factor in gamma correction, parameter in dγ ,


generic threshold
h(x) hash (message digest) function

h histogram

h(k,l) histogram of the (k, l)th DCT mode (in a JPEG file)
Notation and symbols 381

H(x) entropy of random variable x

Hmin (x) minimal entropy of random variable x

H(x|y) conditional entropy of x given y

H(x), H −1 (x) binary entropy function, its inverse

Hq (x), Hq−1 (x) q-ary entropy function, its inverse

H parity-check matrix

H0 , H1 null and alternative hypothesis

Hp binary Hamming code with codimension p

Ik k × k identity matrix

I(β) Fisher information w.r.t. β

I(x; y) mutual information between random variables x and y

IDCT inverse DCT transform

J1 , J2 JPEG images

k∈K secret stego key from the set of all stego keys

k(x, y) kernel (used in support vector machines)

Λ lattice

log logarithm to the base e = 2.718281828045 . . .

L(x) likelihood ratio for measurements x

L2 (a, b) vector space of all quadratically integrable functions on


interval [a, b]
μ, μ[k], μc [k] mean, kth moment, kth central moment

m∈M message from the set of all messages


382 Notation and symbols

Mx moment-generating function of random variable x

MSE(θ̂) mean-square error of estimator θ̂

N (μ, σ2 ) Gaussian random variable with mean μ and variance σ 2

n01 , nAC , nc number of DCT coefficients different from 0 and 1, num-


ber of AC DCT coefficients, number of bits to represent
a color
Oh,b cover oracle drawing the next b cover bits based on the
history of h cover bits
π(x) symbol-assignment function

P set of pixel pairs

Ψ isomorphism

p[i] probability mass function

pv [i] p-value (in histogram attack)

Pc , pc , Ps , ps distributions of cover and stego images

Path random path through image

PE minimum average probability of error (PMD + PFA )/2

PD , PMD , PFA probability of detection, missed detection, and false


alarm
1
PD−1 2 false-alarm probability at 50% detection

Q(x) the complementary cumulative distribution function of


a standard normal variable
Q[i, j] quantization matrix

QΛ quantizer to lattice Λ

Q(y, u|x) covert channel (matrix of conditional probabilities)

qf quality factor

q size of a q-ary alphabet


Notation and symbols 383

ρ[i] embedding impact at pixel i

ρ(x, y) correlation coefficient for random variables x and y

rH , rV , rD horizontal, vertical, and diagonal noise residuals

R covering radius

Ra average distance to code

R(f ) risk functional

R0 region of acceptance of null hypothesis

R1 critical region (acceptance of alternative hypothesis)

s syndrome

S set of changeable (dry) pixels in writing on wet paper

Sf (x) sparsity measure defined by the feature set f

θ unknown parameter to be estimated

ϑ(x, y) number of embedding changes

t(B) texture measure of block B

σ2 variance

V variation

Vq (n, r) volume of a ball with radius r in an n-dimensional space

U (a, b) uniform distribution on the interval [a, b]

V Voronoi region, also an abstract vector space

W noise residual

W Wiener filter
384 Notation and symbols

w(x) Hamming weight of x ∈ {0, 1}n

wθ weighted stego image

x∈C cover image from the set of all possible covers C

xinv , xemb random variables describing the cover component that


is invariant with respect to embedding and the one used
for embedding (in model-based steganography)
X , Y, Z, V, W primary sets used in Sample Pairs Analysis

y stego image

Y, Cr , Cb luminance and two chrominance signals used in JPEG

Z(c1 , c2 ) color cut for colors c1 , c2 (in Pairs Analysis)

Z, R the set of all integers, real numbers

ZM finite cyclic group of order M

{0, 1} bit strings of arbitrary length

[a, b], (a, b) closed, open interval

Operations and relations

The following operations are used throughout the text:

→ mapping

D
→ convergence in distribution

P
→ convergence in probability

 symbol used in definitions

∼ x ∼ f , random variable x follows distribution f

≈ approximate equality for scalars, asymptotic equality for


functions
Notation and symbols 385

& concatenation of strings

⊕ XOR (eXclusive OR)

|A| cardinality of set A

|s| length of string s

!x, y" inner (dot) product between vectors in abstract vector


spaces

x = !x, x" norm of vector x

x·y dot product between vectors in Euclidean space

x rounding to the closest integer larger than or equal to x

x rounding down to the closest integer smaller than or


equal to x
Emb, Ext embedding and extraction algorithms (mappings)

PEmb, PExt probabilistic embedding and extraction algorithms


(mappings)
LSBflip(x) operation that flips the LSB of integer x

LSB(x) the least significant bit of integer x

LSBF5 (x) the least significant bit of integer x as redefined in the


F5 algorithm
round(x) rounding x to the closest integer

sign(x) sign of x

trunc(x) truncating x to a finite dynamic range. For an 8-bit


range,



⎨255 x > 255
trunc(x) = x x ∈ [0, 255]


⎩0 x<0
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
Glossary pp. 387-408

Chapter DOI:
Cambridge University Press
Glossary

±1 embedding. A steganographic method in which message bits are embedded


by randomly modifying cover elements by at most 1 to match the LSB with
the message bit (also called LSB matching).
–F5. A variation of the steganographic algorithm F5 in which the embedding
operation of decreasing the absolute value of the DCT coefficient is replaced
with increasing its absolute value.
ABG. Anti-Blooming Gate built into an imaging sensor to bleed off overflow
from saturated pixels.
AC coefficient. DCT coefficient other than the DC term (from alternating
current).
Achievable rate. Relative payload that can be undetectably embedded in a
cover source using a given steganographic method.
Acrostics. Linguistic steganography.
A/D. Analog/Digital.
Adaptive
– selection channel. A selection channel determined by the cover content.
– steganography. Steganography in which the data-hiding process is depen-
dent on the cover content.
Additive color model. Model of color in which color is created by adding
several base colors, such as red, green, and blue.
Adjacency histogram. Histogram of adjacent cover elements.
Advantage. Success rate of a warden (judge) in detecting steganographic con-
tent.
AES. Advanced Encryption Standard. A symmetric cryptosystem.
ALE. Amplitude of Local Extrema. A steganalytic method for detection of ±1
embedding.
388 Glossary

Alice. A fictitious character from the prisoners’ problem who secretly commu-
nicates with Bob.
Alphabet. A set of symbols used to construct codes.
Amplifier noise. Image noise component due to signal amplification.
APS. Active Pixel Sensor.
Arithmetic coder. A lossless compression algorithm.
Attack. An algorithm whose goal is to detect the usage of steganography.
AUC/AUR. Two acronyms for Area Under ROC Curve.
Average distance to code. The average Hamming distance from a randomly
(uniformly) selected word to a code.
Ball (of radius r and center x). A set of points at distance at most r from x.
Basis pattern. One of the 64 patterns forming the discrete cosine transform
(DCT).
Batch steganography. Steganographic communication by splitting payload
into multiple cover objects.
Bayer pattern. A popular arrangement of red, green, and blue color filters
allowing a sensor to register color.
BCH. A class of parametrized error-correcting codes invented by Bose, Ray-
Chaudhuri, and Hocquenghem.
Benchmarking. A procedure aimed to compare and evaluate systems.
Bernoulli random variable. A random variable B(p) reaching values in {0, 1}
with probabilities p and 1 − p.
Between-image error. A component of the estimation error of quantitative
steganalyzers specific to each image.
Bias. A systematic estimator error, b(θ) = E[θ̂] − θ, where θ is a parameter that
is being estimated.
Binary
– arithmetic. Arithmetic in finite field {0, 1}.
– classification. A classification to two classes.
– entropy function. Entropy of a Bernoulli random variable B(x), H(x) =
−x log x − (1 − x) log(1 − x).
Blind steganalysis. An approach to steganalysis whose aim is to detect an
arbitrary steganographic method.
Blockiness. Sum of discontinuities at boundaries of 8 × 8 blocks in a decom-
pressed JPEG file.
Glossary 389

Blooming. An imaging artifact caused by charge overflow from saturated pixels.


BMP. Bitmap raster image format.
Bob. A fictitious character from the prisoners’ problem who secretly communi-
cates with Alice.
Borel set. A subset of real numbers that can be written as a countable union
or intersection of intervals.
bpnc. Unit of relative payload for JPEG images (bits per non-zero DCT coeffi-
cient).
bpp. Unit of relative payload for spatial formats (bits per pixel).
C-SVM. Soft-margin SVM.
Cachin’s definition of steganographic security. An information-theoretic
definition of steganographic security.
Calibration. Process of estimating the cover image from the stego image.
Cardan’s grille. A linguistic-steganography method in which a secret message
is extracted by applying a mask over text.
Cauchy distribution. A thick-tail probability distribution used to model the
distribution of DCT coefficients in a JPEG file.
Cauchy–Schwartz inequality.  For any two vectors from an abstract vector
space x, y ∈ V, |!x, y"| ≤ x y.
CCD. Charge-Coupled Device (type of imaging sensor ).
cdf. Cumulative distribution function.
Central moment. Statistical moment of a zero-meaned random variable.
CFA. Color Filter Array. An array of filters bonded to pixels of an imaging
sensor used to acquire color.
Change rate. The relative number of embedding changes with respect to the
number of all possible changes a steganographic method can introduce. Usually
denoted β.
Changeable pixels/elements. Cover elements that are allowed to be modified
during steganographic embedding.
Channel distortion. Signal distortion in a communication channel.
Charge transfer. Process of moving charge from a CCD sensor.
Charge-transfer efficiency. Percentage of signal preserved during charge
transfer.
390 Glossary

Chebyshev sum-inequality. For any non-decreasing sequence a[i], non-


increasing sequence b[i], and non-negative sequence p[i]

N −1 
N −1 
N −1 
N −1
p[i] p[i]a[i]b[i] ≤ p[i]a[i] p[i]b[i].
i=0 i=0 i=0 i=0

Chernoff–Stein lemma. A statement that binds KL divergence between dis-


tributions and the probability of missed detection in Neyman–Pearson hy-
pothesis testing.
Chi-square
– distribution. Probability distribution of a sum of squares of iid zero-mean
Gaussian variables.
– test. A statistical test used to determine whether or not discrete data
follows a known distribution.
Chrominance. Component of signal communicating color.
Ciphertext. Stream of symbols obtained by encrypting plaintext.
Classifier. An algorithm capable of distinguishing objects from several classes.
CLT. Central Limit Theorem.
CMOS. Complementary Metal–Oxide–Semiconductor (type of imaging sensor ).
CMY. Cyan, Magenta, Yellow, subtractive color model.
CMYK. Cyan, Magenta, Yellow, blacK subtractive color model.
Code. A set of strings of symbols from an alphabet.
– length. The number of alphabet symbols in each codeword.
– dimension. The dimension of a linear code considered as a vector subspace.
– codimension. Dimension of the orthogonal complement of a code. For a
linear code with length n and dimension k, the codimension is n − k.
Codeword. An element of a code.
COG. Center Of Gravity. The mass center of data points assuming all data
points have the same mass.
Color
– channel. A component of color signal covering one color.
– correction. A transformation of color registered by the sensor.
– cube. The set of all RGB colors expressible as triples of integers (R, G, B) ∈
{0, . . . , 255}3.
– cut. A binary sequence obtained by scanning an image by rows and regis-
tering 0 and 1 for two fixed colors.
– filter array. See CFA.
– interpolation. See demosaicking.
– model. See color representation.
Glossary 391

– quantization. Process of decreasing the number of unique colors in an


image. Part of conversion to palette format.
– representation. Also called color model. A method for representing color
in a computer using a vector of numerical values.
– transformation. Process of converting one color representation to another.
Complementary cumulative distribution function. The tail probability
of a random variable, 1 − F (x), where F (x) is the cumulative distribution
function.
Complete feature set. A set of numerical features that completely character-
izes natural images.
Composite hypothesis testing. A hypothesis-testing problem in which some
distributions are not known or depend on unknown parameters.
Compression ratio. Ratio between the original size of a file and its compressed
version.
Conditional entropy. Entropy of a random variable conditional on another
random variable.
Co-occurrence matrix. Histogram of pairs of cover elements along a certain
direction (e.g., horizontal pairs, vertical, diagonal, etc.).
Correlation. Optimal detector of known signals corrupted by additive white
Gaussian noise.
– coefficient. A scalar value expressing linear dependence between two ran-
dom variables.
Coset. The set of all words with the same syndrome.
– leader. A member of a coset with the smallest Hamming weight.
Cover
– element. An individual entity (e.g., pixel, DCT coefficient ) from which the
cover object is constituted.
– modification. A steganographic method in which the cover object is mod-
ified to embed the secret message.
– object. The original unmodified object before embedding a secret message
in it.
– selection. A steganographic method in which the cover object is selected
from a database so that the cover communicates the required message.
– source. Random variable defining the process of selecting objects as cover
objects for steganographic communication.
– synthesis. A steganographic method in which the cover object is synthesized
to communicate the required message.
Covering radius. The smallest radius of a ball needed to cover all possible
words with balls centered at all codewords.
392 Glossary

Covert communication/channel. A non-obvious information exchange usu-


ally obtained by hiding messages in other objects. Another term for steganog-
raphy.
Cramer–Rao lower bound. A bound on variance of an estimator.
Critical region. A set of measurements for which a detector decides the alter-
native hypothesis (signal present).
Cross-validation. A method for determining parameters of a support vector
machine.
CRT. Cathode-Ray Tube. An old monitor construction that uses a glass tube.
CRW. Canon raw image format.
Cryptographically secure PRNG. A pseudo-random number generator sat-
isfying the property that it be computationally intractable to derive its seed
from a sequence of pseudo-random numbers generated by it.
Cryptosystem
– asymmetric (public key). An encryption system in which the process of
encrypting (controlled by the encryption or public key) is different from
decryption (controlled by the decryption key). The system is conceived so
that knowledge of the encryption key does not allow deriving the decryp-
tion key or decrypt because the task of finding the decryption key leads to
an intractable problem. An example of a public-key cryptosystem is RSA.
– symmetric (private key). An encryption scheme controlled by a secret
key that needs to be shared between the communicating parties before
starting the communication. Examples are DES and AES.
CTE. Charge-Transfer Efficiency.
Dark current. Sensor response when it is not lit by light.
Data masking. A covert communication method in which data is given statis-
tical properties of typical cover objects.
DC term. The DCT coefficient corresponding to spatial frequency (0, 0) (from
direct current).
DCT. Discrete Cosine Transform.
– coefficient. A coefficient obtained via DCT.
– mode. DCT coefficients corresponding to a particular spatial frequency.
Dead pixel. A defective pixel always registering no (low) signal.
Decision function. Another word for detector used in support vector machines.
Decompression. Process of obtaining the original data from its compressed
form.
Glossary 393

Defective
– memory. A memory in which some cells are defective or permanently stuck
to zero or one.
– pixel. A pixel whose response to light does not comply with design speci-
fications.
Deflection coefficient. A numerical quantity measuring the performance of a
detector.
Demosaicking. Process of interpolating colors from a signal acquired by an
imaging sensor equipped with a color filter array.
Denoising. Process of removing noise from a signal.
DES. Data-Encryption Standard. A symmetric cryptosystem.
Detection probability. Probability of correctly detecting a stego image as
stego.
Detector. An algorithm that detects presence of a signal.
DFT. Discrete Fourier Transform.
Digital image acquisition. Process of acquiring an image through an imaging
sensor.
Discrete cosine transformation. An orthogonal transformation used in JPEG
format.
Distance to code. Distance between a given word and the closest codeword
from the code.
Distortion. Signal degradation typically due to noise or processing.
Distortion-limited embedder. An embedding algorithm whose embedding dis-
tortion is bounded.
Dithering. Process of spreading quantization error to neighboring pixels to
prevent creating bands during color quantization.
DNG. Digital NeGative file format.
Double JPEG compression. Image that was JPEG compressed twice, each
time with a different quantization matrix.
Dry elements. See changeable elements.
Dual
– code. The orthogonal complement of a linear code.
– histogram. A feature used in blind steganalysis of JPEG images.
DWT. Discrete Wavelet Transform.
394 Glossary

Embedding
– algorithm/mapping. Algorithm that embeds a secret message in a cover
object.
– capacity. Maximal number of bits that can be embedded using a given
steganographic method.
– distortion. Distortion imposed onto the cover object due to embedding a
secret message.
– efficiency. The expected number of bits communicated per unit expected
embedding distortion.
– impact. Increase in statistical detectability of embedding changes.
– operation. Procedure that modifies individual cover elements during em-
bedding.
– path. A path along which message bits are embedded.
– while dithering. A steganographic method for palette images.
Empirical risk. Risk function evaluated from data samples.
Entropy. Uncertainty (information content) of a random variable.
– encoder/compressor. A lossless compression algorithm capable of com-
pressing a stream of symbols close to its entropy.
– decoder/decompressor. Algorithm inverting the action of an entropy
encoder.
Erasure channel. A communication channel in which some symbols may get
replaced by an erasure symbol.
√ ´x 2
Error function. Erf(x) = (2/ π) 0 e−t dt.
Estimator. An algorithm that computes an estimate of an unknown parameter.
Eve. A fictitious character (the warden) from the prisoners’ problem.
Extraction algorithm/mapping. Algorithm that extracts secret message
from the stego object.
EzStego. A steganographic program for palette images.
F5. A steganographic algorithm for the JPEG format.
False alarm. An error when a cover object is mistakenly identified as stego.
– probability. Probability of encountering a false alarm.
Feasible covert channel. A covert channel that complies with the bound on
the embedding distortion and perfect security.
Feature. A numerical quantity extracted from an object.
Finite field. A finite alphabet of symbols that can be added or multiplied so
that the operations have similar properties as operations of addition and
multiplication on real numbers.
Glossary 395

Fisher information. A fundamental quantity expressing the amount of infor-


mation that measurements (that depend on an unknown parameter) give
about the parameter value.
Floyd–Steinberg dithering. Algorithm for dithering by diffusing color quan-
tization error to neighboring pixels.
Forensic steganalysis. Effort directed towards extracting the message from a
stego object. It may include various tasks, such as estimating attributes of the
message (e.g., its length) and determining the steganographic method used
for embedding the message and/or the stego and encryption keys.
Galois field. See finite field.
Gamma
– correction. Non-linear transform of light intensity expressed as power law.
´ ∞ x−1
– function. Γ(x) = 0 t e−t dt.
Gaussian
– kernel. Kernel used in SVMs,
 2
k(x, x ) = e−γx−x  .

– variable. A random variable with mean μ and variance σ 2 with the follow-
ing probability distribution:
1 (x−μ)2
√ e− 2σ2 .
2πσ

Generalized Gaussian distribution. A unimodal probability distribution


which can be thought of as a generalized form of the Gaussian distribution.
Generator matrix. A matrix whose rows form the basis of a linear code.
GIF. Graphic Interchange File format for images.
Gifshuffle. A steganographic program for palette images.
GLRT. Generalized Likelihood-Ratio Test.
Golay code. One of the perfect codes.
Hamming
– code. A perfect linear code with parity-check matrix containing all non-
zero words of fixed length as columns (all columns are different up to a
multiplication by an element from the finite field ).
– distance. The number of elements in which two words differ.
– weight. The number of non-zero symbols in a word.
Hash. A bit string extracted from an object (digest).
– function (one-way function). A function that assigns a bit string of fixed
length to any bit string on its input. A cryptographic hash f (x) must
396 Glossary

satisfy the property that it be computationally intractable to invert f


(i.e., find y such that f (y) = x) and to find collisions (i.e., x and y such
that f (x) = f (y)).
HCF. Histogram Characteristic Function. Fourier transform of the histogram.
Hilbert kernel density estimator. A specific sparsity measure.
Hilbert space. A complete linear vector space with inner product.
Hinge loss function. A convex loss function used in definition of risk function
in soft-margin support vector machines.
Histogram
– attack. A statistical attack on LSB embedding based on detecting artifacts
in the histogram of the stego image.
– characteristic function. Amplitude of the Fourier transform of a his-
togram.
Homogeneous bit pair. A pair of bits 00 or 11.
Hot pixel. A defective pixel with very high response independent of the incident
light.
iid. Independent and identically distributed.
IIF. Type of an image format.
Imaging sensor. A semiconductor device used to acquire digital images in
cameras.
Inter-block feature. A feature obtained by quantifying a relationship between
DCT coefficients from different blocks.
Intra-block feature. A feature obtained by quantifying a relationship between
DCT coefficients from the same block.
IQR. Inter-Quartile Range. A robust measure of spread corresponding to the
range defined by the first and third quartile.
Isolation. Distance between a palette color and another, different, closest palette
color.
Isomorphic code. A geometrically identical code.
J-divergence. A symmetrized KL divergence DKL (f ||g) + DKL (g||f ).
JPEG. Joint Photographic Expert Group image format.
JPEG2000. A modern version of JPEG based on discrete wavelet transform.
JPEG80. Image database RAW compressed with 80% quality JPEG.
Glossary 397

JPEG compatibility steganalysis. A steganalysis method that detects pres-


ence of embedding changes by verifying the compatibility of each 8 × 8 pixel
block with JPEG compression.
JP Hide&Seek. A steganographic program for JPEG images.
Jsteg. A steganographic program for JPEG images.
Kernel. A similarity function used to define kernelized SVMs.
Kronecker delta. Function defined as δ(0) = 1 and δ(x) = 0 otherwise.
Kullback–Leibler divergence (distance). A fundamental concept from in-
formation theory. It is a measure of distance between two random variables.
Also called relative entropy.
Kurtosis. The fourth normalized central moment.
Large payload. Payload whose length is close to embedding capacity.
Lattice. A discrete set of points in space, usually exhibiting a regular structure.
Law of rare events. Probabilistic description of random phenomena that occur
independently and uniformly in time (see Poisson distribution).
LCD. Liquid-Crystal Display.
LibSVM. A popular support-vector-machine library.
Likelihood-ratio test. An optimal detector for a simple hypothesis-testing
problem.
Linear
– CCD. A CCD sensor formed by a one-dimensional row of photodetectors.
– code. A set of codewords forming a vector space.
– SVM. A support vector machine realized with a separating hyperplane
directly in the feature space.
LMP. Locally Most Powerful detector.
Log inequality. log x ≤ x − 1 valid for all x > 0.
Log-sum inequality. For any non-negative numbers r1 , . . . , rk , and positive
s1 , . . . , sk

k

k
ri k
j=1 rj
ri log ≥ ri log k .
i=1
s i i=1 j=1 s j

Lossless compression. Compression of data that does not incur any loss of
information.
398 Glossary

Lower embedding efficiency. The ratio between the payload length and the
largest number of embedding changes that may be needed to embed that
payload.
LSB. Least Significant Bit. The last bit in a big-endian representation of an
integer.
– embedding. A steganographic method in which message bits are embedded
in a sequence of natural numbers by replacing their least significant bits
with message bits.
– matching. See ±1 embedding.
– pair. A pair of values differing only in their LSBs.
– plane. Array of LSBs.
LSE. Least-Square Estimation. A general estimation method in which data
model parameters are determined by minimizing the squared error between
the model and observations.
Luminance. Color component carrying information about light intensity.
LT
– code. Sparse linear code designed for the erasure channel.
– process. An algorithm based on LT codes used for realizing non-shared
selection channels.
Machine learning. The field of artificial intelligence.
Macroblock. A block of 16 × 16 pixels used in JPEG format.
MAD. Median Absolute Deviation. A robust measure of spread.
MAE. Median Absolute Error. A robust measure of error spread.
MAP. Maximum A Posteriori Estimation. A general estimation method for
determining an unknown parameter of a distribution. The parameter is de-
termined by maximizing the probability of the parameter given the measure-
ments.
Margin. The distance between the support vector closest to the separating hy-
perplane.
Markov
– chain. A sequence of random variables xi in which each variable depends
only on the previous variable (Pr{xi |xi−1 } = Pr{xi |xi−1 , . . . , x1 }). For dis-
crete variables, a Markov chain is completely described using the transition
probability matrix A[k, l] = Pr{xi = l|xi−1 = k}.
– features. Features for blind steganalysis of JPEG images obtained by quan-
tifying the relationship between neighboring DCT coefficients of the same
block.
Glossary 399

– random field. A two-dimensional generalization of a Markov chain. An


array of random variables in which each variable depends only on the
variables from its neighborhood.
Matrix embedding. A syndrome coding method for decreasing the embedding
distortion.
– theorem. A general methodology for constructing matrix-embedding meth-
ods from linear codes.
Matrix LT process. An algorithm based on LT codes. It permutes columns
and rows of a matrix and brings it to an upper-diagonal form.
Maximum a posteriori estimator. An estimator of unknown param-
eter θ from data x that maximizes Pr{θ|x}, θ̂ = arg maxθ Pr{θ|x} =
arg maxθ Pr{x|θ}P (θ).
Maximum-likelihood estimator. An estimator of unknown parameter θ from
data x that maximizes Pr{x|θ}, θ̂ = arg maxθ Pr{x|θ}.
Maximum mean discrepancy. A two-sample statistic used to classify between
two categories.
MB1 (MB2). Model-Based Steganography with (without) deblocking.
Median absolute error. The median of absolute values of error realizations.
Meet-in-the-middle. A brute-force algorithm for finding coset leaders.
Message digest function. A function that returns a digest (hash) of a given
object.
Message source. Random variable defining the statistical properties of com-
municated messages, such as their length.
MEX. Matlab EXecutable file.
Microdot. Stego object downsized to such minuscule dimensions that it resem-
bles a speck of dirt.
Microlens. A miniature lens bonded to each pixel to help direct light to the
photodetector.
Mimic function. A mimic function changes a file to match its statistical prop-
erties to those of another file.
Minimal
– distance. Hamming distance between two closest codewords from a code.
– entropy. An information-theoretic concept characterizing the uncertainty
of a random variable.
– embedding impact. A general principle for constructing steganographic
schemes.
400 Glossary

Missed detection. An error made when a stego object is mistakenly identified


as a cover object.
– probability. The probability of failing to recognize a stego image as con-
taining a secret message.
MLE. Maximum-Likelihood Estimation. A general estimation method for deter-
mining an unknown parameter of a distribution. The parameter is determined
by maximizing the probability of observing the measurements conditioned on
the unknown parameter.
MMx. A steganographic algorithm for the JPEG format.
Model-based steganography. A steganographic system designed to preserve
a model of cover source.
MPEG. A video format based on JPEG.
MSE. Mean-Square Error.
Multi-classification. Classification into more than two classes.
Mutual information (between two random variables). Information about a
random variable conveyed by another random variable.
MVU estimator. Minimum-Variance Unbiased estimator.
Negligible function. A function of x that falls to zero faster than any power
of x.
NEF. Nikon raw image format.
Neyman–Pearson hypothesis testing. A principle for constructing detectors
in which one imposes an upper bound on the probability of a false alarm and
minimizes the probability of missed detection.
Noise
– moments. Statistical moments of the noise residual.
– reduction. Process of suppressing noise in a signal.
– residual. The result of subtracting the signal and its denoised version.
Non-shared selection channel. A selection channel that is not shared between
the sender and the recipient.
Normalized histogram. A sample probability mass function.
NRCS. Natural Resources Conservation Service image database.
nsF5. An improved version of the F5 algorithm with removed shrinkage.
OC-NM. One-Class Neighbor Machine, a classifier that recognizes a single class.
OG. An abbreviation for OutGuess.
Glossary 401

One-time pad. A simple symmetric cryptosystem in which message bits are


XORed with a random stream of bits shared between Alice and Bob.
One-way function. A mapping h for which it is computationally infeasible to
find x for given h(x).
Optimal parity embedding. A steganographic algorithm for palette images.
Oracle. A device that can be queried to produce a cover object from a given
source.
Orthogonal complement. The set of all vectors orthogonal to the object.
OutGuess. A steganographic program for hiding messages in JPEG files.
p-value. The tail probability of a chi-square distribution.
Pairs Analysis. A quantitative steganalysis method for palette or grayscale
images.
Palette. A look-up table of all colors that may occur in a palette image.
– color. A color stored in a palette.
Parity. A bit assigned to a cover element.
– check matrix. A generator matrix of the dual code.
– function. A symbol-assignment function that assigns bits.
Passive warden. A warden who does not interfere with communication.
Payload. Length of the secret message.
pdf. Probability density function (defined for continuous variables).
Perfect
– code. A code that saturates the sphere-covering inequality.
– security. Impossibility to distinguish between cover and stego objects.
Perturbed quantization. A steganographic embedding principle in which em-
bedding occurs during processing the cover object while limiting the changes
to those elements of the cover object that experience the largest quantization
error.
Photodetector. Another term for pixel.
Photodiode. Another term for pixel.
Photoelectric effect. A phenomenon during which a photon creates an
electron–hole pair.
Photonic noise. See shot noise.
Pixel. An element of an imaging sensor capable of registering incident light
intensity.
402 Glossary

– well. A portion of pixel collecting free electrons released due to the photo-
electric effect at image acquisition.
Plaintext. Message to be encrypted.
pmf. Probability mass function (defined for discrete variables).
PNG. Portable Network Graphics image format.
Poisson distribution. A mathematical form of the law of rare events.
PQe. A version of the perturbed quantization algorithm for JPEG images.
PQt. A version of the perturbed quantization algorithm for JPEG images.
Precover. A hypothetical cover used to justify a priori distribution of some
cover-image statistics.
Primary set. A set of pixel pairs with specified relationship (used to construct
Sample Pairs Analysis attack on LSB embedding).
Prisoners’ problem. A fictitious scenario involving Alice, Bob, and warden
Eve, used to demonstrate the problem of steganography and steganalysis.
PRNG. Pseudo-Random Number Generator.
PRNU. Pixel Photo-Response Non-Uniformity. A systematic imperfection due
to varying response to light of individual pixels.
Probabilistic algorithm. An algorithm whose mechanism involves random-
ness.
Pseudo-random selection channel. A selection channel determined using a
PRNG.
PSNR. Peak Signal-to-Noise Ratio.
Public-key
– encryption. Asymmetric cryptosystem (realized using the so-called public
key) and decryption, realized with the private key).
– steganography. Steganographic channel with public access to the secret
message, which is encrypted using a private key.
q-ary. Related to an alphabet consisting of q symbols.
QIM. Quantization Index Modulation (a robust watermarking technique).
Quality factor. A scalar value that controls the quality of a JPEG file.
Quantitative steganalysis. Steganalysis whose objective is to estimate the
embedded payload or the change rate (or the number of embedding changes).
Quantization. Process of rounding real-valued quantities to represent them
with bits.
– error. Distortion induced by quantization.
Glossary 403

– noise. Distortion introduced by quantization.


– step. During quantization, values are rounded to multiples of the quanti-
zation step.
– matrix (table). A matrix of quantization steps used in JPEG compression.
Quantum efficiency. The probability that a photodetector absorbs a photon.
Rainbow coloring. Assignment of colors to lattice points so that the colors of
all neighboring lattice points are different.
Random
– binning. A methodology for creating random codes.
– code. A code generated randomly.
Raster format. Image representation by storing pixel values as one or more
matrices of the same dimensions as the image.
RAW. A database of never-compressed natural images taken by digital cameras.
Raw sensor output. Unprocessed signal obtained by reading out the charge of
individual pixels.
Readout noise. Noise introduced during the process of extracting the charge
at each pixel.
Regularity condition. We say that a pdf p(x; θ) satisfies the regularity condi-
tion if

 
∂ log p(x; θ)
Ep(x;θ) = 0 for all θ.
∂θ

This condition is satisfied for most well-behaved distributions.


Rejection sampling. A steganographic method in which the cover or its ele-
ments are selected by randomly drawing them from their corresponding sets
till a correct message is embedded.
Relative
– distortion. Distortion measured per cover element.
– embedding capacity. Embedding capacity per cover element.
– entropy. See KL divergence.
– message length/payload. Ratio between the payload and embedding ca-
pacity. Usually denoted α. Also called relative payload.
Repetition code. A code whose codewords are strings of repeating symbols.
RGB. Red, Green, and Blue channels in a color signal.
Risk functional. Performance measure of SVMs.
Robustness. Ability to withstand distortion.
404 Glossary

ROC curve. Receiver Operating Characteristic of a detector showing the prob-


ability of detection versus probability of false alarm.
RS analysis. A quantitative steganalysis method for detection of LSB embedding
in images.
RSA. (Rivest Shamir Adleman) The first and the most popular public-key cryp-
tosystem created from the fact that it is computationally hard to factorize a
composite number into the product of primes.
RSD. Robust Soliton Distribution.
S-Tools. A steganographic program.
Sample Pairs Analysis. A targeted steganalysis method for LSB embedding in
spatial domain.
SCAN. An image database containing scans of natural images.
SDCS. Sum and Difference Covering Set.
Secure payload. The number of bits that can be communicated at a given
security level using a specific imperfect steganographic scheme. The secure
payload grows only with the square root of the number of pixels in the cover.
Selection channel. A portion of the cover object used for embedding the secret
message.
Sensor. See imaging sensor.
– noise. The complex of noise sources introduced during image acquisition.
– resolution. The number of pixels on the sensor.
Separating hyperplane. Hyperplane separating two classes.
Sequential selection channel. A selection channel defined by some simple
scan of the image, e.g., by rows or columns.
Shot noise. Noise in signal acquired by an imaging sensor due to quantum
properties of light.
Shrinkage. Shrinkage occurs when a non-zero DCT coefficient is modified to
zero after message embedding.
Side-information. Information available to Alice but not Bob.
Skewness. The third normalized central moment.
SLR. Single-Lens Reflex camera.
SNR. Signal-to-Noise Ratio.
Soft-margin SVM. An SVM that allows misclassifications by penalizing them.
SPA. Sample Pairs Analysis, a quantitative steganalysis of LSB embedding in
the spatial domain.
Glossary 405

Space-filling curve (planar). A curve that goes through every point in a unit
square.
Sparse codes. Codes with codewords of small Hamming weight.
Sparsity measure. A function evaluating sparsity of data points at a given
point.
Spatial frequency. A pair of integers characterizing a DCT mode.
Sphere-covering bound. An inequality that binds the number of codewords
in a code, its length, and its covering radius.
Spread-spectrum (data hiding). A method for information hiding in which a
bit is spread into a longer random sequence that is superimposed on the cover
object. Usually used to achieve robustness with respect to channel distortion.
Square-root law. Thesis claiming that the steganographic capacity of imperfect
stegosystems grows only as the square root of the number of cover elements.
Standard quantization matrix. A family of quantization matrices
parametrized by quality factor as specified in the JPEG format standard.
Statistical
– detectability. A measure of how detectable embedding changes are using
standard methods of hypothesis testing.
– restoration. A general principle for constructing steganographic methods
that preserve selected statistics of the cover object.
Steganalysis. The counterpart of steganography. Effort directed towards dis-
covering the presence of a secret message.
Steganalysis-aware steganography. Steganography constructed to avoid
known steganalysis attacks.
Steganalyzer. A specific implementation of some steganalysis attack.
Steganographic
– capacity. Maximum number of message bits that can be embedded in a
cover object without introducing statistically detectable artifacts.
– channel. A covert communication system consisting of the cover source,
stego key source, and message source, and the physical communication
channel used to exchange messages.
– communication. See steganographic channel.
– file system. A tool to thwart “rubber-hose attacks” that allows the user
to plausibly deny that encrypted files reside on the disk.
– scheme. See steganographic channel.
– security. Impossibility to construct an attack on a steganographic scheme.
Steganography. The art of communicating messages in a covert manner.
406 Glossary

Steghide. A steganographic program that embeds messages in JPEG files while


preserving the histogram of DCT coefficients.
Stego
– key. A secret shared between the sender and the recipient that drives the
embedding and extraction algorithms.
– key source. Random variable defining the process of selecting the stego
key in steganographic communication.
– noise. The difference between cover and stego objects.
– object. The result of embedding a message in the cover object.
Stegosystem. Another term for steganographic channel.
Stirling’s formula. An approximate formula for factorial.
Stochastic modulation. A steganographic scheme that embeds messages in
raster images by adding noise with predefined probability density function.
Structural steganalysis. A generalized extensible formulation of SPA aimed
at detection of LSB embedding.
Student’s t-distribution. A statistical distribution obtained by normalizing
zero-mean Gaussian samples by their sample variance.
Subtractive color model. Color model in which colors are created by removing
base colors.
Support vector. A data point determining the separating hyperplane in SVMs.
Support vector machine (SVM). A tool for machine learning.
– kernelized. An SVM implemented using a kernel.
– soft-margin. An SVM capable of discriminating between non-separable
data samples.
– weighted. An SVM whose decision errors are controlled by penalization
parameters.
SVM. Support vector machine.
Symbol-assignment function. A mapping that assigns an alphabet symbol
(or an element of a finite field ) to every cover element (every pixel or DCT
coefficient). Symbol-assignment functions enable using the apparatus of cod-
ing to improve embedding efficiency (matrix embedding) and to communicate
using non-shared selection channels.
Syndrome. A bit stream obtained by multiplying a word and the parity-check
matrix.
– coding. A method for communicating a secret message as a syndrome of
a linear code.
Glossary 407

System attack. An attack on a steganographic scheme that utilizes implemen-


tation errors or other auxiliary information rather than the statistical distri-
bution of cover-object elements.
Systematic form. A linear code is in systematic form if its generator matrix is
[I; A], where I is the identity matrix.
Systematic imperfection. Distortion of image signal that is not random in
nature and repeats in all images.
Tail inequality. An upper bound on partial sum of binomial coefficients

βn
n
≤ 2nH(β) .
i=0
i

Targeted steganalysis. An approach to steganalysis designed to detect a spe-


cific steganographic method.
TDI. Time-Delay Integration imaging.
Ternary
– alphabet. An alphabet consisting of three symbols.
– code. A code built over a ternary alphabet.
TIFF. Tagged-Image File Format.
Timestamped bits. Bits tagged with a scalar value (time).
Trace set. A set of pixel pairs used in structural steganalysis.
Transform image format. Format that represents an image in a transform
domain.
Tristimulus theory. A theory of color vision based on the fact that the human
retina contains three types of cones sensitive to red, green, and blue colors.
True-color image. An image in which each pixel is represented using 24 bits
(8 bits per color channel ).
UMP. Universally Most Powerful Detector.
Unbiased estimator. An estimator whose bias is zero, E[θ̂] = θ for all θ.
Undetectability. The inability to prove the existence of a secret message in a
stego object.
Variation. A feature used in blind steganalysis of JPEG images.
Voronoi cell. A region defined for a lattice.
WAM. Blind steganalyzer constructed to detect spatial-domain steganography
using Wavelet Absolute Moments of noise residuals.
408 Glossary

Warden. The subject or a computer program that monitors the traffic between
Alice and Bob in the prisoners’ problem.
– active. A warden that slightly distorts the communication so that it still
conveys the same overt meaning. The goal is to prevent steganography.
– malicious. A warden who tries to trick the communicating parties by
impersonating them or by other means that are based on the warden’s
knowledge of the steganographic protocol.
– passive. A warden that passively observes the communication.
Watermarking. Data-hiding application in which the hidden data supplements
the cover object.
Wavelet transform. A signal transform with basis functions that are localized
simultaneously both in the spatial and in the frequency domain.
Weak Law of large numbers. The law that states that the sample mean
converges to the mean in probability.
Weighted SVM. A support vector machine in which misclassifications are pe-
nalized on the basis of weights assigned to false alarms and missed detections.
Wet paper codes. Codes that enable communication using non-shared selection
channels.
Wet pixels. Pixels that are not to be changed during steganographic embedding.
White balance. Color transformation adjusting the gain of each color channel.
Wiener filter. An adaptive linear denoising filter.
Within-image error. A component of the estimation error of quantitative ste-
ganalyzers that depends on the placement of embedding changes in the stego
image.
WMF. Windows Meta File image format.
Word. A string of symbols from an alphabet.
WS. Weighted stego method for detection of LSB embedding.
XOR. Exclusive or.
YUV. Luminance and two chrominance signals in a color signal.
Zig-zag scan. Scanning order of quantized DCT coefficients used during JPEG
compression.
ZZW construction. A general method for constructing matrix-embedding
methods from existing methods.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
References pp. 409-426

Chapter DOI:
Cambridge University Press
References

[1] Fingerprints for car parts. The Economist, December 8, 2005.


[2] G. F. Amelio, M. F. Tompsett, and G. E. Smith. Experimental verification of the charge
coupled device concept. Bell Systems Technical Journal, 49:593–600, 1970.
[3] R. Anderson. Stretching the limits of steganography. In R. J. Anderson, editor, Infor-
mation Hiding, 1st International Workshop, volume 1174 of Lecture Notes in Computer
Science, pages 39–48, Cambridge, May 30–June 1, 1996. Springer-Verlag, Berlin.
[4] R. J. Anderson, R. Needham, and A. Shamir. The steganographic file system. In D. Auc-
smith, editor, Information Hiding, 2nd International Workshop, volume 1525 of Lecture
Notes in Computer Science, pages 73–82, Portland, OR, April 14–17, 1998. Springer-
Verlag, Berlin.
[5] R. J. Anderson and F. A. P. Petitcolas. On the limits of steganography. IEEE Journal
of Selected Areas in Communication, 16(4):474–481, 1998.
[6] N. Aoki. A band extension technique for G.711 speech using steganography. IEICE
Transactions on Communications, E89-B(6):1896–1898, 2006.
[7] N. Aoki. Potential of value-added speech communications by using steganography. In
B.-Y. Liao, editor, Proceedings of the 3rd International Conference on Intelligent Infor-
mation Hiding and Multimedia Signal Processing, volume 2, pages 251–254, Kaohsiung,
Taiwan, November 26–28, 2007.
[8] A. Aspect, J. Dalibard, and G. Roger. Experimental test of Bell’s inequalities using
time-varying analyzers. Physical Review Letters, 49:1804, 1982.
[9] I. Avcibas. Audio steganalysis with content-independent distortion measures. IEEE
Signal Processing Letters, 13:92–95, February 2006.
[10] I. Avcibas, M. Kharrazi, N. D. Memon, and B. Sankur. Image steganalysis with binary
similarity measures. EURASIP Journal on Applied Signal Processing, 17:2749–2757,
2005.
[11] I. Avcibas, N. D. Memon, and B. Sankur. Steganalysis using image quality metrics. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security
and Watermarking of Multimedia Contents III, volume 4314, pages 523–531, San Jose,
CA, January 22–25, 2001.
[12] I. Avcibas, N. D. Memon, and B. Sankur. Image steganalysis with binary similarity
measures. In Proceedings IEEE, International Conference on Image Processing, ICIP
2002, volume 3, pages 645–648, Rochester, NY, September 22–25, 2002.
[13] F. Bach, D. Heckerman, and E. Horvitz. On the path to an ideal ROC curve: Consider-
ing cost asymmetry in learning classifiers. In R. G. Cowell and Z. Ghahramani, editors,
Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics
(AISTATS), pages 9–16. Society for Artificial Intelligence and Statistics, 2005. Available
410 REFERENCES

electronically at http://www.gatsby.ucl.ac.uk/aistats/.
[14] M. Backes and C. Cachin. Public-key steganography with active attacks. In J. Kilian,
editor, 2nd Theory of Cryptography Conference, volume 3378 of Lecture Notes in Com-
puter Science, pages 210–226, Cambridge, MA, February 10–12, 2005. Springer-Verlag,
Heidelberg.
[15] F. Bacon. Of the Advancement and Proficiencie of Learning or the Partitions of Sci-
ences, volume VI. Leon Lichfield, Oxford, for R. Young and E. Forest, 1640.
[16] M. Barni and F. Bartolini. Watermarking Systems Engineering: Enabling Digital Assets
Security and Other Applications, volume 21 of Signal Processing and Communications.
Boca Raton, FL: CRC Press, 2004.
[17] R. J. Barron, B. Chen, and G. W. Wornell. The duality between information embedding
and source coding with side information and some applications. IEEE Transactions on
Information Theory, 49(5):1159–1180, 2003.
[18] R. Bergmair. Towards linguistic steganography: A systematic investigation of ap-
proaches, systems, and issues. Final year thesis, April 2004. University of Derby,
http://bergmair.cjb.net/pub/towlingsteg-rep-inoff-a4.ps.gz.
[19] J. Bierbrauer. Introduction to Coding Theory. London: Chapman & Hall/CRC, 2004.
[20] J. Bierbrauer and J. Fridrich. Constructing good covering codes for applications in
steganography. LNCS Transactions on Data Hiding and Multimedia Security, 4920:1–
22, 2008.
[21] R. Böhme. Weighted stego-image steganalysis for JPEG covers. In K. Solanki, K. Sul-
livan, and U. Madhow, editors, Information Hiding, 10th International Workshop, vol-
ume 5284 of Lecture Notes in Computer Science, pages 178–194, Santa Barbara, CA,
June 19–21, 2007. Springer-Verlag, New York.
[22] R. Böhme. Improved Statistical Steganalysis Using Models of Heterogeneous Cover Sig-
nals. PhD thesis, Faculty of Computer Science, Technische Universität Dresden, 2008.
[23] R. Böhme and A. D. Ker. A two-factor error model for quantitative steganalysis. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security,
Steganography, and Watermarking of Multimedia Contents VIII, volume 6072, pages
59–74, San Jose, CA, January 16–19, 2006.
[24] R. Böhme and A. Westfeld. Feature-based encoder classification of compressed audio
streams. ACM Multimedia System Journal, 11(2):108–120, 2005.
[25] I. A. Bolshakov. Method of linguistic steganography based on collocationally-verified
synonymy. In J. Fridrich, editor, Information Hiding, 6th International Workshop, vol-
ume 3200 of Lecture Notes in Computer Science, pages 180–191, Toronto, May 23–25,
2005. Springer-Verlag, Berlin.
[26] K. Borders and A. Prakash. Web tap: Detecting covert web traffic. In V. Atluri, B. Pfitz-
mann, and P. Drew McDaniel, editors, Proceedings 11th ACM Conference on Computer
and Communications Security (CCS), pages 110–120, Washington, DC, October 25–29,
2004.
[27] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge: Cambridge University
Press, 2004.
[28] W. S. Boyle and G. E. Smith. Charge coupled semiconductor devices. Bell Systems
Technical Journal, 49:587–593, 1970.
[29] J. Brassil, S. Low, N. F. Maxemchuk, and L. O’Gorman. Hiding information in docu-
ment images. In Proceedings of the Conference on Information Sciences and Systems,
CISS, pages 482–489, Johns Hopkins University, Baltimore, MD, March 22–24, 1995.
REFERENCES 411

[30] R. P. Brent, S. Gao, and A. G. B. Lauder. Random Krylov spaces over finite fields.
SIAM Journal of Discrete Mathematics, 16(2):276–287, 2003.
[31] D. Brewster. Microscope, volume XIV. Encyclopaedia Britannica or the Dictionary of
Arts, Sciences, and General Literature, Edinburgh, IX – Application of photography
to the microscope, 8th edition, 1857.
[32] C. W. Brown and B. J. Shepherd. Graphics File Formats. Greenwich, CT: Manning
Publications Co., 1995.
[33] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2):121–167, 1998.
[34] S. Cabuk, C. E. Brodley, and C. Shields. IP covert timing channels: Design and de-
tection. In V. Atluri, B. Pfitzmann, and P. Drew McDaniel, editors, Proceedings 11th
ACM Conference on Computer and Communications Security (CCS), pages 178–187,
Washington, DC, October 25–29, 2004.
[35] C. Cachin. An information-theoretic model for steganography. In D. Aucsmith, editor,
Information Hiding, 2nd International Workshop, volume 1525 of Lecture Notes in
Computer Science, pages 306–318, Portland, OR, April 14–17, 1998. Springer-Verlag,
New York.
[36] G. Cancelli and M. Barni. MPSteg-color: A new steganographic technique for color
images. In T. Furon, F. Cayre, G. Doërr, and P. Bas, editors, Information Hiding,
9th International Workshop, volume 4567 of Lecture Notes in Computer Science, pages
1–15, Saint Malo, June 11–13, 2007. Springer-Verlag, Berlin.
[37] G. Cancelli, G. Doërr, I. J. Cox, and M. Barni. A comparative study of ±1 steganalyzers.
In Proceedings IEEE International Workshop on Multimedia Signal Processing, pages
791–796, Cairns, Australia, October 8–10, 2008.
[38] G. Cancelli, G. Doërr, I. J. Cox, and M. Barni. Detection of ±1 LSB steganography
based on the amplitude of histogram local extrema. In Proceedings IEEE, International
Conference on Image Processing, ICIP 2008, pages 1288–1291, San Diego, CA, October
12–15, 2008.
[39] R. Chandramouli and N. D. Memon. A distributed detection framework for steganaly-
sis. In J. Dittmann, K. Nahrstedt, and P. Wohlmacher, editors, Proceedings of the 3rd
ACM Multimedia & Security Workshop, pages 123–126, Los Angeles, CA, November 4,
2000.
[40] R. Chandramouli and N. D. Memon. Analysis of LSB based image steganography tech-
niques. In Proceedings IEEE International Conference on Image Processing, ICIP 2001,
Thessaloniki, October 7–10, 2001. CD ROM version.
[41] M. Chapman, G. I. Davida, and M. Rennhard. A practical and effective approach to
large-scale automated linguistic steganography. In Proceedings of the 4th International
Conference on Information Security, volume 2200 of Lecture Notes in Computer Sci-
ence, pages 156–165, Malaga, October 1–3, 2001. Springer-Verlag, Berlin.
[42] B. Chen and G. Wornell. Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Transactions on
Information Theory, 47(4):1423–1443, 2001.
[43] M. Chen, J. Fridrich, and M. Goljan. Determining image origin and integrity using
sensor noise. IEEE Transactions on Information Forensics and Security, 1(1):74–90,
March 2008.
[44] H. G. Chew, R. E. Bogner, and C. C. Lim. Dual-ν support vector machine with er-
ror rate and training size biasing. In Proceedings IEEE, International Conference on
412 REFERENCES

Acoustics, Speech, and Signal Processing, volume 2, pages 1269–1272, Salt Lake City,
UT, May 7–11, 2001.
[45] C. T. Clelland, V. Risca, and C. Bancroft. Hiding messages in DNA microdots. Nature,
399:533–534, June 10, 1999.
[46] G. D. Cohen, I. Honkala, S. Litsyn, and A. Lobstein. Covering Codes, volume 54.
Amsterdam: Elsevier, North-Holland Mathematical Library, 1997.
[47] E. Cole. Hiding in Plain Sight: Steganography and the Art of Covert Communication.
New York: Wiley Publishing Inc., 2003.
[48] S. Coll and S. B. Glasser. Terrorists turn to web as base of operations. Washington
Post, page A01, August 7, 2005.
[49] P. Comesana and F. Pérez-Gonzáles. On the capacity of stegosystems. In J. Dittmann
and J. Fridrich, editors, Proceedings of the 9th ACM Multimedia & Security Workshop,
pages 3–14, Dallas, TX, September 20–21, 2007.
[50] T. M. Cover and J. A. Thomas. Elements of Information Theory. New York: John
Wiley & Sons, Inc., 1991.
[51] I. J. Cox, M. L. Miller, J. A. Bloom, J. Fridrich, and T. Kalker. Digital Watermarking
and Steganography. Morgan Kaufman Publishers Inc., San Francisco, CA, 2007.
[52] R. Crandall. Some notes on steganography. Steganography Mailing List, available from
http://os.inf.tu-dresden.de/~westfeld/crandall.pdf, 1998.
[53] S. Craver. On public-key steganography in the presence of an active warden. In D. Auc-
smith, editor, Information Hiding, 2nd International Workshop, volume 1525 of Lecture
Notes in Computer Science, pages 355–368, Portland, OR, April 14–17, 1998. Springer-
Verlag, New York.
[54] N. Cristianini and J. Shawe-Taylor. Support Vector Machines and Other Kernel-Based
Learning Methods. Cambridge: Cambridge University Press, 2000.
[55] O. Dabeer, K. Sullivan, U. Madhow, S. Chandrasekaran, and B. S. Manjunath. De-
tection of hiding in the least significant bit. IEEE Transactions on Signal Processing,
52:3046–3058, 2004.
[56] N. Dedic, G. Itkis, L. Reyzin, and S. Russell. Upper and lower bounds on black-box
steganography. In J. Kilian, editor, Theory of Cryptography, volume 3378 of Lecture
Notes in Computer Science, pages 227–244, Cambridge, MA, February 10–12, 2005.
Springer-Verlag, London.
[57] Y. Desmedt. Subliminal-free authentication and signature. In C. G. Günther, editor,
Advances in Cryptology – EUROCRYPT ’88, Workshop on the Theory and Application
of Cryptographic Techniques, volume 330 of Lecture Notes in Computer Science, pages
22–33, Davos, May 25–27, 1988. Springer-Verlag, Berlin.
[58] J. Dittmann, T. Vogel, and R. Hillert. Design and evaluation of steganography for
voice-over-IP. In Proceedings IEEE International Symposium on Circuits and Systems
(ISCAS), Kos, May 21–24, 2006.
[59] E. R. Dougherty. Random Processes for Image and Signal Processing, volume Mono-
graph PM44. Washington, DC: SPIE Press, International Society for Optical Engineer-
ing, 1998.
[60] I. Dumer. Handbook of Coding Theory: Volume II, chapter 23, Concatenated Codes
and Their Multilevel Generalizations, pages 1911–1988. Elsevier Science, Amsterdam,
1998.
[61] S. Dumitrescu and X. Wu. LSB steganalysis based on higher-order statistics. In A. M.
Eskicioglu, J. Fridrich, and J. Dittmann, editors, Proceedings of the 7th ACM Multi-
REFERENCES 413

media & Security Workshop, pages 25–32, New York, NY, August 1–2, 2005.
[62] S. Dumitrescu, X. Wu, and N. D. Memon. On steganalysis of random LSB embedding
in continuous-tone images. In Proceedings IEEE, International Conference on Image
Processing, ICIP 2002, pages 324–339, Rochester, NY, September 22–25, 2002.
[63] S. Dumitrescu, X. Wu, and Z. Wang. Detection of LSB steganography via Sample
Pairs Analysis. In F. A. P. Petitcolas, editor, Information Hiding, 5th International
Workshop, volume 2578 of Lecture Notes in Computer Science, pages 355–372, Noord-
wijkerhout, October 7–9, 2002. Springer-Verlag, New York.
[64] J. Eggers, R. Bäuml, and B. Girod. A communications approach to steganography. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security
and Watermarking of Multimedia Contents IV, volume 4675, pages 26–37, San Jose,
CA, January 21–24, 2002.
[65] T. Ernst. Schwarzweisse Magie. Der Schlüssel zum dritten Buch der Steganographia des
Trithemius. Daphnis, 25(1), 1996.
[66] J. M. Ettinger. Steganalysis and game equilibria. In D. Aucsmith, editor, Informa-
tion Hiding, 2nd International Workshop, volume 1525 of Lecture Notes in Computer
Science, pages 319–328, Portland, OR, April 14–17, 1998. Springer-Verlag, New York.
[67] H. Farid and L. Siwei. Detecting hidden messages using higher-order statistics and
support vector machines. In F. A. P. Petitcolas, editor, Information Hiding, 5th Inter-
national Workshop, volume 2578 of Lecture Notes in Computer Science, pages 340–354,
Noordwijkerhout, October 7–9, 2002. Springer-Verlag, New York.
[68] T. Filler and J. Fridrich. Binary quantization using belief propagation over factor graphs
of LDGM codes. In 45th Annual Allerton Conference on Communication, Control, and
Computing, Allerton, IL, September 26–28, 2007.
[69] T. Filler and J. Fridrich. Complete characterization of perfectly secure stegosystems
with mutually independent embedding. In Proceedings IEEE, International Conference
on Acoustics, Speech, and Signal Processing, Taipei, April 19–24, 2009.
[70] T. Filler and J. Fridrich. Fisher information determines capacity of -secure steganog-
raphy. In S. Katzenbeisser and A.-R. Sadeghi, editors, Information Hiding, 11th Inter-
national Workshop, volume 5806 of Lecture Notes in Computer Science, pages 31–47,
Darmstadt, June 7–10, 2009. Springer-Verlag, New York.
[71] T. Filler, A. D. Ker, and J. Fridrich. The Square Root Law of steganographic capacity
for Markov covers. In N. D. Memon, E. J. Delp, P. W. Wong, and J. Dittmann, editors,
Proceedings SPIE, Electronic Imaging, Security and Forensics of Multimedia XI, volume
7254, pages 08 1–08 11, San Jose, CA, January 18–21, 2009.
[72] G. Fisk, M. Fisk, C. Papadopoulos, and J. Neil. Eliminating steganography in Internet
traffic with active wardens. In F. A. P. Petitcolas, editor, Information Hiding, 5th
International Workshop, volume 2578 of Lecture Notes in Computer Science, pages
18–35, Noordwijkerhout, October 7–9, 2003. Springer-Verlag, New York.
[73] M. A. Fitch and R. E. Jamison. Minimum sum covers of small cyclic groups. Congressus
Numerantium, 147:65–81, 2000.
[74] J. D. Foley, A. van Dam, S. K. Feiner, J. F. Hughes, and R. L. Phillips. Introduction
to Computer Graphics. New York: Addison-Wesley, 1997.
[75] E. Franz. Steganography preserving statistical properties. In F. A. P. Petitcolas, ed-
itor, Information Hiding, 5th International Workshop, volume 2578 of Lecture Notes
in Computer Science, pages 278–294, Noordwijkerhout, October 7–9, 2002. Springer-
Verlag, New York.
414 REFERENCES

[76] E. Franz. Embedding considering dependencies between pixels. In E. J. Delp and P. W.


Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Forensics, Steganogra-
phy, and Watermarking of Multimedia Contents X, volume 6819, pages D 1–D 12, San
Jose, CA, January 27–31, 2008.
[77] E. Franz and A. Schneidewind. Pre-processing for adding noise steganography. In
M. Barni, J. Herrera, S. Katzenbeisser, and F. Pérez-González, editors, Information
Hiding, 7th International Workshop, volume 3727 of Lecture Notes in Computer Sci-
ence, pages 189–203, Barcelona, June 6–8, 2005. Springer-Verlag, Berlin.
[78] J. Fridrich. Feature-based steganalysis for JPEG images and its implications for future
design of steganographic schemes. In J. Fridrich, editor, Information Hiding, 6th Inter-
national Workshop, volume 3200 of Lecture Notes in Computer Science, pages 67–81,
Toronto, May 23–25, 2004. Springer-Verlag, New York.
[79] J. Fridrich. Asymptotic behavior of the ZZW embedding construction. IEEE Transac-
tions on Information Forensics and Security, 4(1):151–153, March 2009.
[80] J. Fridrich and R. Du. Secure steganographic methods for palette images. In A. Pfitz-
mann, editor, Information Hiding, 3rd International Workshop, volume 1768 of Lecture
Notes in Computer Science, pages 47–60, Dresden, September 29–October 1, 1999.
Springer-Verlag, New York.
[81] J. Fridrich, P. Lisoněk, and D. Soukal. On steganographic embedding efficiency. In J. L.
Camenisch, C. S. Collberg, N. F. Johnson, and P. Sallee, editors, Information Hiding,
8th International Workshop, volume 4437 of Lecture Notes in Computer Science, pages
282–296, Alexandria, VA, July 10–12, 2006. Springer-Verlag, New York.
[82] J. Fridrich and M. Goljan. Digital image steganography using stochastic modulation. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security and
Watermarking of Multimedia Contents V, volume 5020, pages 191–202, Santa Clara,
CA, January 21–24, 2003.
[83] J. Fridrich and M. Goljan. On estimation of secret message length in LSB steganog-
raphy in spatial domain. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE,
Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Con-
tents VI, volume 5306, pages 23–34, San Jose, CA, January 19–22, 2004.
[84] J. Fridrich, M. Goljan, and R. Du. Detecting LSB steganography in color and gray-scale
images. IEEE Multimedia, Special Issue on Security, 8(4):22–28, October–December
2001.
[85] J. Fridrich, M. Goljan, and R. Du. Steganalysis based on JPEG compatibility. In A. G.
Tescher, editor, Special Session on Theoretical and Practical Issues in Digital Water-
marking and Data Hiding, SPIE Multimedia Systems and Applications IV, volume 4518,
pages 275–280, Denver, CO, August 20–24, 2001.
[86] J. Fridrich, M. Goljan, and D. Hogea. Steganalysis of JPEG images: Breaking the F5
algorithm. In Information Hiding, 5th International Workshop, volume 2578 of Lec-
ture Notes in Computer Science, pages 310–323, Noordwijkerhout, October 7–9, 2002.
Springer-Verlag, New York.
[87] J. Fridrich, M. Goljan, D. Hogea, and D. Soukal. Quantitative steganalysis of digi-
tal images: Estimating the secret message length. ACM Multimedia Systems Journal,
9(3):288–302, 2003.
[88] J. Fridrich, M. Goljan, and D. Soukal. Higher-order statistical steganalysis of palette
images. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging,
Security and Watermarking of Multimedia Contents V, pages 178–190, Santa Clara,
REFERENCES 415

CA, January 21–24, 2003.


[89] J. Fridrich, M. Goljan, and D. Soukal. Perturbed quantization steganography using
wet paper codes. In J. Dittmann and J. Fridrich, editors, Proceedings of the 6th ACM
Multimedia & Security Workshop, pages 4–15, Magdeburg, September 20–21, 2004.
[90] J. Fridrich, M. Goljan, and D. Soukal. Searching for the stego key. In E. J. Delp and
P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganography,
and Watermarking of Multimedia Contents VI, volume 5306, pages 70–82, San Jose,
CA, January 19–22, 2004.
[91] J. Fridrich, M. Goljan, and D. Soukal. Efficient wet paper codes. In M. Barni, J. Herrera,
S. Katzenbeisser, and F. Pérez-González, editors, Information Hiding, 7th International
Workshop, Lecture Notes in Computer Science, pages 204–218, Barcelona, June 6–8,
2005. Springer-Verlag, Berlin.
[92] J. Fridrich, M. Goljan, and D. Soukal. Perturbed quantization steganography. ACM
Multimedia System Journal, 11(2):98–107, 2005.
[93] J. Fridrich, M. Goljan, and D. Soukal. Steganography via codes for memory with de-
fective cells. In 43rd Annual Allerton Conference on Communication, Control, and
Computing, Allerton, IL, September 28–30, 2005.
[94] J. Fridrich, M. Goljan, D. Soukal, and T. Holotyak. Forensic steganalysis: Determining
the stego key in spatial domain steganography. In E. J. Delp and P. W. Wong, editors,
Proceedings SPIE, Electronic Imaging, Security, Steganography, and Watermarking of
Multimedia Contents VII, volume 5681, pages 631–642, San Jose, CA, January 16–20,
2005.
[95] J. Fridrich, T. Pevný, and J. Kodovský. Statistically undetectable JPEG steganography:
Dead ends, challenges, and opportunities. In J. Dittmann and J. Fridrich, editors,
Proceedings of the 9th ACM Multimedia & Security Workshop, pages 3–14, Dallas, TX,
September 20–21, 2007.
[96] J. Fridrich and D. Soukal. Matrix embedding for large payloads. IEEE Transactions on
Information Forensics and Security, 1(3):390–394, 2006.
[97] F. Galand and C. Fontaine. How Reed–Solomon codes can improve steganographic
schemes. EURASIP Journal on Information Security, 2009. Article ID 274845,
doi:10.1155/2009/274845.
[98] F. Galand and G. Kabatiansky. Information hiding by coverings. In Proceedings IEEE,
Information Theory Workshop, ITW 2003, pages 151–154, Paris, March 31–April 4,
2003.
[99] S. I. Gel’fand and M. S. Pinsker. Coding for channel with random parameters. Problems
of Control and Information Theory, 9(1):19–31, 1980.
[100] S. Gianvecchio and H. Wang. Detecting covert timing channels: An entropy-based ap-
proach. In P. Ning, S. De Capitani di Vimercati, and P. F. Syverson, editors, Proceed-
ings 14th ACM Conference on Computer and Communication Security (CCS), pages
307–316, Alexandria, VA, October 28–31, 2007.
[101] J. Giffin, R. Greenstadt, P. Litwack, and R. Tibbetts. Covert messaging through TCP
timestamps. In R. Dingledine and P. Syverson, editors, Proceedings Privacy Enhancing
Technologies Workshop (PET), volume 2482 of Lecture Notes in Computer Science,
pages 194–208, San Francisco, CA, April 14–15, 2002. Springer-Verlag, Berlin.
[102] M. Goljan, J. Fridrich, and T. Holotyak. New blind steganalysis and its implications. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security,
Steganography, and Watermarking of Multimedia Contents VIII, volume 6072, pages
416 REFERENCES

1–13, San Jose, CA, January 16–19, 2006.


[103] R. L. Graham and N. J. A. Sloane. On additive bases and harmonious graphs. SIAM
Journal on Algebraic and Discrete Methods, 1(4):382–404, December 1980.
[104] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel
method for the two-sample-problem. In B. Schölkopf, J. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems 19, pages 513–520. MIT Press,
Cambridge, MA, 2007.
[105] H. Haanpää. Minimum sum and difference covers of abelian groups. Journal of Integer
Sequences, 7(2), 2004. Article 04.2.6.
[106] T. G. Handel and M. T. Stanford III. Hiding data in the OSI network model. In Infor-
mation Hiding, 1st International Workshop, volume 1174 of Lecture Notes in Computer
Science, pages 23–38, Cambridge, May 30–June 1, 1996. Springer-Verlag, Berlin.
[107] J. J. Harmsen and W. A. Pearlman. Steganalysis of additive noise modelable infor-
mation hiding. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic
Imaging, Security and Watermarking of Multimedia Contents V, volume 5020, pages
131–142, Santa Clara, CA, January 21–24, 2003.
[108] C. Heegard and A. A. El-Gamal. On the capacity of computer memory with defects.
IEEE Transactions on Information Theory, 29(5):731–739, 1983.
[109] Herodotus. The Histories. Penguin Books, London, 1996. Translated by Aubrey de
Sélincourt.
[110] S. Hetzl and P. Mutzel. A graph-theoretic approach to steganography. In J. Dittmann,
S. Katzenbeisser, and A. Uhl, editors, Communications and Multimedia Security, 9th
IFIP TC-6 TC-11 International Conference, CMS 2005, volume 3677 of Lecture Notes
in Computer Science, pages 119–128, Salzburg, September 19–21, 2005.
[111] F. S. Hill, Jr. Computer Graphics Using Open GL. Upper Saddle River, NJ: Prentice
Hall, 2nd edition, 2000.
[112] N. J. Hopper. On steganographic chosen covertext security. In 32nd Annual Interna-
tional Colloquium on Automata, Languages and Programming, (ICALP 2005), pages
311–321, Lisbon, July 11–15, 2005.
[113] N. J. Hopper, J. Langford, and L. von Ahn. Provably secure steganography. In M. Yung,
editor, Advances in Cryptology, CRYPTO ’02, 22nd Annual International Cryptology
Conference, volume 2442 of Lecture Notes in Computer Science, pages 77–92, Santa
Barbara, CA, August 18–22, 2002. Springer-Verlag.
[114] C. Hsu and C. Lin. A comparison of methods for multi-class support vector machines.
Technical report, Department of Computer Science and Information Engineering, Na-
tional Taiwan University, Taipei, 2001.
[115] D. F. Hsu and X. Jia. Additive bases and extremal problems in groups, graphs and
networks. Utilitas Mathematica, 66:61–91, 2004.
[116] G. A. Francia III and T. S. Gomez. Steganography obliterator: An attack on the least
significant bits. In Proceedings 3rd Annual Conference on Information Security Curricu-
lum Development (InfoSecCD ’06), pages 85–91, Kennesaw State University, Kennesaw,
GA, September 22–23, 2006.
[117] D. Johnson. White King and Red Queen: How the Cold War Was Fought on the Chess-
board. Houghton Mifflin Company, Boston and New York, 2008.
[118] M. K. Johnson, S. Lyu, and H. Farid. Steganalysis of recorded speech. In E. J. Delp and
P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganography,
and Watermarking of Multimedia Contents VII, volume 5681, pages 664–672, San Jose,
REFERENCES 417

CA, January 16–20, 2005.


[119] N. F. Johnson and S. Jajodia. Exploring steganography: Seeing the unseen. IEEE
Computer, 31:26–34, February 1998.
[120] N. F. Johnson and S. Jajodia. Steganalysis of images created using current steganogra-
phy software. In D. Aucsmith, editor, Information Hiding, 2nd International Workshop,
volume 1525 of Lecture Notes in Computer Science, pages 273–289, Portland, OR, April
14–17, 1998. Springer-Verlag, New York.
[121] N. F. Johnson and S. Jajodia. Steganalysis: The investigation of hidden information. In
Proceedings IEEE, Information Technology Conference, Syracuse, NY, September 1–3,
1998.
[122] N. F. Johnson and P. Sallee. Detection of hidden information, covert channels and
information flows. In John G. Voeller, editor, Wiley Handbook of Science Technology
for Homeland Security. New York: Wiley & Sons, Inc, April 4, 2008.
[123] S. Katzenbeisser and F. A. P. Petitcolas, editors. Information Hiding Techniques for
Steganography and Digital Watermarking. New York: Artech House, 2000.
[124] S. Katzenbeisser and F. A. P. Petitcolas. Defining security in steganographic systems.
In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security
and Watermarking of Multimedia Contents IV, volume 4675, pages 50–56, San Jose,
CA, January 21–24, 2002.
[125] S. M. Kay. Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory,
volume II. Upper Saddle River, NJ: Prentice Hall, 1998.
[126] S. M. Kay. Fundamentals of Statistical Signal Processing, Volume II: Detection Theory,
volume II. Upper Saddle River, NJ: Prentice Hall, 1998.
[127] A. D. Ker. Improved detection of LSB steganography in grayscale images. In J. Fridrich,
editor, Information Hiding, 6th International Workshop, volume 3200 of Lecture Notes
in Computer Science, pages 97–115, Toronto, May 23–25, 2004. Springer-Verlag, Berlin.
[128] A. D. Ker. A general framework for structural analysis of LSB replacement. In M. Barni,
J. Herrera, S. Katzenbeisser, and F. Pérez-González, editors, Information Hiding, 7th
International Workshop, volume 3727 of Lecture Notes in Computer Science, pages
296–311, Barcelona, June 6–8, 2005. Springer-Verlag, Berlin.
[129] A. D. Ker. Resampling and the detection of LSB matching in color bitmaps. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security,
Steganography, and Watermarking of Multimedia Contents VII, volume 5681, pages
1–15, San Jose, CA, January 16–20, 2005.
[130] A. D. Ker. Steganalysis of LSB matching in grayscale images. IEEE Signal Processing
Letters, 12(6):441–444, June 2005.
[131] A. D. Ker. Fourth-order structural steganalysis and analysis of cover assumptions. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security,
Steganography, and Watermarking of Multimedia Contents VIII, volume 6072, pages
25–38, San Jose, CA, January 16–19, 2006.
[132] A. D. Ker. A capacity result for batch steganography. IEEE Signal Processing Letters,
14(8):525–528, 2007.
[133] A. D. Ker. A fusion of maximal likelihood and structural steganalysis. In T. Furon,
F. Cayre, G. Doërr, and P. Bas, editors, Information Hiding, 9th International Work-
shop, volume 4567 of Lecture Notes in Computer Science, pages 204–219, Saint Malo,
June 11–13, 2007. Springer-Verlag, Berlin.
418 REFERENCES

[134] A. D. Ker. Optimally weighted least-squares steganalysis. In E. J. Delp and P. W.


Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganography, and
Watermarking of Multimedia Contents IX, volume 6505, pages 6 1–6 16, San Jose, CA,
January 29–February 1, 2007.
[135] A. D. Ker. Steganalysis of embedding in two least significant bits. IEEE Transactions
on Information Forensics and Security, 2:46–54, 2007.
[136] A. D. Ker. The ultimate steganalysis benchmark? In J. Dittmann and J. Fridrich,
editors, Proceedings of the 9th ACM Multimedia & Security Workshop, pages 141–148,
Dallas, TX, September 20–21, 2007.
[137] A. D. Ker. Locating steganographic payload via WS residuals. In A. D. Ker,
J. Dittmann, and J. Fridrich, editors, Proceedings of the 10th ACM Multimedia &
Security Workshop, pages 27–32, Oxford, September 22–23, 2008.
[138] A. D. Ker and R. Böhme. Revisiting weighted stego-image steganalysis. In E. J. Delp
and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Forensics,
Steganography, and Watermarking of Multimedia Contents X, volume 6819, pages 5 1–5
17, San Jose, CA, January 27–31, 2008.
[139] A. D. Ker and I. Lubenko. Feature reduction and payload location with WAM steganal-
ysis. In N. D. Memon, E. J. Delp, P. W. Wong, and J. Dittmann, editors, Proceedings
SPIE, Electronic Imaging, Security and Forensics of Multimedia XI, volume 7254, pages
0A 1–0A 13, San Jose, CA, January 18–21, 2009.
[140] A. D. Ker, T. Pevný, J. Kodovský, and J. Fridrich. The Square Root Law of stegano-
graphic capacity. In A. D. Ker, J. Dittmann, and J. Fridrich, editors, Proceedings of
the 10th ACM Multimedia & Security Workshop, pages 107–116, Oxford, September
22–23, 2008.
[141] Y. Kim, Z. Duric, and D. Richards. Modified matrix encoding technique for mini-
mal distortion steganography. In J. L. Camenisch, C. S. Collberg, N. F. Johnson, and
P. Sallee, editors, Information Hiding, 8th International Workshop, volume 4437 of
Lecture Notes in Computer Science, pages 314–327, Alexandria, VA, July 10–12, 2006.
Springer-Verlag, New York.
[142] G. Kipper. Investigator’s Guide to Steganography. Boca Raton, FL: CRC Press, 2004.
[143] J. Kodovský and J. Fridrich. Influence of embedding strategies on security of stegano-
graphic methods in the JPEG domain. In E. J. Delp and P. W. Wong, editors, Proceed-
ings SPIE, Electronic Imaging, Security, Forensics, Steganography, and Watermarking
of Multimedia Contents X, volume 6819, pages 2 1–2 13, San Jose, CA, January 27–31,
2008.
[144] J. Kodovský and J. Fridrich. On completeness of feature spaces in blind steganalysis.
In A. D. Ker, J. Dittmann, and J. Fridrich, editors, Proceedings of the 10th ACM
Multimedia & Security Workshop, pages 123–132, Oxford, September 22–23, 2008.
[145] G. Kolata. A mystery unraveled, twice. The New York Times, pages F1–F6, April 14,
1998.
[146] G. Kolata. Veiled messages of terror may lurk in cyberspace. The New York Times,
October 30, 2001.
[147] N. Komaki, N. Aoki, and T. Yamamoto. A packet loss concealment technique for VoIP
using steganography. IEICE Transactions on Fundamentals of Electronics, Communi-
cations, and Computer Sciences, E86-A(8):2069–2072, 2003.
[148] O. Koval, S. Voloshynovskiy, T. Holotyak, and T. Pun. Information theoretic analysis of
steganalysis in real images. In S. Voloshynovskiy, J. Dittmann, and J. Fridrich, editors,
REFERENCES 419

Proceedings of the 8th ACM Multimedia & Security Workshop, pages 11–16, Geneva,
September 26–27, 2006.
[149] C. Krätzer and J. Dittmann. Pros and cons of mel-cepstrum based audio steganalysis
using SVM classification. In T. Furon, F. Cayre, G. Doërr, and P. Bas, editors, Infor-
mation Hiding, 9th International Workshop, pages 359–377, Saint Malo, June 11–13,
2007.
[150] C. Krätzer, J. Dittmann, A. Lang, and T. Kühne. WLAN steganography: A first prac-
tical review. In S. Voloshynovskiy, J. Dittmann, and J. Fridrich, editors, Proceedings
of the 8th ACM Multimedia & Security Workshop, pages 17–22, Geneva, September
26–27, 2006.
[151] M. Kutter and F. A. P. Petitcolas. A fair benchmark for image watermarking systems.
In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security
and Watermarking of Multimedia Contents I, volume 3657, pages 226–239, San Jose,
CA, 1999.
[152] A. V. Kuznetsov and B. S. Tsybakov. Coding in a memory with defective cells. Problems
of Information Transmission, 10:132–138, 1974.
[153] Tri Van Le. Efficient provably secure public key steganography. Technical report,
Florida State University, 2003. Cryptography ePrint Archive, http://eprint.iacr.
org/2003/156.
[154] Tri Van Le and K. Kurosawa. Efficient public key steganography secure against adap-
tively chosen stegotext attacks. Technical report, Florida State University, 2003. Cryp-
tography ePrint Archive, http://eprint.iacr.org/2003/244.
[155] K. Lee, C. Jung, S. Lee, and J. Lim. New steganalysis methodology: LR cube analysis
for the detection of LSB steganography. In M. Barni, J. Herrera, S. Katzenbeisser, and
F. Pérez-González, editors, Information Hiding, 7th International Workshop, volume
3727 of Lecture Notes in Computer Science, pages 312–326, Barcelona, June 6–8, 2005.
Springer-Verlag, Berlin.
[156] K. Lee and A. Westfeld. Generalized category attack – improving histogram-based
attack on JPEG LSB embedding. In T. Furon, F. Cayre, G. Doërr, and P. Bas, edi-
tors, Information Hiding, 9th International Workshop, volume 4567 of Lecture Notes
in Computer Science, pages 378–392, Saint Malo, June 11–13, 2007. Springer-Verlag,
Berlin.
[157] K. Lee, A. Westfeld, and S. Lee. Category attack for LSB embedding of JPEG images.
In Y.-Q. Shi, B. Jeon, Y.Q. Shi, and B. Jeon, editors, Digital Watermarking, 5th Inter-
national Workshop, volume 4283 of Lecture Notes in Computer Science, pages 35–48,
Jeju Island, November 8–10, 2006. Springer-Verlag, Berlin.
[158] X. Li, B. Yang, D. Cheng, and T. Zeng. A generalization of LSB matching. IEEE Signal
Processing Letters, 16(2):69–72, February 2009.
[159] X. Li, T. Zeng, and B. Yang. Detecting LSB matching by applying calibration technique
for difference image. In A. D. Ker, J. Dittmann, and J. Fridrich, editors, Proceedings
of the 10th ACM Multimedia & Security Workshop, pages 133–138, Oxford, September
22–23, 2008.
[160] X. Li, T. Zeng, and B. Yang. Improvement of the embedding efficiency of LSB matching
by sum and difference covering set. In Proceedings IEEE, International Conference on
Multimedia and Expo, pages 209–212, Hannover, June 23–April 26, 2008.
[161] Tsung-Yuan Liu and Wen-Hsiang Tsai. A new steganographic method for data hiding
in Microsoft Word documents by a change tracking technique. IEEE Transactions on
420 REFERENCES

Information Forensics and Security, 2(1):24–30, March 2007.


[162] D. Llamas, C. Allison, and A. Miller. Covert channels in internet protocols: A survey.
In M. Merabti and R. Pereira, editors, Proceedings 6th Annual Postgraduate Sym-
posium about the Convergence of Telecommunications, Networking and Broadcasting
(PGNET), Liverpool, June 27–28, 2005.
[163] P. Lu, X. Luo, Q. Tang, and L. Shen. An improved sample pairs method for detection of
LSB embedding. In J. Fridrich, editor, Information Hiding, 6th International Workshop,
volume 3200 of Lecture Notes in Computer Science, pages 116–127, Toronto, May 23–
25, 2004. Springer-Verlag, Berlin.
[164] M. Luby. LT codes. In 43rd Annual IEEE Symposium on Foundations of Computer
Science, FOCS 2002, pages 271–282, Vancouver, November 16–19, 2002.
[165] N. B. Lucena, G. Lewandowski, and S. J. Chapin. Covert channels in IPv6. In
G. Danezis and D. Martin, editors, Proceedings Privacy Enhancing Technologies Work-
shop (PET), volume 3856 of Lecture Notes in Computer Science, pages 147–166,
Dubrovnik, May 30–June 1, 2006. Springer-Verlag, Berlin.
[166] A. Lysyanskaya and M. Meyerovich. Steganography with imperfect sampling. Technical
report, Brown University, 2005. Cryptography ePrint Archive, http://eprint.iacr.
org/2005/305.
[167] S. Lyu and H. Farid. Steganalysis using color wavelet statistics and one-class support
vector machines. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic
Imaging, Security, Steganography, and Watermarking of Multimedia Contents VI, vol-
ume 5306, pages 35–45, San Jose, CA, January 19–22, 2004.
[168] S. Lyu and H. Farid. Steganalysis using higher-order image statistics. IEEE Transac-
tions on Information Forensics and Security, 1(1):111–119, 2006.
[169] W. Mazurczyk and K. Szczypiorski. Steganography of VoIP streams. In Proceedings
of the 3rd International Symposium on Information Security, volume 5332 of Lecture
Notes in Computer Science, pages 1001–1018, Monterrey, Mexico, November 10–11,
2008. Springer-Verlag, Berlin.
[170] A. D. McDonald and M. G. Kuhn. StegFS: A steganographic file system for Linux.
In Information Hiding, 3rd International Workshop, volume 1768 of Lecture Notes in
Computer Science, pages 454–468, Dresden, September 29–October 1, 1999. Springer-
Verlag, Berlin.
[171] D. J. C. McKay. Information Theory, Inference, and Learning Algorithms. Cambridge:
Cambridge University Press, 2003.
[172] S. Meignen and H. Meignen. On the modeling of DCT and subband image data for
compression. IEEE Transactions on Image Processing, 4(2):186–193, February 1995.
[173] Y. Miche, B. Roue, A. Lendasse, and P. Bas. A feature selection methodology for
steganalysis. In B. Günsel, A. K. Jain, A. M. Tekalp, and B. Sankur, editors, Multimedia
Content Representation, Classification and Security, International Workshop, volume
4105 of Lecture Notes in Computer Science, pages 49–56, Istanbul, September 11–13,
2006. Springer-Verlag.
[174] J. Mielikainen. LSB matching revisited. IEEE Signal Processing Letters, 13(5):285–287,
May 2006.
[175] M. K. Mihcak, I. Kozintsev, K. Ramchandran, and P. Moulin. Low-complexity image
denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing
Letters, 6(12):300–303, December 1999.
REFERENCES 421

[176] D. S. Mitrinovic, J. E. Pecaric, and A. M. Fink. Classical and New Inequalities in


Analysis. Kluwer Academic Publishers, Dordrecht, 1993.
[177] I. S. Moskowitz, R. E. Newman, D. P. Crepeau, and A. R. Miller. Covert channels and
anonymizing networks. In S. Jajodia, P. Samarati, and P. F. Syverson, editors, Proceed-
ings Workshop on Privacy in the Electronic Society (WPES), pages 79–88, Washington,
DC, October 30, 2003.
[178] P. Moulin, M. K. Mihcak, and G. I. Lin. An information-theoretic model for image wa-
termarking and data hiding. In Proceedings IEEE, International Conference on Image
Processing, ICIP 2000, volume 3, pages 667–670, Vancouver, September 10–13, 2000.
[179] P. Moulin and J. A. Sullivan. Information-theoretic analysis of information hiding.
IEEE Transactions on Information Theory, 49(3):563–593, March 2003.
[180] P. Moulin and Y. Wang. New results on steganographic capacity. In Proceedings of the
Conference on Information Sciences and Systems, CISS, Princeton, NJ, March 17–19,
2004.
[181] A. Munoz and J. M. Moguerza. Estimation of high-density regions using one-class
neighbor machines. IEEE Transactions on Pattern Analysis and Machine Intelligence,
26(3):476–480, 2006.
[182] B. Murphy and C. Vogel. Statistically constrained shallow text marking: Techniques,
evaluation paradigm, and results. In E. J. Delp and P. W. Wong, editors, Proceedings
SPIE, Electronic Imaging, Security, Steganography, and Watermarking of Multimedia
Contents IX, volume 6505, pages Z 1–Z 9, San Jose, CA, January 29–February 1, 2007.
[183] H. Noda, M. Niimi, and E. Kawaguchi. Application of QIM with dead zone for his-
togram preserving JPEG steganography. In Proceedings IEEE, International Confer-
ence on Image Processing, ICIP 2005, volume II, pages 1082–1085, Genova, September
11–14, 2005.
[184] H.-O. Peitgen, H. Jürgens, and D. Saupe. Chaos and Fractals: New Frontiers of Science.
Berlin: Springer-Verlag, 1992.
[185] W. Pennebaker and J. Mitchell. JPEG: Still Image Data Compression Standard. Van
Nostrand Reinhold, New York, 1993.
[186] F. Perez-Gonzalez and S. Voloshynovskiy, editors. Fundamentals of Digital Image Wa-
termarking. New York: Wiley Blackwell, 2005.
[187] F. A. P. Petitcolas. MP3Stego software. 1998.
[188] K. Petrowski, M. Kharrazi, H. T. Sencar, and N. D. Memon. Psteg: Steganographic em-
bedding through patching. In Proceedings IEEE, International Conference on Acous-
tics, Speech, and Signal Processing, pages 537–540, Philadelphia, PA, March 18–23,
2005.
[189] T. Pevný and J. Fridrich. Multiclass blind steganalysis for JPEG images. In E. J. Delp
and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganogra-
phy, and Watermarking of Multimedia Contents VIII, volume 6072, pages O 1–O 13,
San Jose, CA, January 16–19, 2006.
[190] T. Pevný and J. Fridrich. Merging Markov and DCT features for multi-class JPEG ste-
ganalysis. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging,
Security, Steganography, and Watermarking of Multimedia Contents IX, volume 6505,
pages 3 1–3 14, San Jose, CA, January 29–February 1, 2007.
[191] T. Pevný and J. Fridrich. Benchmarking for steganography. In K. Solanki, K. Sullivan,
and U. Madhow, editors, Information Hiding, 10th International Workshop, volume
5284 of Lecture Notes in Computer Science, pages 251–267, Santa Barbara, CA, June
422 REFERENCES

19–21, 2008. Springer-Verlag, New York.


[192] T. Pevný and J. Fridrich. Detection of double-compression for applications in steganog-
raphy. IEEE Transactions on Information Forensics and Security, 3(2):247–258, 2008.
[193] T. Pevný and J. Fridrich. Multiclass detector of current steganographic methods for
JPEG format. IEEE Transactions on Information Forensics and Security, 3(4):635–
650, December 2008.
[194] T. Pevný and J. Fridrich. Novelty detection in blind steganalysis. In A. D. Ker,
J. Dittmann, and J. Fridrich, editors, Proceedings of the 10th ACM Multimedia &
Security Workshop, pages 167–176, Oxford, September 22–23, 2008.
[195] T. Pevný, J. Fridrich, and A. D. Ker. From blind to quantitative steganalysis. In N. D.
Memon, E. J. Delp, P. W. Wong, and J. Dittmann, editors, Proceedings SPIE, Electronic
Imaging, Security and Forensics of Multimedia XI, volume 7254, pages 0C 1–0C 14,
San Jose, CA, January 18–21, 2009.
[196] A. C. Popescu. Statistical Tools for Digital Image Forensics. PhD thesis, Department
of Computer Science, Dartmouth College, 2005.
[197] S. Pradhan, J. Chou, and K. Ramchandran. Duality between source coding and channel
coding and its extension to the side information case. IEEE Transactions on Informa-
tion Theory, 49(5):1181–1203, 2003.
[198] N. Provos. Defending against statistical steganalysis. In 10th USENIX Security Sym-
posium, pages 323–335, Washington, DC, August 13–17, 2001.
[199] N. Provos and P. Honeyman. Detecting steganographic content on the internet. Tech-
nical report, 01–11, CITI, August 2001.
[200] R. Radhakrishnan, M. Kharrazi, and N. D. Memon. Data masking a new approach
for data hiding? Journal of VLSI Signal Processing Systems, 41(3):293–303, November
2005.
[201] J. A. Reeds. Solved: The ciphers in Book III of Trithemius’s Steganographia. Cryptolo-
gia, 22:291–319, October 1998.
[202] X.-M. Ru, H.-J. Zhang, and X. Huang. Steganalysis of audio: Attacking the Steghide.
In Proceedings of the International Conference on Machine Learning and Cybernetics,
volume 7, pages 3937–3942, Guangzhou, August 18–21, 2005.
[203] J. Rutenberg. A nation challenged: Videotape. New York Times, February 1, 2002.
[204] P. Sallee. Model-based steganography. In T. Kalker, I. J. Cox, and Y. Man Ro, editors,
Digital Watermarking, 2nd International Workshop, volume 2939 of Lecture Notes in
Computer Science, pages 154–167, Seoul, October 20–22, 2003. Springer-Verlag, New
York.
[205] P. Sallee. Model-based methods for steganography and steganalysis. International Jour-
nal of Image Graphics, 5(1):167–190, 2005.
[206] K. Sayood. Introduction to Data Compression (3rd edition). New York: Morgan Kauf-
mann, 2000.
[207] B. Schneier. Applied Cryptography. New York: John Wiley & Sons, 1996.
[208] D. Schönfeld and A. Winkler. Embedding with syndrome coding based on BCH codes.
In S. Voloshynovskiy, J. Dittmann, and J. Fridrich, editors, Proceedings of the 8th ACM
Multimedia & Security Workshop, pages 214–223, Geneva, September 26–27, 2006.
[209] T. Sharp. An implementation of key-based digital signal steganography. In I. S.
Moskowitz, editor, Information Hiding, 4th International Workshop, volume 2137 of
Lecture Notes in Computer Science, pages 13–26, Pittsburgh, PA, April 25–27, 2001.
Springer-Verlag, New York.
REFERENCES 423

[210] Y. Q. Shi, C. Chen, and W. Chen. A Markov process based approach to effective
attacking JPEG steganography. In J. L. Camenisch, C. S. Collberg, N. F. Johnson,
and P. Sallee, editors, Information Hiding, 8th International Workshop, volume 4437 of
Lecture Notes in Computer Science, pages 249–264, Alexandria, VA, July 10–12, 2006.
Springer-Verlag, New York.
[211] F. Y. Shih. Digital Watermarking and Steganography: Fundamentals and Techniques.
Boca Raton, FL: CRC Press, 2007.
[212] B. Shimanovsky, J. Feng, and M. Potkonjak. Hiding data in DNA. In F. A. P. Petitcolas,
editor, Information Hiding, 5th International Workshop, volume 2578 of Lecture Notes
in Computer Science, pages 373–386, Noordwijkerhout, October 7–9, 2002. Springer-
Verlag, Berlin.
[213] M. Sidorov. Hidden Markov models and steganalysis. In J. Dittmann and J. Fridrich,
editors, Proceedings of the 6th ACM Multimedia & Security Workshop, pages 63–67,
Magdeburg, September 20–21, 2004.
[214] G. J. Simmons. The prisoner’s problem and the subliminal channel. In D. Chaum,
editor, Advances in Cryptology, CRYPTO ’83, pages 51–67, Santa Barbara, CA, August
22–24, 1983. New York: Plenum Press.
[215] A. J. Smola and B. Schölkopf. A tutorial on support vector regression. NeuroCOLT2
Technical Report NC2-TR-1998-030, 1998.
[216] T. Sohn, J. Seo, and J. Moon. A study on the covert channel detection of TCP/IP
header using support vector machine. In S. Qing, D. Gollmann, and J. Zhou, editors,
Proceedings of the 5th International Conference on Information and Communications
Security, volume 2836 of Lecture Notes in Computer Science, pages 313–324, Huhe-
haote, October 10–13, 2003. Springer-Verlag, Berlin.
[217] K. Solanki, A. Sarkar, and B. S. Manjunath. YASS: Yet another steganographic scheme
that resists blind steganalysis. In T. Furon, F. Cayre, G. Doërr, and P. Bas, editors,
Information Hiding, 9th International Workshop, volume 4567 of Lecture Notes in Com-
puter Science, pages 16–31, Saint Malo, June 11–13, 2007. Springer-Verlag, New York.
[218] K. Solanki, K. Sullivan, U. Madhow, B. S. Manjunath, and S. Chandrasekaran. Provably
secure steganography: Achieving zero K–L divergence using statistical restoration. In
Proceedings IEEE, International Conference on Image Processing, ICIP 2006, pages
125–128, Atlanta, GA, October 8–11, 2006.
[219] A. Somekh-Baruch and N. Merhav. On the capacity game of public watermarking
systems. IEEE Transactions on Information Theory, 50(3):511–524, 2004.
[220] D. Soukal, J. Fridrich, and M. Goljan. Maximum likelihood estimation of secret message
length embedded using ±k steganography in spatial domain. In E. J. Delp and P. W.
Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganography, and
Watermarking of Multimedia Contents VII, volume 5681, pages 595–606, San Jose,
CA, January 16–20, 2005.
[221] M. R. Spiegel. Schaum’s Outline of Theory and Problems of Statistics. McGraw-Hill,
New York, 3rd edition, 1961.
[222] M. Stamm and K. J. Ray Liu. Blind forensics of contrast enhancement in digital images.
In Proceedings IEEE, International Conference on Image Processing, ICIP 2008, pages
3112–3115, San Diego, CA, October 12–15, 2008.
[223] I. Steinwart. On the influence of the kernel on the consistency of support vector ma-
chines. Journal of Machine Learning Research, 2:67–93, 2001. Available electronically
at http://www.jmlr.org/papers/volume2/steinwart01a/steinwart01a.ps.gz.
424 REFERENCES

[224] G. W. W. Stevens. Microphotography – Photography and Photofabrication at Extreme


Resolutions. London, Chapman & Hall, 1968.
[225] K. Sullivan, U. Madhow, B. S. Manjunath, and S. Chandrasekaran. Steganalysis for
Markov cover data with applications to images. IEEE Transactions on Information
Forensics and Security, 1(2):275–287, June 2006.
[226] A. Tacticius. How to Survive Under Siege/Aineas the Tactician. Oxford: Clarendon
Ancient History Series, 1990.
[227] C. M. Taskiran, U. Topkara, M. Topkara, and E. J. Delp. Attacks on lexical natural
language steganography systems. In E. J. Delp and P. W. Wong, editors, Proceedings
SPIE, Electronic Imaging, Security, Steganography, and Watermarking of Multimedia
Contents VIII, volume 6072, pages 97–105, San Jose, CA, January 16–19, 2006.
[228] D. S. Taubman and M. W. Marcellin. JPEG 2000 Image Compression Fundamentals,
Standards, and Practices. Kluwer Academic Publishers, Boston, MA, 2002.
[229] J. Taylor and A. Verbyla. Joint modeling of location and scale parameters of the t-
distribution. Statistical Modeling, 4:91–112, 2004.
[230] J. Tobin and R. Dobard. Hidden in Plain View: The Secret Story of Quilts and the
Underground Railroad. Doubleday, New York, 1999.
[231] B. S. Tsybakov. Defect and error correction. Problemy Peredachi Informatsii, 11:21–30,
July–September 1975. Translated from Russian.
[232] R. Tzschoppe, R. Bäuml, J. B. Huber, and A. Kaup. Steganographic system based
on higher-order statistics. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE,
Electronic Imaging, Security and Watermarking of Multimedia Contents V, volume
5020, pages 156–166, Santa Clara, CA, January 21–24, 2003.
[233] M. van Dijk and F. Willems. Embedding information in grayscale images. In Proceedings
of the 22nd Symposium on Information and Communication Theory, pages 147–154,
Enschede, May 15–16, 2001.
[234] L. von Ahn and N. Hopper. Public-key steganography. In C. Cachin and J. Camenisch,
editors, Advances in Cryptology – EUROCRYPT 2004, International Conference on the
Theory and Applications of Cryptographic Techniques, volume 3027 of Lecture Notes
in Computer Science, pages 323–341, Interlaken, May 2–6, 2004. Springer-Verlag, Hei-
dleberg.
[235] Y. Wang and P. Moulin. Steganalysis of block-structured stegotext. In E. J. Delp and
P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganography,
and Watermarking of Multimedia Contents VI, volume 5306, pages 477–488, San Jose,
CA, January 19–22, 2004.
[236] Y. Wang and P. Moulin. Statistical modelling and steganalysis of DFT-based image
steganography. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic
Imaging, Security, Steganography, and Watermarking of Multimedia Contents VIII,
volume 6072, pages 2 1–2 11, San Jose, CA, January 16–19, 2006.
[237] Y. Wang and P. Moulin. Perfectly secure steganography: Capacity, error exponents,
and code constructions. IEEE Transactions on Information Theory, Special Issue on
Security, 55(6):2706–2722, June 2008.
[238] P. Wayner. Mimic functions. CRYPTOLOGIA, 16(3):193–214, July 1992.
[239] P. Wayner. Disappearing Cryptography. Morgan Kaufmann, San Francisco, CA, 2nd
edition, 2002.
[240] J. Werner. Optimization – Theory and Applications. Braunschweig: Vieweg, 1984.
REFERENCES 425

[241] A. Westfeld. High capacity despite better steganalysis (F5 – a steganographic algo-
rithm). In I. S. Moskowitz, editor, Information Hiding, 4th International Workshop,
volume 2137 of Lecture Notes in Computer Science, pages 289–302, Pittsburgh, PA,
April 25–27, 2001. Springer-Verlag, New York.
[242] A. Westfeld. Detecting low embedding rates. In F. A. P. Petitcolas, editor, Informa-
tion Hiding, 5th International Workshop, volume 2578 of Lecture Notes in Computer
Science, pages 324–339, Noordwijkerhout, October 7–9, 2002. Springer-Verlag, Berlin.
[243] A. Westfeld. Space filling curves in steganalysis. In E. J. Delp and P. W. Wong, editors,
Proceedings SPIE, Electronic Imaging, Security, Steganography, and Watermarking of
Multimedia Contents VII, volume 5681, pages 28–37, San Jose, CA, January 16–20,
2005.
[244] A. Westfeld. Generic adoption of spatial steganalysis to transformed domain. In
K. Solanki, K. Sullivan, and U. Madhow, editors, Information Hiding, 10th Interna-
tional Workshop, volume 5284 of Lecture Notes in Computer Science, pages 161–177,
Santa Barbara, CA, June 19–21, 2007. Springer-Verlag, New York.
[245] A. Westfeld and R. Böhme. Exploiting preserved statistics for steganalysis. In
J. Fridrich, editor, Information Hiding, 6th International Workshop, volume 3200 of
Lecture Notes in Computer Science, pages 82–96, Toronto, May 23–25, 2004. Springer-
Verlag, Berlin.
[246] A. Westfeld and A. Pfitzmann. Attacks on steganographic systems. In A. Pfitzmann,
editor, Information Hiding, 3rd International Workshop, volume 1768 of Lecture Notes
in Computer Science, pages 61–75, Dresden, September 29–October 1, 1999. Springer-
Verlag, New York.
[247] E. H. Wilkins. A History of Italian Literature. Oxford University Press, London, 1954.
[248] F. J. M. Williams and N. J. Sloane. The Theory of Error-Correcting Codes. North-
Holland, Amsterdam, 1977.
[249] P. W. Wong, H. Chen, and Z. Tang. On steganalysis of plus–minus one embedding
in continuous-tone images. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE,
Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Con-
tents VII, volume 5681, pages 643–652, San Jose, CA, January 16–20, 2005.
[250] F. B. Wrixon. Codes, Ciphers and Other Cryptic and Clandestine Communication. New
York: Black Dog & Leventhal Publishers, 1998.
[251] Z. Wu and W. Yang. G.711-based adaptive speech information hiding approach. In
De-Shuang Huang, K. Li, and G. W. Irwin, editors, Proceedings of the International
Conference on Intelligent Computing, volume 4113 of Lecture Notes in Computer Sci-
ence, pages 1139–1144, Kunming, August 16–19, 2006. Springer-Verlag, Berlin.
[252] G. Xuan, Y. Q. Shi, J. Gao, D. Zou, C. Yang, Z. Z. P. Chai, C. Chen, and W. Chen. Ste-
ganalysis based on multiple features formed by statistical moments of wavelet charac-
teristic functions. In M. Barni, J. Herrera, S. Katzenbeisser, and F. Pérez-González, ed-
itors, Information Hiding, 7th International Workshop, volume 3727 of Lecture Notes in
Computer Science, pages 262–277, Barcelona, June 6–8, 2005. Springer-Verlag, Berlin.
[253] C. Yang, F. Liu, X. Luo, and B. Liu. Steganalysis frameworks of embedding in multiple
least-significant bits. IEEE Transactions on Information Forensics and Security, 3:662–
672, 2008.
[254] R. Zamir, S. Shamai, and U. Erez. Nested linear/lattice codes for structured multiter-
minal binning. IEEE Transactions on Information Theory, 48(6):1250–1276, 2002.
426 REFERENCES

[255] T. Zhang and X. Ping. A fast and effective steganalytic technique against Jsteg-like
algorithms. In Proceedings of the ACM Symposium on Applied Computing, pages 307–
311, Melbourne, FL, March 9–12, 2003.
[256] T. Zhang and X. Ping. A new approach to reliable detection of LSB steganography in
natural images. Signal Processing, 83(10):2085–2094, October 2003.
[257] W. Zhang, X. Zhang, and S. Wang. Maximizing steganographic embedding efficiency
by combining Hamming codes and wet paper codes. In K. Solanki, K. Sullivan, and
U. Madhow, editors, Information Hiding, 10th International Workshop, volume 5284
of Lecture Notes in Computer Science, pages 60–71, Santa Barbara, CA, June 19–21,
2008. Springer-Verlag, New York.
[258] X. Zhang, W. Zhang, and S. Wang. Efficient double-layered steganographic embedding.
Electronics Letters, 43:482–483, April 2007.
Cambridge Books Online
http://ebooks.cambridge.org/

Steganography in Digital Media

Principles, Algorithms, and Applications


Jessica Fridrich
Book DOI:

Online ISBN: 9781139192903


Hardback ISBN: 9780521190190

Chapter
Plate section pp. null-null

Chapter DOI:
Cambridge University Press
Plate 1 Fig 2.3

Plate 2 Fig 2.5


Plate 3 Fig 3.7

Plate 4 Fig 3.8


Plate 5 Fig 5.1

Plate 6 Fig 5.5 Plate 7 Fig 5.6


(a) (b)

(c) (d)
Plate 8 Fig 5.7

You might also like