Dokumen - Pub - Steganography in Digital Media Principles Algorithms and Applications 1nbsped 0521190193 978 0 521 19019 0
Dokumen - Pub - Steganography in Digital Media Principles Algorithms and Applications 1nbsped 0521190193 978 0 521 19019 0
JESSICA FRIDRICH
Binghamton University, State University of New York (SUNY)
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521190190
C Cambridge University Press 2010
A catalogue record for this publication is available from the British Library
Time will bring to light whatever is hidden; it will cover up and conceal what is
now shining in splendor.
Preface page xv
Acknowledgments xxiii
1 Introduction 1
1.1 Steganography throughout history 3
1.2 Modern steganography 7
1.2.1 The prisoners’ problem 9
1.2.2 Steganalysis is the warden’s job 10
1.2.3 Steganographic security 11
1.2.4 Steganography and watermarking 12
Summary 13
4 Steganographic channel 47
4.1 Steganography by cover selection 50
4.2 Steganography by cover synthesis 51
4.3 Steganography by cover modification 53
Summary 56
Exercises 57
5 Naive steganography 59
5.1 LSB embedding 60
5.1.1 Histogram attack 64
5.1.2 Quantitative attack on Jsteg 66
5.2 Steganography in palette images 68
5.2.1 Embedding in palette 68
5.2.2 Embedding by preprocessing palette 69
5.2.3 Parity embedding in sorted palette 70
5.2.4 Optimal-parity embedding 72
5.2.5 Adaptive methods 73
5.2.6 Embedding while dithering 75
Summary 76
Exercises 76
6 Steganographic security 81
6.1 Information-theoretic definition 82
6.1.1 KL divergence as a measure of security 83
6.1.2 KL divergence for benchmarking 85
6.2 Perfectly secure steganography 88
6.2.1 Perfect security and compression 89
6.2.2 Perfect security with respect to model 91
6.3 Secure stegosystems with limited embedding distortion 92
6.3.1 Spread-spectrum steganography 93
6.3.2 Stochastic quantization index modulation 95
6.3.3 Further reading 97
6.4 Complexity-theoretic approach 98
6.4.1 Steganographic security by Hopper et al. 100
6.4.2 Steganographic security by Katzenbeisser and Petitcolas 101
6.4.3 Further reading 102
Summary 103
Exercises 103
Contents ix
10 Steganalysis 193
10.1 Typical scenarios 194
10.2 Statistical steganalysis 195
10.2.1 Steganalysis as detection problem 196
10.2.2 Modeling images using features 196
10.2.3 Optimal detectors 197
10.2.4 Receiver operating characteristic (ROC) 198
10.3 Targeted steganalysis 201
10.3.1 Features 201
10.3.2 Quantitative steganalysis 205
10.4 Blind steganalysis 207
10.4.1 Features 208
10.4.2 Classification 209
10.5 Alternative use of blind steganalyzers 211
10.5.1 Targeted steganalysis 211
10.5.2 Multi-classification 211
10.5.3 Steganography design 212
10.5.4 Benchmarking 212
10.6 Influence of cover source on steganalysis 212
10.7 System attacks 215
10.8 Forensic steganalysis 217
Summary 218
Exercises 219
A Statistics 293
A.1 Descriptive statistics 293
A.1.1 Measures of central tendency and spread 294
A.1.2 Construction of PRNGs using compounding 296
A.2 Moment-generating function 297
A.3 Jointly distributed random variables 299
A.4 Gaussian random variable 302
A.5 Multivariate Gaussian distribution 303
A.6 Asymptotic laws 305
A.7 Bernoulli and binomial distributions 306
A.8 Generalized Gaussian, generalized Cauchy, Student’s t-distributions 307
A.9 Chi-square distribution 310
A.10 Log–log empirical cdf plot 310
xii Contents
Chapter
Preface pp. xv-xxii
Chapter DOI:
Cambridge University Press
Preface
steganalysis for digital media files. Even though this field is still developing at a
fast pace and many fundamental questions remain unresolved, the foundations
have been laid and basic principles established. This book was written to provide
the reader with the basic philosophy and building blocks from which many prac-
tical steganographic and steganalytic schemes are constructed. The selection of
the material presented in this book represents the author’s view of the field and
is by no means an exhaustive survey of steganography in general. The selected
examples from the literature were included to illustrate the basic concepts and
provide the reader with specific technical solutions. Thus, any omissions in the
references should not be interpreted as indications regarding the quality of the
omitted work.
This book was written as a primary text for a graduate or senior undergradu-
ate course on steganography. It can also serve as a supporting text for virtually
any course dealing with aspects of media security, privacy, and secure commu-
nication. The research problems presented here may be used as motivational
examples or projects to illustrate concepts taught in signal detection and esti-
mation, image processing, and communication. The author hopes that the book
will also be useful to researchers and engineers actively working in multimedia
security and assist those who wish to enter this beautiful and rapidly evolving
multidisciplinary field in their search for open and relevant research topics.
The text naturally evolved from lecture notes for a graduate course on
steganography that the author has taught at Binghamton University, New York
for several years. This pedigree influenced the presentation style of this book as
well as its layout and content. The author tried to make the material as self-
contained as possible within reasonable limits. Steganography is built upon the
pillars of information theory, estimation and detection theory, coding theory,
and machine learning. The book contains five appendices that cover all topics
in these areas that the reader needs to become familiar with to obtain a firm
grasp of the material. The prerequisites for this book are truly minimalistic and
consist of college-level calculus and probability and statistics.
Each chapter starts with simple reasoning aimed to provoke the reader to think
on his/her own and thus better see the need for the content that follows. The
introduction of every chapter and section is written in a narrative style aimed
to provide the big picture before presenting detailed technical arguments. The
overall structure of the book and numerous cross-references help those who wish
to read just selected chapters. To aid the reader in implementing the techniques,
most algorithms described in this book are accompanied with a pseudo-code.
Furthermore, practitioners will likely appreciate experiments on real media files
that demonstrate the performance of the techniques in real life. The lessons
learned serve as motivation for subsequent sections and chapters. In order to
make the book accessible to a wide spectrum of readers, most technical arguments
are presented in their simplest core form rather than the most general fashion,
while referring the interested reader to literature for more details. Each chapter is
closed with a brief summary that highlights the most important facts. Readers
Preface xvii
Disk space
Cover type Count
13.8%
Audio 445 Audio
Disk space 416 14.8%
Images 1689
Network 39 56.1% 8.5%
Text
Other files 81 Images
Text 255 Video (2.8%)
Video 86 Other (2.7%)
Network (1.3%)
Number of steganographic software applications that can hide data in electronic media
as of June 2008. Adapted from [122] and reprinted with permission of John Wiley &
Sons, Inc.
can test their newly acquired knowledge on carefully chosen exercises placed
at the end of the chapters. More involved exercises are supplied with hints or
even a brief sketch of the solution. Instructors are encouraged to choose selected
exercises as homework assignments.
All concepts and methods presented in this book are illustrated on the ex-
ample of digital images. There are several valid reasons for this choice. First
and foremost, digital images are by far the most common type of media for
which steganographic applications are currently available. Furthermore, many
basic principles and methodologies can be readily extended from images to other
digital media, such as video and audio. It is also considerably easier to explain
the perceptual impact of modifying an image rather than an audio clip sim-
ply because images can be printed on paper. Lastly, when compared with other
digital objects, the field of image steganography and steganalysis is by far the
most advanced today, with numerous techniques available for most typical image
formats.
The first chapter contains a brief historical narrative that starts with the rather
amusing ancient methods, continues with more advanced ideas for data hiding in
written documents as well as techniques used by spies during times of war, and
ends with modern steganography in digital files. By introducing three fictitious
characters, prisoners Alice and Bob and warden Eve, we informally describe
secure steganographic communication as the famous prisoners’ problem in which
Alice and Bob try to secretly communicate without arousing the suspicion of Eve,
who is eagerly eavesdropping. These three characters will be used in the book
to make the language more accessible and a little less formal when explaining
technical aspects of data-hiding methods. The chapter is closed with a section
that highlights the differences between digital watermarking and steganography.
Knowing how visual data is represented in a computer is a necessary prereq-
uisite to understand the technical material in this book. Chapter 2 first explains
basic color models used for representing color in a computer. Then, we describe
the structure of the most common raster, palette, and transform image formats,
xviii Preface
including the JPEG. The description of each format is supplied with instruc-
tions on how to work with such images in Matlab to give the reader the ability
to conveniently implement most of the methods described in this book.
Since the majority of digital images are obtained using a digital camera, cam-
corder, or scanner, Chapter 3 deals with the process of digital image acquisition
through an imaging sensor. Throughout the chapter, emphasis is given to those
aspects of this process that are relevant to steganography. This includes the
processing pipeline inside typical digital cameras and sources of noise and im-
perfections. Noise is especially relevant to steganography because the seemingly
useless stochastic components of digital images could conceivably convey secret
messages.
In Chapter 4, we delve deeper into the subject of steganography. Three basic
principles for constructing steganographic methods are introduced: steganogra-
phy by cover selection, cover synthesis, and cover modification. Even though the
focus of this book is on data-hiding methods that embed secret messages by
slightly modifying the original (cover) image, all three principles can be used
to build steganographic methods in practice. This chapter also introduces ba-
sic terminology and key building blocks that form the steganographic channel
– the source of cover objects, source of secret messages and secret keys, the
data-hiding and data-extraction algorithms, and the physical channel itself. The
physical properties of the channel are determined by the actions of the warden
Eve, who can position herself to be a passive observant or someone who is actively
involved with the flow of data through the channel. Discussions throughout the
chapter pave the way towards the information-theoretic definition of stegano-
graphic security given in Chapter 6.
The content of Chapter 5 was chosen to motivate the reader to ask basic ques-
tions about what it means to undetectably embed secret data in an image and
to illustrate various (and sometimes unexpected) difficulties one might run into
when attempting to realize some intuitive hiding methods. The chapter contains
examples of some early naive steganographic methods for the raster, palette,
and JPEG formats, most of which use some version of the least-significant-bit
(LSB) embedding method. The presentation of each method continues with crit-
ical analysis of how the steganographic method can be broken and why. The
author hopes that this early exposure of specific embedding methods will make
the reader better understand the need for a rather precise technical approach in
the remaining chapters.
Chapter 6 introduces the central concept, which is a formal information-
theoretic definition of security in steganography based on the Kullback–Leibler
divergence between the distributions of cover and stego objects. This definition
puts steganography on a firm mathematical ground that allows methodological
development by studying security with respect to a cover model. The concept of
security is further explained by showing connections between security and detec-
tion theory and by providing examples of undetectable steganographic schemes
built using the principles outlined in Chapter 4. We also introduce the concept
Preface xix
computed from image noise residuals. This steganalyzer is also used to demon-
strate how much statistical detectability in practice depends on the source of
cover images.
Chapter 13 discusses the most fundamental problem of steganography, which
is the issue of computing the largest payload that can be securely embedded in
an image. Two very different concepts are introduced – the steganographic ca-
pacity and secure payload. Steganographic capacity is the largest rate at which
perfectly secure communication is possible. It is not a property of one specific
steganographic scheme but rather a maximum taken over all perfectly secure
schemes. In contrast, secure payload is defined as the number of bits that can be
communicated at a given security level using a specific imperfect steganographic
scheme. The secure payload grows only with the square root of the number of pix-
els in the image. This so-called square-root law is experimentally demonstrated
on a specific steganographic scheme that embeds bits in the JPEG domain. The
secure payload is more relevant to practitioners because all practical stegano-
graphic schemes that hide messages in real digital media are not likely to be
perfectly secure and thus fall under the squre-root law.
To make this text self-contained, five appendices accompany the book. Their
style and content are fully compatible with the rest of the book in the sense
that the student does not need any more prerequisites than a basic knowledge
of calculus and statistics. The author anticipates that students not familiar with
certain topics will find it convenient to browse through the appendices and either
refresh their knowledge or learn about certain topics in an elementary fashion
accessible to a wide audience.
Appendix A contains the basics of descriptive statistics, including statistical
moments, the moment-generating function, robust measures of central tendency
and spread, asymptotic laws, and description of some key statistical distributions,
such as the Bernoulli, binomial, Gaussian, multivariate Gaussian, generalized
Gaussian, and generalized Cauchy distributions, Student’s t-distribution, and
the chi-square distribution.
As some of the chapters rely on basic knowledge of information theory, Ap-
pendix B covers selected key concepts of entropy, conditional entropy, joint en-
tropy, mutual information, lossless compression, and KL divergence and some
of its key properties, such as its relationship to hypothesis testing and Fisher
information.
The theory of linear codes over finite fields is the subject of Appendix C. The
reader is introduced to the basic concepts of a generator and parity-check matrix,
covering radius, average distance to code, sphere-covering bound, orthogonality,
dual code, systematic form of a code, cosets, and coset leaders.
Appendix D contains elements of signal detection and estimation. The author
explains the Neyman–Pearson and Bayesian approach to hypothesis testing, the
concepts of a receiver-operating-characteristic (ROC) curve, the deflection coef-
ficient, and the connection between hypothesis testing and Fisher information.
The appendix continues with composite hypothesis testing, the chi-square test,
xxii Preface
and the locally most powerful detector. The topics of estimation theory covered
in the appendix include the Cramer–Rao lower bound, least-square estimation,
maximum-likelihood and maximum a posteriori estimation, and the Wiener filter.
The appendix is closed with the Cauchy–Schwartz inequality in Hilbert spaces
with inner product, which is needed for proofs of some of the propositions in this
book.
Readers not familiar with support vector machines (SVMs) will find Ap-
pendix E especially useful. It starts with the formulation of a binary classification
problem and introduces linear support vector machines as a classification tool.
Linear SVMs are then progressively generalized to non-separable problems and
then put into kernelized form as typically used in practice. The weighted form
of SVMs is described as well because it is useful to achieve a trade-off between
false alarms and missed detections and for drawing an ROC curve. The appendix
also explains practical issues with data preprocessing and training SVMs that
one needs to be aware of when using SVMs in applications, such as in blind
steganalysis.
Because the focus of this book is strictly on steganography in digital sig-
nals, methods for covert communication in other objects are not covered. In-
stead, the author refers the reader to other publications. In particular, lin-
guistic steganography and data-hiding aspects of some cryptographic applica-
tions are covered in [238, 239]. The topic of covert channels in natural lan-
guage is also covered in [18, 25, 41, 161, 182, 227]. A comprehensive bibli-
ography of all articles published on covert communication in linguistic struc-
tures, including watermarking applications, is maintained by Bergmair at http:
//semantilog.ucam.org/biblingsteg/. Topics dealing with steganography in
Internet protocols are studied in [106, 162, 163, 165, 177, 216]. Covert timing
channels and their security are covered in [26, 34, 100, 101]. The intriguing
topic of steganography in Voice over IP applications, such as Skype, appears
in [6, 7, 58, 147, 150, 169, 251]. Steganographic file systems [4, 170] are useful
tools to thwart “rubber-hose attacks” on cryptosystems when a person is coerced
to reveal encryption keys after encrypted files have been found on a computer
system. A steganographic file system allows the user to plausibly deny that en-
crypted files reside on the disk. In-depth analysis of current steganographic soft-
ware and the topics of data hiding in elements of operating systems are provided
in [142]. Finally, the topics of audio steganography and steganalysis appeared
in [9, 24, 118, 149, 187, 202].
Acknowledgments
I would like to acknowledge the role of several individuals who helped me com-
mit to writing this book. First of all and foremost, I am indebted to Richard
Simard for encouraging me to enter the field of steganography and for support-
ing research on steganography. This book would not have materialized without
the constant encouragement of George Klir and Monika Fridrich. Finally, the
privilege of co-authoring a book with Ingemar Cox [51] provided me with energy
and motivation I would not have been able to find otherwise.
Furthermore, I am happy to acknowledge the help of my PhD students for
their kind assistance that made the process of preparing the manuscript in TEX
a rather pleasant experience instead of the nightmare that would for sure have
followed if I had been left alone with a TEX compiler. In particular, I am im-
mensely thankful to TEX guru Tomáš Filler for his truly significant help with
formatting the manuscript, preparing the figures, and proof-reading the text,
to Tomáš Pevný for contributing material for the appendix on support vector
machines, and to Jan Kodovský for help with combing the citations and proof-
reading. I would also like to thank Ellen Tilden and my students from the ECE
562 course on Fundamentals of Steganography, Tony Nocito, Dae Kim, Zhao Liu,
Zhengqing Chen, and Ran Ren, for help with sanitizing this text to make it as
free of typos as possible.
Discussions with my colleagues, Andrew D. Ker, Miroslav Goljan, Andreas
Westfeld, Rainer Böhme, Pierre Moulin, Neil F. Johnson, Scott Craver, Patrick
Bas, Teddy Furon, and Xiaolong Li were very useful and helped me clarify some
key technical issues. The encouragement I received from Mauro Barni, Deepa
Kundur, Slava Voloshynovskiy, Jana Dittmann, Gaurav Sharma, and Chet Hos-
mer also helped with shaping the final content of the manuscript. Special thanks
are due to George Normandin and Jim Moronski for their feedback and many
useful discussions about imaging sensors and to Josef Sofka for providing a pic-
ture of a CCD sensor. A special acknowledgement goes to Binghamton University
Art Director David Skyrca for the beautiful cover design.
Finally, I would like to thank Nicole and Kathy Fridrich for their patience and
for helping me to get into the mood of sharing.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
1 - Introduction pp. 1-14
Chapter DOI:
Cambridge University Press
1 Introduction
A woman named Alice sends the following e-mail to her friend Bob, with whom
she shares an interest in astronomy:
My friend Bob,
until yesterday I was using binoculars for stargazing. Today, I decided to try my new
telescope. The galaxies in Leo and Ursa Major were unbelievable! Next, I plan to check
out some nebulas and then prepare to take a few snapshots of the new comet. Although
I am satisfied with the telescope, I think I need to purchase light pollution filters to
block the xenon lights from a nearby highway to improve the quality of my pictures.
Cheers,
Alice.
At first glance, this letter appears to be a conversation between two avid ama-
teur astronomers. Alice seems to be excited about her new telescope and eagerly
shares her experience with Bob. In reality, however, Alice is a spy and Bob is her
superior awaiting critical news from his secret agent. To avoid drawing unwanted
attention, they decided not to use cryptography to communicate in secrecy. In-
stead, they agreed on another form of secret communication – steganography.
Upon receiving the letter from Alice, Bob suspects that Alice might be using
steganography and decides to follow a prearranged protocol. Bob starts by listing
the first letters of all words from Alice’s letter and obtains the following sequence:
π = 3.141592653589793 . . .
and reads the message from the extracted sequence of letters by putting down
the third letter in the sequence, then the next first letter, the next fourth letter,
etc. The resulting message is
buubdlupnpsspx.
Finally, Bob replaces each letter with the letter that precedes it in the alphabet
and deciphers the secret message
attack tomorrow.
2 Chapter 1. Introduction
Let us take a look at the tasks that Alice needs to carry out to communicate
secretly with Bob. She first encrypts her message by substituting each letter with
the one that follows it in the English alphabet (e.g., a is substituted with b, b
with c, . . . , and z with a). Note that this simple substitution cipher could be
replaced by a more secure encryption algorithm if desired. Then, Alice needs
to write an almost arbitrary but meaningful(!) letter while making sure that
the words whose location is determined by the digits of π start with the letters
of the encrypted message. Of course, instead of the decimal expansion of π,
Alice and Bob could have agreed on a different integer sequence, such as one
generated from a pseudo-random number generator seeded with a shared key.
The shared information that determines the location of the message letters is
called the steganographic key or stego key. Without knowing this key, it is not
only difficult to read the message but also difficult for an eavesdropper to prove
that the text contains a secret message.
Note that the hidden message is unrelated to the content of the letter, which
only serves as a decoy or “cover” to hide the very fact that a secret message is
being sent. In fact, this is the defining property of steganography:
The word steganography is a composite of the Greek words steganos, which means
“covered,” and graphia, which means “writing.” In other words, steganography is
the art of concealed communication where the very existence of a message is se-
cret. The term steganography was used for the first time by Johannes Trithemius
(1462–1516) in his trilogy Polygraphia and in Steganographia (see Figure 1.1).
While the first two volumes described ancient methods for encoding messages
(cryptography), the third volume (1499) appeared to deal with occult powers,
black magic, and methods for communication with spirits. The volume was pub-
lished in Frankfurt in 1606 and in 1609 the Catholic Church put it on the list of
“libri prohibiti” (forbidden books). Soon, scholars began suspecting that the book
was a code and attempted to decipher the mystery. Efforts to decode the book’s
secret message came to a successful end in 1996 and 1998 when two researchers in-
dependently [65, 201] revealed the hidden messages encoded in numbers through
several look-up tables included in the book [145]. The messages turned out to
be quite mundane. The first one was the Latin equivalent of “The quick brown
fox jumps over the lazy dog,” which is a sentence that contains every letter of
the alphabet. The second message was: “The bearer of this letter is a rogue and
a thief. Guard yourself against him. He wants to do something to you.” Finally,
the third was the start of the 21st Psalm.
The first written evidence about steganography being used to send messages
is due to Herodotus [109], who tells of a slave sent by his master, Histiæus, to
the Ionian city of Miletus with a secret message tattooed on his scalp. After
the tattooing of the message, the slave grew his hair back in order to conceal
the message. He then traveled to Miletus and, upon arriving, shaved his head
to reveal the message to the city’s regent, Aristagoras. The message encouraged
Aristagoras to start a revolt against the Persian king.
Herodotus also documented the story of Demeratus, who used steganography
to alert Sparta about the planned invasion of Greece by the Persian Great King
Xerxes. To conceal his message, Demeratus scraped the wax off the surface of
a wooden writing tablet, scratched the message into the wood, and then coated
the tablet with a fresh layer of wax to make it appear to be a regular blank
writing tablet that could be safely carried to Sparta without arousing suspicion.
Aeneas the Tactician [226] is credited with inventing many ingenious stegano-
graphic techniques, such as hiding messages in women’s earrings or using pigeons
to deliver secret messages. Additionally, he described some simple methods for
hiding messages in text by modifying the height of letter strokes or by marking
letters in a text using small holes.
Hiding messages in text is called linguistic steganography or acrostics. Acros-
tics was a very popular ancient steganographic method. To embed a unique
“signature” in their work, some poets encoded secret messages as initial letters
of sentences or successive tercets in a poem. One of the best-known examples
4 Chapter 1. Introduction
Figure 1.1 The title page of Steganographia by Johannes Trithemius, the inventor of the
word “steganography.” Reproduced by kind permission of the Syndics of Cambridge
University Library.
This way, the message could be extracted even from printed or photocopied
documents.
In 1857, Brewster [31] proposed a very ingenious technique that was actually
used in several wars in the nineteenth and twentieth centuries. The idea is to
shrink the message so much that it starts resembling specks of dirt but can still
be read under high magnification. The technological obstacles to use of this idea
in practice were overcome by the French photographer Dragon, who developed
technology for shrinking text to microscopic dimensions. Such small objects could
be easily hidden in nostrils, ears, or under fingernails [224]. In World War I, the
Germans used such “microdots” hidden in corners of postcards slit open with a
knife and resealed with starch. The modern twentieth-century microdots could
hold up to one page of text and even contain photographs. The Allies discovered
the usage of microdots in 1941. A modern version of the concept of the microdot
was recently proposed for hiding information in DNA for the purpose of tagging
important genetic material [45, 212]. Microdots in the form of dust were also
recently proposed to identify car parts [1].
Perhaps the best-known form of steganography is writing with invisible ink.
The first invisible inks were organic liquids, such as milk, urine, vinegar, diluted
honey, or sugar solution. Messages written with such ink were invisible once
the paper had dried. To make them perceptible, the letter was simply heated
up above a candle. Later, more sophisticated versions were invented by replacing
the message-extraction algorithm with safer alternatives, such as using ultraviolet
light.
In 1966, an inventive and impromptu steganographic method enabled a pris-
oner of war, Commander Jeremiah Denton, to secretly communicate one word
when he was forced by his Vietnamese captors to give an interview on TV. Know-
ing that he could not say anything critical of his captors, as he spoke, he blinked
his eyes in Morse code, spelling out T-O-R-T-U-R-E.
Steganography became the subject of a dispute during the match between
Viktor Korchnoi and Anatoly Karpov for the World Championship in chess in
1978 [117]. During one of the games, Karpov’s assistants handed him a tray with
yogurt. This was technically against the rules, which prohibited contact between
the player and his team during play. The head of Korchnoi’s delegation, Petra
Leeuwerik, immediately protested, arguing that Karpov’s team could be passing
him secret messages. For example, a violet yogurt could mean that Karpov should
offer a draw, while a sliced mango could inform the player that he should decline a
draw. The time of serving the food could also be used to send additional messages
(steganography in timing channels). This protest, which was a consequence of
the extreme paranoia that dominated chess matches during the Cold War, was
taken quite seriously. The officials limited Karpov to consumption of only one
type of yogurt (violet) at a fixed time during the game. Using the terminology of
this book, we can interpret this protective measure as an act of an active warden
to prevent usage of steganography.
6 Chapter 1. Introduction
In the 1990s, the story of a “quilt code” allegedly used in the Underground
Railroad surfaced in the media. The Underground Railroad appeared sponta-
neously as a clandestine network of secret pathways and safe houses that helped
black slaves in the USA escape from slavery during the first part of the nine-
teenth century. According to the story told by a South Carolina woman named
Ozella Williams [230], people sympathetic to the cause displayed quilts on their
fences to non-verbally inform the escapees about the direction of their journey
or which action they should take next. The messages were supposedly hidden
in the geometrical patterns commonly found in American patchwork quilts (see
Figure 1.2). Since it was common to air quilts on fences, the master or mistress
would not be suspicious about the quilts being on display.
The recent explosion of interest in steganography is due to a rather sudden
and widespread use of digital media as well as the rapid expansion of the Internet
(Figure 1.3 shows the annual count of research articles on the subject of steganog-
raphy published by the IEEE). It is now a common practice to share pictures,
video, and sound with our friends and family. Such objects provide a very favor-
able environment for concealing secret messages for one good reason: typical dig-
ital media files consist of a large number of individual samples (e.g., pixels) that
can be imperceptibly modified to encode a secret message. And there is no need to
develop technical expertise for those who wish to use steganography because the
hiding process itself can be carried out by a computer program that anyone can
download from the Internet for free. As of writing this book in late 2008, one can
Introduction 7
200
100
50
0
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
Figure 1.3 The growth of the field is witnessed by the number of articles annually
published by IEEE that contain the keywords “steganography” or “steganalysis.”
300
200
100
0
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Figure 1.4 The number of newly released steganographic software applications or new
versions per year. Adapted from [122] and reprinted with permission of John Wiley &
Sons, Inc.
is intercepted, even though the content of the message is protected, the fact that
the subjects are communicating secretly is obvious. In some situations, it may be
important to avoid drawing attention and instead embed sensitive data in other
objects so that the fact that secret information is being sent is not obvious in
the first place. This is the approach taken by steganography.
Every steganographic system discussed in this book consists of two basic com-
ponents – the embedding and extraction algorithms. The embedding algorithm
accepts three inputs – the secret message to be communicated, the secret shared
key that controls the embedding and extraction algorithms, and the cover ob-
ject , which will be modified to convey the message. The output of the embedding
algorithm is called the stego object. When the stego object is presented as an
input to the message-extraction algorithm, it produces the secret message.
Steganography offers a feasible alternative to encryption in oppressive regimes
where using cryptography might attract unwanted attention or in countries where
the use of cryptography is legally prohibited. An interesting documented use of
steganography was presented at the 4th International Workshop on Information
Hiding [209]. Two subjects developed a steganographic scheme of their own to
hide messages in uncompressed digital images and then used it successfully for
several years when one of them was residing in a hostile country that explicitly
prohibited use of encryption. The reason for their paranoia was a story told by
their friend who already resided in the area, who had tried to send an encrypted
e-mail only to have it returned to him by the local Internet service provider with
the message appended, “Please, don’t send encrypted emails – we can’t read
them.”
Introduction 9
The actions of an active warden will likely inform Alice and Bob that they
are under surveillance. Instead of actively blocking the covert communication
channel, Eve may decide not to intervene at all and instead try to extract the
messages to learn about the prisoners’ escape plans. Effort directed towards
extracting the secret message belongs to the field of forensic steganalysis. If Eve
is successful and gains access to the stego (encryption) key, a host of other options
opens up for her. She can now be devious and impersonate the prisoners to trick
them to reveal more secrets. Such a warden is called malicious.
should be compliant with those of other images produced by her camera. Thus,
any automatic steganalysis system only inspecting statistical properties of im-
ages would label the image as compliant with the legitimate use of the channel.
A human warden will, of course, have full access to the communicated message.
It is intuitively clear that Alice and Bob can increase their sense of security
and decrease the chance of being caught by Eve if they communicate only very
short messages. This would, however, make their communication less efficient
and quite likely impractical. Therefore, Alice and Bob need to know how large a
message they can hide in a given object without introducing artifacts that would
trigger Eve’s detector. The size of the critical message is called the steganographic
capacity. The research in steganography focuses on design of algorithms that
permit sending messages that are as long as possible without making the stego
objects statistically distinguishable from cover objects.
Summary
r Steganography is the practice of communicating a secret message by hiding it
in a cover object.
r Steganography is usually described as the prisoners’ problem in which two
prisoners, Alice and Bob, want to hatch an escape plan but their communica-
tion is monitored by the warden (Eve), who will cut the communication once
she suspects covert exchange of data.
r The most important property of steganography is statistical undetectability,
which means that it should be impossible for Eve to prove the existence of a
secret message in a cover. Statistically undetectable steganographic schemes
are called secure.
r A warden who merely observes the traffic between Alice and Bob is called
passive. An active or malicious warden tampers with the communication in
order to prevent the prisoners from using steganography or to trick them into
revealing their communication.
r Digital watermarking is a data-hiding application that is related to steganog-
raphy but is fundamentally quite different. While in steganography the secret
message has usually no relationship to the cover object, which plays the role
of a mere decoy, watermarks typically supply additional information about
the cover. Moreover, and most importantly, watermarks do not have to be
embedded undetectably.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
2 - Digital image formats pp. 15-32
Chapter DOI:
Cambridge University Press
2 Digital image formats
Digital images are commonly represented in four basic formats – raster, palette,
transform, and vector. Each representation has its advantages and is suitable
for certain types of visual information. Likewise, when Alice and Bob design
their steganographic method, they need to consider the unique properties of
each individual format. This chapter explains how visual data is represented and
stored in several common image formats, including raster and palette formats,
and the most popular format in use today, the JPEG. The material included in
this chapter was chosen for its relevance to applications in steganography and
is thus necessarily somewhat limited. The topics covered here form the minimal
knowledge base the reader needs to become familiar with. Those with sufficient
background may skip this chapter entirely and return to it later on an as-needed
basis. An excellent and detailed exposition of the theory of color models and their
properties can be found in [74]. A comprehensive description of image formats
appears in [32].
In Section 2.1, the reader is first introduced to the basic concept of color
as perceived by humans and then learns how to represent color quantitatively
using several different color models. Section 2.2 provides details of the processing
needed to represent a natural image in the raster (BMP, TIFF) and palette
formats (GIF, PNG). Section 2.3 is devoted to the popular transform-domain
format JPEG, which is the most common representation of natural images today.
For all three formats, the reader is instructed how to work with such images in
Matlab.
blue light. The cones that register the blue light have the smallest sensitivity
to light intensity, while the cones that respond to green light have the highest
sensitivity. Electrical signals produced by the cones are fed to the brain, allowing
us to perceive color. This is the tristimulus theory of color perception.
This theory leads to the so-called additive color model. According to this
model, any color is obtained as a linear combination of three basic colors (or
color channels) – red, green, and blue. Denoting the amount of each color as
R, G, and B, where each number is from the interval [0, 1] (zero intensity to
full intensity), each color can be represented as a three-dimensional vector in the
RGB color cube (R, G, B) ∈ [0, 1]3 . Hardware systems that emit light are usually
modeled as additive. For example, old computer monitors with the Cathode-
Ray Tube (CRT) screens create colors by combining three RGB phosphores on
the screen. Liquid-Crystal Display (LCD) panels combine the light from three
adjacent pixels. Full intensity of all three colors is perceived as white, while low
intensity in all is perceived as dark or black.
The subtractive color model is used for hardware devices that create colors
by absorption of certain wavelengths rather than emission of light. A good ex-
ample of a subtractive color device is a printer. The standard basic colors for
subtractive systems are, by convention, cyan, magenta, and yellow, leading to
color representation using the vector CMY. These three colors are obtained by
removing from white the colors red, green, and blue, respectively. The CMY sys-
tem is augmented with a fourth color, black (abbreviated as K) to improve the
printing contrast and save on color toners.
The following relationship holds between the additive RGB and subtractive
CMY systems:
C = 1 − R, (2.1)
M = 1 − G, (2.2)
Y = 1 − B. (2.3)
Although the additive RGB color model describes the colors perceivable by
humans quite well, it is redundant because the three signals are highly correlated
among themselves and is thus not the most economical for transmission. A very
popular color system is the YUV model originally developed for transmission of
color TV signals. The requirement of backward compatibility with old black-and-
white TVs led the designers to form the color TV signal as luminance augmented
with chrominance signals. The reader is forewarned that from now on the letter Y
will always stand for luminance and not yellow as in the CMY(K) color system.
The luminance Y is defined as a weighted linear combination of the RGB
channels with weights determined by the sensitivity of the human eye to the
three RGB colors,
Table 2.1. Typical color bit depth for various applications. For nc ≥ 8, the sampling of
color images is in bits per color channel.
nc Colors Application
the JPEG format. Table 2.2 shows the color bit depth allowed by each format.
For palette formats, the bit depth represents the range of palette indices.
The majority of readers would probably agree that the most intuitive way to
represent natural images in a computer is to sample the colors on a sufficiently
dense rectangular grid. This approach also nicely plays into how digital images
are usually acquired through an imaging sensor (Chapter 3). Images stored in
such spatial-domain formats form very large files that allow the steganographer
to hide relatively large messages. In this section, we describe the details of the
raster and palette representations.
and PNG. Their color sampling is shown in Table 2.2. These formats may use
lossless compression [206] to provide a smaller file size. For example, BMP may
use runlength compression (optionally), PNG uses DEFLATE, while TIFF allows
multiple different compression schemes.
In this book, a grayscale image in raster format will be represented us-
ing an M × N matrix of integers from the range {0, . . . , 2nc − 1}, where
typically nc = 8. A true-color BMP image will be represented with three
such matrices. The Matlab command1 for importing a BMP image is X
= imread(’my_image.bmp’). Alternatively, saving a uint8 matrix of in-
tegers X to a BMP image is obtained using the command imwrite(X,
’my_stego_image.bmp’, ’bmp’). The reader is urged to check the Matlab help
facility for the list of formats supported by his/her version of Matlab.
x[i, j + 1]
··· x[i, j] x[i, j + 2] · · · Original 24-bit image
+e[i, j]
In this algorithm, the pixel colors are truncated to the closest palette color,
and, at the same time, the next pixel to be visited is modified by the truncation
error at the current pixel. If a pixel is truncated, say, to a color that is less red,
a small amount of red is added to the next pixel to locally preserve the overall
color balance in the image.
This process can be further refined by spreading the truncation error among
more pixels. We just need to make sure that the pixels are spatially close to
the current pixel and that no pixels are modified that have already been visited.
Weights can be assigned as fixed numbers or random variables with sum equal
to 1 across all pixels receiving a portion of the truncation error.
One of the most popular dithering algorithms is Floyd–Steinberg dithering.
To explain this algorithm, we now represent the pixels in the image as a two-
dimensional array and assume that the dithering algorithm starts in the upper
Digital image formats 21
y[i, j] = c[k] where c[k] is the closest color to x[i, j], (2.13)
e[i, j] = x[i, j] − y[i, j], (2.14)
x[i, j + 1] = x[i, j + 1] + α0,1 e[i, j], (2.15)
x[i + 1, j + 1] = x[i + 1, j + 1] + α1,1 e[i, j], (2.16)
x[i + 1, j] = x[i + 1, j] + α1,0 e[i, j], (2.17)
x[i + 1, j − 1] = x[i + 1, j − 1] + α1,−1 e[i, j]. (2.18)
7 1 5
Typical values of the coefficients are α0,1 = 16 , α1,1 = 16 , α1,0 = 16 , and
3
α1,−1 = 16 . The dithering process basically arranges for a trade-off between lim-
ited color resolution and spatial resolution. Because human eyes have the ability
to integrate colors in a small patch when looking from a distance, dithering al-
lows us to perceive new shades of color not present in the palette. The dithering
process introduces characteristic structures or patterns (noisiness) that may be-
come visible in areas of small color gradient. The stochastic error spread helps
by breaking any regular dithering patterns that may otherwise arise, thereby
creating a more visually pleasing image. As an example, in Figure 2.3 we show a
magnified portion of a true-color image after storing it as a GIF image with 256
colors in the palette obtained using the median-cut algorithm (the color image is
displayed in Plate 1). The colors were dithered using Floyd–Steinberg dithering.
The Image Processing Toolbox of Matlab offers a number of routines that make
working with GIF images in Matlab very easy. A GIF image can be read using
the command [Ind, Map] = imread(’my_image.gif’). The variable Map is an
n × 3 double array of palette colors consisting of n ≤ 256 colors. Each row of Map
is one palette color in the RGB format scaled so that R, G, B ∈ [0, 1]. The variable
Ind is the array of indices to the palette of the same dimensions as the image.
Modified versions of both arrays can be used to write the modified image to disk
as a GIF file using the command imwrite(Ind, Map, ’my_stego_image.gif’,
’gif’).
22 Chapter 2. Digital image formats
Figure 2.3 Magnified portion of a true-color image (left) and the same portion after
storing the image as GIF (right). A color version of this figure appears in Plate 1.
3. DCT transform. The Y Cr Cb signals from each block are transformed from
the spatial domain to the frequency domain with the DCT. The DCT can be
thought of as a change of basis representing 8 × 8 matrices.
4. Quantization. The resulting transform coefficients are quantized by dividing
them by an integer value (quantization step) and rounded to the nearest in-
teger. The luminance and chrominance signals may use different quantization
tables. Larger values of the quantization steps produce a higher compression
ratio but introduce more perceptual distortion.
5. Encoding and lossless compression. The quantized DCT coefficients are
arranged in a zig-zag order, encoded using bits, and then losslessly compressed
using Huffman or arithmetic coding. The resulting bit stream is prepended
with a header and stored with the extension ’.jpg’ or ’.jpeg.’ For applications
in steganography, it will not be necessary to understand the details of this
last step.
padded parts are not displayed. Also, before applying the DCT, all pixel values
are shifted by subtracting 128 from them.
7
d[k, l] = f [i, j; k, l]B[i, j] (2.19)
i,j=0
7
w[k]w[l] π π
= cos k(2i + 1) cos l(2j + 1)B[i, j], (2.20)
i,j=0
4 16 16
π π
√
where f [i, j; k, l] = (w[k]w[l]/4) cos 16 k(2i + 1) cos 16 l(2j + 1) and w[0] = 1/ 2,
w[k > 0] = 1. The coefficient d[0, 0] is called the DC coefficient (or the DC term),
while the remaining coefficients with k + l > 0 are called AC coefficients.
The DCT is invertible and the inverse transform (IDCT) is
7
w[k]w[l] π π
B[i, j] = cos k(2i + 1) cos l(2j + 1)d[k, l]. (2.21)
4 16 16
k,l=0
The fact that the transform involves real numbers rather than integers increases
its complexity and memory requirements, which could be an issue for mobile
electronic devices, such as digital cameras or cell phones. Fortunately, the JPEG
format allows various implementations of the transform that work only with
integers and are thus much faster and more easily implemented in hardware.
The fact that there exist various implementations of the transform means that
one image could be compressed to several slightly different JPEG files. In fact,
this difference may not be that small for some images and may influence the
statistical distribution of DCT coefficients [95].
The DCT can be interpreted as a change of basis in the vector space of all 8 × 8
matrices, where the sum of matrices and multiplication by a scalar are defined in
the usual elementwise manner and the dot product between matrices X and Y
is X · Y = 7i,j=0 X[i, j]Y[i, j]. For a fixed pair (k, l), we call the 8 × 8 matrix
f [i, j; k, l] the (k, l)th basis pattern. All 64 such patterns, depicted in Figure 2.4,
form an orthonormal system because
7
f [i, j; k, l]f [i, j; k , l ] = δ(k − k )δ(l − l ) for all k, k , l, l ∈ {0, . . . , 7},
i,j=0
(2.22)
Digital image formats 25
Figure 2.4 All 64 orthonormal basis patterns used in JPEG compression. Below, an
example of an expansion of a pattern into a linear combination of basis patterns. Image
provided courtesy of Andreas Westfeld.
2.3.3 Quantization
The purpose of quantization is to enable representation of DCT coefficients using
fewer bits, which necessarily results in loss of information. During quantization,
the DCT coefficients d[k, l] are divided by quantization steps from the quantiza-
tion matrix Q[k, l] and rounded to integers
d[k, l]
D[k, l] = round , k, l ∈ {0, . . . , 7}, (2.24)
Q[k, l]
26 Chapter 2. Digital image formats
Figure 2.5 A magnified portion of a true-color image in BMP format (left) and the same
portion after compressing with JPEG quality factor qf = 20 (right). A color version of this
figure appears in Plate 2.
max {1, round (2Q50 (1 − qf /100))} , qf > 50
Qqf = (2.25)
min {255 · 1, round (Q50 50/qf )} , qf ≤ 50,
where the 50% quality standard JPEG quantization matrix (for the luminance
component Y ) is
⎛ ⎞
16 11 10 16 24 40 51 61
⎜ 12 12 14 19 26 58 60 55 ⎟
⎜ ⎟
⎜ 14 13 16 24 40 57 69 56 ⎟
⎜ ⎟
⎜ ⎟
(lum) ⎜ 14 17 22 29 51 87 80 62 ⎟
Q50 =⎜ ⎟. (2.26)
⎜ 18 22 37 56 68 109 103 77 ⎟
⎜ ⎟
⎜ 24 35 55 64 81 104 113 92 ⎟
⎜ ⎟
⎝ 49 64 78 87 103 121 120 101 ⎠
72 92 95 98 112 100 103 99
Digital image formats 27
The chrominance quantization matrices are obtained using the same mechanism
with the 50% quality chrominance quantization matrix,
⎛ ⎞
17 18 24 47 99 99 99 99
⎜ 18 21 26 66 99 99 99 99 ⎟
⎜ ⎟
⎜ 24 26 56 99 99 99 99 99 ⎟
⎜ ⎟
⎜ ⎟
(chr) ⎜ 47 66 99 99 99 99 99 99 ⎟
Q50 = ⎜ ⎟. (2.27)
⎜ 99 99 99 99 99 99 99 99 ⎟
⎜ ⎟
⎜ 99 99 99 99 99 99 99 99 ⎟
⎜ ⎟
⎝ 99 99 99 99 99 99 99 99 ⎠
99 99 99 99 99 99 99 99
We will denote the (k, l)th DCT coefficient in the bth block as D[k, l, b], b ∈
{1, . . . , NB }, where NB is the number of all 8 × 8 blocks in the image. Note
that for a color image, there will be three such three-dimensional arrays, one for
luminance and two for the chrominance signals. The pair (k, l) ∈ {0, . . . , 7} ×
{0, . . . , 7} is called the spatial frequency (or mode) of the DCT coefficient.
Before the DCT coefficients are encoded using bits and losslessly compressed,
the blocks are ordered from the upper left corner to the bottom right corner and
the individual coefficients from each block are arranged by scanning the block
using a zig-zag scan that starts at the spatial frequency (0, 0) and proceeds
towards (7, 7).
2.3.4 Decompression
The decompression works in the opposite order. After reading the quantized DCT
blocks from the JPEG file, each block of quantized DCT coefficients D is multi-
plied by the quantization matrix Q, d̃[k, l] = Q[k, l]D[k, l], k, l ∈ {0, . . . , 7}, and
the inverse DCT is applied to the 8 × 8 matrix d̃. The values are finally rounded
to integers and truncated to a finite dynamic range (usually {0, . . . , 255}). The
block of decompressed pixel values B̃ is thus
28 Chapter 2. Digital image formats
-1 6 3 -1 0 0 0 0 -16 90 37 -17 -1 -2 -2 -1
4 1 -4 -1 0 0 0 0 63 10 -46 -14 12 0 0 2
0 -1 0 0 0 0 0 0 -2 -9 -5 12 4 -5 -2 1
0 0 0 0 0 0 0 0 1 -3 -2 0 -3 -1 1 1
0 0 0 0 0 0 0 0 0 -2 -1 -1 0 1 1 -1
0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0
0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 -1 0 1 0 0 0 0
Figure 2.6 An 8 × 8 luminance block of pixels and its quantized DCT coefficients for
JPEG quality factor qf = 20 (left) and qf = 90 (right).
B̃ = trunc round IDCT(d̃) , (2.29)
where IDCT(·) is the inverse DCT (2.21) and trunc(x) is the operation of truncat-
ing integers to a finite dynamic range (trunc(x) = x for x ∈ [0, 255], trunc(x) = 0
for x < 0, and trunc(x) = 255 for x > 255). Due to quantization, rounding, and
truncation, B̃ will in general differ from the original block B.
·104
0
−20 −15 −10 −5 0 5 10 15 20
Figure 2.7 Histogram of luminance DCT coefficients for the image shown in Figure 5.1
stored as 95% quality JPEG.
102
101
100
10−1
10−2
10−3
10−4
10−5
−100 −50 0 50 100
Figure 2.8 Histogram of luminance DCT coefficients from the image shown in Figure 5.1
compressed as 95% quality JPEG and the generalized Gaussian fit.
im=jpeg_read(’color_image.jpg’);
Lum=im.coef_arrays{im.comp_info(1).component_id};
ChromCr=im.coef_arrays{im.comp_info(2).component_id};
ChromCb=im.coef_arrays{im.comp_info(3).component_id};
Lum_quant_table=im.quant_tables{im.comp_info(1).quant_tbl_no};
Chrom_quant_table=im.quant_tables{im.comp_info(2).quant_tbl_no};
...
jpeg_write(im_stego,’my_stego_image.jpg’);
Summary
r Human eyes are sensitive to electromagnetic radiation in the range of approx-
imately 380 nm to 750 nm and each color is uniquely captured by the spectral
power within this range.
Digital image formats 31
Exercises
2.1 [Palette ordering] Write a Matlab routine that orders a palette (repre-
sented as an n × 3 array) by luminance. Use the conversion formula (2.4) for the
ordering.
2.2 [Draw palette] Write a routine that displays the palette colors of a
GIF image as 16 × 16 little squares arranged by rows (from left to right, top to
bottom) into a square pattern. Inspect Figure 5.5 or Plate 6 for an example of
the output.
32 Chapter 2. Digital image formats
2.4 [JPEG histogram] Using the JPEG Matlab Toolbox, write a routine
that displays a histogram of DCT coefficients from a chosen DCT mode (k, l). Use
your routine and analyze a picture of a regular scene. Plot the histograms of DCT
modes (1, 2), (1, 4), and (1, 6). Notice that the histograms become progressively
“spikier.” This is because modes with higher spatial frequencies are more often
quantized to zero.
2.5 [Generalized Gaussian Fit] Write a routine that fits the generalized
Gaussian distribution to a histogram of DCT coefficients using the method of
moments (Section A.8). For debugging purposes, apply it to artificially generated
Gaussian data to see whether you are getting the expected result. Then apply it
to the three histograms obtained in the previous project. The shape parameter
should decrease with increasing spatial frequency of the DCT mode, quantifying
thus your observation that the histogram becomes spikier.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
3 - Digital image acquisition pp. 33-46
Chapter DOI:
Cambridge University Press
3 Digital image acquisition
transferred from the sensor for further processing. This topic is described in
Section 3.2. The processing of the acquired signal in the camera is the subject
of Sections 3.3 and 3.4 that explain how sensors register color through a color
filter array and how they process the signal for it to be viewable on a computer
monitor. Finally, in Section 3.5 we describe a topic of special interest to steganog-
raphers – imperfections and stochastic processes involved in image acquisition.
Figure 3.1 Kodak KAI 1100 CM-E CCD sensor with 11 megapixels.
well capacity and lower quantum efficiency and due to the fact that the various
electronic components are packed closer to each other and thus influence each
other more. Inhomogeneities of the silicon also become more influential with
decreased pixel size. Sensors with larger pixels (10 microns or larger) produce
images with a higher signal-to-noise ratio (SNR) but are more expensive.
To capture an image, both CCD and CMOS sensors perform a sequence of in-
dividual tasks. They first absorb photons, generate a charge using the photo-
electric phenomenon, then collect the charge, transfer it, and finally convert the
charge to a voltage. The CCD and CMOS sensors differ in how they transfer the
charge and convert it to voltage.
CCDs transfer charges between pixel wells by shifting them from one row of
pixels of the array to the next row, from top to bottom (see Figure 3.2). The
transfer happens in a parallel fashion through a vertical shift-register architec-
ture. The charge transfer is “coupled” (hence the term charge-coupled device)
in the sense that as one row of charge is moved down, the next row of charge
(which is coupled to it) shifts into the vacated pixels. The last row is a horizontal
shift register that serially transfers the charge out of the sensor for further pro-
cessing. Each charge is converted to voltage, amplified, and then sent to an A/D
converter, which converts it into a bit string. Even though the charge-transfer
process itself is not completely lossless, the Charge-Transfer Efficiency (CTE)
in today’s CCD sensors is very high (0.99999 or higher) and thus influences the
resulting image quality in a negligible manner at least as far as steganographic
applications are concerned. The charge conversion and amplification, however,
do introduce noise into the signal in the form of readout and amplifier noise.
Both noise signals can be well modeled as a sequence of iid Gaussian variables.
36 Chapter 3. Digital image acquisition
Charge-coupled
transfer
Amplifier
Last row is a horizontal
shift register A/D convertor
G B
R G
color image, each photodetector has a filter layer bonded to the silicon that allows
only light of a certain color to pass through, absorbing all other wavelengths. The
filters are assigned to pixels in a two-dimensional pattern called the Color Filter
Array (CFA). Most sensors use the Bayer pattern developed by Kodak in the
1970s. It is obtained by tiling the sensor periodically using 2 × 2 squares as
depicted in Figure 3.3. The Bayer pattern has twice as many green pixels as red
or blue, reflecting the fact that human eyes are more sensitive to green than red
or blue.
To form a complete digital image, the other two missing colors at each pixel
must be obtained by interpolation (also called demosaicking) from neighboring
pixels. A very simple (but not particularly good) color-interpolation algorithm
computes the missing colors as in Table 3.1. There exist numerous very sophisti-
cated content-adaptive color-interpolation algorithms that perform much better
than this simple algorithm.
Color-interpolation algorithms may introduce artifacts into the resulting im-
age, such as aliasing (moiré patterns) or misaligned colors in the neighborhood
of edges. These artifacts have a very characteristic structure and thus cannot be
used for steganography.
A very important consequence of color interpolation for steganography is that
no matter how sophisticated the demosaicking algorithm is, it is a type of fil-
tering, and it will inevitably introduce dependences among neighboring pixels.
After all, the red color at a green pixel is some function of the neighboring col-
ors, etc. Small modifications of the colors due to data embedding may disrupt
these dependences and thus become statistically detectable.
We note that not all cameras use CFAs. Some high-end digital videocameras
use a prism that splits the light into three beams and sends them to three separate
sensors, each sensor registering one color at every pixel. This approach is not
usually taken in compact cameras because it makes the camera more bulky and
expensive. There also exist special sensors that can register all three colors at
every pixel, capitalizing on the fact that red, green, and blue light penetrates to
different depths of the silicon layer. By reading out the charge from each layer
separately, rather than from the whole photodetector as in conventional sensors,
all three colors are obtained at every pixel. This design is incorporated in the
Foveon X3 sensor, for example, in the Sigma SD9 camera.
To capture color, scanners typically use trilinear CCDs consisting of three
adjacent linear CCDs, each equipped with a different color filter.
38 Chapter 3. Digital image acquisition
Table 3.1. A simple example of a color-interpolation algorithm for the Bayer pattern.
Color-interpolation algorithm
3.5 Noise
There are numerous noise sources that influence the resulting image produced
by the sensor. Some are truly random processes, such as the shot noise caused by
quantum properties of light, while other sources are systematic in the sense that
they would be the same if we were to take the same image twice. It should be clear
that while systematic imperfections cannot be used for steganography, random
components can and are thus very fundamental for our considerations. If we
knew the random noise component exactly, in principle we could subtract it from
the image and replace it with an artificially created noise signal with the same
statistical properties that would carry a secret message (Chapter 7). We now
review each noise source individually, pointing out their role in steganography.
Dark current is the image one would obtain when taking a picture in com-
plete darkness. The main factors contributing to dark current are impurities in
the silicon wafer or imperfections in the silicon crystal lattice. Heat also leads to
1 Some sensors use Anti-Blooming Gates (ABGs) when the charge exceeds 50% of well capacity
to prevent blooming. This causes the photodetector response to become non-linear at high
intensity.
40 Chapter 3. Digital image acquisition
dark noise (with the noise energy doubling with an increase in temperature of
6–8◦ C). The thermal noise can be suppressed by cooling the sensor, which is typ-
ically done in astronomy. The number of thermal electrons is also proportional
to the exposure time. Some consumer cameras take a dark frame with a closed
shutter when the camera is powered up and subtract it from every image the
camera takes. One interesting example worth mentioning here is the KAI 2020
chip developed by Kodak. This chip calculates the dark current (sensor output
when not exposed to light) and subtracts it from the illuminated image. This
method is frequently used in CMOS sensors to suppress noise as well as other
artifacts.
Figure 3.4 shows an example of dark current on the raw sensor output obtained
using a 60-second exposure with the SBIG STL-1301E camera equipped with a
1280 × 1024 CCD sensor cooled to −15◦C.
Photo-Response Non-Uniformity (PRNU). Due to imperfections in the
manufacturing process and the silicon wafer, the dimensions as well as the quan-
tum efficiency of each pixel may slightly vary. Additional imperfections may be
introduced by anomalies in the CFA and microlenses. Therefore, even when tak-
ing a picture of an absolutely uniformly illuminated scene (e.g., the blue sky), the
sensor output will be slightly non-uniform even if we eliminated all other sources
of noise. This non-uniformity may have components of low spatial frequencies,
such as a gradient or darkening at image corners, circular or irregular blobs due
to dirty optics or dust on the protective glass of the sensor (Figure 3.5), and a
stochastic component resembling white noise due to anomalies in the CFA and
varying quantum efficiency among pixels. The PRNU is a systematic artifact in
the sense that two images of exactly the same scene would exhibit approximately
the same PRNU artifacts. (This is why it is sometimes called the fixed pattern
noise.) Thus, the PRNU does not increase the amount of information we can
embed in an image because it is systematic and not random.
Digital image acquisition 41
Figure 3.5 Dust particles on the sensor protective glass show up as fuzzy dark spots
(circled).
Figure 3.6 Magnified portion of the stochastic component of PRNU in the red channel for
a Canon G2 camera. For display purposes, the numerical values were scaled to the range
[0, 255] and rounded to integers to form a viewable grayscale image.
It is worth mentioning that the PRNU can be used as a sensor fingerprint for
matching an image to the camera that took it [43] in the same way as bullet
scratches can be used to match a bullet to the gun barrel that fired it. Figure 3.6
shows a magnified portion of the PRNU in the red channel from a Canon G2
camera. To isolate the PRNU and suppress random noise sources, the pattern
was obtained by averaging the noise residuals of 300 images acquired in the TIFF
format. The noise residual for each image was obtained by taking the difference
between the image and its denoised version after applying a denoising filter.
Shot noise is the result of the quantum nature of light and makes the number
of electrons released at each photodetector essentially a random variable due to
the random variations of photon arrivals. Shot noise is a fundamental limitation
that cannot be circumvented. The presence of random components during im-
42 Chapter 3. Digital image acquisition
e−λΔt (λΔt)k
Pr{ξ = k} = = p[k] (3.8)
k!
with mean value and variance (see Exercise 3.2)
E[ξ] = p[k]k = λΔt, (3.9)
k≥0
Var[ξ] = p[k]k 2 − (λΔt)2 = λΔt, (3.10)
k≥0
where λ > 0 is the expected number of photons captured in a unit time interval.
With increased number of photons, λΔt, the relative (percentual) variations of
ξ decrease because the ratio
E[ξ] 1
= √ → 0. (3.11)
Var[ξ] λΔt
This means that the shot noise decreases with increased pixel size and with
longer exposure times. Also, we note that for large λΔt the Poisson distribution
is well approximated with a Gaussian distribution N (λΔt, λΔt).
Charge-transfer efficiency. The transfer of charge to the output amplifier
in a CCD sensor is not a completely lossless phenomenon. This results in an
additional source of variations in the final collected charge. The charge-transfer
efficiency in the latest CCD designs is very close to 1 (e.g., 0.99999), which
means that it is entirely reasonable to simply neglect this effect for applications
in steganography and assume that the charge-transfer efficiency is 1.
Amplifier noise. The charge collected at each pixel is amplified using an
on-chip amplifier. This can be done in the last row of photodetectors for a CCD
sensor or at each pixel in a CMOS sensor. The amplifier noise is well modeled
as a Gaussian random variable with zero mean. It dominates the shot noise at
low light conditions.
Quantization noise. The amplified signal on the sensor output is further
transformed through a complicated chain of processing, such as demosaicking,
gamma correction, low-pass filtering to prevent aliasing during subsequent re-
sampling, etc. Finally, the image data can be converted to one of the image
formats, such as TIFF or JPEG, which introduces quantization errors.
The most important fact we need to realize for steganography is that the in-
camera processing will introduce local dependences among neighboring pixels.
This is what makes steganography in digital images very challenging because the
exact nature of these dependences is quite complex and different for each camera.
Digital image acquisition 43
Figure 3.7 Hot pixel in an image (top) and its closeup (bottom). Because the hot pixel
had a red filter in front of it, the hot pixel appears red. Note the spread of the red color
due to demosaicking and other in-camera processing. For a color version of this figure, see
Plate 3.
Figure 3.8 Blooming artifacts due to overexposure. Also, notice the green artifact in the
lower right corner caused by multiple reflections of light in camera optics. A color version
of this figure is displayed in Plate 4.
Summary
r The photoelectric effect in silicon is the main physical principle on which
imaging sensors are based.
r There exist two competing sensor technologies – CCD and CMOS.
r Each imaging sensor consists of millions of individual photosensitive detectors
(photodiodes, photodetectors, or pixels).
r The light creates a charge at each pixel that is transferred, converted to volt-
age, amplified, and quantized. CCD and CMOS differ mainly in how the
charge is transferred. In a CCD, the charge is transferred out of the sensor
in a sequential manner, while CMOS sensors are capable of transferring the
charge in a parallel fashion.
r The quantized signal generated by the sensor is further processed through a
complex chain of processing that involves white-balance (gain) adjustment,
demosaicking, color correction, denoising, filtering, gamma correction, and
finally conversion to some common image format (JPEG, TIFF).
r The processing introduces local dependences into the image.
r The image-acquisition process is influenced by many sources of imprecision
and noise due to physical properties of light (shot noise), slight differences in
pixel dimensions and silicon inhomogeneities (pixel-to-pixel non-uniformity),
optics (vignetting, chromatic aberration), charge transfer and readout (read-
Digital image acquisition 45
out noise, amplifier noise, reset noise), quantization noise, pixel defects (hot
and dead pixels), and defects due to charge overflow (blooming).
r Some sources of imprecision and noise are random in nature (e.g., shot noise,
readout noise), while others are systematic components that repeat from im-
age to image.
r The presence of truly random noise components in images acquired using
imaging sensors is quite fundamental and has direct implications for steganog-
raphy (Section 6.2.1).
Exercises
3.1 [Law of rare events] Assume that photons arrive sequentially in time
in an independent fashion with a constant average rate of arrival. Let λ be
the probability that one photon arrives in a unit time interval. Show that the
probability that k events occur in a unit time interval is
e−λ λk
Pr{k events} = . (3.12)
k!
Hint: Divide the unit time interval into n subintervals of length 1/n. Assuming
that the probability that two events occur in one subinterval is negligible, the
probability that exactly k photons arrive can be obtained from the binomial
distribution Bi(n, λ/n)
k n−k
n λ λ
Pr{k events} = 1− . (3.13)
k n n
The result is then obtained by taking the limit n → ∞ for a fixed k.
3.2 [Poisson random variable] Show that the mean and variance of a Pois-
son random variable ξ
e−λΔt (λΔt)k
Pr{ξ = k} = = p[k] (3.14)
k!
are
E[ξ] = p[k]k = λΔt, (3.15)
k≥0
Var[ξ] = p[k]k 2 − (λΔt)2 = λΔt. (3.16)
k≥0
3.3 [Analysis of sensor imperfections] Take your digital camera and ad-
just its settings to the highest resolution and highest JPEG quality. Then take
N ≥ 10 images of blue sky, I[i, j; k], i = 1, . . . , m, j = 1, . . . , n, k = 1, . . . , N ,
where indices i and j determine the pixel at the position (i, j) in the kth image.
You might want to zoom in but not to the range of a digital zoom if your camera
has this capability. Make sure that the images do not contain stray objects, such
as airplanes, birds, or stars/planets. Compute the sample variance of all pixels
46 Chapter 3. Digital image acquisition
1 2
N
σ̂[i, j] = I[i, j; k] − Ī[i, j] , (3.17)
N −1
k=1
where
1
N
Ī[i, j] = I[i, j; k] (3.18)
N
k=1
is the sample mean at pixel (i, j). Plot the histogram of the sample variance.
The variations at each pixel are due to combined random noise sources, such as
the shot noise or readout noise.
Extract the noise residual W from all images using the Wiener filter
W[·, ·; k] = I[·, ·; k] − W (I[·, ·; k]). (3.19)
In Matlab, the Wiener filter W is accessed using the routine wiener2.m. Then,
average all N noise residuals
1
N
K[i, j] = W[i, j; k]. (3.20)
N
k=1
Chapter
4 - Steganographic channel pp. 47-58
Chapter DOI:
Cambridge University Press
4 Steganographic channel
Because in this book the covers are digital images, the cover-source attributes
include the image format, origin, resolution, typical content type, etc. The cover-
source properties are determined by objects that Alice and Bob would be ex-
changing if they were not secretly communicating. Modern steganography typi-
cally makes some fundamental assumption about the cover source that enables
formal analysis. For example, considering the cover source as a random variable
allows analysis of steganography using information theory (this topic is elabo-
Message
source Message
rated upon in Section 6.1). Alternatively, the cover source can be interpreted as
an oracle that can be queried. This leads to the complexity-theoretic study of
steganography explained in Section 6.4.
The embedding algorithm is a procedure through which the sender determines
an image that communicates the required secret message. The procedure may
depend on a secret shared between Alice and Bob called the stego key. This
key is needed to correctly extract a secret message from the stego image. For
example, Alice can embed her secret bit stream as the least significant bits of
pixels chosen along a non-intersecting pseudo-randomly generated path through
the image (determined by the stego key).
The protocol that Alice and Bob use to select the stego keys is usually modeled
with a random variable on the space of all keys. For example, a reasonable
strategy is to select the stego key randomly (with uniform distribution) from the
set of all possible stego keys.
The message source has a major influence on the security of the steganographic
channel. Imagine two extreme situations. On the one hand, Alice and Bob al-
ways communicate only a short message, say 16 bits, in every stego image. On
the other hand, Alice and Bob have a need to communicate large messages and
frequently embed as many bits into the image as the embedding algorithm al-
lows. Intuitively, the prisoners are at much greater risk in the latter case. The
distribution of messages can be modeled using a random variable on the space
of all possible messages.
The actual communication channel used to send the images is assumed to be
monitored by a warden (Eve). Eve can assume three different roles. Position-
ing herself into the role of a passive observer, she merely inspects the traffic
and does not interfere with the communication itself. This is called the passive-
warden scenario. Alternatively, Eve may suspect that Alice and Bob might use
steganography and she can preventively attempt to disrupt the steganographic
channel by intentionally distorting the images exchanged by Alice and Bob. For
example, she may compress the image using JPEG, resize or crop the image,
apply a slight amount of filtering, etc. Unless Alice and Bob use steganography
that is resistant (robust) to such processing, the steganographic channel would
Steganographic channel 49
be broken by Eve’s actions. This is called the active-warden scenario. Finally, Eve
can be even more devious and she may try to guess the steganographic method
that Alice and Bob use and attempt to impersonate Alice or Bob or otherwise
intervene to confuse the communicating parties. There is a difference between
this so-called malicious warden and the active warden. The actions of an active
warden are aimed at making steganography impossible for Alice and Bob, while a
malicious warden does not necessarily intend to entirely disrupt the stego channel
but rather use it to her advantage to determine whether or not steganography
is taking place. In this book, we focus mainly on the passive-warden scenario
in which the communication channel is assumed error-free.2 This is the case
of communication via standard Internet protocols, where error-correction and
authentication tools guarantee error-free data transmission.
The discussion above underlines the need to view steganographic communica-
tion in a wider context. When designing a steganographic scheme, the prisoners
have a choice in selecting the basic elements to form their communication scheme.
As will be seen in the next chapter, it is not very difficult to write a computer
program that hides a large amount of data in an image. On the other hand,
writing a program that does so without introducing any detectable artifacts is
quite hard.
2 The active- and malicious-warden scenarios are discussed in [14, 53, 66, 72, 116, 178] and
the references therein.
50 Chapter 4. Steganographic channel
Note that the digest consists of three bits. The problem may arise when Alice
decides to use this technique to communicate not just once but repetitively. If
Alice is equally likely to send each triple of bits out of all eight possible triples
of bits (which is a reasonable assumption if she is sending parts of an encrypted
document, for example), the stego images sent by her will equally likely produce
any of the 8-bit triples as the digest. How do we know, however, that in natural
images the distribution of LSBs of the first three pixels in the upper left corner
follows this distribution? Most likely, it does not because these pixels are likely
to belong to a piece of sky and thus their values are far from being independent.
Note that this problem arises only because we consider multiple uses of the
steganographic channel and allow Eve to inspect all transmissions from Alice
rather than considering one image at a time. This observation is, in fact, quite
fundamental and will lead us to a formal definition of steganographic security
using information theory in Section 6.1.
The reader might suspect that had Alice used a better digest, such as the
first three bits of a cryptographic hash MD5 or SHA applied to the whole image
Steganographic channel 51
rather than just three pixels, intuitively, the above issue would be largely elim-
inated. We revisit this simple thought experiment in Section 6.4.1 dealing with
the complexity-theoretic definition of steganography.
Figure 4.2 The same 16 × 16 pixel block from four (uncompressed) TIFF images of blue
sky taken with the same camera within a short time interval. First note that the blocks
are not uniform even though the scene is perfectly uniform. Second, the blocks appear
different in every picture due to random noise sources.
is later analyzed for compatibility with features typically obtained from cover
images (see Chapter 10). Thus, all that Alice needs to achieve to evade detec-
tion is to make the stego image look like a cover in the feature space. The stego
“image” does not have to look like a natural image because, as long as its fea-
tures are compatible with features of natural images, it should pass through the
detector. A practical data-masking method was constructed in [200] by applying
a time-varying inverse Wiener filter shaping every 1024 message bits to match a
reference audio frame.
Emb : C × K × M → C, (4.5)
Ext : C × K → M, (4.6)
In other words, Alice can take any cover x ∈ C and embed in it any message
m ∈ M(x) using any key k ∈ K(x), obtaining the stego image y = Emb(x, k, m).
The number of messages that can be communicated in a specific cover x depends
on the steganographic scheme and it may also depend on the cover itself. For
example, if C is the set of all 512 × 512 grayscale images and Alice embeds
one message bit per pixel, then M = {0, 1}512×512 and |M(x)| = 2512×512 for
all x ∈ C. On the other hand, if C is the set of all 512 × 512 grayscale JPEG
images with quality factor qf = 75 and Alice embeds one bit per each non-zero
quantized DCT coefficient, the number of messages that can be embedded in a
specific cover depends on the cover itself because the number of non-zero DCT
coefficients in a JPEG file depends on the image content.
54 Chapter 4. Steganographic channel
Message m Message m
y
Cover x Emb(·) Ext(·)
Key k Key k
Figure 4.3 Steganography by cover modification (passive-warden case).
π : X → A, (4.10)
where X is the range of individual cover elements, such as pixels or DCT coef-
ficients. One frequently used bit-assignment (parity) function is the least signif-
icant bit
Throughout this book, the reader will learn about other symbol-assignment func-
tions.
If the embedding algorithm is designed to avoid making embedding changes
to certain areas of the cover image, we speak of adaptive steganography. For
example, we may wish to skip flat areas in the image and concentrate embedding
changes to textured regions. The subset of the image where embedding changes
are allowed is called the selection channel. Another example of a selection channel
Steganographic channel 55
is when the message bits are embedded along a pseudo-random path through
the image generated from the stego key. In general, it is in the interest of both
Alice and Bob not to reveal any information or as little as possible about the
selection channel as this knowledge can help an attacker. If the selection channel
is available to Alice but not to Bob, it is called a non-shared selection channel.
Steganography by cover modification introduces distortion into the cover. The
distortion is typically measured with a mapping d(x, y), d : C × C → [0, ∞). One
commonly used family of distortion measures is parametrized by γ ≥ 1,
n
dγ (x, y) = |x[i] − y[i]|γ . (4.12)
i=1
For γ = 1,
n
d1 (x, y) = |x[i] − y[i]| (4.13)
i=1
is the number of embedding changes, where δ is the Kronecker delta (2.23). Note
that if the amplitude of embedding changes is |x[i] − y[i]| = 1, dγ and ϑ coincide
for all γ.
The distortion measures above are absolute in the sense that they measure the
total distortion. Often, it is useful to express distortion per cover element (e.g.,
per pixel), in which case we speak of relative distortion
d(x, y)
. (4.16)
n
The quantity
ϑ(x, y)
β= (4.17)
n
is called the change rate and will typically be denoted β.
Two popular relative measures are the Mean-Square Error (MSE)
1
n
d2 (x, y)
MSE = = |x[i] − y[i]|2 , (4.18)
n n i=1
where xmax is the maximum value that x[i] can attain. For example, for 8-bit
grayscale images, xmax = 255.
The average embedding distortion is the expected value E [d(x, y)] taken over
all x ∈ C, k ∈ K, m ∈ M selected according to some fixed probability distribu-
tions from their corresponding sets.
A very important characteristic of a stegosystem that has a major influence
on its security is the embedding efficiency. We define it here rather informally as
the average number of bits embedded per average unit distortion
Ex [log2 |M(x)|]
e= . (4.20)
Ex,m [d(x, y)]
Summary
r A steganographic channel consists of the source of covers, the message source,
embedding and extraction algorithms, the source of stego keys, and the com-
munication channel.
r If the channel is error-free, we speak of a passive warden. An active warden
intentionally distorts the communication with the hope of preventing usage of
steganography. A malicious warden tries to trick the communicating parties,
e.g., by impersonation.
r There exist three types of embedding algorithms: steganography by cover
selection, cover synthesis, and cover modification.
r The embedding capacity for a given cover is the maximal number of bits that
can be embedded in it.
r The relative embedding capacity is the ratio between the embedding capacity
and the number of elements in the cover where a message can be embedded.
Steganographic channel 57
Exercises
4.3 [Rule for adding PSNR] Let x[i] be an n-dimensional vector of real
numbers and let η 1 [i] ∼ N (0, σ12 ) and η 2 [i] ∼ N (0, σ22 ) be two iid Gaussian se-
quences with zero mean. Show that for n → ∞, the PSNR between x and
y = x + η 1 + η 2 satisfies
PSNR PSNR1 PSNR2
10− 10 = 10− 10 + 10− 10 , (4.26)
58 Chapter 4. Steganographic channel
Chapter
5 - Naive steganography pp. 59-80
Chapter DOI:
Cambridge University Press
5 Naive steganography
The first steganographic techniques for digital media were constructed in the
mid 1990s using intuition and heuristics rather than from specific fundamental
principles. The designers focused on making the embedding imperceptible rather
than undetectable. This objective was undoubtedly caused by the lack of stegana-
lytic methods that used statistical properties of images. Consequently, virtually
all early naive data-hiding schemes were successfully attacked later. With the
advancement of steganalytic techniques, steganographic methods became more
sophisticated, which in turn initiated another wave of research in steganalysis,
etc. This characteristic spiral development can be expressed through the follow-
ing quotation:
Thus, one can think of the sequence (b[i, 1], . . . , b[i, nc ]) as the binary represen-
tation of x[i] in big-endian form (the most significant bit b[i, 1] is first). The LSB
is the last bit b[i, nc ].
LSB embedding, as its name suggests, works by replacing the LSBs of x[i] with
the message bits m[i], obtaining in the process the stego image y[i]. Algorithm 5.1
shows a pseudo-code for embedding a bit stream in an image along a pseudo-
random path generated from a secret key shared between Alice and Bob.
Note that in a color image the number of elements in the cover, n, is three
times larger than for a grayscale image. Thus, the pseudo-random path is chosen
across all pixels and color channels. The message embedded using Algorithm 5.1
can be extracted with the pseudo-code in Algorithm 5.2.
The amplitude of changes in LSB embedding is 1, maxi |x[i] − y[i]| = 1, which
is the smallest possible change for any embedding operation. Under typical view-
ing conditions, the embedding changes in an 8-bit grayscale or true-color image
are not visually perceptible. Moreover, because natural images contain a small
amount of noise due to various noise sources present during image acquisition
Naive steganography 61
Table 5.1. Relative counts of bits, neighboring bit pairs, and triples from an LSB plane
of the image shown in Figure 5.1.
Frequency of occurrence
Bits 0.49942, 0.50058
Pairs 0.24958, 0.24984, 0.24984, 0.25073
Triples 0.1246, 0.1250, 0.1247, 0.1251, 0.1250, 0.1249, 0.1251, 0.1256
(see Chapter 3), the LSB plane of raw, never-compressed natural images already
looks random. Figure 5.1 (and color Plate 5) shows the original cover image and
its LSB plane b[i, nc ] for the red channel.
Table 5.1 contains the frequencies of occurrence of single bits in b[i, nc ], i =
1, . . . , n, pairs of neighboring bits (b[i, nc ], b[i + 1, nc]), and triples of neighboring
bits (b[i, nc ], b[i + 1, nc ], b[i + 2, nc ]). The data are consistent with the claim that
the LSB plane is random. Even though this is not a proof of randomness,1 the
argument is convincing enough to make us intuitively believe that any attempts
to detect the act of randomly flipping a subset of bits from the LSB plane are
doomed to fail. This seemingly intuitive claim is far from truth because LSB
embedding in images can be very reliably detected (see Chapter 11 on targeted
steganalysis). For now, we provide only a small hint.
Even if the LSB plane of covers was truly random, it may still be possible
to detect embedding changes due to flipping LSBs if, for example, the second
LSB plane b[i, nc − 1] and the LSB plane were somehow dependent! In the most
extreme case of dependence, if b[i, nc − 1] = b[i, nc ] for each i, detecting LSB
changes would be trivial. All we would have to do is to compare the LSB plane
with the second LSB plane.
LSB embedding belongs to the class of steganographic algorithms that embed
each message bit at one cover element. In other words, each bit is located at a
1 Pearson’s chi-square test from Section D.4 could replace intuition by verifying that the
distributions are uniform at a certain confidence level.
62 Chapter 5. Naive steganography
Figure 5.1 A true-color 800 × 548 image and the LSB plane of its red channel. Plate 5
shows the color version of the image.
x+1 when x even
LSBflip(x) = (5.2)
x−1 when x odd
= x + 1 − 2(x mod 2) (5.3)
x
= x + (−1) . (5.4)
Naive steganography 63
where δ is the Kronecker delta (2.23). We will assume that Alice is embedding a
stream of m random bits. The assumption of randomness is reasonable because
Alice naturally wants to minimize the impact of embedding and thus compresses
the message and probably also encrypts to further improve the security of com-
munication. We denote by α = m/n the relative payload Alice communicates.
Assuming she embeds the bits along a pseudo-random path through the image,
the probability that a pixel is not changed is equal to the probability that it
is not selected for embedding, 1 − α, plus the probability that it is selected, α,
multiplied by the probability that no change will be necessary, which happens
with probability 12 because we are embedding a random bit stream. Thus, for
any j,
α α
Pr{y[i] = j|x[i] = j} = 1 − α + = 1 − , (5.7)
2 2
α
Pr{y[i] = j|x[i] = j} = . (5.8)
2
Because during LSB embedding the pixel values within one LSB pair {2k, 2k +
1}, k = 0, . . . , 2nc −1 − 1, are changed into each other but never to any other
value, the sum hα [2k] + hα [2k + 1] stays unchanged for any α and thus forms
an invariant under LSB embedding. Here, we denoted the histogram of the stego
image as hα . Thus, the expected value of hα [2k] is equal to the number of pixels
with values 2k that stay unchanged plus the number of pixels with values 2k + 1
64 Chapter 5. Naive steganography
8,000 8,000
6,000 6,000
4,000 4,000
2,000 2,000
0 0
28 30 32 34 36 38 28 30 32 34 36 38
·104 ·104
6 6
4 4
2 2
0 0
−20 0 20 −20 0 20
Figure 5.2 Effect of LSB embedding on histogram. Top: Magnified portion of the
histogram of the image shown in Figure 5.1 after converting it to an 8-bit grayscale before
and after LSB embedding. (Shown are histogram values for grayscales between 28 and
39.) Bottom: Histogram of quantized DCT coefficients of the same image before and
after embedding using Jsteg (see Section 5.1.2). Left figures correspond to cover images,
right figures to stego.
α α
E [hα [2k]] = 1 − h[2k] + h[2k + 1], (5.9)
2 2
α α
E [hα [2k + 1]] = h[2k] + 1 − h[2k + 1]. (5.10)
2 2
Note that if Alice fully embeds her cover image with n bits (α = 1), we have
h[2k] + h[2k + 1]
E [h1 [2k]] = E [h1 [2k + 1]] = , k = 0, . . . , 2nc −1 − 1. (5.11)
2
We say that LSB embedding has a tendency to even out the histogram within
each bin. This leads to a characteristic staircase artifact in the histogram of
the stego image (Figure 5.2), which can be used as an identifying feature for
images fully embedded with LSB embedding. This observation is quantified in
the so-called histogram attack [246], which we now describe.
hα [2k] + hα [2k + 1]
hα [2k] ≈ h[2k] = , k = 0, . . . , 2nc −1 − 1. (5.12)
2
Naive steganography 65
H0 : hα ∼ h, (5.13)
H1 : hα ∼ h, (5.14)
which we approach using Pearson’s chi-square test [221] (also, see Appendix D).
This test determines whether the even grayscale values in the stego image fol-
low the known distribution h[2k], k = 0, . . . , 2nc −1 − 1. The chi-square test first
computes the test statistic S,
d−1
(hα [2k] − h[2k])2
S= , (5.15)
k=0
h[2k]
where d = 2nc −1 . Under the null hypothesis, the even grayscale values follow
the probability mass function, h[2k], and the test statistic (5.15) approximately
follows the chi-square distribution, S ∼ χ2d−1 , with d − 1 degrees of freedom
(see Section A.9 on the chi-square distribution). That is as long as all d bins,
h[2k], k = 0, . . . , 2nc −1 − 1, are sufficiently populated. Any unpopulated bins
must be merged so that h[2k] > 4 for all k, to make S approximately chi-square
distributed.
One can intuitively see that if the even grayscales follow the expected distri-
bution, the value of S will be small, indicating the fact that the stego image is
fully embedded with LSB embedding. Large values of S mean that the match is
poor and notify us that the image under inspection is not fully embedded. Thus,
we can construct a detector of images fully embedded using LSB embedding by
setting a threshold γ on S and decide “cover” when S > γ and “stego” otherwise.
The probability of failing to detect a fully embedded stego image (probability of
missed detection) is the conditional probability that S > γ for a stego image,
We set the threshold γ so that the probability of a miss is at most PMD . Denoting
the probability density function of χ2d−1 as fχ2d−1 (x), the threshold is determined
from (5.16),
ˆ∞ ˆ∞
e− 2 x 2 −1
x d−1
The value PMD (γ) is called the p-value and it measures the statistical significance
of γ. It is the probability that a chi-square-distributed random variable with d − 1
degrees of freedom would attain a value larger than or equal to γ.
The histogram attack can identify images fully embedded with random mes-
sages (α = 1) and it can also be used to detect messages with α < 1 if the order of
embedding is known (e.g., sequential). In this case, the m = αn message bits are
66 Chapter 5. Naive steganography
p-value
0.5
0
0 0.2 0.4 0.6 0.8 1
Percentage of visited pixels
Figure 5.3 The p-value for the histogram attack on a sequentially embedded 8-bit
grayscale image with relative message length α = 0.4.
embedded along a known path in the image represented using the vector of in-
dices Path[i], i = 1, . . . , n. Evaluating the ith p-value pv [i] from the histogram of
{x[Path[1]], . . . , x[Path[i]]} from the stego image, after a short transient phase,
pv [i] will reach a value close to 1. It will suddenly fall to zero when we arrive at
the end of the message (at approximately i ≈ αn) and it will stay at zero until
we exhaust all pixels. This is because the test statistic S will cease to follow
the chi-square distribution. Figure 5.3 shows pv [i] for a sequentially embedded
message with α = 0.4 (the cover image is shown in Figure 5.1). Thus, for sequen-
tial embedding this test not only determines with very high probability that a
random message has been embedded but also estimates the message length.
If the embedding path is not known, the histogram attack is ineffective unless
the majority of pixels have been used for embedding. Attempts to generalize this
attack to randomly spread messages include [199, 242]. The most accurate ste-
ganalysis methods for LSB embedding are the detectors discussed in Chapter 11.
F (h) = 0. (5.18)
Naive steganography 67
From equations (5.9) and (5.10), h can be expressed using E [hα ] by solving the
system of two linear equations for h[2k] and h[2k + 1],
Moreover, the histogram is monotonically increasing for j < 0 and decreasing for
j > 0. Because the LSB pairs are . . . , {−4, −3}, {−2, −1}, {0, 1}, {2, 3}, . . . and
because LSB embedding evens out the differences in counts in each LSB pair,
h[2k] decrease and h[2k + 1] increase with embedding for k > 0 and the effect is
the opposite for k < 0 (h[2k] increase and h[2k + 1] decrease). Thus, we use the
following function of the histogram to attack Jsteg:
F (h) = h[2k] + h[2k + 1] − h[2k + 1] − h[2k]. (5.22)
k>0 k<0 k≥0 k<0
Note that for the cover image F (h) = 0 as required. Also, with embedding, F
increases due to the impact of LSB embedding on positive and negative even
and odd DCT values. By substituting (5.19) and (5.20) into (5.22) and using the
approximation hα ≈ E [hα ], we obtain2
(ahα [2k] − bhα [2k + 1]) + (−bhα [2k] + ahα [2k + 1])
k>0 k<0
− (−bhα [2k] + ahα [2k + 1]) − (ahα [2k] − bhα [2k + 1]) = hα [1]. (5.23)
k>0 k<0
2 Note that because the DCT coefficients equal to 1 are skipped, h[1] = hα [1].
68 Chapter 5. Naive steganography
Rearranging the terms, an equation for the unknown relative message length α
is obtained,
(a + b) (hα [2k] − hα [2k + 1]) + (a + b) (hα [2k + 1] − hα [2k]) = hα [1],
k>0 k<0
(5.24)
where a + b = 1/(1 − α) (recall that hα is known as it is the histogram of the
stego image). Solving for α, we finally get for its estimate
k=0 Δhα [k]
α̂ = 1 − , (5.25)
hα [1]
where
H0 : α̂ = 0, (5.28)
H1 : α̂ > 0, (5.29)
Palette images, such as GIF or the indexed form of PNG, represent the image
data using pointers to a palette of colors stored in the header. It is possible to
hide messages both in the palette and in the image data.
α=0
α = 0.1
150 α = 0.2
α = 0.3
α = 0.5
100
50
0
−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
α
Figure 5.4 Histogram of estimated message length α̂ for cover images and stego images
embedded with Jsteg and relative payload 0.1, 0.2, 0.3, 0.5. The data was computed from
954 grayscale JPEG images with quality factor qf = 75.
Figure 5.5 Palette colors of image shown in Figure 5.1 saved as GIF. A color version of
this figure is shown in Plate 6.
Although this method provides high capacity, its steganographic security is low
because the palette of the stego image will have very unusual structure [121] that
is unlikely to occur naturally during color quantization (see Section 2.2.2). It will
contain suspicious groups of 2, 4, or 8 close colors depending on the technique. It
is thus relatively easy to identify stego images simply by analyzing the palette.
What is even worse is that the detection will be equally reliable even for very
short messages.
Figure 5.6 Palette colors sorted by luminance (by rows). A color version of this figure is
shown in Plate 7.
(a) (b)
(c) (d)
Figure 5.7 A small magnified portion of the image shown in Figure 5.1 saved as 256-color
GIF (a), the same portion after embedding a maximum-length message using EzStego
(b), optimal-parity embedding (c), and embedding while dithering (d). A color version of
this figure is shown in Plate 8.
to assign parities (bits 0 and 1) to each palette color so that the closest color to
each palette color has the opposite parity. If this could be achieved, we can simply
embed messages as color parities of indices instead of their LSBs by swapping a
color for its closest neighbor from the palette. This is the idea behind optimal-
parity embedding [80], explained next.
is closest and has the opposite parity. Thus, we can construct a steganographic
algorithm that embeds one message bit at each palette index (each pixel) as the
color parity and this algorithm induces the smallest possible distortion. This is
because if the message bit does not match the color parity, we can swap the color
for another palette color that is the closest to it.
Note that there may be more than one parity assignment with the above
optimal property. The algorithm above, however, is deterministic and will always
produce one specific assignment. Also note that the optimal parity assignment is
only a function of the palette and not of the frequency with which the colors
appear in the image. This means that the recipient can construct the same
assignment from the stego image as the sender and thus read the message.
Assuming that color c[j] occurs in the cover image with frequency3 p[j],
N p
j=1 p[j] = 1, if the message-carrying pixels are selected pseudo-randomly, the
expected embedding distortion per visited pixel is
1 2 1
n p N
E [d2 (x, y)]
= E dRGB (x[i], y[i]) = p[j]s2 [j], (5.32)
m m i=1 2 j=1
because for a message of length m, there will be on average mp[j] pixels of color
c[j] containing message bits and one half of them will have to be modified to
c[j ], inducing distortion s2 [j].
3 One can also say that p is the normalized color histogram or sample pmf of colors.
74 Chapter 5. Naive steganography
choose the color with the smallest isolation to minimize the overall embedding
distortion. The recipient simply reads the message by following the same steps,
extracting message bits from the parity of all blocks whose texture measure is
above the threshold.
The problem with this scheme is that the act of embedding may change the
texture measure to fall below the threshold and thus the recipient will not read
the bit embedded in that block. This problem is common to many steganographic
methods that use adaptive selection rules. For the scheme above, the solution is
simple. After embedding each bit, we need to check whether the block texture
is still above the threshold. If it is, the embedding can continue. If the texture
falls below the threshold, we need to embed the same bit in the next block
because the block where we just embedded will be skipped by the recipient. This
modification decreases the embedding efficiency because sometimes changes are
made that do not embed any bits. However, the decrease in embedding efficiency
Naive steganography 75
is usually very small because the chances that the block’s texture will fall below
the threshold after embedding are small.
The problem above is a specific example of a selection channel that is not
completely shared between the sender and the recipient. Chapter 9 is devoted
to the problem of communicating with non-shared selection channels, where a
general solution is presented using so-called wet paper codes.
Summary
r The simplest steganographic method is Least-Significant-Bit (LSB) embed-
ding.
r The effect of LSB embedding on an image histogram can be quantified. The
embedding evens out the populations of both values from the same LSB pair
{2k, 2k + 1}.
r Histogram attack is an attack on LSB embedding when the message placement
is known.
r Jsteg is a steganographic algorithm for JPEG images that uses LSB embed-
ding. It can be attacked by utilizing a priori knowledge about the cover-image
histogram, such as its symmetry.
r Messages can be embedded in palette images either in the palette or in the
indices (image data). Hiding in palette provides limited capacity independent
of the image size.
r Embedding in indices to palette offers larger capacity but often creates easily
discernible artifacts.
r Optimal-parity embedding is an assignment of bits to palette colors that can
be used to minimize the embedding distortion.
r Embedding while dithering is an embedding method for palette images that
minimizes the total distortion due to color quantization and embedding.
r A possible way to improve security is to confine the embedding changes to
more textured or noisy areas of the cover (adaptive steganography).
Exercises
Emb : Two bits are embedded sequentially at each pixel by replacing two least
significant bits with two message bits. For example, if x[i] = 14 = (00001110)2
and we want to embed bit pairs 00, 01, 10, 11, x[i] is changed to 12, 13, 14, 15,
respectively. If x[i] = 32 = (00100000)2, we embed the same bit pairs by chang-
ing x[i] to 32, 33, 34, and 35, etc. Ext : Two bits are extracted sequentially as
the least two LSBs from the pixels to form the message.
Calculate the embedding efficiency for both d1 and d2 distortion under the as-
sumption that the message is a random bit stream and the pixel values are
uniformly distributed in {0, . . . , 255}.
then use the assumption that the pixel values are uniformly distributed to obtain
the average distortion per pixel.
5.3 [LSB embedding of biased bit stream] Assume that the cover image is
an 8-bit grayscale image. Suppose that the secret message is a random biased bit
stream, i.e., the bits are iid realizations of a Bernoulli random variable ν ∼ B (p0 )
with probability mass function
Pr{ν = 0} = p0 , (5.36)
Pr{ν = 1} = p1 , (5.37)
d−1
2 2
p̂0 = arg max hα [2k] − 2λh[2k] + hα [2k + 1] − 2(1 − λ)h[2k] , (5.43)
λ∈R
k=0
78 Chapter 5. Naive steganography
where h[2k] is defined in (5.12), and compute the chi-square test statistic
d−1
2
hα [2k] − 2p̂0 h[2k]
S= , (5.44)
k=0
2p̂ 0 h[2k]
which now follows the chi-square distribution with d − 2 = 126 degrees of free-
dom because we had to estimate the unknown parameter p0 from the data.
5.5 [Power of parity] Assume that the cover image has a biased distribution
of LSBs: the fraction r of pixels has LSBs equal to 1 and the fraction 1 − r
has LSBs equal to 0, 0 < r < 1. Let x[i] be the LSBs of pixels ordered along
a pseudo-random path. Consider the following sequence of bits b[i] obtained
as the XOR of LSBs of disjoint groups of m consecutive pixels b[1] = x[1] ⊕
x[2] ⊕ · · · ⊕ x[m], b[2] = x[m + 1] ⊕ x[m + 2] ⊕ · · · ⊕ x[2m], . . .. Show that the
bit stream b[i] becomes unbiased exponentially fast with m by proving that
|Pr{b[i] = 0} − Pr{b[i] = 0}| = |1 − 2r|m . (5.45)
Hint: For m even, Pr{b[i] = 0} = Pr{x[1] + x[2] + · · · + x[m] is even} and ex-
press the probabilities using r and 1 − r.
5.8 [LSB embedding as noise adding] Explain why the impact of LSB
embedding in cover x cannot be written as adding to x an iid noise, y = x + ξ.
Hint: Are x and ξ independent?
5.9 [View bit planes] Write a routine in Matlab that displays a selected
bit plane (a two-color black-and-white image) of an image represented using a
uint8 array. For a color image, display the bit planes of red, green, and blue
channels separately as three two-color images. You may wish to use green–white
combination for the bit plane of the green channel, red–white combination for
the red channel, etc.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
6 - Steganographic security pp. 81-106
Chapter DOI:
Cambridge University Press
6 Steganographic security
assume that Eve knows the steganographic channel (Kerckhoffs’ principle) and
thus knows both Pc and Ps . This rather strong assumption is justified because
in real life the prisoners can never be sure how much Eve knows (she may be a
government agency with significant resources, for example) and thus it is pru-
dent to grant her omnipotence. Under this assumption, in Section 6.1 we define
steganographic security as the KL divergence between the distributions of cover
and stego images, Pc and Ps . The importance of this information-theoretic quan-
tity will become apparent later when we show how the KL divergence imposes
fundamental limits on Eve’s detector and how it can be used for comparing
steganographic schemes. In Section 6.2, we discuss several specific examples of
perfectly secure stegosystems and point out an interesting relationship between
perfect security and perfect compression. Section 6.3 investigates secure stegosys-
tems under the condition that the embedding distortion introduced by Alice is
limited (the case of a so-called distortion-limited embedder). In the same section,
we also show how certain algorithms originally proposed for robust watermarking
can be modified into secure stegosystems.
Even though the information-theoretic definition of security is well developed
and widely accepted in the steganographic community, there exist important
alternative approaches, which we also mention in this paragraph. Inspired by
the concept of security of public-key cryptosystems, in Section 6.4 we explain a
complexity-theoretic approach to steganographic security, which makes an im-
portant connection between security in steganography and properties of some
common cryptographic primitives, such as one-way functions. This research di-
rection arose due to critique of the information-theoretic definition of security,
which ignores the important issue of computational complexity and Eve’s ability
to actually implement an attack.
Pc (x)
DKL (Pc ||Ps ) = Pc (x) log , (6.1)
Ps (x)
x∈C
is given by (6.1) with C = {0, 1}, pc (1) = PFA , and ps (0) = PMD ,
dKL (PFA , PMD ) DKL (pc ||ps ) (6.4)
1 − PFA PFA
= (1 − PFA ) log + PFA log . (6.5)
PMD 1 − PMD
Because Eve’s detector is a type of processing and processing cannot increase
the KL divergence (Proposition B.6 in Appendix B), for an -secure stego system
we must have
dKL (PFA , PMD ) ≤ DKL (Pc Ps ) ≤ . (6.6)
This inequality imposes a fundamental limit on the performance of any detec-
tor Eve can build. Requiring that the probability of false alarm for Eve’s detector
cannot be larger than some fixed value PFA , 0 < PFA < 1, the smallest possible
probability of missed detection she can achieve is
PMD (PFA ) = arg min {PMD |dKL (PFA , PMD ) ≤ }, (6.7)
PMD ∈[0,1]
where the minimum is taken over all detectors whose probability of false alarm
is PFA . Figure 6.1 shows the probability of detection of a stego object
PD (PFA ) = 1 − PMD (PFA ) (6.8)
as a function of PFA for various values of . The curves are called Receiver-
Operating-Characteristic (ROC) curves (see Appendix D for the definition). The
region under each curve is the range within which the performance of Eve’s
detector must fall for a given value of . Note that Eve cannot minimize both
errors at the same time as there appears to be a trade-off between the two types
of error. In particular, if Eve is not allowed to falsely accuse Alice or Bob of using
steganography, PFA = 0, Eve’s detector will fail to detect -secure steganographic
communication with probability at least
PMD ≥ e− . (6.9)
This can be easily seen by setting PFA = 0 in (6.6). Thus, the smaller is, or
the closer the two distributions, Pc and Ps , are, the greater the likelihood that
a covert communication will not be detected. This motivates choosing the KL
divergence as a measure of steganographic security.
It is also instructive to inspect what kind of detector Eve obtains for secure
stegosystems with = 0. Because the KL divergence is always non-negative,
from (6.6) we have that dKL (PFA , PMD ) = 0. As shown in Appendix B, this can
happen only if distributions pc and ps are the same, or PMD = 1 − PFA . A detec-
tor whose false alarms and missed detections satisfy this relationship amounts
to a detector that is just randomly guessing. To see this, imagine the following
family of detectors parametrized by p ∈ [0, 1]:
1 with probability p
Fp (x) = (6.10)
0 with probability 1 − p.
Steganographic security 85
=1
0.8
0.6
= 0.1
PD
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
PFA
Figure 6.1 Probability of detection PD of a stego image as a function of false alarms,
PFA . The ROC curves correspond to = 0.1, 0.2, . . . , 1 with = 1 corresponding to the
top curve and = 0.1 to the bottom curve.
When a cover image is sent to this detector, Fp flips a biased coin and decides
“stego” with probability p (the false-alarm probability is PFA = p). On the other
hand, presenting the detector with a stego object, it is detected as cover with
probability 1 − p (the missed detection rate is PMD = 1 − p). Thus, this randomly
guessing detector satisfies PMD = 1 − PFA .
1
lim log PMD (PFA ) = −DKL (Pc ||Ps ). (6.11)
n→∞ n
Alternatively, one could also say that the probability of a miss decays exponen-
tially with n, PMD (PFA ) ≈ e−nDKL (Pc ||Ps ) .
We stress that this result holds only in the limit of a large number of obser-
vations, n. For a finite number of observations, larger KL divergence does not
necessarily imply the existence of a better detector. Exercise 6.3 shows an ex-
ample of two families of distributions, gN and hN on C = {1, . . . , N }, for which
DKL (gN ||hN ) = N and for which any detector based on one observation will al-
ways have PD − PFA ≤ δN , where δN → 0. In other words, despite the fact that
the KL divergence between distributions gN and hN grows to infinity with N ,
our ability to decide between them on the basis of a single observation diminishes
to random guessing.
Benchmarking steganographic systems using their KL divergence as hinted
above is, however, still not without problems. Ignoring for now the issue of nu-
merically evaluating the KL divergence for a real stegosystem, the KL divergence
is a function of the relative payload (or, more accurately, the distribution of the
change rate β). Thus, it is conceivable that two stegosystems compare differently
for different payloads.
On the other hand, if the prisoners wish to stay undetected, with time they
must start embedding smaller and smaller payloads, otherwise Eve would detect
the covert communication with certainty. To see this, imagine that the prisoners
communicate using change rate bounded from below, β ≥ β0 > 0. Then, the KL
divergence between cover and stego objects would be bounded from below by
DKL (Pc ||Pβ0 ).1 Invoking the Chernoff–Stein lemma again, Eve’s detector would
thus achieve an arbitrarily small probability of missed detection, PMD , at any
bound on false alarms PFA . In other words, the prisoners would be caught with
probability approaching 1.
Thus, it makes sense to define the steganography benchmark by properties of
Eve’s best detector in the limit of β → 0. Let us assume for simplicity that covers
are represented by n iid realizations x[i], i = 1, . . . , n, of a scalar random variable
x with pmf Pc ≡ P0 . If the embedding modifications are also independent of each
other, the stego object is a sequence of iid realizations that follow distribution
Pβ , where β is the change rate. In the simplest case, Eve knows the payload and
1 Pβ0 stands for the distribution of stego objects modified with change rate β0 .
Steganographic security 87
It is shown in Section D.2 that for small β and large n, Lβ (x)/n approaches the
Gaussian distribution
1 N − 21 β 2 I(0), n1 β 2 I(0) under H0
Lβ (x) ∼ 1 2 1
(6.15)
n 2
N 2 β I(0), n β I(0) under H1 ,
where I(β) is the Fisher information for one observation (see Section D.2 for
more details about Fisher information),
1 ∂Pβ (x)
2
I(β) = . (6.16)
x
Pβ (x) ∂β
with covers exhibiting dependences in the form of a Markov chain [69, 70]. For
such stegosystems, similar limiting behavior for small change rates can be estab-
lished for the KL divergence between the Markov chain of covers and the hidden
Markov chain of stego objects.2
Fisher information could potentially be used for benchmarking stegosystems
with real-life cover sources, such as digital images. Here, there is little hope,
however, that we could compute it analytically due to the difficulty of modeling
images. A plausible practical option is to model images using numerical features,
such as features used in blind steganalysis (see Chapter 12), and compute the
Fisher information experimentally from a database of cover and stego images
embedded with varying change rate. This is currently an active area of research
and the interested reader is referred to [136, 191] and the references therein.
C0 = arg minPc (C ) − Pc (C − C ), C1 = C − C0 . (6.20)
C ⊂C
2 Steganographic embedding into a Markov chain produces stego objects that are no longer
Markov and instead form a hidden Markov chain (see, e.g., [213]).
Steganographic security 89
Bob the object from C encoded by that codeword. Then, she continues with
the remainder of the message bits till the complete message has been sent. Bob
reads the secret message bits by compressing Alice’s objects using the same com-
pression scheme and concatenating the bit strings. Because Alice uses a perfect
prefix-free compressor, she will be sending objects that exactly follow the re-
quired distribution Pc and thus Ps = Pc , which means that the steganographic
method is perfectly secure. Note that the average steganographic capacity over
all covers is the entropy H(Pc ).
This stegosystem is rather academic because the complexity and dimensional-
ity of covers formed by digital media objects, such as images, will prevent us from
determining even a rough approximation to the distribution Pc . It is, however,
possible to realize this idea within a sufficiently simple model of covers (see Sec-
tion 7.1.2 on model-based steganography). The scheme also tells us something
quite fundamental. In Chapter 3, we learned that the process of image acqui-
sition using imaging sensors is influenced by multiple sources of imperfections
and noise. Some of the noise sources, such as the shot noise, are truly random
phenomena caused by the quantum properties of light.3 Thus, in the absence of
other imperfections, even if we took multiple images of exactly the same scene
with identical camera settings, we would obtain slightly different images. The
images could be described as a superposition of the true scene S[i] and a two-
dimensional field of some (not necessarily independent or identically distributed)
random variables η[i],
where n is the number of pixels in the image. The variables η[i] will be locally
dependent and also dependent on S due to demosaicking, in-camera processing,
and artifacts, such as blooming, and possibly due to JPEG compression. Because
the dependence is only local, η[i] and η[j] will be independent as long as the
distance between the pixels i and j is larger than some fixed threshold deter-
mined by the physical phenomena inside the sensor and the character of the
dependences. If we model the dependences as a Markov Random Field (MRF),
it is known that the entropy of a MRF increases linearly with the number of
random variables [50]. Thus, with increasing number of pixels, n, the entropy of
η will be proportional to the number of pixels,
3 Experiments convincingly violating Bell’s inequalities [8] showed that quantum mechanics is
free of “hidden” parameters and is thus indeterministic.
Steganographic security 91
This thought experiment also tells us that with every digital image, x, acquired
by a sensor, there is a large cluster of natural images that slightly differ from
x in their noise components. And the size of this cluster increases exponentially
with the number of pixels in the image. Essentially all steganographic schemes
try, in one way or another, to reach into this image cluster by following certain
design principles and elements as discussed in Chapters 7–9.
4 For block transforms, such as the DCT in JPEG compression, one can use a more general
model and consider the image as 64 parallel channels, each modeled as an iid sequence with
a different distribution.
92 Chapter 6. Steganographic security
models) or a large image database to also estimate the distribution of θ for het-
erogeneous models. Adopting this approach, Alice can then guarantee that her
steganography will be undetectable within the model if her embedding preserves
the essential statistics. For an iid source, she needs to preserve the histogram,
while for the Markov model, she needs to preserve the transition-probability ma-
trix and the Markov property. This approach to steganography enables exact
mathematical proofs of security as well as progressive, methodological improve-
ment. There is some hope that with advances in statistical modeling of images,
this approach will lead to secure stegosystems.
So far, however, all steganographic schemes for digital images that follow this
paradigm (see the methods in Chapter 7) have been broken. This is because all
that Eve needs to do to attack a stegosystem that is provably secure within a
given model is to use a better model of covers that is not preserved by embed-
ding. Often, it is sufficient to merely identify a single statistical quantity that is
predictably perturbed by embedding, which is usually not very hard given the
complexity of typical digital media. We discuss these issues from the point of
view of steganalysis in Chapters 10–12.
Message m Message m
D1
Channel A(ỹ|y)
Cover x Emb(·) Ext(·)
y D2 ỹ
Key k Key k
Figure 6.2 A diagram of steganographic communication with an embedder whose
distortion is limited by D1 and noisy channel A(·|·) with distortion limited by D2 .
schemes that would work for real digital media, the results provide quite valuable
fundamental insight.
Covers will be represented as sequences of n elements from some alphabet X ,
x ∈ C = X n . The steganographic channel, captured with the embedding mapping
Emb, extraction mapping Ext, source of covers, messages, and secret keys, is now
augmented with two more requirements that bound the expected embedding
distortion and the expected channel distortion that the stego object experiences
on its way from Alice to Bob,
E [d(x, y)] = Pc (x) Pk (k) Pm (m)d (x, Emb(x, k, m)) ≤ D1 , (6.28)
x k m
Ps (y)A(ỹ|y)d(ỹ, y) ≤ D2 , (6.29)
y,ỹ
where d(x, y) is some measure of distortion per cover element, such as the energy
1 2
i=1 (x[i] − y[i]) . The matrix A(ỹ|y) captures the probabilistic
1 2
n d2 (x, y) = n
active warden (noisy channel) and stands for the conditional probability that the
noisy stego object, ỹ, is received by Bob when stego object y was sent by Alice.
The scalar values, D1 and D2 , are the bounds on the per-element embedding
and channel distortion. Again, the stegosystem is considered secure if Pc = Ps .
A diagram showing the main elements of the communication channel is displayed
in Figure 6.2.
Secure stegosystems with distortion-limited embedder exist and two examples,
for the passive-warden case, are given below.5
for some scalar γ. We now need to determine γ and Dw so that the cover model
is preserved and the embedding distortion is below D1 .
Because w is independent of x, the stego object y is a sequence of iid Gaussian
random variables N (0, γ 2 σc2 + Dw ). To obtain a perfectly secure stegosystem, we
need to preserve the Gaussian model of the cover source. Because y is a zero-mean
Gaussian signal, we must preserve the variance γ 2 σc2 + Dw = σc2 . The expected
value of the embedding distortion is
n
1 1
E d2 (x, y) = E (x[i] − y[i])2 (6.31)
n n i=1
n
1 2
= E ((1 − γ)x[i] − (2m − 1)w[i]) (6.32)
n i=1
Bob receives ỹ = y + z, where each z[i] ∼ N (0, D2 ). The distribution of the test
statistic ρ has the same mean but a higher variance. Nevertheless, with increasing
√
n, its variance is again proportional to 1/ n.
Finally, we note that, even though the spread-spectrum embedding method
is perfectly secure with a distortion-limited embedder, it has a low embedding
capacity far below the theoretical steganographic capacity for such a channel [49,
235] (also, see Chapter 13).
+ + +
Λ◦ Λ◦
+ + Λ+ +
V◦ Vf
+ + +
Figure 6.3 Left: Lattice Λ◦ (circles) with its Voronoi cell V◦ . Right: The fine lattice
Λ◦ ∪ Λ+ and its Voronoi cell Vf . Note that Λ+ = Λ◦ + (1, 1).
Example 6.1: [Stochastic QIM with two lattices] Figure 6.3 shows an ex-
ample of a regular lattice in R2 ,
Λ = Λ◦ = {(2i, 2j)| i, j ∈ Z} (6.43)
and two dither vectors d◦ = (0, 0), d+ = (1, 1). Note that Λ+ = {(2i + 1, 2j +
1)| i, j ∈ Z}, V◦ = [−1, 1] × [−1, 1], and Vf = {x| |x[1] + x[2]| < 1} if we use the
L1 norm. The message that can be sent in each block consisting of two cover
elements is one bit, m ∈ {0, 1}.
This simple version of QIM is not suitable for steganography as the stego
images confined to the fine lattice would certainly not follow the distribution of
covers and could easily be identified by Eve as suspicious. Alice needs to spread
the stego objects around the lattice points to preserve the distribution. This
is the idea behind the stochastic QIM [235]. For now, we will assume that the
covers are x ∈ RN rather than considering them broken up into disjoint blocks
of N elements each.
The Euclidean space is divided into M regions that are translates of each
other,
Rm = Λm + Vf , 1 ≤ m ≤ M. (6.44)
Note that Rm is the region of all noisy stego objects that carry the same mes-
sage, m. Let p[m] = Pc (Rm ) be the probability of a cover being in Rm . Let
us further assume that the probability that message m is sent is exactly p[m].
This could be arranged, for example, by prebiasing the stream of originally uni-
formly distributed message symbols using an entropy decoder (see more details
in Section 7.1.2 on model-based steganography).
We now define the embedding function Emb(x, m) for any cover x ∈ RN and
message m. Alice first identifies z ∈ Λm that is closest to x. If z already belongs
to Rm , no embedding change is required and Alice simply sends y = x. In the
Steganographic security 97
opposite case, Alice generates y randomly from the shifted Voronoi cell z + Vf
according to the distribution Pc constrained to the cell z + Vf , Pc (·)/Pc (z + Vf ).
Note that the bound on embedding distortion, D1 , determines how close to each
other the lattice points must be.
It should be clear that the stego object y ends up in Rm with probability p[m].
Also, within each cell, the probability distribution has the correct shape. The
only question is with what probability y appears in a given cell, or Pr{y ∈ z +
Vf }. It can be shown that for a well-behaved Pc , Pr{y ∈ z + Vf } ≈ Pc (z + Vf ).
More accurately, for any > 0, |Pr{y ∈ z + Vf } − Pc (z + Vf )| < for sufficiently
small D1 . Rather than proving this for the general case, we provide a qualitative
argument for the simple case of the fine lattice shown in Figure 6.3, right. It
shall be clear that the argumentation applies to more general cases.
Let us compute the probability of y ∈ z + Vf . In Figure 6.4, z + Vf is the
diamond. Since p[0] = p[1] = 12 , Pr{y ∈ z + Vf } is 12 times the probability that
x falls into the cell directly plus 12 times the probability that we end up in one of
the four triangles and move to the cell. Denoting the union of the four triangular
regions as T , we have
1 1
Ps (z + Vf ) = Pr{y ∈ z + Vf } = Pc (z + Vf ) + Pc (T ). (6.45)
2 2
If the distribution Pc were locally linear around z, we would have the equality
Ps (z + Vf ) = Pr{y ∈ z + Vf } = Pc (z + Vf ). For general but well-behaved Pc ,6 the
difference between the probabilities Ps (z + Vf ) and Pc (z + Vf ) can be made ar-
bitrarily small for sufficiently small D1 . (Recall that D1 determines the spacing
between the points in the lattice.) In summary, we will have |Pc (x) − Ps (x)| <
for all x for sufficiently small D1 . Assuming Pc (x) > κ > 0 whenever Pc (x) = 0
(which will be satisfied for any finite cover set C), we can use the result of Exer-
cise 6.2 to claim that the KL divergence DKL (Pc ||Ps ) < /κ.
+ + +
z + Vf
z
+ + +
+ + +
Figure 6.4 Example of a cell Vf surrounding a point z from the fine lattice formed by
circles and crosses. The four triangular regions shaded light gray are the regions from
where the cover x can be mapped to the cell during embedding.
Bernoulli sequences B 12 (binary sequences with probability of 0 equal to 12 )
both under an active and under a passive warden. Covers with a multivariate
Gaussian distribution and their security and steganographic capacity are studied
in [235]. This work also contains a rather fundamental result that block embed-
ding in such Gaussian covers is insecure in the sense that the KL divergence grows
linearly with the number of blocks. Perfectly secure and -secure steganographic
methods for Gaussian covers are studied in [49]. The authors compute a lower
bound on the capacity of perfectly secure stegosystems and study the increase in
capacity for -secure steganography. They also describe a practical lattice-based
construction for high-capacity secure stegosystems. The work is contrasted with
results obtained for digital watermarking where the steganographic constraint
Pc = Ps is absent. Codes for construction of high-capacity stegosystems are de-
scribed in [237]. The authors also show how their approach can be generalized
from iid sources to Markov sources and sources over continuous alphabets.
tion through espionage. Thus, our paranoia dictates that we should assume that
Eve knows all details of the distribution. Realistically, however, this can only
be reasonable within a sufficiently simple model of covers in some artificially
conceived communication channel rather than in reality, where the covers are
very complex objects, such as digital-media files. Additionally, one may attack
the very assumption that covers can be described by a random variable at all.
We choose not to delve into this rather philosophical issue and instead refer the
reader to [22] for an intriguing discussion of this issue from the perspective of
epistemology.
Second, the information-theoretic approach by definition ignores complexity
issues. It is concerned only with the possibility of constructing an attack rather
than its practical realization. It is quite feasible that, even though an attack on
a stegosystem can be mounted in principle, its practical realization may require
resources so excessive that no warden can implement the attack. For example, if
the computational complexity of the attack grows exponentially with the security
parameters of the stegosystem (stego key length and number of samples in the
cover), one could choose the key length and covers large enough to make sure
that any warden with polynomially bound resources will be destined to random
guessing. This idea is the basis of definitions of steganographic security based on
complexity-theoretic principles [113, 124].
To help explain the point further, we take a look at the related field of cryp-
tography. According to Shannon’s groundbreaking work on cryptosystems, the
only unconditionally secure cryptosystems are those whose key has the same
length as the text to be encrypted. This is what security defined using informa-
tion theory leads to. However, there exists an important and very useful class
of asymmetric cryptosystems (so-called public-key systems [207]) whose security
stems from the excessive computational complexity of constructing the attack
rather than the fundamental impossibility to construct one. In this section, we
explain a similar approach to defining steganographic security that is based on
complexity-theoretic considerations.
In 2002, Hopper et al. [113] and Katzenbeisser and Petitcolas [124] inde-
pendently proposed a complexity-theoretic definition of steganographic security.
These two proposals share several important novel ideas. First, the requirement of
knowing the probability distribution of covers is replaced with a weaker assump-
tion of availability of an oracle O that samples from the set of covers according
to their distribution over the channel. Second, the security of a stegosystem is
established by means of a probabilistic game between a judge and the warden.
The warden is allowed to sample the oracle O and the embedding oracle is im-
plemented as a black box seeded with an unknown stego key. The warden is
then essentially asked to distinguish between the outputs of the two oracles. The
advantage of the warden is defined as the probability of correct decision minus
1
2 . The stegosystem is secure if the warden’s advantage is negligible (it falls to
zero faster than any power of 1/k, where k is the security parameter, such as
100 Chapter 6. Steganographic security
the number of bits in the stego key or the number of elements in the cover). We
first describe the approach that appeared in [113].
7 A timestamped bit is a pair {m[i], t[i]}, where t[i] is the time of sending the ith bit m[i].
8 A probabilistic algorithm uses randomness as part of its reasoning, usually realized using a
PRNG.
9 We use the prefix P in the embedding and extraction algorithms to stress the difference
that the PEmb and PExt are now probabilistic algorithms rather than mappings as in the
previous four sections or in Chapter 4. The symbol in PExt means that it can operate on
bit strings of any length.
10 |s| denotes the length of bit string s.
Steganographic security 101
and -insecurity on the basis of the warden’s ability to correctly identify the
oracle using polynomial-complexity calculations.
One of the most intriguing implications of this complexity-theoretic view of
steganographic security is the fact that secure stegosystems exist if and only if
secure one-way (hash) functions exist under the assumption that the minimal
entropy11 of the cover source is larger than 2. This ties the security in steganog-
raphy to the security of common cryptographic primitives. The proof of this
statement is constructive and the embedding method is essentially a combina-
tion of steganography by cover selection and cover synthesis (see Chapter 4) that
proceeds by embedding one bit in a block of b cover bits.
Alice uses a key-dependent hash function12 h(k, ·) that maps b timestamped
bits to {0, 1}. The hash function enters the embedding process in the follow-
ing manner. Having synthesized (embedded) h timestamped stego bits, Alice
generates the next b bits, y, by querying the oracle Oh,b until h(k, y) returns
the required message. The number of calls to the oracle is limited by a fixed
upper bound and thus there is a small probability that a message bit will not
be embedded in the block. This embedding algorithm, which is called rejection
sampling, is an example of steganography by cover selection. The extraction al-
gorithm simply applies the same hash function to each block of b timestamped
stego bits to recover one message bit from each block. Under some rather tech-
nical assumptions about the hash function, which are satisfied if secure one-way
functions exist, this embedding algorithm is secure in the sense of Section 6.4.1.
11 The minimal entropy, Hmin , of a source with pmf p[i] is defined as Hmin (p) =
mini log2 1/p[i].
12 In [113], the authors use the term pseudo-random function.
102 Chapter 6. Steganographic security
PEmb(x2 , k, m). Then, the judge flips a fair coin and gives either the cover x1 or
the stego object PEmb(x2 , k, m) to the warden. The warden performs a polyno-
mial probabilistic test to decide whether she is observing a cover or stego object.
The advantage of the warden is the probability of a correct guess minus 12 . The
stegosystem is secure for oracle O if the warden’s advantage is negligible.
It is possible to construct secure steganographic systems S by reducing an
intractable problem Pint to the steganographic decision problem for S. The proof
of security can then be realized by contradiction in the following manner. Under
the assumption that S is not secure in the above sense, there exists a probabilistic
game Z between the warden and the judge that allows the warden to detect
stego objects with non-negligible probability. If the stegosystem S is constructed
in such a way that the existence of Z implies that instances of Pint can be solved
with non-negligible probability, we obtain a contradiction with the intractability
of Pint . The interested reader is referred to [124], where the authors use this
approach to construct a stegosystem whose insecurity would lead to an attack
on the RSA cryptosystem [207].
Summary
r The information-theoretic definition of steganographic security assumes that
the cover source is a random variable with known distribution Pc .
r Feeding covers into a stegosystem according to their distribution Pc , the dis-
tribution of stego objects follows distribution Ps .
r A steganographic system is perfectly secure (or -secure) if the Kullback–
Leibler divergence DKL (Pc ||Ps ) = 0 (or DKL (Pc ||Ps ) ≤ ).
r The KL divergence is a measure of how different the two distributions are.
r The existence of perfect compression of the cover source implies the existence
of perfectly secure steganographic schemes.
r Describing the covers using a simplified model, we obtain a concept of security
with respect to a model. Stegosystems preserving this model are undetectable
within the model.
r Spread-spectrum methods and lattice-based methods (quantization index
modulation) can be used to construct stegosystems with distortion-limited
embedder and distortion-limited active warden.
r Steganographic security defined in the information-theoretic sense is con-
cerned only with the possibility to mount an attack and not its feasibility
(computational complexity).
r An alternative definition of security is possible in which access to the distribu-
tions of covers is replaced with availability of an oracle that generates covers.
Security is then defined as the inability of a polynomially bounded warden to
construct a reliable attack.
r The complexity-theoretic security of stegosystems can be proved by reducing
the steganalysis problem to a known intractable problem or by using secure
one-way functions.
Exercises
6.1 [KL divergence between two Gaussians] Let f1 (x) and f2 (x) be the
pdfs of two Gaussian random variables N (μ1 , σ12 ) and N (μ2 , σ22 ). Prove for their
KL divergence
ˆ
f1 (x) 1 σ22 σ12 (μ2 − μ1 )2
DKL (f1 ||f2 ) = f1 (x) log dx = log + − 1 + .
f2 (x) 2 σ12 σ22 σ22
(6.46)
104 Chapter 6. Steganographic security
6.4 [KL divergence for LSB embedding in iid source] Assume that the
cover image is a sequence of independent and identically distributed random
variables with pmf p0 [i]. After embedding relative payload α = 2β using LSB
embedding, the stego image is a sequence of iid random variables with pmf
Show that
β2
127
1 1
DKL (p0 ||pβ ) = (p0 [2i] − p0 [2i + 1])2 + + O(β 3 ).
2 i=0 p0 [2i] p0 [2i + 1]
(6.55)
Hint: Use Proposition B.7.
Steganographic security 105
6.5 [KL divergence for ±1 embedding in iid source] Assume that the
cover image is a sequence of independent and identically distributed random
variables with range {0, . . . , 255} and pmf p0 [i]. After embedding relative pay-
load α = 2β using ±1 embedding (see Section 7.3.1), the stego image becomes a
sequence of iid random variables with pmf computed in Exercise 7.1. Show that
the KL divergence satisfies
253
(p0 [i − 1] − 2p0 [i] + p0 [i + 1])2
DKL (p0 ||pβ ) = β2
i=2
8p0 [i]
(p0 [1] − 2p0 [0])2 2 (2p0 [0] − 2p0 [1] + p0 [2])2 2
+ β + β
8p0 [0] 8p0 [1]
(p0 [253] − 2p0 [254] + p0 [255])2 2
+ β
8p0 [254]
(p0 [254] − 2p0 [255])2 2
+ β + O(β 3 ). (6.56)
8p0 [255]
Hint: Use Proposition B.7.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
7 - Practical steganographic methods pp. 107-134
Chapter DOI:
Cambridge University Press
7 Practical steganographic methods
Steganographic schemes from the first class are based on a simplified model of
the cover source. The schemes are designed to preserve the model and are thus
undetectable within this model. The remaining three design principles are heuris-
tic. The goal of the second principle is to masquerade the embedding as some
natural process, such as noise superposition during image acquisition. The third
principle uses known steganalysis attacks as guidance for the design. Finally,
the fourth principle first assigns a cost of making an embedding change at each
element of the cover and then embeds the secret message while minimizing the
total cost (impact) of embedding. It is also possible and, in fact, advisable, to
take into consideration all four principles.
We now describe each design philosophy in more detail and give examples of
specific embedding schemes.
108 Chapter 7. Practical steganographic methods
is determined by the most imbalanced LSB pair, which is the pair {2k, 2k + 1}
with the largest ratio
max{h[2k], h[2k + 1]}
. (7.1)
min{h[2k], h[2k + 1]}
Because the histogram of DCT coefficients in a single-compressed JPEG image
has a spike at zero (see the discussions in Chapter 2) and because the LSB pair
{0, 1} is skipped during embedding, the most imbalanced LSB pair is {−2, −1}.
Because h[−2] < h[−1] in typical cover images, after embedding the maximum
correctable payload all remaining coefficients with value −2 will have to be
modified to −1 in the correction phase.
Writing n01 for the number of all DCT coefficients not equal to 0 or 1, the
maximal correctable payload will be αmax n01 , 0 ≤ αmax ≤ 1. Because all αmax n01
coefficients are selected pseudo-randomly in the embedding phase, the number
of unused DCT coefficients with value −2 after embedding is (1 − αmax )h[−2].
In order to restore the number of −1s in the stego image, this value must be
larger than or equal to the expected decrease in the number of coefficients with
value −1, which is (αmax /2)h[−1] − (αmax /2)h[−2]. This is because, assuming a
random message is embedded, the probability that a message bit will match the
LSB of −1 is 12 and thus on average (αmax /2)h[−1] coefficients with value −1 will
be unchanged by embedding and the same number of them will be modified to
−2. The second term, (αmax /2)h[−2] is the expected number of coefficients with
value −2 that will be modified to −1 during embedding. Because h[−2] < h[−1],
the expected drop in the number of coefficients with value −1 is the difference
(αmax /2)h[−1] − (αmax /2)h[−2]. Thus, we obtain the following inequality and
eventually an upper bound on the maximum correctable payload αmax :
αmax αmax
(1 − αmax )h[−2] ≥ h[−1] − h[−2], (7.2)
2 2
2h[−2]
αmax ≤ . (7.3)
h[−1] + h[−2]
This condition guarantees that at the end of the embedding phase on average
there will be enough unused coefficients with magnitude −2 that can be flipped
back to −1 to make sure that the occurrences of the LSB pair {−2, −1} are pre-
served after embedding. As this is the most imbalanced LSB pair, the occurrences
of virtually all other LSB pairs can be preserved using the same correction step,
as well. We note that some very sparsely populated histogram bins in the tails
of the DCT histogram may not be restored correctly during the second phase,
but, since their numbers are statistically insignificant, the impact on statistical
detectability is negligible. The average capacity αmax of OutGuess for typical
natural images is around 0.2 bpnc (bits per non-zero DCT coefficient).
Steganographic schemes that embed messages in the spatial domain require
more complex models because neighboring pixels are more correlated than DCT
coefficients in a JPEG file. These correlations cannot be captured using first-
110 Chapter 7. Practical steganographic methods
order statistics. Instead, one can use the joint statistics of neighboring pixel
pairs [75], statistics of differences between neighboring pixels (see Section 11.1.3
on structural steganalysis), or Markov chains.
where C(xinv ) is the set of cover elements whose invariant part is xinv . The
embedding algorithm embeds a portion of the message in each C(xinv ) in the fol-
lowing manner. First, the message bits are encoded using symbols from Xemb
and then the symbols are prebiased so that they appear with probabilities
Pr{xemb |xinv = xinv }. This is achieved by running the message symbols through
an entropy decompressor, for a compression scheme designed to compress symbols
from Xemb distributed according to the same conditional probabilities.1 When
the decompressor is fed with xemb distributed uniformly in Xemb , it will output
symbols with probabilities Pr{xemb |xinv = xinv }. Thus, when xemb of all cover el-
ements from C(xinv ) are replaced with the transformed symbols, xemb , the stego
image elements will follow the cover-image model as desired.
The fraction of bits that can be embedded in each element of C(xinv ) is the
entropy of Pr{xemb |xinv = xinv } or
Stego y
Model Stego y
Entropy decoder
Message m
embedding for all i, we can use this invariant and fit a parametric model, h(x),
through the points ((2i + 2i + 1)/2, (h[2i] + h[2i + 1])/2). For example, we can
model the DCT coefficients using the generalized Cauchy model with pdf (see
Appendix A)
−p
p−1 |x|
h(x) = 1+ (7.8)
2s s
and determine the parameters using maximum-likelihood estimation (see Exam-
ple D.8). Denoting the 7 most significant bits of an integer a as MSB7 (a), we
define the model using the conditional probabilities
h(2i)
Pr{xemb = 0|xinv = MSB7 (2i)} = , (7.9)
h(2i) + h(2i + 1)
h(2i + 1)
Pr{xemb = 1|xinv = MSB7 (2i)} = . (7.10)
h(2i) + h(2i + 1)
The sender continues with the embedding process by selecting all h[2i] +
h[2i + 1] DCT coefficients at the chosen DCT mode that are equal to 2i or
2i + 1. Their LSBs are replaced with a segment of the message that was de-
compressed using an arithmetic decompressor to the length h[2i] + h[2i + 1].
The decompressor is designed to transform a sequence of uniformly distributed
message bits to a biased sequence with 0s and 1s occurring with probabilities
(7.9)–(7.10). Because we are replacing a bit sequence with another bit sequence
with the same distribution, the model (7.9)–(7.10) will be preserved.
The recipient first constructs the model and computes the probabilities given
by (7.9)–(7.10). This can be achieved because h[2i] + h[2i + 1] is invariant with
respect to embedding changes! Individual message segments are extracted by
feeding the LSBs for each DCT mode and each LSB pair {2i, 2i + 1} into the
arithmetic compressor and concatenating them.
This specific example of model-based steganography is designed to preserve the
model of the histograms of all individual DCT modes. This is in contrast with
OutGuess that preserves the sample histogram of all DCT coefficients and not
necessarily the histograms of the DCT modes. For 80% quality JPEG images, the
average embedding capacity of this algorithm is approximately 0.8 bpnc, which
is remarkably large considering the scope of the model and four times larger than
for OutGuess.
We now calculate the embedding efficiency of this algorithm defined as the
average number of bits embedded per unit distortion. In order to simplify the
notation, we set
An embedding change is performed when the coefficient’s LSB does not match the
prebiased message bit. Because the probability that the prebiased message bit is 0
is the same as the probability that the LSB of the coefficient is 0, which is p0 , the
embedding needs to change the LSB with probability p0 (1 − p0 ) + (1 − p0 )p0 =
2p0 (1 − p0 ). Therefore, the embedding efficiency is
4.5
Embedding efficiency e(p0 )
3.5
2.5
2
0 0.2 0.4 0.6 0.8 1
p0
There exists a more advanced version of this algorithm [205] that attempts
to preserve one higher-order statistic called “blockiness” defined as the sum of
discontinuities along the boundaries of 8 × 8 pixel blocks in the spatial domain
(see (12.9) in Chapter 12). This is achieved using the idea of statistical restoration
by making additional modifications to unused DCT coefficients in an iterative
manner to adjust the blockiness to its original value in the cover image.2 In
Chapter 12, the original version of Model-Based Steganography is abbreviated
MBS1, while the more advanced version with deblocking is denoted as MBS2.
2 These additional changes, however, make this version of Model-Based Steganography more
detectable (see Chapter 12 and [190, 210]).
114 Chapter 7. Practical steganographic methods
1
ˆ 2
k+
In theory, by adding realizations of the stego noise to the cover image we could
embed H(p) bits (the entropy of round(η)) at every pixel. Instead of trying to
develop a practical method that achieves this capacity, which is not an easy task,
we provide a simple suboptimal method.
It will be advantageous to work with bits represented using the pair {−1, 1}
rather than {0, 1}. Thus, from now till the end of this section, −1 and 1 represent
binary 0 and 1, respectively. Next, we define a parity function π with the following
antisymmetry property,
π(u + v, v) = −π(u, v) (7.15)
for all integers u and v. The following function, for example, satisfies this re-
quirement:
⎧
⎪
⎪ u
for 2k|v| ≤ u ≤ 2k|v| + |v| − 1
⎨(−1)
π(u, v) = −(−1)u for (2k − 1)|v| ≤ u ≤ 2k|v| − 1 (7.16)
⎪
⎪
⎩0 for v = 0,
where k is an integer.
The embedding algorithm starts by generating a pseudo-random path through
the image using the stego key and two independent stego noise sequences r[i] and
s[i] with the probability mass function (7.14). The sender follows the embedding
path and embeds one message bit, m, at the ith pixel, x[i], if and only if r[i] = s[i].
In this case, the sender can always embed one bit by adding either r[i] or s[i] to
x[i] because due to the antisymmetry property of the parity function π (7.15)
π(x[i] + s[i], r[i] − s[i]) = −π(x[i] + s[i] + r[i] − s[i], r[i] − s[i]) (7.17)
= −π(x[i] + r[i], r[i] − s[i]). (7.18)
In the case when r[i] = s[i], the sender does not embed any bit, replaces x[i]
with y[i] = x[i] + r[i], and continues with embedding at the next pixel along the
pseudo-random walk.
To complete the embedding algorithm, however, we need to resolve one more
technicality of the embedding process. If, during embedding, y[i] = x[i] + r[i]
gets out of its dynamic range, the sender has little choice but to slightly deviate
from the stego noise model and instead add r [i] so that π(x[i] + r [i], r[i] − s[i]) =
m with |r [i] − r[i]| as small as possible.
116 Chapter 7. Practical steganographic methods
The recipient first uses the stego key to generate the same stego noise se-
quences, r[i] and s[i], and the pseudo-random walk through the pixels. The mes-
sage bits are read from the stego image pixels, y[i], as parities m = π(y[i], r[i] −
s[i]). No bit is extracted from pixel y[i] if r[i] = s[i].
In summary, the sender starts by generating two independent stego noise se-
quences with the required probability mass function and then at each pixel at-
tempts to embed one message bit by adding one of the two samples of the stego
noise. This works due to the antisymmetry property of the parity function. If the
two stego noise samples are the same, the embedding process does not embed
any bits. The sender, however, still adds the noise to the image so that the stego
image is, indeed, obtained by adding noise of a given pmf to the cover image as
required.
Stochastic modulation embeds one bit at every pixel as long as r[i] = s[i]. Thus,
the relative embedding capacity is 1 − Pr{r = s} = 1 − k p[k]2 bits per pixel,
where the probabilities can be computed using (7.14). The random noise sources
during image acquisition, such as the shot noise and the readout noise, are well
modeled as Gaussian random variables with zero mean and variance σ 2 . In this
case, the relative embedding capacity per pixel is as shown in Figure 7.4. Because
the stego noise sequences are more likely to be equal for small σ, the capacity
increases with the noise variance. Also, note that the sender can communicate
about 0.7 bits per pixel using quantized standard Gaussian noise N (0, 1).
Note that stochastic modulation is suboptimal in general because only 1 −
2
k p[k] bits are embedded at every pixel rather than H(p)
bits, the entropy
of the stego noise (Exercise 7.4 shows that, indeed, 1 − k p[k]2 ≤ H(p)). Al-
though it is not known how to construct steganographic systems reaching the the-
oretic embedding capacity for general stego noise, there exist capacity-reaching
constructions for special instances of p, such as when the stego noise amplitude
is at most 1 (see Section 9.4.5).
No matter how plausible the heuristic behind stochastic modulation sounds,
the fact is that it can be reliably detected using modern steganalysis methods
based on feature extraction and machine learning. The main reason is that dur-
ing image acquisition, the noise is injected before the signal is even quantized
in the A/D converter and further processed. Adding the noise to the final TIFF
or BMP image is not the same because this image already contains a multitude
of complex dependences among neighboring pixels due to in-camera process-
ing, such as demosaicking, color correction, and filtering. Thus, the stego noise
should be superimposed on the raw sensor output rather than the final image.
It is, however, not clear at this point how to embed bits in the pixel domain by
modifying the raw sensor output. A possible solution is to apply coding meth-
ods for communication with non-shared selection channels, which is the topic of
Chapter 9.
Practical steganographic methods 117
0.6
0.4
0.2
0
0 2 4 6 8
σ2
Figure 7.4 Relative embedding capacity of stochastic modulation (in bits per pixel)
realized by adding white Gaussian noise of variance σ 2 .
α ≤ Hd (D). (7.20)
Alternatively, the relative payload α can be embedded with distortion no smaller
than Hd−1 (α), where Hd−1 is the inverse function to Hd . This translates into the
118 Chapter 7. Practical steganographic methods
We now show that the optimal stego noise distribution is the discrete gener-
alized Gaussian
e−λ|k|
γ
is the normalization factor. To see this, we write for the entropy H(p) of an
arbitrary distribution p satisfying the bound (7.22)
1 1 popt [k]
H(p) = p[k] log = p[k] log + p[k] log (7.25)
p[k] popt [k] p[k]
k k k
≤ log Z(λ) + λ p[k]|k|γ ≤ log Z(λ) + λD, (7.26)
k
From the property of the KL divergence, the equality is reached when p = popt ,
which proves the optimality of popt (7.23). The parameter λ is determined from
the requirement k p[k]|k|γ = D.
Note that when the distortion is measured as energy, γ = 2, the bound (7.22)
limits the variance of p. In this case, the stego noise with maximal entropy is
the discrete Gaussian, in compliance with the classical result from information
theory that the highest-entropy noise among all variance-bounded distributions
is the Gaussian distribution.
For steganography whose embedding changes are restricted to ±1, the stego
noise distribution satisfies p[k] = 0 for k ∈/ {−1, 0, 1} and dγ (p) becomes the
change rate (4.17) for any γ > 0. In this case, the function Hdγ (x) can be deter-
mined analytically [233] (also see Exercise 7.5) to be
Hdγ (x) = −x log2 x − (1 − x) log2 (1 − x) + x, (7.28)
the ternary entropy function (Section 8.6.2).
Practical steganographic methods 119
7.3.1 ±1 embedding
It was recognized early on that the embedding operation of flipping the LSB cre-
ates many problems due to its asymmetry (even values are never decreased and
odd values never increased during embedding). Flipping LSBs is an unnatural
operation that introduces characteristic artifacts into the histogram. An obvious
remedy is to use an embedding operation that is symmetrical. A trivial modifi-
cation of LSB embedding is the so-called ±1 embedding, also sometimes called
LSB matching. This embedding algorithm embeds message bits as LSBs of cover
elements; however, when an LSB needs to be changed, instead of flipping the
LSB, the value is randomly increased or decreased, with the obvious exception
that the values 0 and 255 are only increased or decreased, respectively. This has
the effect of modifying the LSB but, at the same time, other bits may be modi-
fied as well. In fact, in the most extreme case, all bits may be modified, such as
when the value 127 = (01111111)2 is changed to 128 = (10000000)2.
Note that the extraction algorithm of ±1 embedding is the same as for LSB
embedding – the message is read by extracting the LSBs of cover elements. The
±1 embedding is much more difficult to attack than LSB embedding. While
there exist astonishingly accurate attacks on LSB embedding (see Sample Pairs
Analysis in Chapter 11), no attacks on ±1 embedding with comparable accuracy
currently exist (Sections 11.4 and 12.5 contain examples of a targeted and a blind
attack).
trix embedding. The operation of LSB flipping was replaced with decrementing
the absolute value of the DCT coefficient by one. This preserves the natural
shape of the DCT histogram, which looks after embedding as if the cover image
was originally compressed using a lower quality factor. The second novel design
element, the matrix embedding, is a coding scheme that decreases the number
of embedding changes. For now, we do not consider matrix embedding in the
algorithm description and instead postpone it to Chapter 8.
The F5 algorithm embeds message bits along a pseudo-random path deter-
mined from a user passphrase. The message bits are again encoded as LSBs of
DCT coefficients along the path. If the coefficient’s LSB needs to be changed, in-
stead of flipping the LSB, the absolute value of the DCT coefficient is decreased
by one. To avoid introducing easily detectable artifacts, the F5 skips over the
DC terms and all coefficients equal to 0. In contrast to Jsteg, OutGuess, or
Model-Based Steganography, F5 does embed into coefficients equal to 1.
Because the embedding operation decreases the absolute value of the coefficient
by 1, it can happen that a coefficient originally equal to 1 or −1 is modified to
zero (a phenomenon called “shrinkage”). Because the recipient will be reading the
message bits from LSBs of non-zero AC DCT coefficients along the same path,
the bit embedded during shrinkage would get lost.3 Thus, if shrinkage occurs,
the sender has to re-embed the same message bit, which, by the way, will always
be a 0, at the next coefficient. However, by re-embedding these 0-bits, the sender
will end up embedding a biased bit stream containing more zeros than ones.
This would mean that odd coefficient values will be more likely to be changed
than even-valued coefficients. Yet again, we run into an embedding asymmetry
that introduces “staircase” artifacts into the histogram. There are at least two
solutions to this problem. One is to embed the message bit m as the XOR of the
coefficient LSB and some random sequence of bits also generated from the stego
key. This way, shrinkage will be equally likely to occur when embedding a 1 or
a 0. The implementation of F5 solves the problem differently by redefining the
LSB for negative numbers:
1 − x mod 2 for x < 0
LSBF5 (x) = (7.29)
x mod 2 otherwise.
Because in natural images the numbers of coefficients equal to 1 and −1 are ap-
proximately the same (h[1] ≈ h[−1]), this simple measure will also cause shrink-
age to occur when embedding both 0s and 1s with approximately equal proba-
bility. The pseudo-code for the embedding and extraction algorithm for F5 that
does not employ matrix embedding is shown in Algorithms 7.1 and 7.2.
To calculate the embedding capacity of F5, realize that F5 does not embed into
DCT coefficients equal to 0 and the DC term. Also, no bit is embedded during
3 Realize that the recipient has no means of determining whether the DCT coefficient was
originally zero or became zero due to embedding.
Practical steganographic methods 121
shrinkage, which will lead to loss of (h[1] + h[−1])/2 bits. Thus, given a JPEG
file with in total nAC AC DCT coefficients, the embedding capacity is nAC −
h[0] − (h[−1] + h[1])/2, where h is the histogram of all AC DCT coefficients.
For example, for a cover source formed by JPEG images with 80% quality, the
122 Chapter 7. Practical steganographic methods
average embedding capacity is about 0.75 bits per non-zero DCT coefficient,
which is quite large.
We stress that the F5 algorithm does not preserve the histogram but pre-
serves its crucial characteristics, such as its monotonicity and monotonicity of
increments [241].
n
dρ (x, y) = ρ[i] (1 − δ(x[i] − y[i])) , (7.30)
i=1
where δ(x) is the Kronecker delta (2.23). We point out that (7.30) implicitly
assumes that the embedding impact is additive because it is defined as a sum
of impacts at individual pixels. In general, however, the embedding modifica-
tions could be interacting among themselves, reflecting the fact that making two
changes to adjacent pixels might be more or less detectable than making the
same changes to two pixels far apart from each other. A detectability measure
that takes interaction among pixels into account would not be additive. If the
Practical steganographic methods 123
0 ≤ e[i] ≤ Q/2. To embed a message bit as the LSB of the rounded value, the
sender will need to round z[i] in the opposite direction, which would result in an
increased quantization error of Q − e[i]. Thus, the embedding distortion would
be the difference between the two rounding errors ρ[i] = Q − 2e[i]. Note that it
is therefore advantageous for the sender to select for embedding those pixels with
the smallest ρ[i], which are exactly those 16-bit values z[i] that are close to the
middle of quantization intervals, e[i] ≈ Q/2. In this case, the recipient cannot
read the message because the selection channel is not available to him (he sees
124 Chapter 7. Practical steganographic methods
only the final quantized values). This problem can be solved using special coding
techniques called wet paper codes explained in Chapter 9.
In general, it is a highly non-trivial problem to design steganographic schemes
that embed a given payload while minimizing the total embedding impact. There
exist many suboptimal schemes for special choices of the embedding impact. The
whole of Chapter 8 is devoted to design of steganographic schemes when ρ[i] = 1
and |x[i] − y[i]| = 1, or in other words when the embedding impact is the number
of embedding changes.
We have already encountered special cases of minimal-embedding-impact
steganography in Chapter 5. The optimal parity assignment was designed to
produce the smallest expected distortion when embedding in a palette image
along a pseudo-random path. The suboptimal scheme for palette images called
the embedding-while-dithering method also belongs to this category.
Since the design of steganographic schemes that attempt to minimize the em-
bedding impact requires knowledge of certain coding techniques, we do not de-
scribe any specific instances of embedding schemes in this chapter. Instead, we
postpone this topic to the next two chapters.
In the next section, we establish a fundamental performance bound for
minimal-embedding-impact steganography. In particular, we derive a quantita-
tive relationship between the maximal payload one can embed for a given bound
on the embedding impact. Knowledge of this theoretical bound will give us an
opportunity to evaluate the performance of suboptimal steganographic schemes
and compare them.
Our problem is now reduced to finding the probability distribution p(s) on the
space of all possible flipping patterns s that minimizes the expected value of the
Practical steganographic methods 125
embedding impact
d(s)p(s) (7.34)
s
where p(i, 0) and p(i, 1) are the probabilities that the ith pixel is not (is) modified
during embedding,
1 e−λρ[i]
p(i, 0) = , p(i, 1) = . (7.43)
1 + e−λρ[i] 1 + e−λρ[i]
This means that the joint probability distribution p(s) can be factorized and
thus we need to know only the marginal probabilities p(i, 1) that the ith pixel is
126 Chapter 7. Practical steganographic methods
where the function H applied to a scalar is the binary entropy function H(x) =
−x log2 x − (1 − x) log2 (1 − x).
Note that in the special case when ρ[i] = 1, for all i, the embedding impact
per pixel is the change rate d/n = β, and we obtain
n
e−λ
m= H (p(i, 1)) = nH , (7.45)
i=1
1 + e−λ
!
n
ne−λ
d=E p(i, 1)ρ[i] = , (7.46)
i=1
1 + e−λ
which gives the following relationship between the change rate and the relative
message length α = m/n:
β = H −1 (α). (7.47)
0.2
0.1
Figure 7.5 Minimal relative embedding impact versus relative message length for four
√
embedding-impact profiles on [0, 1]. Constant: ρ(x) = 1, square-root profile: ρ(x) = x,
linear: ρ(x) = x, square: ρ(x) = x2 .
form
1
d(λ) = Gρ (λ), (7.52)
n
1
α(λ) = λFρ (λ) + log 1 + e−λρ(1) , (7.53)
log 2
ˆ1
ρ(x)e−λρ(x)
Gρ (λ) = dx, (7.54)
1 + e−λρ(x)
0
ˆ1
(ρ(x) + xρ (x)) e−λρ(x)
Fρ (λ) = dx. (7.55)
1 + e−λρ(x)
0
Figure 7.5 shows the embedding impact per pixel, d/n, as a function of relative
payload α for four profiles ρ(x). The square profile ρ(x) = x2 has the smallest
impact for a fixed payload among the four and the biggest difference occurs for
small payloads. This is understandable because the square profile is the smallest
among the four for small x.
Before closing this section, we provide one interesting result that essentially
states that among a certain class of embedding operations for JPEG images,
the embedding operation of F5 is optimal because, in some well-defined sense, it
minimizes the embedding impact.
128 Chapter 7. Practical steganographic methods
We take the total distortion due to quantization and embedding as the measure
of embedding impact
n
dγ,emb = |d[i] − y[i]|γ . (7.57)
i=1
Proof. Viewing d[i] as instances of a random variable, let f (x) be its probability
distribution function. In Chapter 2, we learned that the histogram of quantized
DCT coefficients in a JPEG file follows a distribution with a sharp peak at
zero that is monotonically decreasing for positive coefficient values and increas-
ing for negative coefficients. Thus, we assume that f is increasing on (−∞, 0)
and decreasing on (0, ∞) and therefore Lebesgue integrable. Let us inspect the
embedding distortion for one value of the quantized coefficient d = 0, where,
say, d < 0 (this means that f is increasing at d) under the assumption that a
random fraction of β DCT coefficients are modified. The embedding operation
Practical steganographic methods 129
will thus randomly change a fraction of βh[d] coefficients away from zero (in-
crease their absolute value) and the fraction 1 − towards zero (decrease their
absolute value). Let us now express the increase of distortion due to embedding
with respect to the distortion solely due to rounding dγ,emb − dγ,round (follow
Figure 7.6).
60
40
f (x)
20
0
d−1 d d+1
Figure 7.6 Illustrative example for derivations in the text.
y ∈ − 12 , 12 and g(x) ≥ 0 on d, d + 12 . Thus, we can write for d()
ˆ 1 ˆ 1
2 2
d() = β f (d − y)g(d − y)dy + β f (d + y)g(d + y)dy + C (7.60)
0 0
ˆ 1
2
= β (f (d + y) − f (d − y)) g(d + y)dy + C. (7.61)
0
Because f is increasing and g(d + y) ≥ 0 on 0, 12 , d() is minimized when = 0,
which corresponds to the F5 embedding operation.
Summary
r In this chapter, we study methods for construction of practical steganographic
schemes for digital images.
r There exist four heuristic principles that can be applied to decrease the KL di-
vergence between cover and stego images and thus improve the steganographic
security:
– Model-preserving steganography
– Making embedding mimic natural processing
– Steganalysis-aware steganography
– Minimum-embedding-impact steganography
r In model-preserving steganography, a simplified model of cover images is for-
mulated and the embedding is forced to preserve that model.
r The most frequently adopted model is to consider the cover as a sequence of
iid random variables. The model itself is either the sample distribution of cover
elements or a parametric fit through the sample distribution. Steganography
preserving the cover histogram is undetectable within the iid model.
r Model-preserving steganography can be attacked by identifying a quantity
that is not preserved under embedding.
r Stochastic modulation is an example of embedding designed to mimic the
image-acquisition process. The message is embedded by superimposing quan-
tized iid noise with a given distribution.
r In steganalysis-aware steganography, the designer of the stego system focuses
on making the impact of embedding undetectable using existing steganalysis
schemes.
r Minimum-embedding-impact steganography starts by assigning to each cover
element a scalar value expressing the impact of making an embedding change
at that element. The designer then attempts to embed messages by minimizing
the total embedding impact.
r Among all embedding operations that change a quantized DCT coefficient
towards zero and away from zero with fixed probability, the operation of
the F5 algorithm minimizes the combined distortion due to quantization and
embedding.
Practical steganographic methods 131
Exercises
for any pmf p. Hint: log p[k] = log (1 + (p[k] − 1)) ≤ p[k] − 1 by the log-
inequality log(1 + x) ≤ x, ∀x.
7.5 [Stego noise with amplitude 1] Show that for stego noise se-
quences with amplitude 1 (p[k] = 0 for k ∈ / {−1, 0, 1}), the entropy of the
optimal distribution H(popt ) = Hdγ (D) = −D log2 D − (1 − D) log2 (1 − D) +
D for any γ > 0. Hint: First show that for p[−1] + p[1] = 2p, a constant,
−p[−1] log2 p[−1] − p[1] log2 p[1] is maximal when p[−1] = p[1] = p. Thus, we
need only to search for optimal distributions in the family of distributions
p = (p, 1 − 2p, p) parametrized by a scalar parameter p. For such distributions,
the distortion bound is 2p ≤ D and the result is obtained by setting p = D/2.
Chapter
8 - Matrix embedding pp. 135-166
Chapter DOI:
Cambridge University Press
8 Matrix embedding
In the previous chapter, we learned that one of the general guiding principles for
design of steganographic schemes is the principle of minimizing the embedding
impact. The plausible assumption here is that it should be more difficult for Eve
to detect Alice and Bob’s clandestine activity if they leave behind smaller embed-
ding distortion or “impact.” This chapter introduces a very general methodology
called matrix embedding using which the prisoners can minimize the total num-
ber of changes they need to carry out to embed their message and thus increase
the embedding efficiency. Even though special cases of matrix embedding can be
explained in an elementary fashion on an intuitive level, it is extremely empow-
ering to formulate it within the framework of coding theory. This will require the
reader to become familiar with some basic elements of the theory of linear codes.
The effort is worth the results because the reader will be able to design more
secure stegosystems, acquire a deeper understanding of the subject, and realize
connections to an already well-developed research field. Moreover, according to
the studies that appeared in [143, 95], matrix embedding is one of the most
important design elements of practical stegosystems.
As discussed in Chapter 5, in LSB embedding or ±1 embedding one pixel
communicates exactly one message bit. This was the case of OutGuess as well
as Jsteg. Assuming the message bits are random, each pixel is thus modified
with probability 12 because this is the probability that the LSB will not match
the message bit. Thus, on average two bits are embedded using one change or,
equivalently, the embedding efficiency is 2. If the message length is smaller than
the embedding capacity of the cover image, it is possible to substantially increase
the embedding efficiency and thus embed the same payload with fewer embedding
changes.
To explain why this is possible at all, consider the following simple example.
Let us assume that we wish to embed a message of relative length α = 23 using
LSB embedding. Thus, given a cover image with n pixels, the message contains
nα = 2n/3 bits. This means that we need to embed 2 bits, m[1], m[2], in a group
of three pixels with grayscale values g[1], g[2], g[3]. Classical LSB embedding
would embed bit m[1] at g[1] and bit m[2] at g[2] while skipping g[3] and thus
embed with embedding efficiency 2. We can, however, do better by embedding
136 Chapter 8. Matrix embedding
the bits as
where ⊕ is the exclusive or. If the cover values already satisfy both equations,
no embedding changes are necessary. If the first equation is satisfied but not the
second one, the sender can flip the LSB of g[3]. If the second equation is satisfied
but not the first one, the sender should flip g[1]. If neither is satisfied, the sender
will flip g[2]. Because the probability of each case is 14 , the expected number of
changes is 0 × 14 + 1 × 14 +1 × 14 + 1 × 14 = 34 and the embedding efficiency (in
2
bits per change) is e = 3/4 = 83 > 2.
Note that in this scheme, we can no longer say which pixels convey the message
bits. Both bits are communicated by the group as a whole. One can say that in
matrix embedding the message is communicated by the embedding changes as
well as their position. In fact, the message-extraction rule that the receiver uses
is multiplication1 of the vector of LSBs, x = LSB(g), by a matrix
110
m= x, (8.3)
011
which gave this method its name – matrix embedding [19, 52, 98, 233].
Before presenting a general approach to matrix embedding based on coding
theory, in Section 8.1 we generalize the simple trick of this introduction. Up
until now, the material can be comfortably grasped by all readers without any
background in coding theory. To prepare the reader for the material presented
in the rest of this chapter, Section 8.2 contains a brief overview of the theory
of binary linear codes. Readers not familiar with this subject are additionally
urged to read Appendix C, which contains a more detailed tutorial. Section 8.3
contains the main result of this chapter, the matrix embedding theorem. The
theorem gives us a general methodology to improve the embedding efficiency
of steganographic methods using linear codes. We also revisit Section 8.1 and
interpret the method from a new viewpoint. The theoretical limits of matrix
embedding methods are the subject of Section 8.4. A matrix embedding approach
suitable for hiding large payloads appears in Section 8.5. It also illustrates the
usefulness of random codes. Section 8.6 deals with embedding methods realized
using codes defined over larger alphabets (q-ary codes). Such codes can further
increase the embedding efficiency at the cost of larger distortion. Finally, in
Section 8.7 we explain an alternative approach to minimizing the number of
embedding changes that is based on sum and difference covering sets of finite
cyclic groups.
1 All arithmetic operations are performed in the usual binary arithmetic (8.14) (also, see
Appendix C for more details).
Matrix embedding 137
In this section, we explain the matrix embedding method used in the F5 al-
gorithm (Chapter 7), which was the first practical steganographic scheme to
incorporate this embedding mechanism. The reader will recognize later that the
method is based on binary Hamming codes.
Assume the sender wants to communicate a message with relative length
αp = p/(2p − 1), p ≥ 0, which means that p message bits, m[1], . . . , m[p], need
to be embedded in 2p − 1 pixels. We denote by x the vector of LSBs of 2p − 1
pixels from the cover image (e.g., collected along a pseudo-random path if the
embedding uses a stego key for random spread of the message bits). The sender
and recipient share a p × (2p − 1) binary matrix H that contains all non-zero
binary vectors of length p as its columns. An example of such a matrix for p = 3
is
⎛ ⎞
0001111
H = ⎝0 1 1 0 0 1 1⎠. (8.4)
1010101
As in the previous simple example, the sender modifies the pixel values so that
the column vector of their LSBs, y, satisfies
m = Hy. (8.5)
We call the vector Hy the “syndrome” of y. If by chance the syndrome of the
cover pixels already communicates the correct message, Hx = m, which happens
with probability 1/2p , the sender does not need to modify any of the 2p − 1 cover
pixels, sets y = x, and proceeds to the next block of 2p − 1 pixels and embeds
the next segment of p message bits.
When Hx = m, the sender looks up the difference Hx − m as a column in H
(there must be such a column because H contains all non-zero binary p-tuples).
Let us say that it is the jth column and we write it as H[., j]. By flipping the
LSB of the jth pixel and keeping the remaining pixels unchanged,
y[j] = 1 − x[j], (8.6)
y[k] = x[k], k = j, (8.7)
the syndrome of y now matches the message bits,
Hy = m. (8.8)
This is because Hy = Hx + H(y − x) = Hx − m + H[., j] + m = m, because
Hx − m = H[., j] and in binary arithmetic z + z = 0 for any z.
To complete the description of the steganographic algorithm, the recipient
follows the same path through the image as the sender and reads p message bits
from the LSBs of each block of 2p − 1 pixels as the syndrome m = Hy.
Let us now compute the embedding efficiency of this embedding method. With
probability 1/2p, the sender does not modify any of the 2p − 1 pixels. Further-
138 Chapter 8. Matrix embedding
more, she makes exactly one change with probability 1 − 1/2p . The average num-
ber of embedding changes is thus 0 × 1/2p + 1 × (1 − 1/2p ) = 1 − 1/2p and the
embedding efficiency is
p
ep = . (8.9)
1 − 2−p
The embedding efficiency and the relative payload αp = p/(2p − 1) for different
values of p are shown in Table 8.1.
Table 8.1. Relative payload, αp , and embedding efficiency, ep , in bits per change for
matrix embedding using binary Hamming codes [2p − 1, 2p − 1 − p].
p αp ep
1 1.000 2.000
2 0.667 2.667
3 0.429 3.429
4 0.267 4.267
5 0.161 5.161
6 0.093 6.093
7 0.055 7.055
8 0.031 8.031
9 0.018 9.018
Note that with increasing p, the embedding efficiency, ep , increases while the
relative payload, αp , decreases. The improvement over embedding each message
bit at exactly one pixel is quite substantial. For example, a message of relative
length 0.093 can be embedded with embedding efficiency 6.093, which means
that a little over six bits are embedded with a single embedding change.
The motivational example from the beginning of this section is obtained for
p = 2. Also, notice that the classical LSB embedding corresponds to p = 1, in
which case the matrix H is 1 × 1 and the embedding efficiency is 2.
If the sender plans to communicate a message whose relative length α is not
equal to any αp , one needs to choose the largest p for which αp ≥ α to make
sure that the complete message can be embedded. This means that for α > 23 ,
this matrix embedding method does not bring any improvement over classical
embedding methods that embed one bit at each pixel.
The parameter p also needs to be communicated to the receiver, which can
be arranged in many different ways. One possibility is to reserve a few pixels
from the image using the stego key and embed the binary representation of p
in LSBs of those pixels and then use the rest of the image to communicate the
main payload.
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0 0 0
g = 11 10 15 17 13 21 19 , ⎝ 1 ⎠−⎝ 0 ⎠=⎝ 1 ⎠
0 1 1
x = π(g) = 1 0 1 1 1 1 1 ,
Hx m Hx − m
change 15 to 14 ⎛ ⎞
1
⎜ 0 ⎟
⎛ ⎞⎜ ⎟ ⎛ ⎞ ⎛ ⎞
0 0 0 1 1 1 1 ⎜ 1 ⎟ 0 0
⎜ ⎟
⎝ 0 1 1 0 0 1 1 ⎠×⎜
⎜ 1 ⎟=⎝ 1 ⎠
⎟
⎝ 0 ⎠
1 0 1 0 1 0 1 ⎜ 1 ⎟ 0 1
⎜ ⎟
⎝ 1 ⎠
H x 1 Hx m
This section briefly introduces selected basic concepts and notation from the
theory of binary linear codes that will be needed to understand the material in
the rest of this chapter. The reader is encouraged to read Appendix C, which
140 Chapter 8. Matrix embedding
contains a brief tutorial on linear codes over finite fields. A good introduction to
coding theory is [248].
We denote by Fn2 the vector space of all binary vectors, x = (x[1], . . . , x[n]), of
length n where addition and multiplication by a scalar is performed elementwise,
x + y = (x[1] + y[1], . . . , x[n] + y[n]), (8.10)
bx = (bx[1], . . . , bx[n]), (8.11)
for all x, y ∈ Fn2 , b ∈ {0, 1} = F2 . All operations in F2 are in the usual binary
arithmetic
0 + 0 = 1 + 1 = 0, (8.12)
0 + 1 = 1 + 0 = 0, (8.13)
0 · 1 = 1 · 0 = 0, (8.14)
1 · 1 = 1, (8.15)
0 · 0 = 0. (8.16)
A binary linear code C of length n is a vector subspace of Fn2 . Its elements are
called codewords. As a subspace, the code C is closed under linear combination of
its elements, which means that ∀x, y ∈ C and ∀b ∈ {0, 1}, x + y ∈ C and bx ∈ C.
The code has dimension k (and codimension n − k) if it has a basis consisting of
k ≤ n linearly independent vectors. (We say that the code C is an [n, k] code.) A
code is completely described by its basis because all codewords can be obtained
as linear combinations of the basis vectors. Writing the basis vectors as rows of
a k × n matrix G, we obtain its generator matrix.
Linear code can be alternatively described using its (n − k) × n parity-check
matrix, H, whose n − k rows are linearly independent vectors that are orthogonal
to C, or
HG = 0, (8.17)
where the prime denotes transposition. The code can thus also be defined as
C = {c ∈ Fn2 |Hc = 0}. (8.18)
The generator and parity-check matrices are not unique for the same reason
that a vector subspace does not have a unique basis.
The space Fn2 can be endowed with a measure of distance called the Hamming
distance defined as the number of places where x and y differ,
dH (x, y) = {i ∈ {1, . . . , n}|x[i] = y[i]}. (8.19)
In the context of steganography, dH (x, y) is the number of embedding changes.
The Hamming weight of x ∈ Fn2 is w(x) = dH (x, 0), which is the number of non-
zero elements in x.
The ball of radius r centered at x is the set
B(x, r) = {y ∈ Fn2 |dH (x, y) ≤ r} , (8.20)
Matrix embedding 141
and
n n n r
n
V2 (r, n) = 1 + + + ···+ = (8.21)
1 2 r i=0
i
where dH (x, C) = minc∈C dH (x, c) is the distance between x and the code. In
other words, the covering radius is determined by the most distant point x from
the code.
Recalling the matrix embedding method from Section 8.1, it should not be
surprising that we define for each x ∈ Fn2 its syndrome as s = Hx ∈ Fn−k2 . For
any syndrome s ∈ F2 , its coset C(s) is
n−k
Note that C(0) = C and C(s1 ) ∩ C(s2 ) = ∅ for s1 = s2 . The whole space Fn2 can
thus be decomposed into 2n−k cosets, each coset containing exactly 2k elements,
C(s) = Fn2 . (8.24)
s∈Fn−k
2
Because C(s) is the set of all solutions to the equation Hx = s, from linear algebra
the coset can be written as C(s) = {x ∈ Fn2 |x = x̃ + c, c ∈ C} = x̃ + C, where x̃
is an arbitrary member of C(s).
A coset leader e(s) is a member of the coset C(s) with the smallest Hamming
weight. It is easy to see that the Hamming weight of any coset leader is at most
R, the covering radius of C. Take x ∈ Fn2 arbitrary and calculate its syndrome
s = Hx. Then,
because when c goes through all codewords, x − c goes through all members
of the coset C(s). Note that the inequality is tight as there exists z such that
R = dH (z, C).
This result also implies that any syndrome, s ∈ Fn−k 2 , can be obtained by
adding at most R columns of H. This is because C(s) = {x|Hx = s} and the
weight of a coset leader is the smallest number of columns that need to be
summed to obtain the coset syndrome s. Thus, one method to determine the
covering radius of a linear code is to first form its parity-check matrix and then
find the smallest number of columns of H that can generate any syndrome.
142 Chapter 8. Matrix embedding
The matrix embedding theorem provides a recipe for how to turn any linear code
into a matrix embedding method. The parameters of the code will determine the
properties of the stegosystem, namely its payload and embedding efficiency. By
making the connection between steganography and coding, we will be able to
view the trick explained in Section 8.1 from a different angle and construct new
useful embedding methods.
We will assume that Alice and Bob use a bit-assignment function π : X →
{0, 1} that assigns a bit to each possible value of the cover element from X . For
example, X = {0, . . . , 255} for 8-bit grayscale images and π could be the LSB
of pixels (DCT coefficients), π(x) = x mod 2. Thus, a group of n pixels can be
represented using a binary vector x ∈ Fn2 . The sole purpose of the embedding
operation is to modify the bit assigned to the cover element, which could be
achieved by flipping the LSB or adding ±1 to the pixel value, etc. At this point,
it is immaterial how the pixels are changed because we measure the embedding
impact only as the number of embedding changes.
Recalling the definition from Section 4.3, the steganographic scheme is a pair
of mappings Emb and Ext,
Here, y = Emb(x, m) is the vector of bits extracted from the same block of n
pixels in the stego image and M is the set of all messages that can be communi-
cated. (The embedding capacity is log2 |M| bits.) Note that here we ignore the
role of the stego key as it is, again, not important for our reasoning.
Let us suppose that it is possible to embed every message in M using at most
R changes,
log2 |M|
e= , (8.30)
Ra
where Ra is the average number of embedding changes over all messages and
covers
Ra = E dH (x, Emb(x, m)) . (8.31)
Matrix embedding 143
Proof. Let H be a parity-check matrix of the code, x ∈ Fn2 the vector of LSBs
of n pixels, and m ∈ Fn−k
2 the vector of message bits. Define the embedding
mapping as
where we remind the reader that e(m − Hx) is a coset leader of the coset cor-
responding to the syndrome m − Hx. The corresponding extraction algorithm
is
In other words, the recipient reads the message by extracting the LSBs from the
stego image, y, and multiplying them by the parity-check matrix,
Hy = Hx + He = Hx + m − Hx = m. (8.35)
1 1
= dH (x, C) = dH (x, C), (8.37)
2n 2n
s∈F2
n−k x∈C(s) n x∈F2
which is the average distance to code, Ra . In other words, we have just proved
that e = (n − k)/Ra . The second equality follows from (8.25) expressing the fact
that all coset members x ∈ C(s) have the same distance to C: dH (x, C) = w(e(s)),
144 Chapter 8. Matrix embedding
equal to the weight of a coset leader of C(s). The fact that cosets partition the
whole space implies the third and final equality.
Because the volume of each ball is exactly 1 + n = 2p and all words with the
exception of the center codeword have distance 1 from the code, the average
distance to code is Ra = n/(n + 1) = 1 − 2−p . Thus, the embedding efficiency
of matrix embedding based on binary Hamming code Hp is ep = p/(1 − 2−p ) in
agreement with the result obtained in Section 8.1.
The matrix embedding theorem tells us that linear codes can be used to construct
steganographic schemes that impose fewer embedding changes than the simple
paradigm of embedding one bit at every pixel. This is quite significant because
it gives us the ability to communicate longer messages for a fixed distortion
budget. It would be valuable to know the limits of this approach and determine
the performance (embedding efficiency) of the best possible matrix embedding
method and then attempt to reach this limit.
n n n n
+ + ···+ +ξ = 2αn , (8.39)
0 1 Rn − 1 Rn
where 0 < ξ ≤ 1 is a real number. Note here that every [n, n(1 − α)] code has
2αn cosets.
Besides the lower bound on the covering radius, R ≥ Rn , we obtain a lower
bound for the average distance to code Ra ,
We remark that codes for which the number of coset leaders with Hamming
weight i is exactly ni and ξ = 1 have the largest possible embedding efficiency
within the class of codes of length n. Perfect codes have this property (see Ap-
pendix C). There are only four perfect codes: the trivial repetition code (see
Exercises 8.1 and 8.6), Hamming codes, the binary Golay code, and the ternary
Golay code [248].
The bounds derived above can be summarized in a proposition.
βn
n
|M| ≤ = V2 (βn, n) (8.42)
i=0
i
146 Chapter 8. Matrix embedding
0.8
H(x) 0.6
0.4
0.2
H(x) = −x log2 x − (1 − x) log2 (1 − x)
0
0 0.2 0.4 0.6 0.8 1
x
Figure 8.2 The binary entropy function.
Proof.
n
n i
1 = (β + 1 − β) = n
β (1 − β)n−i (8.55)
i=0
i
n
n β
i βn
n β
i
= (1 − β) n
≥ (1 − β) n
(8.56)
i=0
i 1−β i=0
i 1−β
βn
n β βn
βn
n
≥ (1 − β) n
= (1 − β) n(1−β) βn
β (8.57)
i 1−β i
i=0 i=0
n[(1−β) log2 (1−β)+β log2 β]
=2 V2 (βn, n), (8.58)
which is the tail inequality. The second inequality holds because β/(1 − β) ≤ 1
for β ≤ 12 .
Therefore, the maximal number of bits one can embed using at most R changes
is
R
mmax ≤ nH . (8.61)
n
We can rewrite this to obtain a bound on the lower embedding efficiency, e, for
any relative payload α that can be embedded:
mmax R
α≤ ≤H , (8.62)
n n
R
H −1 (α) ≤ , (8.63)
n
n 1
≤ −1 , (8.64)
R H (α)
αn α
e= ≤ −1 . (8.65)
R H (α)
The bound (8.65) on the lower embedding efficiency is also an asymptotic
bound on the embedding efficiency e,
α
e ≤ −1 . (8.66)
H (α)
This is because the relative average distance to code, Ra /n, which determines
embedding efficiency, and the relative covering radius, R/n, for which we have a
bound, are asymptotically identical as n → ∞ [81].
Knowing the performance bounds, a practical problem for the steganographer
is finding codes that could reach the theoretically optimal embedding efficiency
with low computational complexity. Figure 8.3 shows the bound on embedding
efficiency (8.66) and the performance of selected codes as a function of 1/α, where
α is the relative payload. We can see that, although the Hamming codes have
substantially higher embedding efficiency than the trivial paradigm of embedding
one message bit at one pixel with e = 2, they are still far from the theoretical
upper bound, indicating a space for improvement. A slightly better embedding
efficiency can be obtained using BCH codes [208] and certain classes of non-linear
codes [20].
An important result, which is beyond the scope of this book, states that the
bound (8.66) is tight and asymptotically achievable using linear codes. In partic-
ular, the relative codimension (n − k)/n of almost all random [n, k] codes asymp-
totically achieves H(R/n) for a fixed change rate R/n < 12 and n → ∞ (see, e.g.,
Theorem 12.3.5 in [46]). Thus, there exist embedding schemes based on linear
codes whose embedding efficiency is asymptotically optimal.
Therefore, at least theoretically it should be possible to construct good matrix
embedding schemes from random codes. However, while for structured codes,
such as the Hamming codes, finding a coset leader was particularly simple, for
Matrix embedding 149
4
Bound
Binary Hamming codes
Sparse codes
2
2 4 6 8 10 12 14 16 18 20
α−1
Figure 8.3 Embedding efficiency of binary Hamming codes (crosses) and the bound on
embedding efficiency (8.66). The stars mark the embedding efficiency of sparse linear
codes of length n = 1, 000 and n = 10, 000 as reported in [68].
1. Find any coset member ỹ ∈ C(m − Hx) as the solution of Hỹ = m − Hx.
Because H is in its systematic form, for example ỹ = (m − Hx, 0), where
0 ∈ {0, 1}k .
2. Find the coset leader e(m − Hx). To do so, we need to find the codeword c̃
closest to ỹ,
because the vector ỹ − c goes through all members of the coset. Thus,
e(m − Hx) = ỹ − c̃, which completes the description of the embedding map-
ping Emb via the matrix embedding theorem. If the dimension of the code
is small, the closest codeword can be quickly found by brute force or using
precalculated look-up tables.
5
Upper bound, any n
Upper bound fixed n, k = 14
4.5 k = 14
Upper bound fixed n, k = 10
k = 10
Embedding efficiency e(α)
4
3.5
2.5
2
0.5 0.6 0.7 0.8 0.9 1
α
Figure 8.4 Embedding efficiency versus relative payload α for random codes of dimension
k = 10 and 14. Also shown is the theoretical upper bound for codes of arbitrary
length (8.66) and bounds for codes of a fixed length (8.41).
on syndrome coding, uses so-called Sum and Difference Covering Sets (SDCSs)
of finite cyclic groups [158, 160, 174] (also see Section 8.7).
All matrix embedding methods introduced in the previous sections used binary
codes. The reason for this was that each cover element was represented with a
bit through some bit-assignment function, π, such as its LSB. Fundamentally,
there is no reason why we could not use q-ary representation for q > 2. Consider,
for example, the ±1 embedding where it is allowed to change a pixel value by
±1. Each pixel can thus accept one of three possible values. By using a function
that assigns a ternary symbol at each pixel, for example, π(x) = x mod 3, one
ternary symbol can be embedded at each pixel rather than just one bit. Because
one ternary symbol conveys log2 3 bits of information, the number of embedding
changes would be decreased. Let us calculate the embedding efficiency of this
ternary ±1 embedding to see the improvement.
We already know that the embedding efficiency of classical (binary) ±1 em-
bedding is 2 because the probability that a pixel needs to be modified is 12 . Alter-
natively, two bits are embedded using one embedding change. When embedding
a random stream of ternary symbols, the probability that a pixel will not have
152 Chapter 8. Matrix embedding
Thus, this is a [13, 10] code with covering radius R = 1. The radius is 1 because
we need only a linear combination of one column to obtain any syndrome. For
example, the syndrome (2, 2, 2) , which does not appear directly as a column in
H, can be obtained as a linear combination (multiple) of just one column, the
seventh column. # p $
q −1 qp −1
In general, a q-ary Hamming code is a q−1 , q−1 − p code with covering
radius R = 1. The length is n = (q p − 1)/(q − 1) because there are q p − 1 non-
zero vectors of q-ary symbols of length p and each non-zero column appears in
q − 1 versions (there are q − 1 non-zero multiples). Of course, the codimension
is n − k = p.
Hamming codes can be used to construct steganographic schemes capable of
embedding p q-ary symbols or p log2 q bits in (q p − 1)/(q − 1) pixels by making
on average 1 − q −p changes (because the probability that the syndrome of the
cover, Hx ∈ Fpq , already matches the p message symbols is 1/q p ). Thus, we have
the corresponding relative payload αp (in bits per pixel) and embedding efficiency
ep (in bits per change)
p log2 q
αp = , (8.70)
(q p − 1)/(q − 1)
p log2 q
ep = . (8.71)
1 − q −p
Example 8.6: [±1 embedding using ternary Hamming codes with co-
dimension p]
Assuming the parity-check matrix (8.69), let g = (14, 13, 13, 12, 12, 16, 18,
20, 19, 21, 23, 22, 24) be the vector of grayscale values from a cover-image
block of 13 pixels. The corresponding vector of ternary symbols is x = g mod 3 =
(2, 1, 1, 0, 0, 1, 0, 2, 1, 0, 2, 1, 0), which has the following syndrome:
⎛ ⎞ ⎛ ⎞
2+1+2+1+1 1
Hx = 1 + 1 + 2 + 2 + 2 mod 3 = 2 ⎠ .
⎝ ⎠ ⎝ (8.72)
1+1+1+2+1 0
Let us say that we wish to embed a ternary message m = (0, 1, 2) . We now need
to find the element of the vector x that needs to be changed so that its syndrome
is the desired message. Proceeding as in the matrix embedding theorem, we
154 Chapter 8. Matrix embedding
calculate2
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0 1 2
⎝ ⎠ ⎝ ⎠ ⎝
m − Hx = 1 − 2 = 2 ⎠ = 2H[., 7], (8.73)
2 0 2
This bound is reached by codes n for which the number of coset leaders with
Hamming weight i is exactly i (q − 1) and ξ = 1. The only such codes are
i
perfect codes and they enjoy the largest possible embedding efficiency within
the class of linear codes of length n.
Next, we derive a bound for codes of arbitrary length capable of communicating
a fixed relative payload. The maximal number of messages |M| that can be
communicated with change rate β = R/n is bound from above by the number of
possibilities one can make up to βn changes in n pixels,
βn
n
|M| ≤ Vq (βn, n) = (q − 1)i . (8.77)
i=0
i
The asymptotic behavior of the sum is again determined by the largest last term:
n
log2 Vq (βn, n) ≥ log2 + βn log2 (q − 1) (8.78)
βn
≈ n[H(β) + β log2 (q − 1)] = nHq (β), (8.79)
n
where we used the asymptotic expression (8.53) for log2 βn and
Hq (x) = −x log2 x − (1 − x) log2 (1 − x) + x log2 (q − 1) (8.80)
is the q-ary entropy function shown in Figure 8.6 for q = 3.
The inequality (8.79) together with the equivalent of the tail inequality (see
Exercise 8.3)
nHq (β) ≥ log2 Vq (βn, n) (8.81)
proves that the relative payload that can be embedded using change rate β is
bounded by
log2 |M|
α≤ ≈ Hq (β). (8.82)
n
156 Chapter 8. Matrix embedding
1.5
H3 (x) 1
0.5
0
0 0.2 0.4 0.6 0.8 1
x
Figure 8.6 The ternary entropy function.
A matrix embedding scheme with the maximal embedding efficiency can thus
embed relative payload α bpp by making on average Hq−1 (α) changes per pixel.
Figure 8.5 shows the upper bound (8.84) on embedding efficiency for q = 2, 3, 5
as a function of α−1 , where α is the relative payload in bpp. Note that the
bounds start at the point α = log2 q, e = [q/(q − 1)] log2 q, which corresponds to
embedding at the largest relative payload of log2 q bpp. The same figure shows
the benefit of using q-ary codes for a fixed relative payload α. For example, for
α = 1, the ternary ±1 embedding can theoretically achieve embedding efficiency
e 4.4, which is significantly higher than 2 – the maximal efficiency of binary
codes at this relative message length. The embedding efficiency of binary and
ternary Hamming codes for different values of p is shown with “+” and “”
symbols, respectively.
for γ ≥ 1.
Let us assume that we use optimal q-ary matrix embedding schemes with em-
bedding efficiency reaching the bound (8.84). This means that one can embed
relative payload α (bpp) by making on average Hq−1 (α) changes per pixel. As-
suming that the message is encoded using symbols from Fq and forms a random
stream, the magnitude of these changes is equally likely to reach any of the q − 1
non-zero values in D and the expected impact per changed pixel is
1 γ
dγ = |d| , (8.88)
q−1
d∈D
where D stands for Dodd for q odd and Deven for q even. Thus, the expected
embedding impact per pixel when embedding relative payload α using optimal
q-ary matrix embedding is
1 γ
Δ(α, q, γ) = β(α)dγ = Hq−1 (α) |d| , (8.89)
q−1
d∈D
because β(α) = Hq−1 (α) is the minimal change rate (8.82) for payload α embed-
dable using q-ary codes.
We can obtain some insight into the trade-off between the number of em-
bedding changes and their magnitude by determining the value of q that
minimizes Δ(α, q, γ). Figure 8.7 shows Δ(α, q, γ) for different values of q for
α ∈ {0.1, 0.2, . . . , 0.9} and γ = 1. Note that q = 3 leads to the smallest embed-
ding impact for all relative payloads. This statement holds true for any γ ≥ 1
because Δ(α, 2, γ) = Δ(α, 3, γ) for all γ (the magnitude of embedding changes
is 1 in these cases) and Δ(α, q, γ) ≥ Δ(α, q, 1) for all q > 3. Thus, as long as the
158 Chapter 8. Matrix embedding
α = 0.9
α = 0.1
10−2
2 3 5 10 15
q
Figure 8.7 Expected embedding impact Δ(α, q, 1) for various values of α and q. The
minimal embedding impact is always obtained for q = 3.
embedding impact can be captured using dγ , we can conclude that the most se-
cure steganography is obtained for ternary codes and it does not pay off to make
fewer embedding changes with larger magnitude (or use q-ary codes with q > 3).
Of course, this conclusion hinges on the assumption that statistical detectability
can be captured with a distortion measure. This is not, however, entirely clear.
It is possible that for some combination of the embedding scheme and the cover-
source model, the KL divergence between cover and stego objects will be lower
for codes with q > 3. This issue is currently an open research problem.
The main theme of this chapter is improving the embedding efficiency of stegano-
graphic schemes by decreasing the number of embedding changes. So far, we have
explored methods based on syndrome coding with linear codes (so-called matrix
embedding). Although this approach is the one most developed today, there ex-
ists an alternative and quite elegant approach that originated from the theory
of covering sets of cyclic groups. Moreover, it can be thought of as a generaliza-
tion of ±1 embedding to groups of multiple pixels. Additionally, by connecting a
known problem in steganography with another branch of mathematics, steganog-
raphy may benefit from future breakthroughs in this direction. In this section,
we explain the main ideas and include appropriate references where the reader
may find more detailed information. Following the spirit of this book, we first
introduce a simple example that will later be generalized.
Matrix embedding 159
Table 8.2. Required modification of the cover pair (x[1], x[2]) depending on the value of
Ext(x[1], x[2]) (the first column) to embed a quaternary symbol b ∈ Z4 = {0, 1, 2, 3}
(the first row).
x\b 0 1 2 3
0 (0,0) (1,0) (0,1) (–1,0)
1 (–1,0) (0,0) (1,0) (0,1)
2 (0,–1) (–1,0) (0,0) (1,0)
3 (1,0) (0,–1) (–1,0) (0,0)
Table 8.2 shows that it is always possible to embed a quaternary symbol in each
pair (x[1], x[2]) by modifying at most one element of the pair by ±1. For example,
when Ext(x[1], x[2]) = 2 and we wish to embed symbol 0, we modify x[2] →
x[2] − 1. (In fact, in this case, we can achieve this same effect by modifying x[2] →
x[2] + 1, etc.) Here, we ignore the boundary issues when the modified element
may get out of its dynamic range. If a random symbol stream is embedded using
this method, we embed 2 bits by making a modification to one of the pixels with
probability 12 3 2
16 = 4 . Thus, the embedding efficiency of this method is e = 3/4 =
8
3 = 2.66 . . .. This is a higher embedding efficiency than simply embedding each
bit as the LSB of each pixel with efficiency 2. It is also higher than embedding a
ternary symbol (log2 3 bits) using ±1 embedding, which has embedding efficiency
3
log2 2/3 = 2.377 . . ..
This simple idea, which originally appeared in [174], can be greatly gen-
eralized [158, 160]. For any positive integer M , we will denote by ZM =
{0, 1, . . . , M − 1} the finite cyclic group of order M where the group operation
is addition modulo M . The group (ZM , +) obviously satisfies the axioms of a
group because it is closed with respect to the group operation (for any a, b ∈ ZM ,
the addition is defined as a + b mod M ∈ ZM ), the operation is also associative,
there exists an identity element, which is 1, and each element, a, has an inverse
element, M − a.
Let A = {a[1], . . . , a[n]} be a sequence of elements from ZM . The set A is
called a Sum and Difference Covering Set (SDCS) with parameters (n, k, M ) if
which proves that the extraction function indeed obtains the correct message
symbol b. For practical implementations, the embedding function can be easily
implemented using a look-up table tying each message symbol with the vector s.
For further considerations, it will be convenient to denote dA (b) = ni=1 |s[i]|
the minimal number of embedding modifications to embed symbol b. The embed-
ding efficiency of the resulting steganographic scheme with SDCS A on covers
with elements in ZM is
log2 M
eA = . (8.97)
(1/M ) M−1
b=0 dA (b)
Matrix embedding 161
Example 8.7: Consider A = {1, 2, 6} and convince yourself that for 3i=1 |s[i]| ≤
2 the sum 3i=1 s[i]a[i] can attain all values between −8 and 8. Out of these
M = 17 values, 3i=1 |s[i]| = 0 for the value 0, 3i=1 |s[i]| = 1 for values from the
set {−6, −2, −1, 1, 2, 6}, which is {11, 15, 16, 1, 2, 6} in Z17 , and 3i=1 |s[i]| = 2 for
values in {−8, −7, −5, −4, −3, 3, 4, 5, 7, 8}. Thus, A is an SDCS with parameters
(3, 2, 17) and the embedding efficiency of the associated ±1 embedding scheme
is
log2 17
eA = ≈ 2.673. (8.98)
1
17 (1 × 0 + 6 × 1 + 10 × 2)
This embedding scheme can embed log2 17 bits in n = 3 cover elements by mak-
ing at most k = 2 modifications by ±1.
If, for a given (n, k, M ) SDCS, there does not exist an (n, k, M ) SDCS with
M > M , we call the SDCS maximal (its associated steganographic method com-
municates the largest possible payload for a given choice of n and k). Because
there are V3 (n, k) possible ways one can make k or fewer changes by ±1 to n
cover elements, we obtain the following bound:
k
n
M≤ 2i , (8.99)
i=0
i
which is essentially the same bound as (8.77). For k = 1, the bound states M ≤
1 + 2n. In this case, it is possible to find the maximal SDCS with M = 1 + 2n,
which is A = {1, 2, . . . , n}. To see that A is an (n, 1, 1 + 2n) SDCS, the reader is
encouraged to inspect Exercise 8.5, where the embedding scheme is formulated
from a different perspective using rainbow coloring of lattices.
Finding maximal SDCSs is a rather difficult task. In fact, it is not easy to
find SDCSs in general. Several parametric constructions of SDCSs are described
in [160]. Table 8.3 contains examples of SDCSs useful for embedding large pay-
loads, all found by a computer search. The embedding efficiency and relative
embedding capacity of the associated embedding schemes are also displayed
in Figure 8.8, where we compare the embedding efficiency of steganographic
schemes constructed using SDCSs, Hamming codes, and the ternary repetition
code from Exercise 8.6.
Curiously, the existence of an (n, k, M ) SDCS does not imply the existence
of SDCSs with (n, k, M ), M < M . It is known, for example, that although
there exists a (9, 2, 132) SDCS, there are no SDCSs with parameters (9, 2, x) for
x ∈ {131, 129, 128, 127, 125}. The subject of finding SDCSs is related to other
problems from discrete geometry, graph theory, and sum cover sets [73, 103, 105,
115]. Progress in these areas is likely to find applications in steganography via
the considerations explained in this section.
162 Chapter 8. Matrix embedding
Table 8.3. Examples of SDCSs and the relative payload, α, and embedding efficiency, e,
of their associated steganographic schemes. Note that SDCSs with an even M cannot be
enlarged to an odd M because we would lose the covering property. For example, for the
SDCS {1, 3, 9, 14}, 7 = −9 − 14 mod 30 and 7 would not be covered in Z31 .
(n, k, M ) SDCS α e
4.5
Embedding efficiency e(α)
3.5
Bound on binary codes
3 Bound on ternary codes
Binary Hamming codes
Ternary Hamming codes
2.5
Ternary Golay code
SDCS construction
2 Ternary repetition code
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
α−1
Figure 8.8 Embedding efficiency of ±1 steganographic schemes based on SDCS
constructions from Table 8.3. For comparison, we include the embedding efficiency of
ternary (codimension 1 and 2) and binary Hamming codes (codimension 1–3), the ternary
Golay code, and the ternary repetition code from Exercise 8.6 for n = 4, 7, 10, 13, 16
(from the right).
Summary
r Matrix embedding (or syndrome coding) is a coding method that can increase
the embedding efficiency of steganographic schemes.
r It can be applied only when the message length is smaller than the embedding
capacity.
Matrix embedding 163
r The shorter the message, the larger the improvement due to matrix embed-
ding.
r Matrix embedding is part of the theory of covering codes.
r Good matrix embedding schemes should have small average distance to code
because it determines the embedding efficiency.
r Some of the simplest matrix embedding methods are based on Hamming
codes.
r By assigning a q-ary symbol from a finite field to each cover element, it is
possible to further increase embedding efficiency using q-ary codes.
r Measuring the embedding impact using distortion that takes into account the
magnitude of modifications, ternary codes provide the minimal embedding
impact. In particular, it does not pay off to make fewer embedding changes
of magnitude larger than 1.
r It is possible to design steganographic schemes using sum and difference cov-
ering sets of finite cyclic groups. Current schemes find applications for em-
bedding large payloads.
Table 8.4. Properties of optimal q-ary matrix embedding schemes when embedding into
cover containing n pixels. The function Hq (x) is the q-ary entropy function,
Hq (x) = −x log2 x − (1 − x) log 2 (1 − x) + x log2 (q − 1).
Exercises
8.1 [Binary repetition code] The binary repetition code of length n consists
of two codewords C = {(0, . . . , 0), (1, . . . , 1)}. Show that the covering radius of
this code and its average distance to code are
' (
n−1
R= , (8.100)
2
n n−1
Ra = 1 − 2−n+1 . (8.101)
2 R
8.2 [Binary–ternary conversion] Write a computer program that converts
a stream of q-ary symbols represented using integers {0, 1, . . . , q − 1} to a bi-
nary stream and vice versa. Make sure that the length of the binary stream is
approximately n log2 q for good encoding.
164 Chapter 8. Matrix embedding
8.3 [Tail inequality for q > 2] Prove the tail inequality (8.81). Hint: First,
you can assume that
q−1
0≤β≤ (8.102)
q
because if we restrict our attention to codes containing the all-ones vector, and
thus all its multiples (0, . . . , 0), (1, . . . , 1), . . . , (q − 1, . . . , q − 1), no vector x ∈ Fnq
can be further from these q codewords than n(1 − 1/q) because the furthest
vector contains q/n zeros, q/n ones, . . ., and q/n symbols q − 1. Then write
βn
n β 1
i
1 = (β + 1 − β) ≥ (1 − β)
n
(q − 1)i
n
, (8.103)
i=0
i 1−β q−1
use the inequality β/[(1 − β)(q − 1)] ≤ 1, which follows from (8.102), and follow
the same steps as in the proof of the tail inequality for q = 2.
In other words, the neighborhood is formed by the lattice point itself and 2d
other points that differ from the center point in exactly one coordinate by 1.
Show that the following assignment of 2d + 1 colors, c, c ∈ {0, . . . , 2d + 1}, to
the lattice
d
c(i1 , . . . , id ) = kik mod (2d + 1) (8.105)
k=1
has the property that the neighborhood of every lattice point contains exactly
2d + 1 different colors.
8.5 [±1 embedding in groups] One possibility to avoid modifying the pixels
by more than 1, yet use the power of q-ary embedding, is to group pixels into
disjoint subsets of d pixels. By modifying each pixel by ±1, we obtain 2d possible
modifications plus one case when no modifications are carried out. The rainbow
coloring from the previous example will enable us to assign colors ((2d + 1)-ary
symbols) to each pixel group and embed one (2d + 1)-ary symbol by modifying
at most one pixel by 1 (±1 embedding). Show that the relative payload and
embedding efficiency of this steganographic method are
log2 (2d + 1) log2 (2d + 1)
αd = , ed = . (8.106)
d 1 − (2d + 1)−1
8.6 [Large relative payload using ternary ±1 embedding] Let C be the
ternary [n, 1] repetition code (one-dimensional subspace of Fn3 ) with a ternary
Matrix embedding 165
parity-check matrix H = [In−1 , u], where u is the column vector of 2s. This code
can be used to embed (n − 1) log2 3 bits per n pixels, which gives relative payload
αn = (n − 1)/n log2 3 → αmax = log2 3 bpp with increasing n. Thus this code is
suitable for embedding large payloads close to the maximal relative payload
αmax . If we denote the number of 0s, 1s, and 2s in an arbitrary vector of Fn3 by
a, b, and c, respectively, then the average distance to C can be computed as
1 n n−a
Ra = n (n − max{a, b, c}) , (8.107)
3 a b
where the sum extends over all triples {a, b, c} of non-negative integers such
that a + b + c = n. Thus, the embedding efficiency of this code is e = [(n −
1) log2 3]/Ra . For example, it is possible to embed α = 1.188 bpp with embed-
ding efficiency 2.918 bits per change for n = 4. The performance of this family
of codes is shown in Figure 8.8. In this exercise, prove the expression for the
average distance to code, Ra .
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
9 - Non-shared selection channel pp. 167-192
Chapter DOI:
Cambridge University Press
9 Non-shared selection channel
2n−k+n
!2n
2n−k
2k 1 k
1− n = 1− → 0 as n → ∞ for = const.,
2 2n−k n
(9.1)
2n−k
because 1 − 1/2n−k → 1/e < 1. Thus, for any > 0 we can write k − n
bits into the memory with probability that approaches 1 exponentially fast.
Although asymptotically optimal, the random-binning argument above is not
a practical way to construct steganographic schemes with non-shared selection
channels due to the enormous size of the codebook that needs to be shared. In
Section 9.1, we describe an approach based on syndrome coding using random
linear codes, which is quite suitable for applications in steganography. A prac-
tical and fast wet paper code can be obtained using random linear codes called
LT codes (Section 9.2). To improve the embedding efficiency of wet paper codes,
in Section 9.3 we present another random construction. The usefulness of having
a practical solution for the non-shared selection channel is demonstrated in Sec-
tion 9.4, where we list several fascinating and very diverse applications of wet
paper codes.
Alternative approaches to writing on wet paper that are not discussed in this
book are based on maximum distance separable codes, such as the Reed–Solomon
codes [97], and on BCH codes [208].
sender’s goal is to communicate m < k message bits m ∈ {0, 1}m to the recipient
who has no information about the selection channel S.
Approaching this problem using the paradigm of syndrome coding, the message
is communicated to the recipient as a syndrome. To this end, the sender modifies
some changeable elements in the cover image so that the bits assigned to the stego
image, y, satisfy
Dy = m, (9.2)
where D is an m × n binary matrix shared by the sender and the recipient.
The recipient reads the message by multiplying the vector of stego image bits
y by the matrix D. While this appears identical to matrix embedding, there is
one important difference because the sender is now allowed to modify only the
changeable elements of x.
Using the variable v = y − x, (9.2) can be rewritten as
Dv = m − Dx, (9.3)
where v[i] = 0, for i ∈
/ S, and v[j], j ∈ S, are to be determined. Because v[i] = 0
for i ∈
/ S, the product, Dv, on the left-hand side can be simplified. The sender
can remove from D all n − k columns corresponding to indices i ∈ / S and also
remove from v all n − k elements v[i], i ∈/ S. To avoid introducing too many new
symbols, we will keep the same symbol for the pruned vector v and write (9.3)
as
Hv = z, (9.4)
where H is an m × k submatrix of D consisting of those columns of D with indices
from S. Note that v ∈ {0, 1}k is an unknown vector holding the embedding
changes and z = m − Dx ∈ {0, 1}m is a known right-hand side. Equation (9.4)
is a system of m linear equations for k unknowns v. We now discuss several
options for solving this system.
If the solution exists, it can be found using standard linear-algebra methods,
such as Gaussian elimination. The solution will exist for any right-hand side z
if the rank of H is m (or the rows of H are linearly independent), which means
that we must have m ≤ k as the necessary condition. However, the complexity
of Gaussian elimination will be prohibitively large, O(km2 ), because we cannot
directly impose structure on H that would allow us to solve the system more
efficiently. This is because H was obtained as a submatrix of a larger, user-
selected matrix D through the selection channel over which the sender has no
control because it is determined by the cover image or side-information as in the
examples from the introduction to this chapter. In the next section, we introduce
a class of sparse matrices for which fast algorithms for solving (9.4) are available.
Before we do so, we make two more remarks.
If D is chosen randomly (e.g., generated from the stego key), the probability
that H will be of full rank, rank(H) = m, is 1 − O(2m−k ) (see, for example, [31])
and thus approaches 1 exponentially fast with increasing k − m. This means that
Non-shared selection channel 171
In the previous section, we showed that syndrome coding with matrix D shared
between the sender and the recipient can be used to communicate secret messages
using non-shared selection channels. While the recipient reads the message by
performing a simple matrix multiplication Dy, the sender needs to solve the
linear system (9.4). In this section, we show how to perform this task with low
complexity by choosing D from a special class of random sparse matrices.
The basic idea is to make D (and thus H) sparse so that it can be put with
high probability into upper-diagonal form simply by permuting its rows and
columns. Imagine that matrix H has a column with exactly one 1 in, say, the
j1 th row. The sender swaps this column with the first column and then swaps the
first and j1 th rows, which brings the 1 to the upper left corner of H. Note that
at this stage, for the permuted matrix, H[1, 1] = 1 and H[j, 1] = 0, for j > 1.
We now apply the same step again while ignoring the first column and the first
row of the permuted matrix. Let us assume that we can find again a column
with only one 1, say in the j2 th row,1 and swap the column with the second
column of H followed by swapping the second and j2 th rows. As a result, we
will obtain a matrix with 1s on the first two elements of its main diagonal and
0s below them, H[1, 1] = 1, H[2, 2] = 1, H[j, 1] = 0 for j > 1, and H[j, 2] = 0 for
j > 2. We continue this process, this time ignoring the first two columns and
1 Since we are ignoring the first row, this column may have another 1 as its first element.
172 Chapter 9. Non-shared selection channel
rows, and eventually stop after m steps. At the end of this process, the row and
column permutations will produce a permuted matrix in an upper-diagonal form,
H[i, i] = 1 for i = 1, . . . , m, H[j, i] = 0 for j > i. Such a linear system can be
efficiently solved using the standard back-substitution as in Gaussian elimination.
(Note that the permutations preserve the low density of the matrix.) We call this
permutation procedure a matrix LT process because it was originally invented
for erasure-correcting codes called LT codes [164]. If, at some step during the
permutation process, we cannot find a column with exactly one 1, we say that the
matrix LT process has failed. The trick is to give H properties that will guarantee
that the matrix LT process will successfully finish with a high probability.
If the Hamming weights of columns of H follow a probability distribution
called the Robust Soliton Distribution (RSD) [164], the matrix LT process will
not fail with high probability. Imposing this distribution on the columns of D,
the columns of H will inherit it, too, because H is a submatrix of D obtained
by removing some of its columns. The RSD requires that the probability that a
column in D has Hamming weight i, 1 ≤ i ≤ m, be (1/η)(ν[i] + τ [i]), where
1/m i=1
ν[i] = (9.5)
1/[i(i − 1)] i = 2, . . . , m,
⎧
⎪
⎪ i = 1, . . . , m/T − 1
⎨T /(im)
τ [i] = [T log(T /δ)]/m i = m/T (9.6)
⎪
⎪
⎩0 i = m/T + 1, . . . , m,
√
η= m i=1 (ν[i] + τ [i]), T = c log(m/δ) m, δ and c are suitably chosen constants
whose choice will be discussed later. An example of the RSD for m = 100 is
shown in Figure 9.1. To generate a matrix with the number of ones in its columns
following the RSD, we can first generate a sequence of integers w[1], w[2], . . . that
follows the RSD. Then, the ith column of the matrix is generated by applying a
random permutation to a column containing w[i] ones and m − w[i] zeros.
To obtain some insight into why this distribution looks the way it does, note
that the columns with few ones are more frequent than denser columns to guar-
antee that the LT process will always find a column with just one 1. Without
the rather mysterious spike at Hamming weight 18, however, the matrix would
become too sparse to be of full rank. Thus, the spike ensures that the rank of the
matrix is m. More rigorous analysis of the RSD appears in Chapter 50 of [171]
and the exercises therein.
The analysis of LT codes [164] implies that when the Hamming weights of
columns of D (and thus of H) follow the RSD (9.5)–(9.6) the matrix LT process
finishes successfully with probability Ppass > 1 − δ if the message length m and
the number of changeable elements k satisfy
log2 (m/δ)
k ≥ θm = 1+O √ m. (9.7)
m
Non-shared selection channel 173
0.5
0.4
0.3
0.2
0.1
0
1 5 10 15 18 20 25 30
Figure 9.1 Robust soliton distribution for Hamming weights of columns of D for δ = 0.5,
c = 0.1, and m = 100.
Yet again, we see that asymptotically the sender can communicate k bits because
θ → 1 as m → ∞. The capacity loss due to the finite value of m is about 6%
when m = 10, 000, c = 0.1, and δ = 5 (θ = 1.062), while the probability that the
LT process succeeds is about Ppass ≈ 0.75. This probability increases and the
capacity loss decreases with increasing message length (see Table 9.1).
We note at this point that in practice one can achieve low overhead and high
probability of a successful pass through the LT process with δ > 1, which is in
contradiction with its probabilistic meaning. This is possible because the inequal-
ity Ppass > 1 − δ guaranteed by (9.7) is not tight.
Assuming the maximal-length message is sent (m ≈ k), the average number
of operations required to complete the LT process is [91]
which is significantly faster than Gaussian elimination. The first term arises
from evaluating the product Dx, while the second term is the complexity of the
LT process. The gain in implementation efficiency over using simple Gaussian
elimination is shown in Table 9.1.
9.2.1 Implementation
We now describe how the matrix LT process can be incorporated in a stegano-
graphic method. The sender starts by forming the matrix D with columns fol-
lowing the RSD. A stego key can be used to initialize the pseudo-random number
generator. Applying the bit-assignment function to the cover image, the sender
obtains the vector of bits x and computes the right-hand side z = m − Dx. The
matrix LT process is used to find the solution v to the linear system (9.4). Cover
elements i with v[i] = 1 should be modified to change their assigned bit.
174 Chapter 9. Non-shared selection channel
Table 9.1. Running time (in seconds) for solving m × m and m × θm linear systems
using Gaussian elimination and the matrix LT process, respectively (c = 0.1, δ = 5);
Ppass is the probability of a successful pass through the LT process. The experiments
were performed on a single-processor Pentium PC with 3.4 GHz processor.
m Gauss LT θ Ppass
1,000 0.023 0.008 1.098 43%
10,000 17.4 0.177 1.062 75%
30,000 302 0.705 1.047 82%
100,000 9320 3.10 1.033 90%
The receiver forms the matrix D, applies the bit-assignment function to image
elements (obtaining vector y), and finally extracts the message as the syndrome
m = Dy. Note, however, that in order to do so the recipient needs to know the
message length m because the RSD (and thus D) depends on m as a parameter.
(The remaining parameters c and δ can be public knowledge.) The message
length m thus needs to be communicated to the receiver. For example, the sender
can reserve a small portion of the cover image (e.g., determined from the stego
key) where the parameter m will be communicated using a small matrix D0 with
uniform distribution of 0s and 1s, instead of the RSD, and solve the system using
Gaussian elimination. Because in typical applications m could be encoded using
no more than 20 bits, the Gaussian elimination does not present a significant
increase in complexity because solving a system of 20 equations should be fast.
The payload of m bits is then communicated in the rest of the image using the
matrix LT process whose matrix D follows the RSD.
To complete the algorithm description, we need to explain how the sender
solves the problem with occasional failures of the matrix LT process. Again,
among several different approaches that one can take, probably the simplest one
is to make D dependent on the message length, m, for example by making the
seed for the PRNG that generates the sequence of integers w[1], . . . dependent on
a combination of the stego key and message length. If a failure occurs, a dummy
bit is appended to the message, and the matrix D is generated again followed
by another run of the matrix LT process till a successful pass is obtained.
Algorithm 9.1 contains a pseudo-code for the matrix LT process to ease prac-
tical implementation. The input is a binary m × k matrix H and the right-hand
side z ∈ {0, 1}m. The output is the solution v to the system Hv = z.
on average 50% of ones and 50% of zeros. Since ones correspond to embedding
changes, the message will be embedded with embedding efficiency 2. If the mes-
sage is shorter than the maximal communicable message, m < k, there will be
more than one solution. In the language of coding theory, the solutions will form
a coset C(z) = {x ∈ {0, 1}n|Hx = z}. If the sender selects the solution with the
smallest number of ones (a coset leader) the embedding impact will be mini-
mized or, equivalently, the embedding efficiency maximized. Unfortunately, the
problem of finding a coset leader for general codes is NP-complete.
In this section, we describe a version of wet paper codes with improved embed-
ding efficiency. It is a block-based scheme that embeds small message segments
of p bits in each block using random codes of codimension p. For such codes, the
problem of finding the solution v with the smallest number of ones can be solved
simply using brute force.
Keeping the same notation, we assume there are k changeable pixels in a
cover image consisting of n pixels and we wish to communicate m < k message
bits. The sender and receiver agree on a small integer p (e.g., p ≈ 20) and us-
ing the stego key divide the cover image into nB = m/p disjoint pseudo-random
blocks, where each block will convey p message bits. Each block will thus con-
tain n/nB = pn/m cover elements (for simplicity we assume all quantities are
integers). Since the blocks are formed pseudo-randomly, there will be on average
176 Chapter 9. Non-shared selection channel
2 Note that we measure the relative payload with respect to the number of changeable pixels,
k, rather than the number of all pixels, n.
Non-shared selection channel 177
described in Section 9.2.1. Knowing m, the recipient uses the secret stego key
and partitions the rest of the stego image into the same disjoint blocks as the
sender and extracts p message bits m from each block of pixels y as the syndrome
m = Dy.
9.3.1 Implementation
To assess the memory requirements and complexity of Algorithm 9.2, we con-
sider the worst case, when the sender needs to generate all sets U1 , . . . , UR/2 .
The cardinalities of Ui exponentially increase with i, reach a maximum at around
i ≈ Ra , the average distance to code, and then quickly fall off to zero for i > Ra .
We already know from Chapter 8 that with increasing length of the code (or
increasing p), Ra → R. This means that the above algorithm avoids computing
the largest of the sets Ui . Nevertheless, we will still need to keep in memory all
Ui , i = 1, . . . , R/2 and the indices j1 , . . . , ji for element of Ui . Because on
each
p/α
average |U1 | = p/α, we have on average |Ui | ≤ i . Thus, the total memory re-
p/α
quirements are bounded by O R/2 × R/2 ≈ O p2(p/α)H(Rα/2p) ≈ O (p2κp ),
where κ = H(H −1 (α)/2)/α < 1 and H(x) is the binary entropy function. (For
example, for α = 12 , κ = 0.61.) Here, we used
n thenH(k/n)
asymptotic form for the
binomial number (8.49) from Chapter 8, k ≈ 2 , and the fact that
−1
R ≈ (p/α)H (α) is the expected number of embedding changes for large p (see
Table 8.4).
178 Chapter 9. Non-shared selection channel
8 Bound
2
2 4 6 8 10 12
α−1
Figure 9.2 Embedding efficiency e of wet paper codes realized using random linear codes
of codimension p = 6, 10, 14, 18 displayed as a function of 1/α. The solid curve is the
asymptotic upper bound on embedding efficiency.
F = Q ◦ T : Z N → X n, (9.9)
where X is the dynamic range of the transformed signal x = F (X) and the
real-valued map T : Z N → Rn is some form of processing. The circle stands for
composition of mappings Q ◦ T (X) = Q (T (X)). We denote by T (X) = u ∈ Rn
the intermediate image. The map Q is an integer scalar quantizer with range X
extended to work on vectors by coordinates: Q(u) = (Q(u[1]), . . . , Q(u[n])). We
stress that the cover image is the signal x.
Following the principle of minimum embedding impact, we define ρ[i] using
the uniquely determined integer a, a ≤ u[i] < a + 1, as
ρ[i] = |u[i] − a − (a + 1 − u[i])| = 2u[i] − (a + 1/2). (9.10)
3 Exercises 9.3–9.6 investigate a more general strategy under the assumption that wet paper
codes with the largest theoretical embedding efficiency are used.
Non-shared selection channel 181
In other words, ρ[i] is the increase in the quantization error when quantizing u[i]
to the second closest value to u[i] instead of the closest one. If the sender uses a
bit-assignment function that always assigns two different bits to integers a and
a + 1, the sender has the power to flip the bit assigned to x[i] by rounding to
the second closest value.
We now give a few examples of mappings F that could be used for perturbed-
quantization steganography.
Example 9.1: [Downsampling] For grayscale images in raster format, the trans-
formation T maps an M1 × N1 matrix of integers X[i, j] into an m1 × n1 matrix
of real numbers u = u[r, s] using a resampling algorithm.
versions of perturbed quantization provide better security than the version from
Example 9.4 (see [95] and the steganalysis results in Table 12.8).
of all known steganographic methods for JPEG images [95] (as of late 2008).
The reader is encouraged to inspect the results of blind steganalysis presented
in Tables 12.7 and 12.8.
m + m/e
= e + 1. (9.11)
m/e
4 π2 (x) = LSB(x/2).
186 Chapter 9. Non-shared selection channel
Apply code C0
2 p 2p 2 p
* * *
x[i, 1] x[i, 2] ··· x[i, n]
i=1 i=1 i=1
Figure 9.3 A block of n2p cover elements used in the ZZW code construction.
*
p
2
v[s] = x[i, s], s = 1, . . . , n. (9.12)
i=1
will not know which subsets communicate this additional payload (the receiver
will not know the indices s1 , . . . , sr ), the sender must use wet paper codes, which
is the step described next.
Let H be the p × (2p − 1) parity-check matrix of a binary Hamming code
Hp (Section 8.3.1). Compute the syndrome of each subset as s(s) = Hx[., s] ∈
{0, 1}p , where x[., s] = (x[1, s], . . . , x[2p − 1, s]) written as a column vector.5 Con-
catenate all these syndromes to one column vector of np bits
Now realize that due to the property of Hamming codes, for each si we can
arrange that each syndrome s(si ) , i = 1, . . . , r, can be changed to an arbitrary
syndrome by making at most one change to x[1, s], . . . , x[2p − 1, s].
Label all p bits of syndromes coming from subsets s1 , . . . , sr as dry (which
makes in total pr dry bits) and all remaining bits in (9.13) as wet. If there is more
than one block in the image, concatenate the vectors (9.13) from all blocks to
form one long vector of Lnp bits, where L is the number of blocks. This vector will
have p(r1 + · · · + rL ) dry bits or on average E[p(r1 + · · · + rL )] = LRa p dry bits.
Now form the random sparse matrix D with Lnp columns and p(r1 + · · · + rL )
rows so that its columns follow the RSD as described in Section 9.2. Thus, using
wet paper codes, we can communicate on average LpRa message bits in the whole
image (plus Lm bits embedded using C0 in each block). If the wet paper code
dictates that a syndrome s(s) be changed to s = s(s) , we can arrange for this by
modifying exactly one bit in the corresponding vector of bits x[1, s], . . . , x[2p −
1, s]. If no change in the syndrome s(s) is needed, all bits x[1, s], . . . , x[2p − 1, s]
must stay unchanged. But, because we still need to change the XOR of all bits
x[1, s], . . . , x[2p , s] in (9.12), we simply flip the 2p th bit x[2p , s] because this bit
was put aside and does not participate in the syndrome calculation.
To summarize, we embed in each block of n2p cover elements m + pRa bits
using on average Ra changes. We can also say that the relative payload
m + pRa
αp = (9.14)
n2p
can be embedded with embedding efficiency
m + pRa m
ep = =p+ . (9.15)
Ra Ra
This newly constructed family of codes Cp has one important property. With
increasing p, the embedding efficiency ep follows the upper bound on embedding
efficiency in the sense that the limit is finite [79],
αp 1 m n
lim ep − −1
= − + log2 = Δ(Ra , n, m). (9.16)
p→∞ H (αp ) log 2 Ra Ra
5 Note that we are reserving the last element from each subset x[2p , s] to be used later.
188 Chapter 9. Non-shared selection channel
Bound
Family ( 21 , 2p , 1 + p/2)
2
2 4 6 8 10 12 14 16 18 20
α−1
1
Figure 9.4 Embedding efficiency of codes from the family 2
, 2p , 1 + p/2 for
p = 0, . . . , 6.
6 As explained in Chapter 7, this operation minimizes the total embedding impact in some
well-defined sense.
Non-shared selection channel 189
Summary
r When the placement of embedding changes is not shared with the recipient,
we speak of a non-shared selection channel. Other synonyms are writing on
wet paper or writing to memory with defective cells.
r There exist numerous situations in steganography when non-shared channels
arise, such as in minimum-impact steganography, adaptive steganography, and
public-key steganography.
r Communication using non-shared channels can be realized using syndrome
codes also called wet paper codes.
r Syndrome codes with random matrices are capable of asymptotically com-
municating the maximum possible payload but have complexity cubic in the
message length m.
r Sparse codes with robust soliton distribution of ones in their columns are also
asymptotically optimal and can be implemented with complexity O(n log m),
where n is the number of cover elements.
190 Chapter 9. Non-shared selection channel
Exercises
9.1 [Writing in memory with one stuck cell] When the number of stuck
cells is 1 or n − 1, it is easy to develop algorithms for writing in memory with
defective cells. For n − 1 stuck cells, or k = 1, we can write one bit into the
memory simply by adjusting it so that the message bit m[1] is equal to XOR of
all bits in the memory.
The complementary case, when there is one stuck cell, is a little more compli-
cated. Let m[i], i = 1, . . . , n − 1, be the message to be written in the memory.
If the stuck cell is the jth cell, j > 1, we write m into the memory in the fol-
lowing manner. If m[j − 1] = x[j], we can write the message because the defect
is compatible with the message. We write (m[1], m[2], . . . , m[n − 1]) into cells
(2, 3, . . . , n) and we write 1 into x[1]. If m[j − 1] = x[j], we write the negation
of m, 1 − m, into cells (2, 3, . . . , n) and we write 0 into x[1].
If the stuck cell is the first cell, j = 1, we write the message into cells (2, 3, . . . , n)
as is (if the stuck bit x[1] = 1) and we write its negation into the same cells if
the stuck bit x[1] = 0. Convince yourself that the reading device can always read
the message correctly using the following rule:
if x[1] = 1, read the message as (x[2], x[3], . . . , x[n]), (9.17)
if x[1] = 0, read the message as (1 − x[2], 1 − x[3], . . . , 1 − x[n]). (9.18)
9.2 [Rank of a random matrix] Show that the probability, R[k], that a
randomly generated k × k binary matrix is of full rank is
"
k
1
R[k] = 1− → 0.2889 . . . , as k → ∞. (9.19)
i=1
2i
Note that here the rank should be computed in binary arithmetic. The rank of
a binary matrix computed in binary arithmetic and the rank computed in real
arithmetic may be different.
Hint: Use induction with respect to i for an i × k matrix.
9.3 [Minimum embedding impact I] Let the cover image have n pixels
with embedding impact ρ[i] and assume that the elements are already sorted
so that ρ[i] is non-decreasing, ρ[i] ≤ ρ[i + 1]. In order to embed m bits, the
sender marks pixels 1, . . . , k, m ≤ k, as changeable and embeds the message
into changeable pixels using binary wet paper codes with the best theoretically
possible embedding efficiency e = α/H −1 (α), where α = m/k. Show that the
value of k that minimizes the total embedding impact is
m
k
−1
kopt = arg min H ρ[i]. (9.20)
m≤k≤n k i=1
Non-shared selection channel 191
Hint: According to Table 8.4, the embedding will introduce kH −1 (m/k) changes
and the expected impact per changed pixel is (1/k) ki=1 ρ[i].
9.4 [Minimum embedding impact II] Fix the ratio β = m/n and con-
sider (9.20) in the limit for n → ∞. Assume that ρ[i] = ρ(i/n) for some non-
decreasing function (profile) ρ. Define x = k/n and show that the optimization
problem (9.20) becomes
kopt β
xopt = = arg min H −1 R(x), (9.21)
n β≤x≤1 x
where
ˆx
R(x) = ρ(t)dt. (9.22)
0
this profile c = 1.254, obtaining thus the following rule of thumb: In perturbed-
quantization steganography implemented using wet paper codes with optimal
embedding efficiency, it is better to use k ≈ m + m/4 rather than k = m.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
10 - Steganalysis pp. 193-220
Chapter DOI:
Cambridge University Press
10 Steganalysis
In the prisoners’ problem, Alice and Bob are allowed to communicate but all
messages they exchange are closely monitored by warden Eve looking for traces of
secret data that may be hidden in the objects that Alice and Bob exchange. Eve’s
activity is called steganalysis and it is a complementary task to steganography.
In theory, the steganalyst is successful in attacking the steganographic channel
(i.e., the steganography has been broken) if she can distinguish between cover
and stego objects with probability better than random guessing. Note that, in
contrast to cryptanalysis, it is not necessary to be able to read the secret message
to break a steganographic system. The important task of extracting the secret
message from an image once it is known to contain secretly embedded data
belongs to forensic steganalysis.
In Section 10.1, we take a look at various aspects of Eve’s job depending on
her knowledge about the steganographic channel. Then, in Section 10.2 we for-
mulate steganalysis as a problem in statistical signal detection. If Eve knows the
steganographic algorithm, she can accordingly target her activity to the specific
stegosystem, in which case we speak of targeted steganalysis (Section 10.3). On
the other hand, if Eve has no knowledge about the stegosystem the prisoners
may be using, she is facing the significantly more difficult problem of blind ste-
ganalysis detailed in Section 10.4 and Chapter 12. She now has to be ready to
discover traces of an arbitrary stegosystem. Both targeted and blind steganalysis
work with one or more numerical features extracted from images and then clas-
sify them into two categories – cover and stego. For targeted steganalysis, these
features are usually designed by analyzing specific traces of embedding, while
in blind steganalysis the features’ role is much more ambitious as their goal is
to completely characterize cover images in some low-dimensional feature space.
Blind steganalysis has numerous alternative applications, which are discussed in
Section 10.5.
The reliability of steganalysis is strongly influenced by the source of covers.
The prisoners may choose a source that will better mask their embedding and, on
the other hand, they may also make fatal errors and choose covers so improperly
that Eve will be able to detect even single-bit messages. The influence of covers
on steganalysis is discussed in Section 10.6.
Although this chapter and this book focus on steganalysis based on analyzing
statistical anomalies of pixel values that most steganographic algorithms leave
194 Chapter 10. Steganalysis
behind, Eve may utilize other auxiliary information available to her. For example,
she can inspect file headers or run brute-force attacks on the stego/encryption
keys, hoping to reveal a meaningful message when she runs across a correct key.
Such attacks are called system attacks (Section 10.7) and belong to forensic
steganalysis (Section 10.8).
The main purpose of this chapter is to prepare the reader for Chapters 11
and 12, where specific steganalysis methods are described. Readers not familiar
with the subject of statistical hypothesis testing would benefit from reading
Appendix D before continuing.
when Eve already suspects that somebody may be using steganography, addi-
tional intelligence may be gathered that may provide some information about the
cover source or the steganographic algorithm. Say, if Alice downloads a stegano-
graphic tool while being eavesdropped, Eve suddenly obtains prior information
about the embedding algorithm, as well as the stego key space. Or, if Eve knows
that Alice sends to Bob images obtained using her camera, Eve could purchase
a camera of the same model and tailor her attack to this cover source.
Steganalysis may become significantly easier when a suspect’s computer is
seized and Eve’s task is to steganalyze images on the hard disk. In this case,
the stego tool may still reside on the computer or its traces may be recover-
able even after it has been uninstalled (e.g., using Wetstone’s Gargoyle http:
//www.wetstonetech.com). This gives her valuable prior information about the
potential stego channel. In some situations, Eve may be able to find multiple
versions of one image that are nearly identical. She can first investigate whether
these slightly different versions are the result of steganography or some other
natural process, such as compressing one image using two different JPEG com-
pressors [95]. By comparing the images, she can learn about the nature of em-
bedding changes and their placement (selection channel) and possibly conclude
that the changes are due to embedding a secret message. Once Eve determines
that steganography is taking place, the steganalysis is complete and she may
continue with forensic steganalysis aimed at extracting the secret message itself
or she may decide to interrupt the communication channel if it is within her
competence.
In this section, you will learn that statistical steganalysis is a detection prob-
lem. In practice, it is typically achieved through some simplified model of the
cover source obtained by representing images using a set of numerical features.
Depending on the scope of the features, we recognize two major types of statis-
tical steganalysis – targeted and blind. The following sections describe several
basic strategies for selecting appropriate features for both targeted and blind
steganalysis. Examples of specific targeted and blind methods are postponed to
Chapters 11 and 12.
Before proceeding with the formulation of statistical steganalysis, we briefly
review some basic facts from Chapter 6. The information-theoretic definition of
steganographic security starts with the basic assumption that the cover source
can be described by a probability distribution,
´ Pc , on the space of all possible
cover images, C. The value Pc (B) = B Pc (x)dx is the probability of selecting
cover x ∈ B ⊂ C for hiding a message. Assuming that a given stegosystem as-
sumes on its input covers x ∈ C, x ∼ Pc , stego keys, and messages (both attaining
values on their sets according to some distributions), the distribution of stego
images is Ps .
196 Chapter 10. Steganalysis
H0 : x ∼ Pc , (10.1)
H1 : x ∼ Ps . (10.2)
Optimal Neyman–Pearson and Bayesian detectors can be derived for this prob-
lem using the likelihood-ratio test.
If Eve has no information about the stego system, the steganalysis problem
becomes
H0 : x ∼ Pc , (10.3)
H1 : x ∼ Pc , (10.4)
H0 : β = 0, (10.5)
H1 : β > 0. (10.6)
Eve may employ appropriate tools, such as the generalized likelihood-ratio test,
or convert this problem to simple hypothesis testing by considering β as a ran-
dom variable and making an assumption about its distribution (the Bayesian
approach).
features need to be chosen so that the clusters of f (x) and f (y) have as little
overlap as possible.
As derived in Section 6.1.1, the error probabilities of any detector must satisfy
the following inequality:
1 − PFA PFA
(1 − PFA ) log + PFA log ≤ DKL (pc ||ps ) ≤ DKL (Pc ||Ps ),
PMD 1 − PMD
(10.9)
where DKL is the Kullback–Leibler divergence between the two probability distri-
butions. For secure steganography, DKL (Pc ||Ps ) = 0, and the only detector that
Eve can build is one that randomly guesses between cover and stego, which means
that Eve cannot construct an attack. For -secure stegosystems, DKL (Pc ||Ps ) ≤ ,
and the reliability of detection decreases with decreasing (see Figure 6.1).
In steganography, useful detectors must have a low probability of a false alarm.
This is because images detected as potentially containing secret messages are
likely to be subjected to further forensic analysis (see Section 10.8) to determine
the steganographic program, the stego key, and eventually extract the secret
message. This may require brute-force dictionary attacks that can be quite ex-
pensive and time-consuming. Thus, it is more valuable to have a detector with
very low PFA even though its probability of missed detection may be quite high
(e.g., PMD > 0.5 or higher). Because steganographic communication is typically
repetitive, even a detector with PMD = 0.5 is still quite useful as long as its
false-alarm probability is small.
The hypothesis-testing problem in steganalysis is thus almost exclusively for-
mulated using the Neyman–Pearson setting, where the goal is to construct a
detector with the highest detection probability PD = 1 − PMD while imposing a
198 Chapter 10. Steganalysis
bound on the probability of false alarms, PFA ≤ FA . Even though it is possible
to associate cost with both types of error, Bayesian detectors are typically not
used in steganalysis, because the prior probabilities of encountering a cover or
stego image can rarely be accurately estimated.
Given the bound on false alarms, FA , the optimal Neyman–Pearson detector
is the likelihood-ratio test
ps (x)
Decide H1 when L(x) = > γ, (10.10)
pc (x)
where γ > 0 is a threshold determined from the condition
ˆ
PFA = pc (x)dx = FA , (10.11)
R1
where
r The area, ρ, between the ROC curve and the diagonal line normalized so that
ρ = 0 when the ROC coincides with the diagonal line and ρ ≈ 1 for ROC
curves corresponding to nearly perfect detectors (Figure 10.1(c)). Mathemat-
ically,
ˆ1
ρ=2 PD (x)dx − 1. (10.13)
0
The minimum is reached at a point where the tangent to the ROC curve has
slope 12 . In Figure 10.2, this point is marked with a circle.
r The false-alarm rate at probability of detection equal to PD = 1
2
1
PD−1 . (10.16)
2
This point is marked with a square in Figure 10.2.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(a) (b)
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(c) (d)
Figure 10.1 Examples of ROC curves. The x and y axes in all four graphs are the
probability of false alarms, PFA , and the probability of detection, PD , respectively.
(a) Example of an ROC curve; (b) ROC of a poor detector; (c) ROC of a very good
detector; (d) two hard-to-compare ROCs.
0.8
0.6
PD (PFA )
0.4
0.2
0
0 P −1 (1/2) 0.4 0.6 0.8 1
D
PFA
Figure 10.2 Scalar measures typically used to compare detectors. The square marks the
point on the ROC curve corresponding to the criterion PD−1 (1/2). The point marked with
a circle corresponds to the minimal total decision error PE .
10.3.1 Features
We now describe several strategies for constructing features for targeted ste-
ganalysis and illustrate them with specific examples.
Because in targeted steganalysis the embedding mechanism of the stego system
is known, it makes sense to choose as features the quantities that predictably
202 Chapter 10. Steganalysis
change with embedding. While it is usually relatively easy to identify many such
features, features that are especially useful are those that attain known values
on either stego or cover images.
In the expressions below, β is the change rate (ratio between the number of
embedding changes and the number of all elements in the cover image) and fβ
is the feature computed from a stego image obtained by changing the ratio of β
of its corresponding cover elements.
T1. [Testing for stego artifacts] Identify a feature that attains a specific known
value, fβ , on stego images and attains other, different values on cover images.
Then, formulate a composite hypothesis-testing problem as
H0 : f = fβ , (10.17)
H1 : f = fβ . (10.18)
Note that here H0 is the hypothesis that the image under investigation is
stego, while H1 stands for the hypothesis that it is cover.
T2. [Known cover property] Identify a feature f that predictably changes
with embedding fβ = Φ(f0 ; β) so that Φ can be inverted, f0 = Φ−1 (fβ ; β). As-
suming F (f0 ) = 0 for some known function F : Rd → Rk , estimate β̂ from
F (Φ−1 (fβ ; β̂)) = 0 and test
H0 : β̂ = 0, (10.19)
H1 : β̂ > 0. (10.20)
Note that now we have the more typical case when the hypothesis H0 stands
for cover and H1 for stego. Also, notice that a by-product of this approach
is an estimate of the change rate, which can usually be easily related to the
relative payload, α. For example, for methods that embed each message bit
at one cover element, β = α/2.
T3. [Calibration] In some cases, it is possible to estimate from the stego image
what the value of a feature would be if it were computed from the cover image.
This process is called calibration. Let fβ be the feature computed from the
stego image and f̂0 be the estimate of the cover feature. If the embedding
allows expressing fβ as a function of f0 and β, fβ = Φ(f0 ; β), we can again
estimate β from fβ = Φ(f̂0 ; β̂) and again test
H0 : β̂ = 0, (10.21)
H1 : β̂ > 0. (10.22)
This method also provides an estimate of the change rate β (or the message
payload α).
We now give examples of these strategies. Strategy T1 was pursued when deriv-
ing the histogram attack in Section 5.1.1. The attack was based on the observa-
tion that the histogram h of an 8-bit grayscale image fully embedded with LSB
Steganalysis 203
J1 J2
F F
F(J1 ) − F(J2 )
Figure 10.3 Calibration is used to estimate some macroscopic quantities of the cover
image from the stego image.
4,000
Estimated histogram
Stego image hist.
Frequency of occurrence
2,000
1,000
0
−5 0 5 10
Value of DCT coefficient (1, 0)
Figure 10.4 Histogram of the DCT coefficient for the spatial frequency (1, 0) for the cover
image (+), F5 fully embedded stego image (), and the calibrated image (◦).
where the expected value is taken over embeddings inducing change rate β and
over pseudo-random walks. The second source of error is the cover assumption
F (f0 ) = 0. This is, again, an equality in expected value, this time over the covers
β̂ − β = w + b. (10.29)
can isolate the realization of the between-image error for a given image by simply
averaging all Nw estimates. The distribution of the between-image error is then
sampled by repeating this experiment over Nb images. Student’s t-distribution
is often a good fit for the between-image error.
As an example, we now look more closely at the quantitative steganalyzer for
Jsteg (Section 5.1.2). There, Figure 5.4 shows the histograms of estimated pay-
load for five payloads across Nb = 954 grayscale images. The histograms show
the mixture of both types of error. To separate them, we perform the following
experiment with the same database of images. For a fixed change rate β = 0.2,
each image was embedded with Jsteg Nw = 200 times with different messages
and stego keys (which determine the pseudo-random walk). By running the quan-
titative steganalyzer (5.25), we obtain a matrix of change-rate estimates β̂[i, j],
i = 1, . . . , Nw , j = 1, . . . , Nb . The distribution of the between-image error is ob-
tained by taking the average change-rate estimate over Nw embeddings for each
image,
1
Nw
b[j] = β̂[i, j] − β. (10.30)
Nw i=1
The sample pdf of the between-image error is shown in Figure 10.5. The figure
also includes the log–log empirical cdf plot showing the tail probability Pr{b > x}
as a function of x and the corresponding Gaussian fit using a thin line (see
Appendix A for more details about the plot). It is apparent that this error
exhibits thick tails and the Gaussian model is not a good fit. The plot, however,
seems to indicate that Student’s t-distribution might be a reasonably good fit
because the tail of the experimental data becomes linear for large x in agreement
with the model (Pr{x > x} ≈ x−ν for a random variable x following Student’s
t-distribution with ν degrees of freedom). The inter-quartile range (IQR) for the
between-image error is [0.198, 0.218], suggesting thus that the right tail is thicker
than the left tail.
Figure 10.6 shows the within-image error and the log–log empirical cdf plot
for the right tail obtained for one randomly chosen image from the database
based on Nw = 10, 000 embeddings. The within-image error appears to be well
modeled with a Gaussian distribution. Again, this property seems to be generic
among other quantitative steganalyzers [23].
We close this section with a few notes about some interesting recent devel-
opments in quantitative steganalysis. If the statistical distributions of random
variables involved in Strategy T2 can be derived or if at least reasonable as-
sumptions about them can be made, it becomes possible to derive more accurate
quantitative estimators using the maximum-likelihood principle,
100 10−1
P r{b > x}
50 10−2
0 10−3
−0.1 0 0.1 0.2 10−3 10−2 10−1
b x
Figure 10.5 Distribution of the between-image error and its log–log empirical cdf plot for
the quantitative steganalyzer for Jsteg (5.25). The thin line is a Gaussian fit.
400
10−1
P r{w > x}
200
10−2
0
0.26 0.27 0.28 0.29 10−3 −3
10 10−2
Estimated β x
Figure 10.6 Distribution of the within-image error and its log–log empirical cdf plot for one
image for the quantitative steganalyzer for Jsteg (5.25). The thin line is a Gaussian fit.
dimensional feature space, where the distributions of cover and stego images are
pc and ps . Ideally, we would want the feature space to be complete [144] in the
sense that for any steganographic scheme
so that we do not lose on our ability to distinguish between cover and stego
images by representing the images with their features. In practice, we may be
satisfied with a weaker property, namely the requirement that it be hard to
practically construct a stego scheme with DKL (ps ||pc ) = 0.
Next, we outline several general strategies for selecting good features for blind
steganalysis and then review the options available for constructing detectors.
10.4.1 Features
The impact of embedding can be considered as adding noise of certain specific
properties. Thus, many features are designed to be sensitive to adding noise while
at the same time being insensitive to the image content.
B1. [Noise moments] Transform the image to some domain, such as the Fourier
or wavelet domain, where it is easier to separate image content and noise.
Compute some statistical characteristics of the noise component (such as sta-
tistical moments of the sample distributions of transform coefficients). By
working with the noise residual instead of the image, we essentially improve
the signal-to-noise ratio (here, signal is the stego noise and noise is the cover
image itself) and thus improve the features’ sensitivity to embedding changes,
while decreasing their undesirable dependence on image content.
B2. [Calibrated features] Identify a feature, f , that is likely to predictably
change with embedding and calibrate it. In other words, compute the dif-
ference
fβ − f̂0 . (10.33)
For these features to work best, it is advisable to construct them in the same
domain as where the embedding occurs. For example, when designing features
for detection of steganographic schemes that embed data in quantized DCT
coefficients of JPEG images, compute the features directly from the quan-
tized DCT coefficients. This makes sense because the embedding changes are
Steganalysis 209
and μ̂, α̂, β̂ are its mean, shape, and width parameters estimated from h(kl) [i].
One might as well take the estimated parameters directly as features or use
their calibrated versions. Alternatively, non-parametric models may be used
as well (e.g., sample distributions of DCT coefficients or their groups).
Specific examples of features that can be used for blind steganalysis are given in
Chapter 12.
10.4.2 Classification
After selecting the feature set, Eve has at least two options to construct her de-
tector. One possibility is to mathematically describe the probability distribution
of cover-image features, for example, by fitting a parametric model, p̂c , to the
sample distribution of cover features and test
H0 : x ∼ p̂c , (10.37)
H1 : x ∼ p̂c . (10.38)
The second option is to use a large database of images and embed them using
every known steganographic method1 with uniformly distributed change rates
β (or payloads distributed according to some distribution if the change-rate
distribution is available) and then fit another distribution, p̂s , through the ex-
1 To be more precise, the selection of stego images in Eve’s training set should reflect the
probability with which they occur in the steganographic channel. These probabilities are,
however, unknown in general.
210 Chapter 10. Steganalysis
H0 : x ∼ p̂c , (10.39)
H1 : x ∼ p̂s . (10.40)
Even with a complete feature set, however, approaching the above detection
problems using the likelihood-ratio test (10.10) is typically not feasible. This is
because the feature spaces that aspire to be complete in the above practical sense
are still relatively high-dimensional (with dimensionality of the order of 103 or
higher) to obtain accurate parametric or non-parametric models of pc . Thus, in
practice the detection is formulated as classification. After all, Eve is interested
only in detecting stego images, which is a simpler problem than estimating the
underlying distributions. She trains a classifier on features f (x) for x drawn from
a sufficiently large database of cover and stego images to recognize both classes.
Eve can construct her classifier in two different manners. The first option is to
train a cover-versus-all-stego binary classifier on two classes: cover images and
stego images produced by a sufficiently large number of stego algorithms and an
appropriate distribution of message payloads. The hope is that if the classifier is
trained on all possible archetypes of embedding operations, it should be able to
generalize to previously unseen schemes. With this approach, there is always the
possibility that in the future some new steganographic algorithm may produce
stego images whose features will be incompatible with the distribution p̂s , in
which case such images may be misclassified as cover.
Alternatively, Eve can train a classifier that can recognize cover images in
the feature space and marks everything that does not resemble a cover image
as potentially stego. Mathematically, Eve needs to specify the null-hypothesis
region R0 containing features of cover images (R0 is the complement of the
critical region R1 ). This can be done, for example, by covering the support
of p̂c with hyperspheres [167, 168]. An alternative approach using a one-class
neighbor machine is explained in Section 12.3. This one-class approach to blind
steganalysis has several important advantages. First, the classifier training is
simplified because only cover images are used. Also, the classifier does not need
to be retrained when new embedding methods appear. The potential problem is
that the database has to be very large and diverse. We emphasize the adjective
diverse because we certainly do not wish to misidentify processed covers (e.g.,
sharpened images) as containing stego just because the classifier has not been
trained on them.
In general, the modular structure of blind detectors makes them very flexi-
ble and gives them the ability to evolve with progress in steganography and in
machine-learning. One can obviously easily exchange the machine-learning en-
gine, add more features, or expand the training database. Note that all these
actions require retraining the classifier.
Steganalysis 211
10.5.2 Multi-classification
An important advantage of blind steganalysis is that the position of the stego
image feature in the feature space provides additional information about the
embedding algorithm. In fact, the features embedded by different steganographic
algorithms form clusters and thus it is possible to classify stego images into
known steganographic schemes instead of the binary classification between cover
and stego images.
k A binary classifier can be extended to classify into k > 2 classes
by building 2 binary classifiers distinguishing between every pair of classes and
then fusing the results using a classical voting system. This method for multi-
classification is known as the Max–Wins principle [114].
Multi-classification into known steganographic methods is the first step to-
wards forensic steganalysis, whose goal is to identify the embedding algo-
rithm [189] and the secret stego key [94], and eventually extract the embedded
message.
212 Chapter 10. Steganalysis
10.5.4 Benchmarking
The feature set in a blind steganalyzer capable of detecting known steganographic
schemes is likely to be a good low-dimensional model of covers. This suggests
using the feature space as a simplified model of covers for benchmarking stegano-
graphic schemes by evaluating the KL divergence between the features of covers
and stego objects calculated from some fixed large database. Since the KL di-
vergence is generally hard to estimate accurately in high-dimensional spaces,
alternative statistics, such as the two-sample statistic called maximum mean
discrepancy, could be used instead [104, 191]. The third possibility is to bench-
mark for small payloads only and use for benchmarking the Fisher information
evaluated in the feature space as explained in Section 6.1.2.
It is important to realize that benchmarking steganography in this manner is
with respect to the feature model and the image database. It is possible that
two steganographic techniques might rank differently using a different feature
set (model) or image database [37] (also see the discussion in the next section).
The problem of fair benchmarking is an active area of research.
The properties of the cover-image source have a major influence on the accuracy
of steganalysis. In general, the more “spread out” the cover-image pdf Pc is,
the easier it is for Alice to hide messages without increasing the KL divergence
DKL (Pc ||Ps ) and the more difficult it is to detect them for Eve. For this reason,
one should avoid using covers with little redundancy, such as images with a
low number of colors represented in palette image formats, because there the
spatial distribution of colors is more predictable. Among the most important
attributes of the cover source that influence steganalysis accuracy, we name the
color depth, image content, image size, and previous processing. Experimental
Steganalysis 213
evaluation of the influence of the cover source on the reliability of blind and
targeted steganalysis has been the subject of [22, 23, 102, 129, 130, 139].
Images with a higher level of noise or complex texture have a more spread-
out distribution than images that were compressed using lossy compression or
denoised. Scans of films or analog photographs are especially difficult for ste-
ganalysis because high-resolution scans of photographs resolve the individual
grains in the photographic material, and this graininess manifests itself as high-
frequency noise. It is also generally easier to detect steganographic changes in
color images than in grayscale images, because color images provide more data
for statistical analysis and because Eve can utilize strong correlations between
color channels.
The image size has an important influence on steganalysis as well. Intuitively,
it should be more difficult to detect a fixed relative payload in smaller images
than in larger images because features computed from a shorter statistical sample
are inherently more noisy. This intuitive observation is analyzed in more detail
in Chapter 13 on steganographic capacity. The effect of image size on reliability
of steganalysis also means that JPEG covers with a low quality factor are harder
to steganalyze reliably because the size of the cover is determined by the number
of non-zero coefficients, which decreases with decreasing quality factor.
Image processing may play a decisive role in steganalysis. Processing that is
of low-pass character (denoising, blurring, and even lossy JPEG compression to
some degree) generally suppresses the noise naturally present in the image, which
makes the stego noise more detectable. This is especially true for spatial-domain
steganalysis. In fact, it is possible that a certain steganalysis technique can have
very good performance on one image database and, at the same time, be almost
useless on a different source of images. Thus, it is absolutely vital to test new
steganalysis techniques on as large and as diverse a source of covers as possible.
Sometimes, it may not be apparent at first sight that a certain processing may
introduce artifacts that may heavily influence steganalysis. For example, a trans-
formation of grayscales, such as contrast adjustment, histogram equalization, or
gamma correction, generally does not influence the image noise component in
any significant manner. However, it may introduce characteristic spikes and ze-
ros into the histogram [222]. This unusual artifact is caused by discretization of
the grayscale transformation to force it to map integers to integers. The spikes
and valleys in the histogram that would otherwise not be present in the image
can aid steganalysis methods that use the histogram for their reasoning. A good
example is the superior performance for detection of ±1 embedding of ALE (it
uses Amplitudes of Local Extrema of the histogram) steganalysis [37, 38] on the
database of images supplied with the image-editing software Corel Draw. The
images happen to have been processed using a grayscale transformation, which
makes ALE perform very well.
JPEG images are less sensitive to processing that occurred prior to compres-
sion. This is because the compression has a tendency to suppress many artifacts
that are otherwise strikingly present in the uncompressed images. JPEG images,
214 Chapter 10. Steganalysis
3,000 3,000
2,000 2,000
1,000 1,000
0 0
−15 0 15 −15 0 15
Figure 10.7 Histogram of luminance DCT coefficients for spatial frequency (1, 1) for the
image shown in Figure 5.1 compressed with quality factor 70 (left), and the same for the
(1) (2)
image first compressed with quality factor qf = 85 and then with qf = 70 (right). The
quantization steps for the primary and secondary compression for this DCT mode were
Q(1) [1, 1] = 7, Q(2) [1, 1] = 4.
are small, after transforming the block back to the DCT domain, the coefficients
will still exhibit traces of quantization due to the previous JPEG compression.
In fact, if the number of embedding changes is sufficiently small (say one or two
embedding changes per block), it is possible to recover the original block of cover-
image pixels simply by a brute-force search for the modified pixels. This way,
Eve can not only detect the presence of a secret message with high probability
but also identify which pixels have been modified! Steganalysis based on this
idea is called JPEG compatibility steganalysis [85].
When the steganographic algorithm F5 for JPEG images was introduced, most
researchers focused on attacking the impact of the embedding changes using
statistical steganalysis. It was only later that Niels Provos pointed out2 that
the JPEG compressor in F5 implementation always inserted the following JPEG
comment into the header of the stego image “JPEG Encoder Copyright 1998,
James R. Weeks and BioElectroMech,” which is rarely present in JPEG images
produced by common image-editing software. This comment thus serves as a
relatively reliable detector of JPEG images produced by F5. This is an example
of a system attack. Johnson [119, 121, 122] and [142] give examples of other
unintentional fingerprints left in stego images by various stego products.
Although these weaknesses are not as interesting from the mathematical point
of view, it is very important to know about them because they can markedly sim-
plify steganalysis or provide valuable side-information about the steganographic
channel.
The size of the stego key space can also be used to attack a steganographic
scheme even though the embedding changes it introduces are otherwise statis-
tically undetectable. The stego key usually determines a pseudo-random path
Algorithm 10.2 System attack on stego image y by trying all possible keys.
The stego key is found once a meaningful message is extracted.
// Input: stego image y
while (Keys left) {
k = NextKey();
m = Ext(y, k);
if (m meaningful) {
output(’Image is stego.’);
output(’Message = ’, m);
output(’Key = ’, k);
STOP;
}
}
output(’Image is cover.’);
through the image where the message bits are embedded. A weak stego key or
a small stego key space create a security weakness that can be used by Eve to
mount the system attack shown in Algorithm 10.2.
Depending on the size of the stego key space, Eve can go through all stego
keys or use a dictionary attack. For each key tried, Eve extracts an alleged mes-
sage. Once she obtains a legible message, she will know that Alice and Bob use
steganography and she will have the correct stego key. At this point, a malicious
warden can choose to impersonate either party or simply block the communica-
tion.
The attack above will not work if the message is encrypted prior to embedding,
because the warden cannot reliably distinguish between a random bit stream and
an encrypted message. However, encrypting the message using a strong encryp-
tion scheme with a secure key still does not mean that the stego key does not
have to be strong because Eve can determine the stego key by other means
than inspecting the message. She can still run through all stego keys as above;
however, this time she will be checking the statistical properties of the pixels
along the pseudo-random path rather then the extracted message bits. The sta-
tistical properties of pixels along the true embedding path should be different
than statistical properties of a randomly chosen path, as long as the image is not
fully embedded.3 To see this, assume that the steganographic scheme embeds the
payload by changing n0 pixels while visiting k < n pixels along some embedding
path in an n-pixel cover image. The change rate for the first k pixels along the
embedding path will be n0 /k, while the change rate when following a random
path through the image is only n0 /n < n0 /k. Thus, a sudden increase of this
ratio is indicative of the fact that a correct stego key was used.
3 Fully embedded images can most likely be reliably detected using other methods.
Steganalysis 217
For example, for simple LSB embedding, n0 /k ≈ 1/2 while n0 /n = α/2 < 1/2,
where α is the relative payload. In this case, the warden can apply the histogram
attack from Chapter 5 as her detector. More examples of this type of system
attack can be found in [94].
The goal of steganalysis is to detect the presence of secret messages. In the clas-
sical prisoners’ problem, once Eve finds out that Alice and Bob communicate
using steganography, she decides to block the communication channel. In prac-
tice, however, Eve may not have the resources or authority to do so or may not
even want to because blocking the channel would only alert Alice and Bob that
they are being subjected to eavesdropping. Instead, Eve may try to determine
the steganographic algorithm and the stego key, and eventually extract the mes-
sage. Such activities belong to forensic steganalysis, which can be loosely defined
as a collection of tasks needed to identify individuals who are communicating in
secrecy, the stegosystem they are using, its parameters (the stego key), and the
message itself. We list these tasks below, with the caveat that in any given situ-
ation some of these tasks do not have to be carried out because the information
may already be a priori available.
The power of steganography is that it not only provides privacy in the sense
that no one can read the exchanged messages, but also hides the very presence
of secret communication. Thus, the primary problem in steganography detection
is to decide what communication to monitor in the first place. Since steganalysis
algorithms may be expensive and slow to run, focusing on the right channel is
of paramount importance.
Second, the communication through the monitored channel will be inspected
for the presence of secret messages using steganalysis methods. Once an im-
age has been detected as containing a secret message, it is further analyzed to
determine the steganographic method.
218 Chapter 10. Steganalysis
The warden can continue by trying to recover some attributes of the embedded
message and properties of the stego algorithm. For example, some detection algo-
rithms may provide Eve with an estimate of the number of embedding changes.
In this case, she can approximately infer the length of the secret message. If the
approximate location of the embedding changes can be determined, this may
point to a class of stego algorithms. For example, Eve can use the histogram
attack, described in Chapter 5, to determine whether the message has been se-
quentially embedded. If the prisoners reuse the stego key, then Eve may have
multiple stego images embedded with the same key, which can help her deter-
mine the embedding path [137, 139] and narrow down the class of possible stego
methods.
The character of the embedding changes also leaks information about the em-
bedding mechanism. If Eve can determine that the LSBs of pixels were modified,
she can then focus on methods that embed into LSBs. Eventually, Eve may guess
which stego method has been used and attempt to determine the stego key and
extract the embedded message. If the message is encrypted, Eve then needs to
perform cryptanalysis on the extracted bit stream.
In her endeavor, Eve can mount different types of attacks depending on the
information available to her [120]. The most common case, which we have already
considered, is the stego-image-only-attack in which Eve has only the stego image.
However, in some situations, Eve may have additional information available that
may aid in her effort. For example, in a criminal case the suspect’s computer
may be available, with several versions of the same image on the hard disk. This
will allow Eve to directly infer the location, number, and character of embedding
modifications. This scenario is known as a known-cover attack.
If the steganographic algorithm is known to Eve, her options further increase
as she can now mount two more attacks – the known-stego-method attack and
the known-message attack. Eve can, for example, embed the message with vari-
ous keys and identify the correct key by comparing the locations of embedding
changes in the resulting stego image with those in the stego image under inves-
tigation.
Summary
r Steganalysis is the complementary task to steganography. Its goal is to detect
the presence of secret messages.
r Steganography is broken when the mere presence of a secret message can
be proved. In particular, it is not necessary to read the message to break a
stegosystem.
r Activities directed towards extracting the message belong to forensic steganal-
ysis.
r Steganalysis can be formulated as the detection problem using a variety of
hypothesis-testing scenarios.
Steganalysis 219
r There are two major types of steganalysis attacks – statistical and system
attacks.
r System attacks use some weakness in the implementation or protocol. Statisti-
cal attacks try to distinguish cover and stego images by computing statistical
quantities from images.
r All statistical steganalysis methods work by representing images in some fea-
ture space where a detector is constructed.
r If the features are designed to detect a specific stegosystem, we speak of
targeted steganalysis.
r Features designed to attack an arbitrary stegosystem lead to blind steganalysis
algorithms.
r The features for targeted schemes are designed by
– identifying quantities that predictably change with embedding,
– estimating these quantities for the cover image from the stego image (cali-
bration), or
– finding a function of such quantities that attains a known value on covers.
r Features for blind steganalysis are usually constructed in a heuristic manner
to be sensitive to typical steganographic changes and insensitive to image
content. Calibration can be used to achieve this goal. The features can be
computed in the spatial domain, frequency domain, or wavelet domain.
r The goal of quantitative steganalysis is to estimate the embedded payload (or,
more accurately, the number of embedding changes).
r The error of the estimate from quantitative steganalyzers has two components
– a within-image error and a between-image error. The within-image error
depends on the image content and the payload. It is well modeled using a
Gaussian distribution. The between-image error is the estimator bias for each
image caused by the properties of natural images. It is well modeled using
Student’s t-distribution.
Exercises
μ
PD (PFA ) = Q Q−1 (PFA ) − , (10.41)
σ
−1 1 μ
PD =Q , (10.42)
2 σ
μ
ρ = 1 − 2Q √ , (10.43)
σ 2
μ
PE = Q , (10.44)
2σ
220 Chapter 10. Steganalysis
Chapter
11 - Selected targeted attacks pp. 221-250
Chapter DOI:
Cambridge University Press
11 Selected targeted attacks
Steganalysis is the activity directed towards detecting the presence of secret mes-
sages. Due to their complexity and dimensionality, digital images are typically
analyzed in a low-dimensional feature space. If the features are selected wisely,
cover images and stego images will form clusters in the feature space with mini-
mal overlap. If the warden knows the details of the embedding mechanism, she
can use this side-information and design the features accordingly. This strategy
is recognized as targeted steganalysis. The histogram attack and the attack on
Jsteg from Chapter 5 are two examples of targeted attacks.
Three general strategies for constructing features for targeted steganalysis were
described in the previous chapter. This chapter presents specific examples of four
targeted attacks on steganography in images stored in raster, palette, and JPEG
formats. The first attack, called Sample Pairs Analysis, detects LSB embedding
in the spatial domain by considering pairs of neighboring pixels. It is one of the
most accurate methods for steganalysis of LSB embedding known today. Sec-
tion 11.1 contains a detailed derivation of this attack as well as several of its
variants formulated within the framework of structural steganalysis. The Pairs
Analysis attack is the subject of Section 11.2. It was designed to detect stegano-
graphic schemes that embed messages in LSBs of color indices to a preordered
palette. The EzStego algorithm from Chapter 5 is an example of this embedding
method. Pairs Analysis is based on an entirely different principle than Sample
Pairs Analysis because it uses information from pixels that can be very distant.
The third attack, presented in Section 11.3, is targeted steganalysis of the F5
algorithm that demonstrates the use of calibration as introduced in Section 10.3.
The last attack concerns detection of ±1 embedding in the spatial domain and
is detailed in Section 11.4. These steganalysis methods were chosen to illustrate
the basic principles on which many targeted attacks are based.
1
−6 −6
−10 −10
−14 −14
128
s
−18 −18
−22
−22
256
1 128 256 1 128 256
r
Figure 11.1 Left: Logarithm of the normalized adjacency histogram of horizontally
neighboring pixel pairs (r, s) from 6000 never-compressed raw digital-camera images (see
the description of Database RAW in Section 11.1.1). Right: A cross-section along the
minor diagonal of the adjacency histogram.
correlations among pixels of natural images, we are more likely to see a pair of
neighboring pixels (r, s) than a pair (r , s ) whenever |r − s| < |r − s |. In other
words, the larger the difference |r − s| is, the less probably will such a pair occur
in a natural image.1 Thus, unlike the histogram of pixels, the histogram of pixel
pairs has a predictable shape (see Figure 11.1). Let us now take a look at a pixel
pair (r, s) with r < s. The pair can accept four different forms,
(r, s) ∈ {(2i, 2j), (2i, 2j + 1), (2i + 1, 2j), (2i + 1, 2j + 1)}. (11.1)
Let h[r, s] be the number of horizontally adjacent pixel pairs (r, s) in the image.
Because r < s, we expect the counts for the pair (2i, 2j + 1) to be the lowest
among the four pairs, and the counts for (2i + 1, 2j) to be the highest. LSB
embedding changes any pair into another with probability that depends on the
change rate β. Because the set of four pairs (11.1) is obviously closed under
LSB embedding, the count of pixel pairs in the stego image, hβ , is a convex
combination of the cover counts for all four pairs. For example, if the payload is
embedded pseudo-randomly in the image,
Similar expressions can be obtained for the other two counts. The important
observation here is that hβ [2i + 1, 2j] will decrease with β because three out
of four terms in (11.2) are smaller than h[2i + 1, 2j]. By a similar argument,
hβ [2i, 2j + 1] will increase with β. Eventually, when β = 0.5, the expected values
of all four counts will be the same. We will put the pair (r, s) into set X whenever
r < s and s is even, and we will include the pair in set Y whenever r < s and s
is odd. With LSB embedding, the counts of pixel pairs in X will decrease, while
the counts of pairs in Y will increase.
A similar analysis can be carried out for pairs (r, s) for which r > s. There,
the situation is complementary in the sense that the counts of pairs (2i + 1, 2j)
will increase, while the counts of (2i, 2j + 1) will decrease. Thus, we include in
X pairs with r > s, s odd, while (r, s) ∈ Y when r > s and s even. This way, the
cardinality of X will decrease with LSB embedding, while the cardinality of Y
will increase, for all pairs (r, s), r = s.
1 Because the order of pixels in the pair matters, we will denote the pair in round brackets
(r, s), rather than curly brackets {r, s}.
224 Chapter 11. Selected targeted attacks
00, 10
11, 01 01, 10
00, 10 X V W Z 00, 11
11, 01 01, 10
00, 11
Figure 11.2 Transitions between primary sets, X , V, W, Z, under LSB flipping. Note that
Y = V ∪ W.
X = {(r, s) ∈ P|(s is even and r < s) or (s is odd and r > s)}, (11.4)
Y = {(r, s) ∈ P|(s is even and r > s) or (s is odd and r < s)}, (11.5)
Z = {(r, s) ∈ P|r = s}. (11.6)
The symmetry of the definitions of the sets X and Y indicates that for cover
images the cardinalities of X and Y should be the same,
|X | = |Y|, (11.7)
In other words, W ∪ Z is the set of all pairs from P that belong to one LSB pair
{2k, 2k + 1}.
The sets X , W, V, and Z are called primary sets. Note that P = X ∪ W ∪ V ∪
Z.
We now analyze what happens to a given pixel pair (r, s) under LSB embed-
ding. There are four possibilities:
LSB embedding may cause a given pixel pair to move from its primary set
to another primary set. The transitions of pairs between the primary sets are
depicted in Figure 11.2. When an arrow points from set A to set B, it means that
a pixel pair originally in A moves to B if modified by the modification pattern
associated with the arrow.
For each modification pattern Ω ∈ {00, 10, 01, 11} and any subset A ⊂ P, we
denote by φ(Ω, A) the expected fraction of pixel pairs in A modified with pattern
Ω. Under the assumption that the message bits are embedded along a random
path through the image, each pixel in the image is equally likely to be modi-
fied. Thus, the expected fraction of pixels modified with a specific modification
pattern Ω ∈ {00, 10, 01, 11} is the same for every primary set A ∈ {X , V, W, Z},
φ(Ω, X ) = · · · = φ(Ω, Z) = φ(Ω). With change rate β, we obtain the following
transition probabilities:
Together with the transition diagram of Figure 11.2, we can now express the
expected cardinalities of the primary sets for the stego image as functions of the
change rate β and the cardinalities of the cover image. Denoting the primary
sets after embedding with a prime, we obtain
From now on, we drop the expectations and assume that the cardinalities of the
primary sets from the stego image will be close to their expectations.
Following Strategy T2 from Chapter 10, our goal is to derive an equation for
the unknown quantity β using only the cardinalities of primed sets because they
can be calculated from the stego image. Equations (11.12) and (11.13) imply
that
The transition diagram shows that the embedding process does not modify the
union W ∪ Z. Denoting κ = |W| + |Z| = |W | + |Z |, on replacing |Z| with κ −
|W|, equation (11.14) becomes
Algorithm 11.1 Sample Pairs Analysis for estimating the change rate from a
stego image. The constant γ is a threshold on the test statistic β̂ set to achieve
PFA < FA , where FA is a bound on the false-alarm rate.
// Input M × N image x
// Form pixel pairs
P = {(x[i, j], x[i, j + 1])|i = 1, . . . , M, j = 1, . . . , N − 1}
x = y = 0;κ = 0;
for k = 1 to M (N − 1) {
(r, s) ← kth pair from P
if (s even & r < s) or (s odd & r > s){x = x + 1;}
if (s even & r > s) or (s odd & r < s){y = y + 1;}
if (s/2 = r/2){κ = κ + 1;}
}
if κ = 0 {output(’SPA failed because κ = 0’); STOP}
a = 2κ;b = 2 (2x √− M (N − 1));c = y − x;
β± = Re((−b ± b2 − 4ac)/(2a));
β̂ = min(β+ , β− );
if β̂ > γ {
output(’Image is stego’);
output(’Estimated change rate = ’, β̂);
}
which is a quadratic equation for the unknown change rate β. This equation can
be directly solved for β because all coefficients in this equation can be evaluated
from the stego image (recall that κ = |W | + |Z |). The final estimate of the
change rate β̂ is obtained as the smaller root of this quadratic equation. Note
that when κ = 0, W ∪ Z = ∅ and thus |X | = |X | = |Y| = |Y | = |P|/2. In this
case (11.19) becomes a useless identity and we cannot estimate β. However, since
κ is the number of pixel pairs where both values belong to the same LSB pair,
this will happen with very small probability for natural images.
2 MAE is a more robust statistic than variance. Moreover, since we know from Section 10.3.2
that the error of quantitative steganalyzers often has thick tails matching Student’s t-
distribution, the variance may not even exist. Also, see discussion in Section A.1.
228 Chapter 11. Selected targeted attacks
Table 11.1. Median estimated change rate and its median absolute error (MAE)
obtained using Sample Pairs Analysis for raw digital-camera images and film scans.
Median/MAE of β̂
Cover (β = 0) β = 0.025 β = 0.125 β = 0.25
RAW 0.0015/0.0076 0.0264/0.0074 0.1261/0.0062 0.2507/0.0050
SCAN 0.0130/0.0331 0.0373/0.0316 0.1347/0.0254 0.2575/0.0177
400
300
200
100
0
0 0.05 0.1 0.15
α = 2β
Figure 11.3 Histogram of estimated message length for images from Database RAW
embedded with α = 0.05 bpp.
channel available to Eve is. Let us assume that Eve knows that Alice and Bob
always embed relative payload α = 0.05 (change rate β ≈ 0.025) using LSB em-
bedding in randomly selected pixels. If, additionally, Eve knows that Alice and
Bob use digital-camera images, she uses the data from experiments on Database
RAW and draws an ROC using Algorithm 10.1. To avoid introducing any sys-
tematic errors due to a low number of images in the database, she divides the
database into two disjoint subsets D = D1 ∪ D2 and takes the cover features (the
features are the estimates β̂) only from images in D1 and stego features only from
images in D2 .
The ROC describes a class of detectors that differ by their threshold γ,
150
100
50
0
0 0.05 0.1 0.15
α = 2β
Figure 11.4 Histogram of estimated message length for images from Database SCAN
embedded with α = 0.05 bpp.
must be concave (see Appendix D), we can further improve the detector’s ROC
by connecting the points (0, 0) and (u1 , PD (u1 )) (the origin and the square in
Figure 11.5), which corresponds to the following family of detectors Fu in the
interval u ∈ [0, u1 ]:
Fu1 (x) with probability u/u1
Fu (x) = (11.21)
F0 (x) with probability 1 − u/u1 .
Note that this detector would decrease the false-alarm rate at PD = 0.5 from
roughly 0.009 to 0.006. By imposing a bound on the false-alarm rate, Eve can
now select the threshold and use the appropriate detector in her eavesdropping.
For comparison, we show in Figure 11.6 the ROC curve for detecting LSB
embedding at relative payload α = 0.05 in raw scans of film. Note that the per-
formance of the detector is markedly worse than for digital-camera images. The
ROC can be made concave using the same procedure as above.
We note that if Eve were facing a different steganographic channel where the
change rate follows a known distribution fβ , she could construct a detector in the
same way as above with one difference: the change rate in stego images, D2 , would
have to follow fβ instead of being fixed at 0.025. If Eve has no prior information
about the payload, she can still obtain the distribution of β̂ on cover images,
fit a parametric model, and fix the threshold on the basis of the bound on false
alarms. This time, however, Eve will not be able to determine the probability of
correctly detecting a stego image as stego as this depends on the distribution of
payloads.
230 Chapter 11. Selected targeted attacks
0.8
PD 0.6
0.4
0.2
0 u1
0 0.02 0.03 0.04 1
PFA
Figure 11.5 ROC for the SPA detector distinguishing cover digital-camera images
(Database RAW) from the same images embedded using LSB embedding in randomly
selected pixels with α = 0.05. Because the ROC very quickly reaches PD = 1, we display
the curve only in the range PFA ≤ 0.05. The circle corresponds to the point with
PD = 0.68 with PFA < 0.01. The line connecting the origin and the square symbol is a
concave hull of the ROC that can be obtained using the detector at u = u1 and the
detector at u = 0.
The set Ci contains all pairs whose values differ by i after right-shifting their
binary representation (dividing by 2 and rounding down). There are four pos-
sibilities for (r, s) ∈ Ci : r − s = 2i, which covers two cases when either both r
and s are even or both are odd, or r − s = 2i − 1 (if r is even and s odd), or
r − s = 2i + 1 (if r is odd and s even). Thus, each trace set Ci can be written as
Selected targeted attacks 231
0.8
PD 0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
PFA
Figure 11.6 ROC curve for the SPA detector of relative payload α = 0.05 in raw scans of
film (Database SCAN). When comparing this figure with Figure 11.5, note the range of
the x axis.
Note that trace sets Ci are invariant with respect to LSB embedding because
the value of r/2 does not depend on the LSB of r. The four trace subsets
of Ci , however, are in general not invariant with respect to LSB embedding.
The transition diagram between the trace subsets, including the probabilities of
transition, is shown in Figure 11.7. As an example, we explain the transitions
from E2i . All pairs (r, s) from E2i have both r and s even. The probability that
a pair (r, s) ∈ E2i ends up again in E2i is the probability that neither r nor s
gets flipped, which is (1 − β)2 . The probability of the transition E2i → O2i is the
probability that both get flipped, which is β 2 . The probabilities of the remaining
two transitions, E2i → E2i+1 and E2i → O2i−1 , are equal to the probability that
one pixel gets flipped but not the other, which is β(1 − β).
The cardinality of each trace set after changing a random portion of β of
LSBs is a random variable with the following expected values derived from the
transition diagram:
⎛ ⎞ ⎛ 2 ⎞⎛ ⎞
E[|E2i |] b ab ab a2 |E2i |
⎜ E[|O2i−1
⎟ ⎜
|] ⎟ ⎜ ab ab ⎟ ⎜ ⎟
⎜ b2 a2 ⎟ ⎜ |O2i−1 | ⎟ , (11.26)
⎝ E[|E |] ⎠ = ⎝ ab a2 b2 ab ⎠ ⎝ |E2i+1 | ⎠
2i+1
E[|O2i |] a2 ab ab b2 |O2i |
where a = β, b = 1 − β. Here, we again use the prime to denote the sets obtained
from the stego image. For any 0 ≤ β < 12 , the matrix is invertible and we can
232 Chapter 11. Selected targeted attacks
(1 − β)2 (1 − β)2
β(1 − β)
E2i+1 O2i
β(1 − β) β2 β2 β(1 − β)
E2i O2i−1
β(1 − β)
(1 − β)2 (1 − β)2
Thus, in theory, if we knew the change rate, we should be able to recover the
original cardinalities of cover trace sets by substituting the trace-set cardinalities
of the stego image, arguing that they should be close to their expected values
as long as the number of pixels in each trace set is large. Alternatively, if we
succeed in finding a condition that the trace-set cardinalities of covers must
satisfy, we obtain equation(s) for the unknown change rate β. In other words, we
are again following Strategy T2 from Chapter 10. By the same reasoning as in
the derivation of SPA, we expect to see in covers approximately the same number
of pairs with r − s = j independently of whether s is even or odd, or |Ej | ≈ |Oj |.
LSB embedding violates this condition only for odd values of j. Indeed, the
condition |E2i | = |O2i | implies E[|E2i |] = E[|O2i |] as can be easily verified from
the first and fourth equations from (11.26). The condition
leads to the following quadratic equation for β obtained from the third equation
from (11.27) and the second equation from (11.27) written for O2i+1 rather than
O2i−1 :
− β(1 − β)|E2i | + β 2 |O2i−1 | + (1 − β)2 |E2i+1 | − β(1 − β)|O2i |
= −β(1 − β)|E2i+2 | + (1 − β)2 |O2i+1 | + β 2 |E2i+3 | − β(1 − β)|O2i+2 |, (11.29)
Selected targeted attacks 233
which simplifies to
β 2 (|Ci | − |Ci+1 |)
+ β |E2i+2 | + |O2i+2 | − 2|E2i+1 | + 2|O2i+1 | − |E2i | − |O2i |
+ |E2i+1 | − |O2i+1 | =0 (11.30)
Here, we used the fact that |Ci | = |Ci | and |Ci | = |E2i
| + |O2i−1
| + |E2i+1
| + |O2i |,
which follows from (11.25). The smaller root of this quadratic equation is the
change-rate estimate obtained from the ith trace set Ci .
At this point, we have multiple choices regarding how to aggregate the available
equations to estimate the change rate:
1. Sum all equations (11.30) for all indices i (or for some limited range, such
as |i| ≤ 50) and solve the resulting single quadratic equation. This option is
essentially the step taken in the generalized version of SPA as it appeared
in [63].
2. Solve (11.30) for some small values of |i|, e.g., |i| ≤ 2, obtaining individual
estimates β̂i , and estimate the final change rate as β̂ = mini β̂i . This choice
has been shown to provide more stable results compared with SPA [127].
3. Solve (11.28) in the least-square sense [128, 163]
β̂ = arg min (|E2i+1 | − |O2i+1 |)2 , (11.31)
β
i
where we substitute from (11.27) for the cardinalities of cover trace sets,
replacing the expected values with the observed cardinalities of stego image
trace sets.
The above formulation of SPA allows direct extensions to groups of more than
two pixels (see the triples analysis and quadruples analysis by Ker [128, 131],
which was reported to provide more accurate change-rate estimates, especially
for short messages). An alternative extension of SPA to groups of multiple pixels
appeared in [61].
The least-square estimator (11.31) can be interpreted as a maximum-likelihood
estimator under the assumption that the differences between the cardinalities of
cover trace sets, |E2i+1 | − |O2i+1 |, are independent realizations of a Gaussian
random variable. In this sense, the least-square estimator is a step in the right
direction because it views the cover image as an entity unknown to the stegan-
alyzer and postulates a statistical model for it. The iid Gaussian assumption is,
however, clearly false because the cardinalities of trace sets decrease with increas-
ing value of |i| (trace sets with large values of |i| are sparsely populated) and thus
exhibit larger relative variations. It is not immediately clear what assumptions
can be imposed on the cover trace sets to derive a “more proper” maximum-
likelihood variant of SPA. An interesting solution was proposed by Ker [133],
who introduced the concept of a precover consisting of pixel pairs with precisely
|E2i+1 | + |O2i+1 | pairs of pixels differing by 2i + 1. A specific cover is then ob-
tained by randomly (uniformly) associating each pair with either E2i+1 or O2i+1 .
234 Chapter 11. Selected targeted attacks
This assumption allows one to model the differences |E2i+1 | − |O2i+1 | as random
variables with a binomial distribution (or their Gaussian approximation) and de-
rive an appropriate maximum-likelihood estimator of the change rate. This ML
version of SPA has been shown to provide improved estimates for low embedding
rates.
We close this section with a brief mention of other related approaches to de-
tection of LSB embedding in the spatial domain. Methods based on statistics of
differences between pairs of neighboring pixels include [88, 155, 256]. Approaches
that use a classical signal-detection framework appeared in [40] and [55]. A qual-
itatively different method called the WS method [83] (Weighted Stego image)
that uses local image estimators is presented in Exercises 11.3 and 11.4. An
improved version of this method has been shown to produce some of the most
accurate results for detection of LSB embedding in the spatial domain [138].
The WS method for the JPEG domain appeared in [21]. Another advantage of
the WS method is that it is less sensitive than SPA to the assumption that the
message is embedded along a pseudo-random path and thus gives better results
when the embedding path is non-random or adaptive. Structural steganalysis
has also been extended to detection of embedding in two LSBs in [135, 253].
Although Pairs Analysis can detect LSB embedding in grayscale and color
images in general, it was originally developed for quantitative steganalysis of
methods that embed messages in indices to a sorted palette using LSB em-
bedding. The EzStego algorithm of Chapter 5 is a typical example of such
schemes. Prior to embedding, EzStego first sorts the palette colors accord-
ing to their luminance and then reindexes the image data accordingly so that
the visual appearance of the image does not change. Then, the usual LSB
embedding in indices is applied. Besides EzStego, early versions of Steganos
(http://steganos.com) and Hide&Seek (ftp://ftp.funet.fi/pub/crypt/
steganography/hdsk41.zip) also employed a similar method.
Before describing Pairs Analysis, we introduce notation and analyze the impact
of EzStego embedding on cover images. Let c[i] = (r[i], g[i], b[i]), i = 0, . . . , 255
be the 256 colors from the sorted palette. During LSB embedding, each color can
be changed only into the other color from the same color pair {c[2k], c[2k + 1]},
k = 0, . . . , 127. For a fixed k, we extract the colors c[2k] and c[2k + 1] from the
whole image, for example by scanning it by rows (Figure 11.8). This sequence
of colors can be converted to a binary vector of the same length by associating
a “0” with c[2k] and a “1” with c[2k + 1]. This binary vector will be called a
color cut for the pair {c[2k], c[2k + 1]} and will be denoted Z(c[2k], c[2k + 1]).
Because palette images have a small number of colors and natural images contain
macroscopic structure, Z is more likely to exhibit long runs of zeros or ones rather
than some random pattern. The embedding process will disturb this structure
Selected targeted attacks 235
8 4 5 0
5 8 9 5
Z( 4 , 5 ) = ( 0 1 1 1 0 0 1 0 )
0 4 4 5
7 3 6 4
Figure 11.8 An example of a color cut. The pixels shaded in gray represent two colors
from the same LSB pair. Crossed-out pixels in white represent the remaining colors.
and increase the entropy of Z. Finally, when the maximal-length message has
been embedded in the cover image (1 bit per pixel), Z will be a random binary
sequence and the entropy of Z will be maximal.
Let us now take a look at what happens during embedding to color cuts for
the “shifted” color pairs {c[2k − 1], c[2k]}, k = 1, . . . , 127. During embedding,
the colors c[2k − 2] and c[2k − 1] are exchanged for each other and so are the
colors c[2k] and c[2k + 1]. Even after embedding the maximal message (each pixel
modified with probability 12 ), the color cut Z(c[2k − 1], c[2k]) will still show some
residual structure. To see this, imagine a binary sequence W that was formed
from the cover image by scanning it by rows and associating a “0” with the colors
c[2k − 2] and c[2k − 1] and a “1” with the colors c[2k] and c[2k + 1]. Convince
yourself that, after embedding a maximal pseudo-random message in the image,
the color cut Z(c[2k − 1], c[2k]) is the same as starting with the sequence W and
skipping each element of W with probability 12 . Because W showed structure in
the cover image, most likely long runs of 0s and 1s, we see that randomly chosen
subsequences of W will show some residual structure as well.
We are now ready to describe the steganalysis method. Denoting the concate-
nation of bit strings with the symbol “&,” we first concatenate all color cuts
Z(c[2k], c[2k + 1]) into one vector
Z = Z(c[0], c[1])& . . . &Z(c[254], c[255]), (11.32)
and all color cuts for shifted pairs Z(c[2k − 1], c[2k]) into
Z = Z(c[1], c[2])& . . . &Z(c[253], c[254])&Z(c[255], c[0]). (11.33)
Next, we define a simple measure of structure in a binary vector as the number of
homogeneous bit pairs 00 and 11 in the vector. For example, a vector of n 1s will
have n − 1 homogeneous bit pairs. We denote by R(β) the expected relative3
number of homogeneous bit pairs in Z after flipping the LSBs of indices of a
fraction β of randomly chosen pixels, 0 ≤ β ≤ 1. We recognize β as the change
rate. Similarly, let R (β) be the expected relative number of homogeneous bit
pairs in Z . For β < 12 , this change rate corresponds to relative payload α = 2β.
Exercise 11.1
1 shows that R(x) is1 a quadratic polynomial with its vertex at
x = 2 and R 2 = (n − 1)/(2n) ≈ 2 (see Figure 11.9). The value R(β) is known
1
1 R (x)
0.8
R(x)
R(x) 0.6
0.4
0.2
0
0 β 0.4 0.6 1−β 1
x
Figure 11.9 Expected number of homogeneous pairs R(x) and R (x) in color cuts Z and
Z as a function of the change rate x. The circles correspond to y values that can be
obtained from the stego image with unknown change rate β. The values R(0) and R (0)
are not known but satisfy R(0) = R (0).
as it can be calculated from the stego image. Note that R(β) = R(1 − β) is also
known.
The value of R 12 can be derived from Z (see Exercise 11.2), while R (β) and
R (1 − β) can be calculated from the stego image and the stego image with all
colors flipped, respectively. Modeling R (x) as a second-degree polynomial, the
difference D(x) = R(x) − R (x) = Ax2 + Bx + C is also a second-degree polyno-
mial.
Finally, we accept one additional assumption,
which says that the number of homogeneous pairs in Z and Z must be the same
if no message has been embedded. This is, indeed, intuitive because there is no
reason why the color cuts of pairs and shifted pairs in the cover image should
have different structures.
In summary, we know the four values
C = 0, (11.39)
4D(1/2) = A + 2B, (11.40)
D(β) = Aβ 2 + Bβ, (11.41)
D(1 − β) = A(1 − β)2 + B(1 − β). (11.42)
The smaller of the two roots is our approximation to the unknown change rate
β. The pseudo-code for Pairs Analysis is shown in Algorithm 11.2.
32
0
0 32
Figure 11.10 Pixel-scanning order along a Hilbert curve for a 32 × 32 image.
1
Estimated payload α
0.8
0.6
0.4
0.2
(kl)
the histogram of absolute values of DCT coefficients for the cover image, h0 ,
is estimated from the stego image using calibration as described in Section 10.3.
Denoting the estimated histograms of absolute values of DCT coefficients cor-
(kl)
responding to spatial frequency (k, l) as ĥ0 , the change rate can be estimated
(kl)
from equations (11.46)–(11.47), where we substitute ĥ0 for the cover image
240 Chapter 11. Selected targeted attacks
histograms and replace the expected values with the sample values
(kl) (kl) (kl)
hβ [i] = (1 − β)ĥ0 [i] + β ĥ0 [i + 1], i > 0, (11.48)
(kl) (kl) (kl)
hβ [0] = ĥ0 [0] + β ĥ0 [1]. (11.49)
This is a system of linear equations for various values of (k, l) and i for just one
unknown – the change rate β. Not all values (k, l) and i, however, should be used.
The histograms are in general less populated for higher spatial frequencies and
higher values of i. Again, we have several possibilities for how to aggregate the
equations to obtain the change-rate estimate (see Section 11.1.3). For brevity,
here we present only the approach proposed in [86].
The steganalyst is advised to obtain three least-square estimates β̂01 , β̂10 , β̂11
from histograms with (k, l) ∈ {(0, 1), (1, 0), (1, 1)} and i = 0, 1,
2
(kl) (kl) (kl)
β̂kl = arg min hβ [0] − ĥ0 [0] − β ĥ0 [1]
β
2
(kl) (kl) (kl)
+ hβ [1] − (1 − β)ĥ0 [1] − β ĥ0 [2] , (11.50)
The main reason why LSB embedding in the spatial domain can be detected
very reliably is the non-symmetrical character of this embedding operation. By
symmetrizing the embedding operation to “add or subtract 1 at random” (±1
embedding) instead of flipping the LSB, a majority of accurate attacks on LSB
embedding is thwarted. In this section, we show some targeted attacks on ±1
embedding, which are, in fact, applicable to more general embedding paradigms
in the spatial domain as long as the impact of embedding the message can be
described as adding iid noise to the image (e.g., stochastic modulation from
Section 7.2.1).
The first attempts to construct steganalytic methods for detection of embed-
ding by noise adding appeared in [39, 40, 242]. We now describe the approach
proposed in [107] and its extension [129, 130].
Selected targeted attacks 241
h
hs
4,000
Histogram count
3,000
2,000
60 65 70 75 80 85 90
Grayscale value
hs = h f , (11.53)
or
hs [i] = h[j]f [i − j], (11.54)
j
where the indices i, j run over the index set determined by the number of colors
in the histogram (e.g., for 8-bit grayscale images, i, j ∈ {0, . . . , 255}). In (11.53),
f is the probability mass function of the stego noise. The specific form of this
convolution for ±1 embedding has been derived in Exercise 7.1.
This observation gives us an idea for deriving useful features for steganalysis
by analyzing the histogram smoothness. Due to the low-pass character of the
convolution, hs will be smoother than h and thus its energy will be concentrated
in lower frequencies. This can be captured by switching to the Fourier repre-
sentation of the histograms and the noise pmf. For clarity, in this section we
will denote Fourier-transformed quantities of all variables with the correspond-
ing capital letters. The Discrete Fourier Transform (DFT) of an N -dimensional
242 Chapter 11. Selected targeted attacks
vector x is defined as
N −1
2πjk
X[k] = x[j]e−i N , (11.55)
j=0
where i in (11.55) stands for the imaginary unit (i2 = −1). The Fourier transform
of the stego-image histogram is obtained as an elementwise multiplication of the
cover-image histogram and the noise pmf,
Hs [k] = H[k]F[k] for each k. (11.56)
The function Hs is called the Histogram Characteristic Function (HCF) of the
stego image. At this point, a numerical quantity is needed that can be computed
from the HCF and that would evaluate the location of the energy in the spec-
trum. Because the absolute value of the DFT is symmetrical about the midpoint
value k = N/2, a reasonable measure of the energy distribution is the Center Of
Gravity (COG) of |H| computed for indices k = 0, . . . , N/2 − 1,
N/2−1
k|H[k]|
COG(H) = k=0 N/2−1
. (11.57)
k=0 |H[k]|
It can be shown using the Chebyshev sum-inequality (see Exercise 11.5) that as
long as |F[k]| is non-increasing,4
COG(Hs ) ≤ COG(H), (11.58)
which can be intuitively expected because the stego image histogram is smoother
and thus the energy of the HCF will shift towards lower frequencies (see Fig-
ure 11.13).
For steganalysis of color images, we will use the three-dimensional color his-
togram, h[j1 , j2 , j3 ], which denotes the number of pixels with their RGB color
(j1 , j2 , j3 ). Furthermore, the one-dimensional DFT (11.55) is replaced with its
three-dimensional version
N −1
2π(j1 k1 +j2 k2 +j3 k3 )
H[k1 , k2 , k3 ] = h[j1 , j2 , j3 ]e−i N . (11.59)
j1 ,j2 ,j3 =0
4 Many known noise distributions, such as the Gaussian or Laplacian distributions, have mono-
tonically decreasing |F[k]|.
Selected targeted attacks 243
·104
|H[k]|
8 |Hs [k]|
Power 6
0 20 40 60 80 100
Frequency k
Figure 11.13 Stego image HCF falls off to zero faster because the stego image histogram
is smoother.
80
70
COG
60
Cover
50 Stego
80
60
COG
40
Cover
Stego
Cover
50 Stego
40
COG
30
20
10
Cover
120 Stego
100
COG
80
60
40
Canon G2 Canon PS S40 Kodak DC290
Figure 11.15 Top: The COG (11.57) for the same images as in Figure 11.14 converted to
grayscale and JPEG compressed using quality factor 75. The embedding change rate was
β = 0.25 (w.r.t. the number of pixels). Bottom: The COG of cover and stego images of
the adjacency histogram (11.65) computed for the same grayscale JPEG images.
HCF appears to vary considerably across different image sources. This is true in
general for most spatial-domain steganalyzers, including the blind constructions
explained in Section 12.5 [37, 139].
Other methods for detection of noise adding and ±1 embedding include meth-
ods based on signal estimation [55, 220, 249], histogram artifacts [37, 38], blind
steganalyzers [10, 252], and blind steganalysis methods described in Section 12.5.
Selected targeted attacks 247
Summary
r LSB embedding in the spatial domain can be reliably detected even at low
relative payloads due to the asymmetry of the embedding operation.
r Sample Pairs Analysis is an example of an attack on LSB embedding in
pseudo-randomly spread pixels. It works by analyzing the embedding impact
on subsets of pairs of neighboring pixels.
r Structural steganalysis is a reformulation of SPA using the concept of trace
sets. This formulation makes it possible to derive detectors that can incorpo-
rate statistical assumptions about the cover image and provide a convenient
framework for generalizing SPA to work with groups of more than two pixels.
r Pairs Analysis is another method for detection of LSB embedding that is
based on a different principle by considering the spatial distribution of colors
from each LSB pair in the entire image. It is especially suitable for attacking
LSB embedding in palette image formats.
r The steganographic algorithm F5 can be attacked using calibration by first
quantifying the relationship between the histograms of stego and cover images
and then estimating the cover-image histogram using calibration.
r ±1 embedding in the spatial domain can be detected using the center of
gravity of the histogram characteristic function (absolute value of the Fourier
transform of the histogram) as the feature. This is because adding an inde-
pendent stego noise to the cover image smooths the histogram. The accuracy
of this method can be improved by considering the adjacency histogram and
by applying calibration (resampling the stego image).
Exercises
11.1 [Pairs analysis I] Prove that the expected value of the relative num-
ber of homogeneous pairs R(β) for color cut Z for change rate β ∈ [0, 1] is a
parabola with its minimum at 2 . In particular, R 2 = (n − 1)/(2n) ≈ 12 for
1 1
large n, where n is the length of the color cut (number of pixels in the image).
Hint: Write the color cut Z of the cover image as a concatenation of r seg-
ments consisting of consecutive runs of k1 , . . . , kr 0s or 1s, k1 + · · · + kr = n.
Thus, R(0) = ri=1 (ki − 1) = n − r. After changing the LSB of a random por-
tion of β pixels, the probability that a homogeneous pair of consecutive bits will
stay homogeneous is β 2 + (1 − β)2 . The expected
number of homogeneous pairs
in the ith segment is thus β + (1 − β) (ki − 1) + 2β(1 − β), where the last
2 2
term comes from the right end of the segment (an additional pair will be formed
at the boundary if the last bit in the segment flips and the first bit of the next
segment does not flip, or vice versa). This last term is missing from the last
segment.
11.2 [Pairs analysis II] Let Z = {b[i]}ni=1 be the color cut for the shifted
color pairs and R (β) be the number of homogeneous pairs after flipping a portion
248 Chapter 11. Selected targeted attacks
β of pixels. Prove that the expected value of R 12 is
n−1
1
E R = 2−k hk , (11.66)
2
k=1
11.3 [Weighted stego image] In this (and the next) exercise, you will derive
another attack on LSB embedding in the spatial domain called a Weighted Stego
(WS) attack. Let x[i], i = 1, . . . , n be the pixel values from an 8-bit grayscale
cover image containing n pixels. The value of x[i] after flipping its LSB will be
denoted with a bar,
x̄[i] LSBflip(x[i]) = x[i] + 1 − 2(x[i] mod 2). (11.68)
Let y[i] denote the stego image after flipping the fraction β of pixels along a
pseudo-random path (this corresponds to embedding an unbiased binary message
of relative message length 2β). Let wθ [i] be the “weighted” stego image,
wθ [i] = y[i] + θ(ȳ[i] − y[i]), 0 ≤ θ ≤ 1, (11.69)
with weight θ. Let
n
D(θ) = (wθ [i] − x[i])2 (11.70)
i=1
be the sum of squares of the differences between the pixels of the weighted stego
image and the cover image. Show that D(θ) is minimal for θ = β. In other words,
n
β = arg min D(θ) = arg min (wθ [i] − x[i])2 . (11.71)
θ θ
i=1
Hint: Substitute (11.69) into (11.71) and divide the sum over pixels for which
x[i] = y[i] (unmodified pixels) and y[i] = x̄[i] (flipped pixels). Then simplify and
Selected targeted attacks 249
Chapter
12 - Blind steganalysis pp. 251-276
Chapter DOI:
Cambridge University Press
12 Blind steganalysis
the properties of the cover source. For example, if Eve knows that Alice likes to
send to Bob images taken with her camera, the warden may purchase a camera
of the same model and use it to create the training database. If the warden has
little or no information about the cover source, which would be the case of an
automatic traffic-monitoring device, constructing a blind steganalyzer is much
harder. There are significant differences in how difficult it is to detect stegano-
graphic embedding in various cover sources. For example, as already mentioned
in Chapter 11, scans of film or analog photographs are typically very noisy and
contain characteristic microscopic structure due to the grains present in film.
Because this structure is stochastic in nature, it complicates detection of embed-
ding changes. Thus, tuning a blind steganalyzer to produce a low false-positive
rate across all cover sources, without being overly restrictive for “well-behaved”
cover sources, such as low-noise digital-camera images, may be quite a difficult
task. This is a serious problem that is not easily resolved [37]. One possible av-
enue Eve can take is to first classify the digital image into several categories,
such as scan, digital-camera image, raw, JPEG, computer graphics, etc., and
then send the image to a classifier that was trained separately for each cover
category. Creating such a system requires large computational and storage re-
sources because each database should contain images that were also processed
using commonly used image-processing operations, such as denoising, filtering,
recoloring, resizing, rotation, cropping, etc. In general, the larger the database,
the more accurate and reliable the steganalyzer will be.
Many different blind steganalysis methods have been proposed in the litera-
ture [10, 11, 12, 67, 78, 102, 167, 168, 189, 190, 210, 225, 236, 252]. Even though,
in principle, a blind detector can be used to detect steganography in any image
format, one can expect that features computed in the same domain as where
the embedding is realized would be the most sensitive to embedding because
in this domain the changes to individual cover elements are lumped and inde-
pendent. Therefore, we divide the description of blind steganalysis according
to the embedding domain. In the next section, we give a specific example of a
feature set for blind steganalysis of stegosystems that embed messages by manip-
ulating quantized DCT coefficients. Using this feature set, in Sections 12.2 and
12.3 we present the details of a specific implementation of a blind steganalyzer
using SVMs and a one-class neighbor machine. Both steganalyzers are tested
regarding how well they can detect stego images and how they generalize to pre-
viously unseen steganographic methods. An example of using blind steganalysis
for construction of targeted attacks is included in Section 12.4. Blind steganal-
ysis of stegosystems that embed messages in the spatial domain is included in
Section 12.5.
Blind steganalysis 253
The reader is referred to Chapter 2 for more details about the JPEG format.
The features should capture all relationships that exist among DCT coeffi-
cients. A good approach is to consider the coefficients as realizations of a random
variable that follows a certain statistical model and choose as features the model
parameters estimated from the data. Unfortunately, the great diversity of natural
images prevents us from finding one well-fitting model and it is thus necessary
to build the features from several models in order to obtain good steganalysis
results.
Additionally, the features are required to be sensitive to typical steganographic
embedding changes and not depend on the image content so that one can easily
separate the cluster of cover and stego image features. To satisfy this require-
ment, the features are calibrated as explained in Chapter 10 (see Figure 10.3).
In calibration, the stego JPEG image J1 is decompressed to the spatial domain,
cropped by 4 pixels in both directions, and recompressed with the same quan-
tization table as J1 to obtain J2 . The calibrated form of feature f is thus the
difference
1
7
NB
H[r] = δ (r − D[k, l, b]) , (12.4)
64 × NB
k,l=0 b=1
1
NB
h (kl)
[r] = δ (r − D[k, l, b]) . (12.5)
NB
b=1
1
N
B
g (r)
[k, l] = δ (r − D[k, l, b]) . (12.6)
NB (r)
b=1
In words, g(r) [k, l] is how many times the value r occurs as the (k, l)th DCT
B
coefficient in all NB blocks and NB (r) = k,l N b=1 δ (r − D[k, l, b]) is the nor-
malization constant. The dual histogram captures the distribution of a given
coefficient value r among different DCT modes. Note that if a steganographic
method preserves all individual histograms, it also preserves all dual histograms
and vice versa.
If we were to take the complete vectors (12.4), (12.5), and matrices (12.6) as
features, the dimensionality of the feature space would be too large. Because
DCT coefficients typically follow a distribution with a sharp spike at r = 0 (see
Chapter 2), the sample pmf can be accurately estimated only around zero while
its values for larger values of r exhibit fluctuations that are of little value. The
same holds true of the individual DCT modes. The most populated are low-
frequency modes with small k + l. Thus, as the first set of features, we select
the first-order statistics shown in Table 12.1. We remind the reader that the
features for blind steganalysis are not used directly in this form but are calibrated
using (12.3).
Blind steganalysis 255
8 M
8 −8 8 8
N
δ (s − D [i, j]) δ (t − D [i + 8, j])
i=1 j=1
C[s, t] =
64 (M/8 − 1) N/8
8
8 −8
M N
8 8
δ (s − D [i, j]) δ (t − D [i, j + 8])
i=1 j=1
+ . (12.7)
64 M/8 (N/8 − 1)
8 −8 8 8
8 M
N
D [i, j] − D [i + 8, j]
i=1 j=1
V =
64 (M/8 − 1) N/8
8 M 8 −8
8 8
N
D [i, j] − D [i, j + 8]
i=1 j=1
+ . (12.8)
64 M/8 (N/8 − 1)
An integral measure of dependences among coefficients from neighboring
blocks is the blockiness defined as the sum of discontinuities along the 8 × 8
block boundaries in the spatial domain. Embedding changes are likely to in-
crease the blockiness rather than decrease it. We define two blockiness measures
for γ = 1 and γ = 2:
M
8
−1
N
|x[8i, j] − x[8i + 1, j]|γ
i=1 j=1
Bγ =
N (M − 1)/8 + M (N − 1)/8
M
8
N −1
|x[i, 8j] − x[i, 8j + 1]|γ
i=1 j=1
+ . (12.9)
N (M − 1)/8 + M (N − 1)/8
Here, M and N are the image height and width in pixels and x[i, j], i =
1, . . . , M, j = 1, . . . , N , are grayscale values of the decompressed JPEG image.
These two features are the only features computed from the spatial representa-
tion of the JPEG image.
The higher-order functionals measuring inter-block dependences among DCT
coefficients are summarized in Table 12.2.
as well as for the vertical and both diagonal directions. To this end, it will be
useful to represent DCT coefficients again using the matrix D[i, j].
Let A[i, j] = |D[i, j]| be the matrix of absolute values of DCT coefficients in
the image. Instead of modeling directly the intra-block dependences among DCT
coefficients, which are quite weak, we will model the differences among them
because the differences will be more sensitive to embedding changes. Thus, we
form four difference arrays along four directions: horizontal, vertical, diagonal,
and minor diagonal (further denoted as Ah [i, j], Av [i, j], Ad [i, j], and Am [i, j]
respectively)
Since the range of differences between absolute values of neighboring DCT coeffi-
cients could be quite large, if the matrices Mh , Mv , Md , Mm were taken directly
as features, the dimensionality of the feature space would be impractically large.
Thus, we use only the central portion of the matrices, −4 ≤ s, t ≤ 4 with the note
that the values in the difference arrays Ah [i, j], Av [i, j], Ad [i, j], and Am [i, j]
larger than 4 are set to 4 and values smaller than −4 are set to −4 prior to
calculating Mh , Mv , Md , Mm . To further reduce the features’ dimensionality, all
four matrices are averaged,
1
M= (Mh + Mv + Md + Mm ) , (12.18)
4
which gives a total of 9 × 9 = 81 features (Table 12.3).
The final feature set summarized in Tables 12.1–12.3 contains 165 + 28 + 81 =
274 calibrated features: 165 first-order statistics, 28 inter-block statistics, and 81
258 Chapter 12. Blind steganalysis
different digital cameras with sizes ranging from 1.4 to 6 megapixels with an
average size of 3.2 megapixels. All images were originally acquired in the raw
format and then converted to grayscale and saved as 75% quality JPEG. The
database was divided into two disjoint sets D1 and D2 . The first set with 3500
images was used only for training the SVM classifier while the second set with
2504 images was used to evaluate the steganalyzer performance. No image ap-
peared in both databases that would be taken by the same camera in order to
make the evaluation of the steganalyzer more realistic.
12.2.2 Algorithms
When training a blind steganalyzer, we need to present it with stego images
from as many (diverse) steganographic methods as possible to give it the abil-
ity to generalize to previously unseen stego images. The tacit assumption we
are making here is that the steganalyzer will be able to recognize stego images
embedded with an unknown steganographic scheme because the features will oc-
cupy a location in the feature space that is more compatible with stego images
rather than cover images. As will be seen in Section 12.2.6, this assumption is
not always satisfied and alternative approaches to blind steganalysis need to be
explored (Section 12.3).
All stego images for training were prepared from the training database D1 us-
ing six steganographic techniques – JP Hide&Seek, F5, Model-Based Steganogra-
phy without (MBS1) and with (MBS2) deblocking, Steghide, and OutGuess. The
algorithms for F5, Model-Based Steganography, and OutGuess were described
in Chapter 7. JP Hide&Seek is a more sophisticated version of Jsteg and its em-
bedding mechanism mostly modifies LSBs of DCT coefficients. The source code
is available from http://linux01.gwdg.de/~alatham/stego.html. Steghide is
another algorithm that preserves the global first-order statistics of DCT coeffi-
cients but using a different mechanism than OutGuess. The embedding is always
done by swapping coefficients rather than modifying their LSBs, which means
that no correction phase is needed. Steghide is described in [110]. These se-
lected six algorithms form the warden’s knowledge base about steganography
from which she constructs her detector.
capacity), another third with payload 0.5, and the last third with 0.25.1 Thus,
each image was embedded with five algorithms with one payload out of three.
The stego images for MBS2 were embedded with an even mixture of relative
payloads 0.3 and 0.15 of the embedding capacity of MBS1. This measure was
necessary because MBS2 often fails to embed longer messages. Thus, the train-
ing database contained a total of 6 × 3500 stego images embedded with an even
mixture of relative payloads 1, 0.5, and 0.25 (with the exception of MBS2).
For practical applications, the number of training stego images produced by
each embedding algorithm should reflect the a priori probabilities of encountering
stego images generated by each steganographic algorithm. In the absence of any
prior information, again one can use uniform distribution and assume that we
are equally likely to encounter a stego image from any of the stego algorithms.
Thus, the training database of stego images was formed by randomly selecting
3500 stego images from all 6 × 3500 stego images.
12.2.4 Training
The training consists of computing the feature vectors for each cover image from
D1 and for each stego image from the training database created in the previous
section. There are many tools one can use for classification purposes. Here, we
describe an approach based on soft-margin weighted support vector machines
(C-SVMs) with Gaussian kernel.2 The kernel width γ and the penalization pa-
rameter C are typically determined by a grid-search on a multiplicative grid,
such as
+ ,
(C, γ) ∈ (2i , 2j )|i ∈ {−3, . . . , 9}, j ∈ {−5, . . . , 3} , (12.19)
1 This means that a larger number of bits was embedded with stego algorithms with higher em-
bedding capacity, which might correspond to how users would use the algorithms in practice.
Consequently, one cannot use such results to fairly compare different stego algorithms.
2 The reader is now encouraged to browse through Appendix E to become more familiar with
SVMs.
Blind steganalysis 261
Table 12.4. Probability of detection PD when presenting the blind steganalyzer with
stego images embedded by “known” algorithms on which the steganalyzer was trained.
The false-alarm rate on the testing set of 2504 cover images was 1.04%.
Algorithm [bpnc] PD
F5 [1.0] 99.96%
F5 [0.5] 99.60%
F5 [0.25] 90.73%
JP HS [1.0] 99.84%
JP HS [0.5] 98.28%
JP HS [0.25] 73.52%
MBS1 [1.0] 99.96%
MBS1 [0.5] 99.80%
MBS1 [0.3] 98.88%
MBS1 [0.15] 71.19%
MBS2 [0.3] 99.12%
MBS2 [0.15] 77.92%
OutGuess [1.0] 99.96%
OutGuess [0.5] 99.96%
OutGuess [0.25] 98.12%
Steghide [1.0] 99.96%
Steghide [0.5] 99.84%
Steghide [0.25] 96.37%
Cover 98.96%
Algorithm [bpnc] PD
–F5 [1.0] 99.08%
–F5 [0.5] 99.60%
–F5 [0.25] 98.48%
MM2 [0.66] 99.64%
MM2 [0.42] 99.20%
MM2 [0.26] 53.67%
MM3 [0.66] 99.72%
MM3 [0.42] 99.32%
MM3 [0.26] 58.51%
Jsteg [1.00] 42.41%
Jsteg [0.50] 42.43%
Jsteg [0.25] 42.05%
Cover 98.96%
bpnc for binary Hamming codes [2p − 1, 2p − p − 1] for p = 2, 3, 4 (see Table 8.1).
Because the embedding capacity of both –F5 and MMx is equal to the number of
non-zero DCT coefficients, payloads expressed in bpnc also express the relative
payload size with respect to the maximal embedding capacity of both algorithms.
Finally, for Jsteg the images were embedded with relative payloads 1.0, 0.5, and
0.25. The quality factor for all stego images was again set to 75, the quality
factor of all images from the training set.
Blind steganalysis 263
The results shown in Table 12.5 demonstrate that the blind detector can gen-
eralize to –F5 and MMx. Even though the embedding mechanism of –F5 is very
different from those of the six algorithms on which the blind steganalyzer was
trained, images produced by –F5 are reliably detected.
Quite surprisingly, however, the images embedded by Jsteg are detected the
least reliably despite the fact that Jsteg is a relatively poor steganographic al-
gorithm that is easily detectable using a variety of targeted attacks, such as the
one explained in Section 5.1.2 or [21, 156, 157]. Jsteg introduces severe arti-
facts into the histogram of DCT coefficients and the steganalyzer has not seen
such artifacts before. Because it was tuned to a low probability of false alarm,
it conservatively assigns such images to the cover class. This analysis underlies
the need to train the classifier on as diverse set of steganographic algorithms as
possible to give it the ability to generalize.
An alternative approach to blind steganalysis that is less prone to such catas-
trophic failures but gives an overall smaller accuracy on known algorithms is the
construction based on a one-class detector. In the next section, we describe one
simple approach to such one-class steganalyzers implemented using a one-class
neighbor machine.
where pc is the distribution of cover features. In other words, the sparsity mea-
sure characterizes the closeness of x to the training set. The OC-NM works by
identifying a threshold γ so that all features x with Sf (x) > γ are classified as
stego.
The training of an OC-NM is simple because we need only to find the threshold
γ. It begins with calculating the sparsity of all training samples m[i] = Sf (f [i, .]),
1 ≤ i ≤ l, and ordering them so that m[1] ≥ m[2] ≥ . . . ≥ m[l]. By setting γ =
m[PFA l], we ensure that a fraction of exactly PFA training features are classified
as stego. Assuming the features f [i, .] are iid samples drawn according to the
pdf pc , with l → ∞ the OC-NMs were shown to converge to a detector with the
required probability of false alarm [181].
Note that there is a key difference between utilizing the training features in
OC-NM and in classifiers based on SVMs. While SVMs use only a fraction of
them during classification (support vectors defining the hyperplane), OC-NMs
use all training features, which shows the relation to classifiers of the nearest-
neighbor type.
The original publication on OC-NMs [181] presents several types of sparsity
measures. In this book, we adopted the one based on the so-called Hilbert kernel
density estimator
!
1
Sf (x) = log l , (12.21)
i=1 1/(x − f [i, .] 2 )hd
where x2 is the Euclidean norm of x. The parameter h in (12.21) controls the
smoothness of the sparsity measure. Intuitively, when x “is surrounded” by the
training features, the sparsity measure will be small (negative). Points that are
farther away from the training features will lead to larger values of Sf .
Table 12.6. Detection accuracy of OC-NM on F5, MBS1, MBS2, JP HS, OutGuess, and
Steghide. For comparison, the detection accuracy of the cover-versus-all-stego classifier
described in Section 12.2.5 is repeated in this table. We emphasize that the
cover-versus-all-stego classifier was trained on a mixture of stego images embedded by
the same six algorithms.
Table 12.6 shows the percentage of correctly classified stego images embed-
ded with six different steganographic algorithms and various payloads. We also
reprint the detection accuracy of the cover-versus-all-stego classifier from Sec-
tion 12.2 for comparison. As one could expect, the binary cover-versus-all-stego
classifier has a better performance because it was trained on a mixture of stego
images embedded using the same algorithms. The advantage of the OC-NM
becomes apparent when testing on algorithms unseen by the binary classifier
(Table 12.7). Here, the difference in performance is only marginal for –F5 and
MMx. The biggest difference occurs for Jsteg images, which are reliably detected
as stego by the OC-NM and almost completely missed by the cover-versus-all-
stego classifier. The last row of the table shows the false-alarm rates for both
methods, which are fairly similar.
Table 12.7. Detection accuracy of OC-NM on –F5, MMx, and Jsteg. For comparison,
the detection accuracy of the cover-versus-all-stego classifier described in Section 12.2.6
is repeated here. We note that the cover-versus-all-stego classifier was not trained on
stego images produced by these three algorithms.
naturally have a better performance than the blind classifiers from Sections 12.2
and 12.3 because the distribution of stego images will be less spread out. This
section provides the reader with an example of how accurate the targeted attacks
are for the following steganographic algorithms: F5, –F5, nsF5, Model-Based
Steganography without deblocking (MBS1), JP Hide&Seek, Steghide, MMx, and
three versions of perturbed quantization in double-compressed JPEG images as
described in Chapter 9 (PQ, PQe, PQt).
As in the previous section, all classifiers were implemented as soft-margin
support vector machines with Gaussian kernel trained on 3500 cover images and
3500 stego images3 with an even mixture of short payloads 0.05, 0.1, 0.15, and
0.2 bpnc. The detection performance was evaluated on 2504 cover images from
database D2 and their stego versions embedded with the same payload mixture
as in the training set. For this experiment, the quality factor of all JPEG images
was set to 70. For the three perturbed quantization methods, the primary and
secondary quality factors were 85 and 70, respectively. The cover images for these
three methods were JPEG images doubly compressed with quality factors 85 and
70.
Table 12.8shows
the detection results obtained using the performance criteria
−1 1
PE and PD 2 (see Section 10.2.4).
Because the payloads embedded using each algorithm were the same, this table
tells us how detectable different steganographic methods are. Note that even
though the tested payloads are relatively short, very reliable targeted attacks
are possible on OutGuess, F5, –F5, Model-Based Steganography, Steghide, and
JP Hide&Seek. The improved version of F5 without shrinkage (nsF5) is the
best tested method that does not use any side-information at the sender (the
uncompressed JPEG image). As expected, the methods that do use this side-
information (versions of PQ and MMx) are among the least detectable.
To summarize, this section demonstrates a general approach to targeted ste-
ganalysis that does not require knowledge of the embedding mechanism. The
basic idea is to train a binary classifier to recognize the class of cover images and
stego images embedded with the particular steganographic method.
Let us assume that we have l stego images embedded with a mixture of change
rates β[i]. Denoting their feature vectors f [i, j], i = 1, . . . , l, j = 1, . . . , d, we will
seek the estimate of the change rate as a linear combination of the features
l
(f [i, .]θ − β[i])2 . (12.23)
i=1
This least-square problem has a standard solution in the form (see Section D.8)
θ̂ = (f f )−1 f β. (12.24)
0.6 F5 algorithm
0.5
Estimated change rate
0.4
0.3
0.2
0.1
0.3
Estimated change rate
0.2
0.1
Figure 12.1 Scatter plot showing the estimated change rate for F5 (Section 7.3.2) and
Model-Based Steganography (Section 7.1.2) versus the true change rate. All estimates
were made on images from the testing set.
270 Chapter 12. Blind steganalysis
Table 12.9. Median absolute error (MAE) and bias for the quantitative change-rate
estimator built from a 275-dimensional feature set implemented using a linear
least-square fit (results for the testing set).
The blind steganalyzer for JPEG images explained in the previous sections used
features that were computed directly from DCT coefficients. This brought two
important advantages – the features were sensitive to embedding and, in combi-
nation with calibration, they were less sensitive to image content. The calibration
worked because quantized DCT coefficients are robust with respect to small (em-
bedding) changes when the quantization is performed on a desynchronized 8 × 8
grid. Unfortunately, this principle is unavailable in the spatial domain. However,
it is possible to calibrate and simultaneously increase the features’ sensitivity
to embedding by calculating the features from the stego image noise residual
r = y − F (y) obtained using a denoising filter F . Working with the noise resid-
ual instead of the whole image has two important advantages:
1. The cover image content is suppressed and thus the features exhibit less vari-
ation from image to image.
2. The SNR between the stego noise and the cover image is increased, which
leads to increased sensitivity of the features to embedding.
the same principle. The role of the denoising filter is replaced with local content
predictors [67, 252] or image-quality metrics [11]. All methods implicitly make
use of the difference between the stego image and its low-pass-filtered version.
The goal is, however, the same – to remove the image content and make the
features more sensitive to stego noise.
σ̂ 2 [i, j] = min(σ 23 [i, j], σ 25 [i, j], σ27 [i, j], σ29 [i, j]), (12.25)
⎜ 1 ⎟
σ 2w [i, j] = max ⎝0, 2 (H[i + k, j + l])2 − σn2 ⎠ . (12.26)
w
k,l=− w
2
4 This choice is justifiable for detection of steganography that imposes independent modifica-
tions to cover-image pixels, such as ±1 embedding. Steganalysis of stego noise with different
spectral characteristics (e.g., with its energy shifted towards medium spatial frequencies as
in [36, 76]) may benefit from analyzing subbands from higher-level wavelet decomposition.
272 Chapter 12. Blind steganalysis
H D
Figure 12.2 Example of a grayscale 512 × 512 image and its wavelet transform using an
8-tap Daubechies wavelet. The H, V, and D subbands contain the high-frequency noise
components of the image along the horizontal, vertical, and diagonal directions.
σ̂ 2 [i, j]
rH [i, j] = H[i, j] − 2 H[i, j]. (12.27)
σ̂ [i, j] + σn2
The noise residuals rD and rV can be obtained using similar formulas.
The steganographic features are calculated for each subband separately as the
first nine central absolute moments μc [k], k = 1, . . . , 9, of the noise residual
256
μc [k] = C |rH [i, j] − r̄H |k , (12.28)
i,j=1
1
where r̄H = C i,j rH [i, j] is the sample mean of rH and C = 256 2 is a nor-
malization constant. Since there are three subbands, there will be a total of 27
features for a grayscale image and 3 × 27 features for a color image.
Note that the amount of data samples (2562 ) in each subband does not al-
low estimating all moments accurately. In particular, only the sample moments
Blind steganalysis 273
Table 12.10. Total error probability PE and PD−1 (1/2) for detectors constructed to detect
±1 embedding in raw images (RAW), their JPEG versions (JPEG80), and scans of film
(SCAN). The whole ROCs are displayed in Figure 12.3.
PE PD−1 (1/2)
RAW 16.6% 1.4%
JPEG80 1.7% 0%
SCAN 22.8% 6.1%
μc [k] for k ≤ 4 are relatively accurate estimates of the true moments. However,
even though the higher-order sample moments are not accurate estimates of the
moments, they can still be valuable as features in steganalysis as long as they
sensitively react to embedding changes.
0.8
0.6
PD
0.4
0.2
JPEG80
RAW
SCAN
0
0 0.2 0.4 0.6 0.8 1
PFA
Figure 12.3 ROC for detection of ±1 embedding in images from the training databases
RAW, JPEG80, and SCAN. The stego images contained an equal mixture of images
embedded with relative payloads 0.1, 0.25, and 0.50 bpp for RAW and JPEG80, and 0.25,
0.5, 0.75, and 1.0 bpp for SCAN.
Summary
r The goal of blind steganalysis is to detect an arbitrary steganographic method.
r It works by first selecting a set of numerical features that can be computed
from an image. The feature set plays the role of a low-dimensional model of
cover images. Then, a classifier is trained on many examples of cover and stego
image features to recognize cover and stego images in the feature space.
r The performance of a blind steganalyzer is typically evaluated on a separate
testing set using the ROC curve or selected numerical characteristics, such as
the total
probability error, PE , or the false-alarm rate at 50% detection rate,
−1 1
PD 2 .
r A reliable blind steganalysis classifier for JPEG images can be constructed
from various first-order and higher-order statistics of quantized DCT coef-
ficients. Calibration further improves the classifier performance because it
makes the features more sensitive to embedding while removing the influence
of the cover content.
r A blind steganalyzer can be built either as a binary cover-versus-all-stego clas-
sifier or as a one-class classifier. In the first case, the two classes are formed by
cover images and stego images embedded by as many steganographic methods
as possible with varying payload. The alternative approach is to train a one-
Blind steganalysis 275
class steganalyzer that recognizes only covers and marks features incompatible
with the training set as stego (or suspicious).
r The equivalent of calibration in the spatial domain is the principle of com-
puting the features from the image noise residual obtained using a denoising
filter.
r An example of a blind steganalysis method in the spatial domain is shown.
It is based on computing higher-order moments of the image noise residual in
high-frequency wavelet subbands.
r Blind steganalysis can also be used to construct targeted and quantitative
attacks and to classify stego images into known steganographic programs.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
13 - Steganographic capacity pp. 277-292
Chapter DOI:
Cambridge University Press
13 Steganographic capacity
sumption that it is unlikely that a perfectly secure stegosystem can ever be built
for real covers due to their overwhelming complexity. Under the assumption that
the steganographic system is not perfectly secure, the secure payload is defined
in absolute terms as the critical size of payload for which the KL divergence be-
tween covers and stego images reaches a certain fixed value. The secure payload
defined in this way is bounded by the square root of the number of pixels, n,
√
which means that the safe communication rate approaches zero as 1/ n. The
result is supported in Section 13.2.2 by experiments carried out for a selected
embedding algorithm and a chosen blind steganalyzer.
Before starting with the technical arguments, we note that Alice may attempt
to determine the secure payload experimentally using current best blind stegana-
lyzers on a large database of cover images from the investigated cover source. By
analyzing images embedded with different payloads, she can estimate the secure
payload as the critical size of payload at which her steganalyzer starts making
random guesses (e.g., PE ≥ 0.5 − δ for some selected δ; see (10.15) for the defi-
nition of PE ). Secure payload defined this way informs the prisoners about the
longest message that can be safely embedded given the current state of the art
in steganalysis. The obvious disadvantage is that the estimate is now tied to a
specific steganalyzer. Alice can thus obtain only an upper bound on the true size
of the secure payload, which may dramatically change with further progress in
steganalysis. The reader is referred to [102, 95, 217] for some recent results.
"
n
A(ỹ|y) = A(ỹ[i]|y[i]). (13.1)
i=1
E [d(x, y)] = Pc (x) Pk (k) Pm (m)d (x, Embn (x, k, m)) ≤ D1 , (13.2)
x k m
Pc (y)A(ỹ|y)d(ỹ, y) ≤ D2 , (13.3)
y,ỹ
1 Here, we use the subscript n to stress the fact that Embn and Extn act on n-element covers.
280 Chapter 13. Steganographic capacity
In the theorem, the supremum is taken over all feasible channels Q and the
infimum over all attack channels A satisfying the distortion constraint.
For the passive-warden scenario, A(ỹ|y) = 1 when ỹ = y, and the capacity
result simplifies to the following corollaries.
Corollary 13.2. [Passive warden] For the passive-warden case (D2 = 0),
The capacity can be calculated for some simple cover sources, such as the
source of uniformly distributed
1 random bits, X = {0, 1}, when covers x are
Bernoulli sequences B 2 (see Appendix A for the definition) and the distortion
is the Hamming distance dH [180]. It has also been established for iid Gaussian
sources [49].
where d = 1 − 2−H(D2 ) and H(x) is the binary entropy function (Figure 8.2).
In this special case, the steganography constraint Pc = Ps does not influence the
steganographic capacity because the capacity-reaching distribution is Ps = Pc .
It is possible to construct steganographic schemes that reach the capacity using
random linear codes [254].
More realistic cover sources formed by iid Gaussian sequences were studied
in [49]. The authors computed a lower bound on the capacity of perfectly se-
cure stegosystems for such sources and studied the increase in capacity for -
secure systems. They also described a practical QIM2 lattice-based construction
for high-capacity secure stegosystems and contrasted their work with results ob-
tained for digital watermarking that do not include the steganographic constraint
Pc = Ps . The authors of [237] describe codes for construction of high-capacity
stegosystems and show how their approach can be generalized from iid sources
to Markov sources and sources over continuous alphabets.
Finally, we note that the work [180] also investigated steganographic capacity
for an alternative definition of an almost sure distortion of the attack channel in
which the channel distortion constraint is replaced with Pr {d(x, y) ≤ D2 } = 0,3
and showed that for this form of the active warden, the steganographic capacity
is also given by Theorem 13.1.
“Thanks to the Central limit theorem, the more covertext we give the warden, the
better he will be able to estimate its statistics, and so the smaller the rate at which
[the steganographer] will be able to tweak bits safely. The rate might even tend to
zero...”
The first insight into the problem was obtained from analysis of batch steganog-
raphy and pooled steganalysis [132]. In batch steganography, Alice and Bob try
to minimize the chances of being caught by dividing the payload into chunks
(not necessarily uniformly) and embedding each chunk in a different image. The
warden is aware of this fact and thus pools her results from testing all images
sent by both prisoners. One of the main results obtained in this study is the fact
that if there exists a detector for the stegosystem, the secure payload grows only
as the square root of the number of communicated covers. This result could be
interpreted as the square-root law for a single image by dividing it into smaller
blocks. In the next section, we describe a different proof of this law for covers
modeled as sequences of iid variables and for steganographic algorithms that do
not preserve the cover model. In Section 13.2.2, the SRL is experimentally veri-
fied using a blind steganalyzer in the DCT domain for the embedding operation
of F5 (Section 7.3.2).
In other words, it is assumed that every message can be embedded in every cover.
If the steganographic method has this property of homogeneity, one could define
the secure payload as log2 |M| for the largest set M for which the KL divergence
between cover and stego images is below a fixed threshold > 0.
Let us assume that the cover source produces sequences of n iid realizations
of a discrete random variable x with probability mass function p0 [i] > 0 for all i,
where i is an index whose range is determined by the range of cover elements (for
example, i ∈ {0, . . . , 255} for an 8-bit grayscale image). Let us further assume
that after embedding using change rate β, the stego image can be described as
Steganographic capacity 283
where r[i] are constants4 independent of β, such that r[k] = 0 for at least one
k. Before discussing the plausibility of assuming that the impact of embedding
on the cover source is in the form of equation (13.13), notice that for imper-
fect steganography there must indeed exist at least one r[k] = 0, otherwise the
steganographic method would be undetectable within our cover-source model.
Assumption (13.13) holds for many practical stegosystems and digital-media
covers. In general, it is true whenever the embedding visits individual cover
elements and applies an independent embedding operation to each visited cover
element. This is true, for example, for steganographic schemes whose embedding
impact can be modeled as adding to x an independent random variable ξ with
pmf u that is linear in β. This is because the distribution of the stego image
elements y = x + ξ is the convolution u p or pβ [i] = j u[i − j]p0 [j].
Examples of practical steganographic schemes that do satisfy (13.13) include
LSB embedding (see equation (5.10)), ±1 embedding (see Section 7.3.1 and Ex-
ercise 7.1), stochastic modulation (Section 7.2.1), and the embedding operation
of F5, –F5, nsF5, and MMx algorithms (see Exercise 7.2 for F5, Section 9.4.7 for
nsF5, and Section 9.4.3 for MMx). Even though the relationship (13.13) may not
be valid for steganography that preserves the first-order statistics (histogram) of
the cover, such as OutGuess, a similar dependence5 may exist for some proper
subsets of the cover or for local pairs of cover elements (since most stego meth-
ods disturb higher-order statistics). For example, OutGuess preserves only the
global histogram but not the histograms of individual DCT modes. Thus, as-
sumption (13.13) applies to virtually all practical steganographic schemes for
some proper model of the cover.
√
1. If the absolute number of embedding changes β(n)n increases faster than n
√
in the sense that limn→∞ β(n)n/ n = ∞, then for sufficiently large n there
exist steganalysis detectors with arbitrarily small probability of false alarms
and missed detection.
√
2. If the absolute number of embedding changes increases slower than n,
√
limn→∞ β(n)n/ n = 0, then the steganographic algorithm can be made -
secure for any > 0 for sufficiently large n.
4
Note that since pβ is a pmf, we must have i
r[i] = 0.
5 When pβ is the histogram of pairs (groups) of pixels, the dependence pβ on β may not
be linear, but the arguments of this section apply as long as ∂pβ [k]/∂β β=0 = 0 for some
(multi-dimensional) index k.
284 Chapter 13. Steganographic capacity
√
3. Finally, when limn→∞ β(n)n/ n = C, the stegosystem is
2
2
(C /2) i r [i]/p0 [i]-secure for sufficiently large n.
Proof. Before proving the theorem, we clarify one technical issue. The theorem
is formulated with respect to the number of embedding changes rather than pay-
load. This is quite understandable because statistical detectability is influenced
by the impact of embedding and not the payload. However, for steganographic
schemes that exhibit a linear relationship between the number of changes and
payload, such as schemes that do not employ matrix embedding, the theorem
can be easily seen in terms of payload.
Proof of Part 1: Under Kerckhoffs’ principle, the warden knows the distri-
bution of cover elements p0 . Her task is to distinguish whether she is observing
realizations of a random variable with pmf pβ with β = 0 or with β > 0. To show
the first part of the SRL, the warden needs only to construct a sufficiently good
detector rather than the best possible one. In particular, it is sufficient to con-
strain ourselves to the index k for which r[k] = 0, say r[k] > 0. Let (1/n)hβ [i] be
the random variable corresponding to the value of the ith bin of the normalized
histogram of the stego image embedded with change rate β. We will investigate
the following scaled variable:
√ 1
νβ,n = n hβ [k] − p0 [k] . (13.14)
n
From the assumption of independence of stego elements, νβ,n is a binomial ran-
dom variable with expected value and variance
√
E [νβ,n ] = β nr[k], (13.15)
Var [νβ,n ] = pβ [k] (1 − pβ [k]) . (13.16)
This step will simplify further arguments and, it is hoped, highlight the main
idea without cluttering it with technicalities. The interested reader is referred
to Exercise 13.1 to see how the arguments can be carried out without assump-
tion (13.17).
The variance of νβ,n can be bounded
1
2
Var[νβ,n ] = σβ,n ≤ (13.18)
4
for all β ∈ [0, 1] because x(1 − x) ≤ 14 .
The difference between the mean values,
√
E[νβ,n ] − E[ν0,n ] = E[νβ,n ] = β nr[k], (13.19)
Steganographic capacity 285
√
tends to ∞ with n → ∞ when β n → ∞. Note that for β = 0, ν0,n is Gaussian
2
N (0, σ0,n 2
) with σ0,n ≤ 14 .
Consider the following detector:
where T is a fixed threshold. We will now show that T can be chosen to make
the detector probability of false alarms and missed detections satisfy
for arbitrary FA , MD > 0 for sufficiently large n. The threshold T (FA ) will be
determined from the requirement that the right tail, x ≥ T (FA), for the Gaussian
2
variable N (0, σ0,n ) is less than or equal to FA . This can be conveniently written
√ ´ ∞ t2
using the tail probability function Q(x) = (1/ 2π) x e− 2 dt for a standard
normal random variable (see Appendix A) as6
T (FA ) T (FA ) 1 −1
FA ≥ Q ≥Q ⇒ Q (FA ) ≤ T (FA ), (13.24)
1/2 σ0,n 2
or
1 −1 √
Q (MD ) ≤ β nr[k] − T (FA), (13.26)
2
√
which can be satisfied for sufficiently large n ≥ nMD because β n → ∞.
Proof of Part 2: We now show the second part of the SRL. In particular, we
√
prove that if β n → 0, the steganography is -secure for any > 0 for sufficiently
large n.
Let x and y be n-dimensional random variables representing the elements
of cover and stego images. Due to the independence assumption about cover
This establishes the result that DKL (x||y) can be made arbitrarily small for
sufficiently large n. Putting this another way, when the number of embedding
changes is smaller than the square root of the number of cover elements n, the
steganographic scheme becomes asymptotically perfectly secure.
√
Proof of Part 3: In the third case when β n → C, the cubic term also goes to
zero because β → 0. The quadratic term can be bounded
nβ 2 r2 [i] C 2 r2 [i]
≤ . (13.31)
2 i p0 [i] 2 i p0 [i]
Having established the SRL of imperfect steganography, we now discuss its im-
plications. Put in simple words, the law states that the secure payload of a wide
class of practical embedding methods is proportional to the square root of the
number of cover elements that can be used for embedding. For the steganog-
√
rapher, this means that an r-times larger cover image can hold only r-times
larger payload at the same level of statistical detectability (the same KL diver-
gence). The SRL also means that it is easier for the warden to detect the same
relative payload in larger images than in smaller ones.
The SRL has additional important implications for steganalysis. For example,
when evaluating the statistical detectability of a steganographic method for a
fixed relative payload, we will necessarily obtain better detection on a database
of large images than on small images. To make comparisons meaningful, we
should fix the “root rate,” defined as the number of bits per square root of the
number of cover elements, rather than the rate. This would also alleviate issues
with interpreting steganalysis results on a database of images with varying size.
Steganographic capacity 287
For schemes that do use matrix embedding, due to the non-linear relationship
between the change rate and the relative payload, the critical size of the payload
√
does not necessarily scale with n. This is because βn changes may communicate
payload up to nH(β) bits, where H(x) is the binary entropy function. Because
for x → 0, H(x) ≈ x log(1/x), we can say that the critical size of the payload
picks up an additional logarithmic factor and scales as nβ log(1/β), which for
√ √
β ≈ 1/ n behaves as n log n.
On the more fundamental level, the SRL essentially tells us that the conven-
tional approach to steganography is highly suboptimal because we already know
from Section 6.2 and Section 13.1 that when the cover source is known, the se-
cure payload is linear in the number of cover elements (the rate is positive). The
reader should also compare this result with the capacity of noisy communication
channels [50], where the number of message bits that can be sent without any
errors increases linearly with the number of communicated symbols.
Finally, we note that the law can be generalized in several directions. It has
been established for a more general class of Markov covers and steganography
realized by applying a mutually independent embedding operation at each cover
element [71]. Also, the linear form of pβ with respect to β in assumption (13.13)
can be relaxed. All that is really necessary to carry out essentially the same steps
in the proof is the assumption that
∂pβ [k]
= 0 for some k. (13.32)
∂β β=0
Constraining ourselves to some right neighborhood of 0, β ∈ [0, β0 ), we could
expand pβ [i] using Taylor expansion on this interval and then repeat similar
arguments as in the proof of the SRL under assumption (13.13).
should all have the same quality factor to avoid biasing the experimental re-
sults. Ideally, the sets Si should contain images from different sources. When
attempting to comply with this requirement, however, an infeasibly high num-
ber of images would be needed. A more economical approach would be to use one
large database of raw, never-compressed images, D, and create the individual im-
age sets by cropping, scaling, and compressing the images from D. Downscaling
would, however, likely introduce an unwanted bias because downsampled images
contain more high-frequency content, which would produce JPEG images with a
different profile of non-zero DCT coefficients across the DCT modes. While crop-
ping avoids this problem, it may produce images with singular content, such as
portions of sky or grass. To avoid creating such pathological images, the cropping
was carried out so that the ratio of non-zero coefficients to the number of pixels
in the image was kept approximately the same as for the original-size image.
The images used in the experiments below were all created from a database of
6000 never-compressed raw images stored as 8-bit grayscales. From this database,
a total of L = 15 image sets that contained images with n = 20 × 103 , 40 ×
103 , . . . , 300 × 103 non-zero DCT coefficients were prepared by cropping. The
cover images from all image sets were singly compressed JPEGs with quality
factor 80.
The statistical detectability was evaluated using the targeted blind stegana-
lyzer containing 274 features described in Section 12.4 implemented using SVMs
with Gaussian kernel. Each image set, Si , was split into two disjoint subsets, one
of which was used for training and the other for testing the steganalyzer. The
training set contained 3500 images and the corresponding 3500 stego images,
while the testing set consisted of 2500 cover and the corresponding 2500 stego
images. The ability of the steganalyzer to detect the embedding on the test-
ing set was evaluated using 1 − PE , where PE = 12 (PFA + PMD ) is the minimal
average classification error under equal prior probabilities of cover and stego im-
ages defined in equation (10.15). For an undetectable steganography, PE ≈ 0.5,
while PE ≈ 1 corresponds to perfect detection. Finally, the training and testing
of the steganalyzer was repeated 100 times with different splitting of Si into the
training and testing subsets and the average PE reported as the result.
First, the same fixed payload was embedded across all image sets. Intuitively,
we expect the detectability to be the largest for the image set with the smallest
images, S1 , while the smallest detectability, close to 0.5, is expected for the set
with the largest images, S15 . Figure 13.1 confirms this expectation. Next, the
payload embedded in images from set Si was made linearly proportional to the
number of non-zero DCT coefficients, ni . As can be seen from Figure 13.2, the
statistical detectability increases with increasing ni . Finally, the detectability
√
stays approximately constant when the payload is made proportional to ni
(Figure 13.3). The spread around the data points in all three figures corresponds
to a 90% confidence interval.
Figure 13.4 shows the result of the fourth experiment, where for each image
set, Si , the largest absolute payload, M (ni ), for which the steganalyzer obtained
Steganographic capacity 289
1
bpnc=0.175
0.9
1 − PE
0.8 bpnc=0.075
0.7
0.6 bpnc=0.025
1.05
bpnc=0.125
1
0.95
0.9 bpnc=0.075
0.85
1 − PE
0.8 bpnc=0.025
0.75
0.7
0.65
1 bpnc=0.125
0.95
bpnc=0.075
0.9
1 − PE
0.85
0.8
0.75 bpnc=0.025
0.7
Summary
r Informally, the steganographic capacity is the size of the maximal payload that
can be undetectably embedded in the sense that the KL divergence between
the set of cover and stego images is zero.
r The steganographic capacity can be rigorously defined in the limit when the
number of elements forming the cover object grows to infinity. It is the largest
communication rate taken across all stegosystem satisfying the requirement of
perfect security, possibly with constraints on the embedding and/or channel
distortion.
r The capacity can be computed analytically for sufficiently simple cover
sources. It is rather difficult to estimate the capacity for real digital media
due to its complexity.
r The secure payload is the largest payload that can be embedded using a given
steganographic scheme with a fixed statistical detectability expressed by the
KL divergence between cover and stego objects.
Steganographic capacity 291
9
m(n) = n0.532
PE = 0.1
8.5 m(n) = n0.535
7.5
Exercises
13.1 [Berry–Esséen theorem] A version of the Berry–Esséen theorem
says that for a sequence of iid random variables x[i], i = 1, . . . , n, whose distri-
bution satisfies E[x] = 0, E[x2 ] = σ 2 > 0, and E[|x|3 ] = ρ < ∞, the cumulative
density function Fn (x) of the random variable
√ x̄
n (13.33)
σ
satisfies
ρC
|Fn (x) − Φ(x)| < √ , for all x. (13.34)
σ3 n
The symbol x̄ stands for the sample mean computed from n realizations, Φ(x)
is the cdf of a normal random variable, and C is a positive constant. The
292 Chapter 13. Steganographic capacity
theorem essentially says that the cdf of the scaled sample mean uniformly
converges to the cdf of its limiting random variable and the rate of convergence
√
is 1/ n.
Use this theorem to carry out the arguments in the proof of the SRL of
imperfect steganography without the Gaussian assumption (13.17). Note that
the same bound holds true for the corresponding tail probability functions
1 − Fn (x) and Q(x) = 1 − Φ(x).
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
A - Statistics pp. 293-312
Chapter DOI:
Cambridge University Press
A Statistics
The purpose of this appendix is to provide the reader with some basic concepts
from statistics needed throughout the main text. A good introductory text on
statistics for signal-processing applications is [59]. A discrete-valued random vari-
able x reaching values from a finite alphabet A with probabilities p(x), x ∈ A,
0 ≤ p(x) ≤ 1, x p(x) = 1, is described by its probability mass function (pmf)
p(x). The probability that x reaches a value in S ⊂ A is Pr{x ∈ S} = x∈S p(x).
Example A.1: [Fair die] x ∈ {1, 2, 3, 4, 5, 6}, p(i) = 16 . The probability that x is
even is Pr{x ∈ {2, 4, 6}} = 3 × 16 = 12 .
ˆx
F (x) = Pr{x ≤ x} = f (t)dt. (A.1)
−∞
r limx→∞ F (x) = 1.
294 Appendix A. Statistics
1 1
Pr{x ≤ μ1/2 (x)} ≥ and Pr{x ≥ μ1/2 (x)} ≥ . (A.10)
2 2
When more than one value μ1/2 (x) satisfies this equation, we again take the mean
of all such values. For example, for the fair die from Example A.1, any value in
the interval (3, 4) satisfies (A.10). The median is thus defined as μ1/2 = 3+42 =
3.5, which coincides with the mean. In general, the median of any probability
distribution that is symmetrical about its mean must be equal to the mean.
Moreover, for a random variable with continuous pdf,
ˆ
μ1/2 (x)
1
f (x)dx = . (A.11)
2
−∞
The sample MAD for a random variable that represents an error term is called
the Median Absolute Error (MAE).
Another common robust measure of spread is the Inter-Quartile Range (IQR).
For a random variable x, it is defined as the interval [q1 , q3 ], where qi is the ith
quartile determined from Pr{x < qi } = i/4, i = 1, . . . , 4.
The next proposition gives a formula for the expected value of a random
variable obtained as a transformation of another random variable x.
This statement is more intuitive for a discrete variable, where the trans-
formed random variable g(x) reaches values g(x) with probabilities p(x) and
thus E[g(x)] = x g(x)p(x). To see this statement for a simpler case of a contin-
uous variable, consider g strictly monotonic and differentiable. Then, assuming
for example that g is increasing, we have for the cdf of y = g(x)
−1
gˆ (y)
due to linearity of the integral. In this special case, the pdf of y = ax + b for
a > 0 is
y−b 1
fy (y) = fx . (A.18)
a a
The inverse function
to y
= F (x) is x = tan π y − 1
2 . Finally, if u ∼ U (0, 1),
the variable tan π u − 2 follows the Cauchy distribution.
1
Statistical moments are often useful for description of the shape of a probability
distribution. For a positive integer k, the kth moment of x is defined as
By differentiating k times
ˆ
dk Mx (t)
= xk etx f (x)dx. (A.26)
dtk
R
Thus,
ˆ
dk Mx (t)
= xk f (x)dx = E[xk ] = μ[k]. (A.27)
dtk t=0
R
x2 − 2μx + μ2 − 2σ 2 tx = x2 − 2x(μ + tσ 2 ) + (μ + tσ 2 )2 + μ2 − (μ + tσ 2 )2
(A.29)
2
= x − (μ + tσ 2 ) − 2μtσ 2 − t2 σ 4 . (A.30)
From the moment-generating function, we can obtain the first four moments of
a Gaussian random variable simply by differentiating Mx (t) and evaluating the
derivatives at 0. Writing higher-order derivatives as roman numerals,
Mx (t) = (μ + tσ 2 )Mx (t) = {at t = 0} = μ, (A.32)
MxII (t) = σ 2 + (μ + tσ 2 )2 Mx (t) = {at t = 0} = μ2 + σ 2 , (A.33)
MxIII (t) = 2σ 2 (μ + tσ 2 ) + σ 2 + (μ + tσ 2 )2 (μ + tσ 2 ) Mx (t), (A.34)
= {at t = 0} = μ3 + 3μσ 2 , (A.35)
MxIV (t) = {at t = 0} = 3σ 4 + 6μ2 σ 2 + μ4 . (A.36)
Let x and y be two real-valued random variables. Their joint probability density
function f (x, y) ≥ 0 defines the probability that a joint event (x, y) will fall into
a Borel subset B ⊂ R2 ,
ˆ
Pr {(x, y) ∈ B} = f (x, y)dxdy. (A.37)
B
´
As with the one-dimensional counterpart, R2 f (x, y)dxdy = 1.
The marginal probabilities for each random variable are
ˆ ˆ
fx (x) = f (x, y)dy, fy (y) = f (x, y)dx. (A.38)
R R
Example A.7 shows that the converse is not generally true. Two uncorrelated
variables may be dependent.
300 Appendix A. Statistics
Linear dependence between two random variables is evaluated using the cor-
relation coefficient
Cov[x, y]
ρ(x, y) = . (A.44)
Var[x]Var[y]
Often, we will also need to know the variance of a sum of random variables
y = ni=1 a[i]xi . The variance of y is
⎡ !2 ⎤ !2
n
n
Var[y] = E[y2 ] − (E[y])2 = E ⎣ a[i]xi ⎦ − a[i]E[xi ] (A.47)
i=1 i=1
⎡ ⎤
n n
n
n
=E⎣ a[i]a[j]xi xj ⎦ − a[i]a[j]E[xi ]E[xj ] (A.48)
i=1 j=1 i=1 j=1
n
n
= a[i]a[j] (E[xi xj ] − E[xi ]E[xj ]) (A.49)
i=1 j=1
n n
= a[i]a[j]Cov[xi , xj ] = aCa , (A.50)
i=1 j=1
where C[i, j] = Cov[xi , xj ] = E[xi xj ] − E[xi ]E[xj ] is the covariance between ran-
dom variables xi and xj forming the covariance matrix C.
If xi are pairwise uncorrelated, i.e., Cov[xi , xj ] = 0 for all i = j, then
n
n
Var a[i]xi = a[i]2 Var[xi ]. (A.51)
i=1 i=1
By differentiating
ˆ∞ ˆz ˆ∞
d
fx+y (z) = dx f (x, u − x)du = f (x, z − x)dx. (A.54)
dz
−∞ −∞ −∞
ˆ∞
fx+y (z) = fx (x)fy (z − x)dx = fx fy (z). (A.55)
−∞
In other words, the pdf of the sum of two independent random variables is the
convolution of their pdfs.
We now show that the sum of two independent Gaussian variables is again
Gaussian, with mean equal to the sum of means and variance equal to the sum
of variances. To this end, we use the following lemma.
which is the pdf of the Gaussian random variable N (μ1 + μ2 , σ12 + σ22 ). In gen-
eral, a linear combination of independent Gaussian variables is again a Gaussian
random variable, with mean equal to the same linear combination of means and
variance equal to a linear combination of variances, where the coefficients in the
linear combination are squared.
302 Appendix A. Statistics
The Gaussian distribution is, without any doubts, the most famous and most
important distribution for a signal-processing engineer. The pdf for a Gaussian
(also called normal) random variable N (μ, σ2 ) with mean μ and variance σ 2 is
1 (x−μ)2
f (x) = √ e− 2σ2 . (A.59)
2πσ
.
Pr{|x − μ| ≤ σ} = 68.27%, (A.60)
.
Pr{|x − μ| ≤ 2σ} = 95.45%, (A.61)
.
Pr{|x − μ| ≤ 3σ} = 99.73%. (A.62)
The cdf for a standard Gaussian random variable x ∼ N (0, 1) will be denoted
´x √ t2
Φ(x) = −∞ (1/ 2π)e− 2 dt. We also define the complementary cumulative dis-
tribution function (the tail probability) as the probability that x reaches a value
larger than x,
ˆ∞
1 t2
Q(x) = Pr{x > x} = 1 − Φ(x) = √ e− 2 dt. (A.63)
2π
x
Thus,
1 x
Q(x) = 1 − Erf √ . (A.67)
2 2
Statistics 303
The tail probability Q(x) cannot be expressed in a closed form. However, for
large x there exists an approximate asymptotic form
2
e−x /2
Q(x) ≈ √ , (A.70)
2πx
which should be understood in the sense that f (x) ≈ g(x) for x → ∞ if
limx→∞ f (x)/g(x) = 1.´ This asymptotic
´ expression can be easily obtained by
integrating by parts ( uv = uv − u v),
ˆ∞ √
t t2 2
Q(x) = √ e− 2 dt = {u = 1/( 2πt), v = te−t /2 } (A.71)
2πt
x
∞ ˆ∞
−t2 /2 2 2
e e−t /2 e−x /2
= − √ + √ dt = √ − R(x), (A.72)
2πt x 2πt2 2πx
x
where
ˆ ∞ 2 ˆ ∞ 2
e−t /2 1 e−t /2 Q(x)
R(x) = √ dt ≤ 2 √ dt = . (A.73)
x 2πt 2 x x 2π x2
Thus,
2 √
e−x /2
/( 2πx) R(x) 1
=1+ =1+O , (A.74)
Q(x) Q(x) x2
which proves (A.70).
D is regular. To obtain the full pdf expressed in variables x[i], we need to multiply
the pdf by the Jacobian of the linear transform. This is because
ˆ ˆ
∂y
Pr{y ∈ B ⊂ R } = f (y)dy = {substitution y = h(x)} = f (h(x)) dx.
n
∂x
B B
(A.80)
Thus, the pdf expressed in terms of x is f (h(x)) |∂y/∂x|. From (A.78),
∂y[i]/∂x[j] = D[i, j] and we have for the Jacobian
∂y
= |D| = |D| · |D | = |C−1 | = 1 . (A.81)
∂x |C|
The pdf of x is thus
1 1 −1
f (x) = e− 2 (x−μ) C (x−μ) . (A.82)
(2π)n |C|
Notice that from (A.75) the mean values E[xi ] = μ[i] (or in vector form, E[x] =
μ) and the covariances
n
n
E[(xi − μ[i])(xj − μ[j])] = E D−1 [i, k]yk × D−1 [j, k]yk (A.83)
k=1 k=1
n
=E D−1 [i, k]D−1 [j, k]yk2 (A.84)
k=1
n
= D−1 [i, k]D−1 [j, k] (A.85)
k=1
In the derivations above, we used the fact that E[yi yj ] = δ(i − j), because yi are
iid, and (D )−1 = (D−1 ) for any regular matrix D. Thus, the matrix C is the
covariance matrix of variables x1 , . . . , xn .
Conversely, if a vector of random variables x follows the distribution (A.82) for
a symmetric positive-definite matrix C, it is jointly Gaussian because any sym-
metric positive-definite matrix C can be written as C−1 = D D for some regular
matrix D (this is called Choleski decomposition). By transforming x using the
linear transform y = D(x − μ), we can make all yi ∼ N (0, 1) and independent,
simply by reversing the steps above. This will also prove that E[x] = μ and
E[(xi − μ[i])(xj − μ[j])] = C[i, j], the covariance matrix.
Note that jointly Gaussian random variables that are uncorrelated must be also
independent. This is a rare case when uncorrelatedness does imply independence.
This is because, if xi and xj are uncorrelated for every i = j, their covariance
matrix C = diag(σ[1]2 , . . . , σ[n]2 ) is diagonal, |C| = σ[1] × · · · × σ[n], and the
pdf (A.82) can be factorized,
1 1 −1
" n
1 − (x[i]−μ[i])
2
which means that the pdf is the product of its marginals, establishing thus their
independence.
Note that uncorrelated Gaussians that are not jointly distributed do not have
to be independent. Consider this simple example.
as n → ∞.
The Central Limit Theorem (CLT) is one of the most fundamental theorems in
probability. It states that given a sequence of iid random variables x1 , . . . , xn with
finite mean value μ and variance σ 2 , the distribution of the average (x1 + · · · +
xn )/n approaches a Gaussian with mean μ and variance σ 2 /n independently of
the distribution of xi .
More precisely, denoting the partial sum sn = x1 + · · · + xn , the following vari-
able converges in distribution to N (0, 1):
sn − nμ D
√ → N (0, 1). (A.90)
σ n
In this section, we introduce two discrete distributions that appear in the book.
A Bernoulli random variable B(p) is a random variable x with range {0, 1}
with
Pr{x = 1} = p, (A.91)
Pr{x = 0} = 1 − p. (A.92)
E[x] = (1 − p) × 0 + p × 1 = p, (A.93)
Var[x] = (1 − p) × (0 − p) + p × (1 − p) = p(1 − p).
2 2
(A.94)
Assume we have an experiment with two possible outcomes “yes” or “no,” where
“yes” occurs with probability p and “no” with probability 1 − p. If the experiment
is repeated n times, the number of occurrences of “yes” is a random variable y
Statistics 307
n k
Pr{y = k} = p (1 − p)n−k , k = 0, 1, . . . , n, (A.95)
k
E[y] = np, (A.96)
Var[y] = np(1 − p). (A.97)
n
Given n independent Bernoulli random variables B(p), x1 , . . . , xn , i=1 xi ∼
Bi(n, p). Thus, by the CLT, Bi(n, p) converges in distribution to a Gaussian in
the sense defined in Section A.6.
1
n
μ̂c [1] = |x[i] − μ̂|, (A.99)
n i=1
1
n
μ̂c [2] = |x[i] − μ̂|2 , (A.100)
n i=1
where
1
n
μ̂ = x[i].
n i=1
308 Appendix A. Statistics
0.6
β =4
β =2
β =1
β = 0.7
0.4
0.2
0
−3 −2 −1 0 1 2 3
Figure A.1 Examples of the generalized Gaussian distribution for different values of the
shape parameter β and α = 1, μ = 0.
μ̂c [1]2
β̂ = G−1 , (A.101)
μ̂c [2]
Γ β̂1
α̂ = μ̂c [1] , (A.102)
Γ β̂2
where
Γ2 x2
G(x) = 1 3 , (A.103)
Γ x Γ x
ˆ∞
Γ(x) = tx−1 e−t dt. (A.104)
0
Note that the method fails to estimate the parameters if μ̂c [1]2 /μ̂c [2] > 3/4
because the range of G is 0, 34 . The Gamma function is implemented in Matlab
as gamma.m. The inverse function G−1 (x) must be implemented numerically (e.g.,
through a binary search).
Another common model for the distribution of DCT coefficients is the Cauchy
distribution, which arises from the ratio of two normally distributed random
variables. Its generalized version has the following pdf:
−p
p−1 |x − μ|
f (x; p, s, μ) = 1+ (A.105)
2s s
Statistics 309
0.5 p = 2.0
p = 1.5
p = 1.2
0.4
0.3
0.2
0.1
0
−3 −2 −1 0 1 2 3
Figure A.2 Generalized Cauchy distribution for different values of the shape parameter p
and s = 1, μ = 0.
with three parameters – the mean μ, the width parameter s > 0, and the shape
parameter p > 1. The generalized Cauchy distribution has thick tails. It becomes
spikier with larger p and flatter with smaller p (see Figure A.2).
Student’s t-distribution finds applications in modeling the output from quan-
titative steganalyzers (see Chapter 10). It arises in the following problem. Let
x1 , . . . , xn be iid Gaussian variables N (μ, σ2 ) and let xn = (x1 + · · · + xn )/n,
σ̂n2 = [1/(n − 1)] ni=1 (xi − x)2 be the sample mean and sample variance. While
the normalized mean is Gaussian distributed if μ and σ 2 are known,
xn − μ
√ ∼ N (0, 1), (A.106)
σ/ n
replacing the standard deviation σ with its sample form leads to a more compli-
cated random variable,
x−μ
√ , (A.107)
σ̂n / n
which follows Student’s t-distribution with ν = n − 1 degrees of freedom:
− ν+1
Γ ν+12 (x − μ)2 2
f (x) = √ 1 + . (A.108)
νπ Γ ν2 ν
The distribution is symmetric about its mean, which is μ for ν > 1 and is un-
defined otherwise. The parameter ν is called the tail index. When ν = 1, the
distribution is Cauchy. As ν → ∞, the distribution approaches the Gaussian
distribution. The variance is ν/(ν − 2) for ν > 2 and undefined otherwise. In
general, only moments strictly smaller than ν are defined. The tail probability
of Student’s distribution satisfies 1 − F (x) ≈ x−ν , where F (x) is the cumulative
density. Maximum likelihood estimators (see Section D.7 and [229]) can be used
in practice to fit Student’s distribution to data.
310 Appendix A. Statistics
0.5 ν =2
ν =3
ν =4
0.4 ν =5
ν = 10
0.3
0.2
0.1
0
0 5 10 15 20
Figure A.3 Chi-square distribution for ν = 2, 3, 4, 5, 10 degrees of freedom.
For a given scalar random variable, x, the log–log empirical cdf plot shows the
tail probability Pr{x > x} (or complementary cdf) as a function of x in a log–
log plot. A separate plot is usually made for the right and left tails when the
Statistics 311
Pr{s > x}
10−2 10−2
10−3 10−3
Chapter
B - Information theory pp. 313-324
Chapter DOI:
Cambridge University Press
B Information theory
If the base of the log is 2, we measure H in bits; if the base is the Euler number,
e = 2.71828..., we speak of “nats.” Since 0 ≤ p(x) ≤ 1, − log p(x) ≥ 0, and thus
H(x) ≥ 0. Note that entropy does not depend on the values x, only on their
probabilities p(x).
Entropy measures the uncertainty of the outcome of x. It is the average number
of bits communicated by one realization of x. Also, H(x) is the minimum average
number of bits needed to describe x.
In Chapter 6, we will encounter a less common notion of minimal entropy
defined as
Hmin (x) = min − log p(x). (B.2)
x∈A
The lower-case letter is used to stress the fact that, although entropy and dif-
ferential entropy share many properties, they also behave very differently. For
example, h(x) can be negative because we can certainly have f (x) > 1 on its
support (h(u) = − log 2 for u ∼ U (0, 1/2)). Moreover, h may not be preserved
under one-to-one mapping (c.f., Proposition B.3).
314 Appendix B. Information theory
The inequality H(f (x)) ≤ H(x) will then be proved by summing (B.7) over all
b ∈ f (A) because
H(f (x)) = − p (b) log p (b) ≤ −p(x) log p(x) (B.8)
b∈f (A) b∈f (A) x∈f −1 (b)
=− p(x) log p(x) = H(x). (B.9)
x∈A
Because p(x)/p (b) is itself a pmf on f −1 (b) (it sums to 1), from the non-negativity
of entropy,
p(x) p(x) 1 p(x)
0≤−
log = −p(x) log (B.10)
p (b) p (b) p (b) p (b)
x∈f −1 (b) x∈f −1 (b)
⎧ ⎫
1 ⎨ ⎬
= −p(x) log p(x) + p(x) log p (b) (B.11)
p (b) ⎩ −1 ⎭
x∈f (b) x∈f −1 (b)
⎧ ⎫
1 ⎨ ⎬
= −p(x) log p(x) + p (b) log p (b) , (B.12)
p (b) ⎩ −1 ⎭
x∈f (b)
The proposition simply states that by transforming a random variable, one can-
not make its output more uncertain. This is intuitively clear because either f
maps elements of A to different elements, which does not change entropy, or
it maps several elements to one (not a one-to-one map), which should decrease
uncertainty.
Entropy is the measure of uncertainty. The conditional entropy of x is the
uncertainty of x given the outcome of another random variable y and it can be
expressed using the conditional probability p(x|y = y):
H(x|y = y) = − p(x|y) log p(x|y). (B.13)
x∈Ax
Since, potentially, the alphabets for x and y may be different, we use the subscript
to denote this fact, x ∈ Ax , y ∈ Ay . Using (B.13), we define the conditional
entropy
H(x|y) = − p(x|y) log p(x|y)p(y) (B.14)
x∈Ax y∈Ay
=− p(x, y) log p(x|y). (B.15)
x∈Ax y∈Ay
Thus, the joint entropy is the sum of the entropy of y and the conditional
entropy H(x|y),
H(x, y) = H(x|y) + H(y). (B.19)
The mutual information I(x; y) measures how much information about x is
conveyed by y,
I(x; y) = H(x) − H(x|y). (B.20)
If H(x|y) = H(x), or I(x; y) = 0, the uncertainty of x is unaffected by y and thus
knowledge of y does not give us any information about x. On the other hand, if
H(x|y) = 0, x is completely determined by y and thus it delivers all information
about x. In other words, the mutual information I(x; y) = H(x).
Mutual information is symmetrical,
I(x; y) = I(y; x), (B.21)
because, expressing H(x|y) using (B.19),
I(x; y) = H(x) − H(x|y) = H(x) + H(y) − H(x, y), (B.22)
which is obviously symmetrical in both random variables.
There exists a fundamental relationship between mutual information and KL
divergence (introduced in the next section):
I(x; y) = − p(x) log p(x) + p(x, y) log p(x|y) (B.23)
x∈Ax x∈Ax y∈Ay
=− p(x, y) log p(x) + p(x, y) log p(x|y) (B.24)
x∈Ax y∈Ay x∈Ax y∈Ay
p(x|y) p(x, y)
= p(x, y) log = p(x, y) log (B.25)
p(x) p(x)p(y)
x∈Ax y∈Ay x∈Ax y∈Ay
and q on A as
p(x)
DKL (p||q) = p(x) log . (B.28)
q(x)
x∈A
Example B.4: [KL divergence on a set of two elements] For A = {0, 1} the
KL divergence (B.28) becomes
p(0) 1 − p(0)
DKL (p||q) = p(0) log + (1 − p(0)) log . (B.30)
q(0) 1 − q(0)
Proof. For the proof, it will be convenient to use the natural logarithm in the
definition of the KL divergence. We will use the log inequality
log t ≤ t − 1 (B.32)
which holds for all t > 0 with equality if and only if t = 1. This can be seen by
noting that log t is a concave function (because (log t) = −t−2 < 0 for t > 0)
and y = t − 1 is its tangent at t = 1.
Substituting t = q(x)/p(x) into (B.32), we obtain
q(x) q(x)
log ≤ − 1. (B.33)
p(x) p(x)
After multiplying by −p(x) (do not forget to flip the inequality sign because
−p(x) ≤ 0)
p(x)
p(x) log ≥ p(x) − q(x). (B.34)
q(x)
318 Appendix B. Information theory
The equality can hold if and only if we had equality for each x, or q(x)/p(x) = 1
for all x, which means that the distributions are identical.
or
From here, we can draw some interesting conclusions. First, because the KL
divergence is always non-negative, we see that we must always have H(x) ≤
log |A|, the entropy is bounded by the logarithm of the number of elements in A.
The difference between this maximal value and the entropy H(x) is just the KL
divergence between the pmf of x and the uniformly distributed random variable
(which always has the maximal entropy, as can be easily verified).
The KL divergence comes up very frequently in information theory and it can
be argued that it is a more fundamental concept than the entropy itself. For
example, let us assume that we wish to compress a random variable x with pmf
p using Huffman code but we know only an estimate of p that we denote p̂. If
we construct a Huffman code based on our imprecise pmf p̂, we will on average
need DKL (p||p̂) more bits for compression of x due to the fact that we know its
pmf only approximately. The KL divergence has also a fundamental relationship
to hypothesis testing (see Appendix D). Moreover, in Section B.1 we showed
that the mutual information I(x; y) between two random variables x and y can
be written as the KL divergence between their joint probability mass function
p(x, y) and the product distribution of their marginals p(x)q(y) (this would be
the joint pmf if x and y were independent): I(x; y) = DKL (p(x, y)||p(x)q(y)).
In order to understand the information-theoretic definition of steganographic
security, we will need the following proposition, which is an equivalent of Propo-
sition B.3 for KL divergence.
Proof. We will need the following log-sum inequality, which holds for any non-
negative r1 , . . . , rk and positive s1 , . . . , sk :
k
k
ri k
j=1 rj
ri log ≥ ri log k , (B.39)
i=1
si i=1 j=1 sj
which can be proved again using the log inequality (B.32), in which we substitute
si rj
t= : (B.40)
ri sj
si rj si rj
log + log ≤ − 1. (B.41)
ri sj ri sj
We now multiply (B.41) by ri and sum over i,
si rj
k k k k
rj
ri log + ri log ≤ si − ri = 0, (B.42)
i=1
ri i=1 sj i=1
sj i=1
abilities of observing b for f (x) and f (y), respectively. Let x1 , . . . , xk be all the ele-
ments of f −1 (b). We now use the log-sum inequality for ri = p(xi ) and si = q(xi ),
noting that ri = p (b) and si = q (b) to obtain
k
p(x)
k
p(xi )
k
j=1 p(xj )
p(x) log = p(xi ) log ≥ p(xi ) log k (B.43)
−1
q(x) i=1
q(xi ) i=1 j=1 q(xj )
x∈f (b)
p (b)
= p (b) log . (B.44)
q (b)
Taking similar steps as in the proof of Proposition B.3, we obtain
p(x) p(x)
DKL (p||q) = p(x) log = p(x) log (B.45)
q(x) −1
q(x)
x∈A b∈f (A) x∈f (b)
p (b)
≥ p (b) log = DKL (p ||q ). (B.46)
q (b)
b∈f (A)
β2 1
2
∂p
= (x; 0) + O(β 3 ) (B.54)
2 x p(x; 0) ∂β
because
∂p ∂ ∂
(x; 0) = p(x; 0) = 1 = 0, (B.55)
x
∂β ∂β x
∂β
∂2p ∂ 2
∂2
(x; 0) = p(x; 0) = 1 = 0. (B.56)
x
∂β 2 ∂β 2 x ∂β 2
"
n
pi (xi )
DKL (px ||py ) = p1 (x1 ), . . . , pn (xn ) log (B.58)
x1 ,...,xn i=1
qi (xi )
n
pi (xi )
= p1 (x1 ), . . . , pn (xn ) log (B.59)
i=1 x1 ,...,xn
qi (xi )
n
pi (xi )
n
= pi (xi ) log = DKL (pi ||qi ), (B.60)
i=1 xi
qi (xi ) i=1
because on the second line we can sum over all xj , j = i, and use the fact that
xj p(xj ) = 1.
1 → 0 (B.63)
2 → 10 (B.64)
3 → 110 (B.65)
4 → 111 . (B.66)
It is clear that this compression scheme is prefix-free. Assume we keep tossing the
tetrahedron while registering the results. We know that we will need at least n ×
H(t) bits to describe the tosses. The average number of bits needed to describe
the realization of one toss with our encoding is 12 × 1 + 14 × 2 + 18 × 3 + 18 × 3 =
1.75. This is also the best encoding that we can have because H(t) = 1.75. Note
Information theory 323
that in this special case the codeword lengths do satisfy l(aj ) = − log2 p(aj ) for
each j.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
C - Linear codes pp. 325-334
Chapter DOI:
Cambridge University Press
C Linear codes
This appendix contains the basics of linear covering codes needed to explain the
material in Chapter 8 on matrix embedding and Chapter 9 on non-shared selec-
tion channels. Coding theory is the appropriate mathematical discipline to formu-
late and solve problems associated with minimal-embedding-impact steganogra-
phy introduced in Chapter 5. An excellent text on finite fields and coding theory
is [248].
We first start with the concept of a finite field and then introduce linear codes
while focusing on selected material relevant to the topics covered in this book.
When q is not prime, Aq with modulo q arithmetic is not a field because the
factors of q would not have a multiplicative inverse. For example, for q = 4, there
is no multiplicative inverse for 2.
A finite field with q elements exists if and only if q = pm for some positive
integer m and p prime. The field is formed by polynomials of degree m − 1
modulo an irreducible polynomial.
All finite fields with q elements are isomorphic and are called Galois fields Fq .
By isomorphism, we mean a one-to-one mapping, Ψ : Fq ↔ Gq , that preserves
all operations, e.g., ∀a, b ∈ Fq , Ψ(ab) = Ψ(a)Ψ(b) and Ψ(a + b) = Ψ(a) + Ψ(b).
The average distance to code is the expected value of dH (x, C) over randomly
uniformly distributed x ∈ Fnq ,
1
Ra = dH (x, C). (C.5)
qn
x∈Fq
n
The covering radius of a code is determined by the most distant word from
the code,
Note that
R ≥ Ra , (C.7)
B(x, R) = Fnq . (C.8)
x∈C
The code length is n = 5 and its dimension is k = 3 (the number of rows in G).
The codewords are elements of F52 (they are five-tuples of bits). We know that
there must be 23 = 8 codewords. Three of them already appear as rows of G.
The fourth one is the all-zero codeword, (0, 0, 0, 0, 0), which is always an element
of every linear code. The remaining four codewords are obtained by adding the
rows of G: (1, 1, 0, 1, 0) is the sum of the first two rows, (1, 0, 1, 0, 1) is the sum
of the last two rows, and (0, 1, 1, 1, 1) is the sum of the first and third row. The
last codeword is the sum of all three rows, (0, 0, 0, 1, 0). Thus, a complete list of
328 Appendix C. Linear codes
all codewords is
⎧ ⎫
⎪
⎪ 0 000 0⎪⎪
⎪
⎪ ⎪
⎪
⎪ 1 011 1⎪⎪
⎪
⎪
⎪ ⎪
⎪
⎪
⎪ 0 110 1⎪⎪
⎪
⎪
⎨ ⎬
1 100 0
C= . (C.10)
⎪
⎪ 1 101 0⎪⎪
⎪
⎪ ⎪
⎪
⎪
⎪ 1 010 1⎪⎪
⎪
⎪
⎪
⎪ ⎪
⎪
⎪ 0 111 1⎪⎪
⎪
⎩ ⎭
0 001 0
This code has many other generator matrices formed by any triple of linearly
independent codewords. The minimal distance of C is 1 because one of the code-
words has Hamming weight 1. The covering radius of C is R = 1 because no other
word in F52 is farther away from C than 1 (the reader is encouraged to verify this
statement by listing the distance to C for all 32 words in F52 ). In Section C.2.4, we
will learn a better and faster method for how to determine the covering radius
for such small codes. Because the distance to C is 0 for all 8 codewords and 1 for
the remaining 32 − 8 = 24 words, the average distance to code Ra = 24 2
32 = 3 .
The ball of radius 1 centered at codeword c = (1, 0, 1, 1, 1) is the set of six
words
⎧ ⎫
⎪
⎪ 1 0 1 1 1⎪ ⎪
⎪
⎪ ⎪
⎪
⎪
⎪ 0 0 1 1 1 ⎪
⎪
⎪
⎨ ⎪
⎬
11111
B(c, 1) = . (C.11)
⎪
⎪ 1 0 0 1 1⎪ ⎪
⎪
⎪ ⎪
⎪
⎪
⎪ 1 0 1 0 1 ⎪
⎪
⎪
⎩ ⎪
⎭
10110
Using these operations, every linear code can be mapped to an isomorphic code
with generator matrix G = [Ik ; A], where Ik is the k × k unity matrix and A is
a k × (n − k) matrix. This form of the generator matrix is called the systematic
form.
The reader is encouraged to think why adding two columns may not lead to
an isomorphic code (come up with an example where adding two rows changes
the minimal distance).
x · y = x1 y1 + · · · + xn yn , (C.12)
all operations in the corresponding finite field. Note that when x ∈ Fn2 contains
an even number of ones, x · x = 0, which is true for the standard dot product in
Euclidean space only when x = 0.
The orthogonal complement to C is the set of all vectors orthogonal to every
codeword:
where we wrote A[j, .] for the jth row of A. The equation above implies
⎛ ⎞ ⎛ ⎞
x[k + 1] x[k + 1]
x[1] = −A[1, .] ⎝ · · · ⎠ , . . . , x[k] = −A[k, .] ⎝ · · · ⎠ . (C.15)
x[n] x[n]
330 Appendix C. Linear codes
We can choose x[k + 1], . . . , x[n] arbitrarily and always find x[1], . . . , x[k] so that
all k equations hold. Choose
⎧⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎫
⎛ ⎞ ⎪ 1 0 0 ⎪
⎪
⎨⎜ ⎟ ⎜ ⎟ ⎪
x[k + 1]
0 1 ⎜ 0 ⎟⎬
⎝ ··· ⎠ ∈ ⎜ ⎟,⎜ ⎟,...,⎜ ⎟ . (C.16)
⎪ ⎝···⎠ ⎝···⎠ ⎝ · · · ⎠⎪
x[n] ⎪
⎩ ⎪
⎭
0 0 1
By writing the solutions as rows of a matrix, we obtain the generator matrix, H,
of the dual code
⎛ ⎞
A[1, 1] A[2, 1] · · · A[k, 1] 1 0 · · · 0
⎜ A[1, 2] A[2, 2] · · · A[k, 2] 0 1 · · · 0 ⎟
H=⎜ ⎝ ···
⎟ = [A , In−k ], (C.17)
··· ··· ··· 0 0 ··· 0⎠
A[1, k] A[2, k] · · · A[k, k] 0 0 · · · 1
which is called the parity-check matrix of C. The parity-check matrix is an equiv-
alent description of the code; G is k × n, H is (n − k) × n. The codewords are
defined implicitly through H because the rows of H are orthogonal to rows of
G, Hc = 0 for all codewords c ∈ C.
The parity-check matrix can be used to find the minimal distance of a code
(at least for small codes). The minimal distance d is the smallest number of
columns in H whose sum is 0 because Hc = 0 can be written as a linear com-
bination of columns of H: Hc = c[1]H[., 1] + · · · + c[n]H[., n] = 0. (Recall that
d = minc∈C, x=0 w(c).)
Example C.4: Consider the code from Example C.3. We will first find the sys-
tematic form of the generator matrix and then the parity-check matrix. Using
the operations that lead to isomorphic codes, we can write
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
10111 10111 10111
G = ⎝0 1 1 0 1⎠ ∼ ⎝0 1 1 0 1⎠ ∼ ⎝0 1 1 0 1⎠ (C.18)
11000 01111 00010
⎛ ⎞ ⎛ ⎞
10111 10011
∼ 0 1 0 1 1 ∼ 0 1 0 1 1 ⎠ = [I3 ; A].
⎝ ⎠ ⎝ (C.19)
00100 00100
The first operation involved adding the first row to the third row to manufacture
zeros in the first column (besides the first one). In the second operation, we
added the second row to the third one to obtain zeros in the second column.
Then, we swapped the third and fourth columns to obtain an upper-diagonal
matrix. The last operation turned the matrix to the desired systematic form by
adding the third row to the first row.
Thus, the parity-check matrix is obtained as
11010
H = [A ; I5−3 ] = . (C.20)
11001
Linear codes 331
The reader is encouraged to verify that the rows of H are orthogonal to all rows
of G in systematic form.
Proof. Because the union of balls with radius equal to R, the covering radius, is
6
the whole space, we must have x∈C
6 nB(x,
R) = Fnq . Now, if all balls were disjoint,
we would have x∈C B(x, R) = Fq = q n = Vq (R, n) × q k , which is equality in
the sphere-covering bound, because all balls have the same volume Vq (R, n) and
there are q k of them. If the balls are not disjoint,
6 we will count
some words more
than once and thus obtain inequality, q = x∈C B(x, R) ≤ Vq (R, n) × q k .
n
An [n, k] code is called perfect if the minimal distance d is odd and the following
equality holds:
d−1
, n = q n−k ,
Vq (C.22)
2
7 8 d−1
In other words, the balls with radius d−12 = 2 centered at codewords cover
the whole space without overlap.
This means that balls of radius 1 centered at codewords cover the whole space
without overlap. This also implies that the covering radius R = 1. Because there
332 Appendix C. Linear codes
are q p words in every ball out of which there is only one codeword, the average
distance to code is Ra = q q−1
p
p = 1 − q −p .
Description of the Golay codes and the proof of Proposition C.6 can be found,
e.g., in [248].
Note that C(0) = C, C(s1 ) ∩ C(s2 ) = ∅ for s1 = s2 . The whole space Fnq can be
thus decomposed into q n−k cosets, each coset containing q k elements,
C(s) = Fnq . (C.25)
s∈Fn−k
q
w(e(s)) ≤ R (C.26)
because as c goes through C, x − c goes through all members of the coset C(s).
This result also implies that any syndrome, s ∈ Fn−k q , can be obtained by
adding at most R columns of H. This is because C(s) = {x|Hx = s} and the
weight of a coset leader is the smallest number of columns that need to be
summed to obtain the coset syndrome s. Thus, one method to determine the
covering radius of a linear code is to first form its parity-check matrix and then
find the smallest number of columns of H that can generate any syndrome.
Example C.7: Using the code from Example C.3 again, this last result can im-
mediately give us the covering radius of the code from the parity-check matrix
11010
H= . (C.28)
11001
Because the columns of H cover all four binary vectors, each syndrome can be
written as a linear combination of one column and thus R = 1. Because k = 3
and n = 5, there are 2n−k = 4 cosets, each containing 2k = 8 words. The coset
corresponding to syndrome s00 = (0, 0) is the code C and the coset leader is the
Linear codes 333
all-zero codeword. The coset leader of the coset C(s01 ), s01 = (0, 1) , is e(s01 ) =
(0, 0, 0, 0, 1) because this is the word with the smallest Hamming weight for
which He = (0, 1) . The coset leader of the coset C(s10 ), s10 = (1, 0) , is e(s10 ) =
(0, 0, 0, 1, 0) because this is the word with the smallest Hamming weight for
which He = (1, 0) . Finally, the coset leader of the coset C(s11 ), s11 = (1, 1) , is
e(s11 ) = (0, 1, 0, 0, 0) because this is a word with the smallest Hamming weight
for which He = (1, 1) . Note that the coset leader for this coset is not unique
because (1, 0, 0, 0, 0) is also a coset leader.
The space F52 = C ∪ {s01 + C} ∪ {s10 + C} ∪ {s11 + C} can be written as a dis-
joint union of four cosets.
Chapter
D - Signal detection and estimation pp. 335-362
Chapter DOI:
Cambridge University Press
D Signal detection and estimation
In this appendix, we explain some elementary facts from statistical signal detec-
tion and estimation. The material is especially relevant to Chapter 6 on definition
of steganographic security and Chapters 10–12 that deal with steganalysis be-
cause the problem of detection of secret messages can be formulated within the
framework of statistical hypothesis testing. The reader is referred to [126] and
[125] for an in-depth treatment of signal detection and estimation for engineers.
We note that the material in this appendix applies to discrete random variables
represented with probability mass functions on simply replacing the integrals
with summations.
H0 : x ∼ p0 , (D.1)
H1 : x ∼ p1 . (D.2)
The two most frequently used criteria for constructing optimal detectors are:
r [Neyman–Pearson] Impose a bound on the probability of false alarms,
PFA ≤ FA , and maximize the probability of detection, PD (FA ) = 1 −
PMD (FA ), or, equivalently, minimize the probability of missed detection. The
optimization process here needs to find among all possible subsets of Rn the
critical region R1 that maximizes the detection probability
ˆ
PD = Pr{x ∈ R1 |x ∼ p1 } = p1 (x)dx (D.7)
R1
r [Bayesian] Assign positive costs to each error, C10 > 0 (cost of false alarm),
C01 > 0 (cost of missed detection), and non-positive “costs” or gains to each
correct decision C00 ≤ 0 (H0 correctly detected as H0 ), C11 ≤ 0 (H1 correctly
detected as H1 ), and prior probabilities P (H0 ) and P (H1 ) that x ∼ p0 and
x ∼ p1 , and minimize the total cost
1
Cij Pr{x ∈ Ri |x ∼ pj }P (Hj ). (D.9)
i,j=0
Theorem D.1. [Likelihood-ratio test] The optimal detector for both scenar-
ios above is the Likelihood-Ratio Test (LRT):
p1 (x)
Decide H1 if and only if L(x) = > γ, (D.10)
p0 (x)
where L(x) is called the likelihood ratio, and the threshold γ is a solution to the
following equation, for the Neyman–Pearson scenario,
ˆ
p0 (x)dx = FA , (D.11)
L(x)>γ
p1 (x)
> −λ. (D.16)
p0 (x)
we express as
⎛ ⎞
ˆ ˆ
C00 ⎝1 − p0 (x)dx⎠ P (H0 ) + C10 p0 (x)dxP (H0 )
R1 R1
⎛ ⎞
ˆ ˆ
+ C11 p1 (x)dxP (H1 ) + C01 ⎝1 − p1 (x)dx⎠ P (H1 )
R1 R1
ˆ
= P (H0 )(C10 − C00 )p0 (x) + P (H1 )(C11 − C01 )p1 (x)dx
R1
P (H0 )(C10 − C00 )p0 (x) + P (H1 )(C11 − C01 )p1 (x) < 0, (D.19)
1
lim log PMD (FA ) = −DKL (p0 ||p1 ). (D.21)
n→∞ n
which defines a line segment connecting the points (u, PD (u)) and (v, PD (v)).
Because the best detector must have PD (x) ≥ PD∗ (x), it means that the ROC
curve lies above this line segment and is thus concave.
A perfect detector with PD (x) = 1 for all x can be obtained only when the
supports of p0 and p1 are disjoint. The opposite case is when p0 = p1 , in which
case, no detector can be built and we have PD (x) = x, which corresponds to a
randomly-guessing detector.
5n (x[i]−w[i])2
√ 1 e− 2σ2
p1 (x) i=1 2πσ
L(x) = = 5n x[i]2
(D.26)
p0 (x) √1 − 2σ2
i=1 2πσ e
1 n
=e σ2
( i=1 x[i]w[i]− 12 ni=1 w[i]2 ) > γ. (D.27)
or
n
1
n
T (x) = 2
x[i]w[i] > σ log γ + w[i]2 = γ . (D.29)
i=1
2 i=1
The result we have just obtained is known as the thesis that the optimal de-
tector for a signal corrupted by additive white Gaussian noise is the correlation.
We now determine the threshold γ for the Neyman–Pearson test and charac-
terize the detector performance. The test statistic T (x) is Gaussian under both
hypotheses because it is a linear combination of iid Gaussians (see Lemma A.6).
The reader can easily verify that
N (μ0 , ν 2 ) under H0
T (x) ∼ (D.30)
N (μ1 , ν 2 ) under H1 ,
where μ0 = 0, μ1 = E, and ν 2 = σ 2 E with E = ni=1 w[i]2 , the energy of the
known signal. We now face a situation when the test statistic is a mean-shifted
Gauss–Gauss (Gaussians with the same variances and different means). When-
ever this situation occurs, the detector’s performance is completely described
using the so-called deflection coefficient
(μ1 − μ0 )2
d2 = . (D.31)
ν2
Signal detection and estimation 341
μ0 − μ1 + νQ−1 (PFA )
PD (PFA ) = Q (D.34)
ν
μ1 − μ0
= Q Q−1 (PFA ) − (D.35)
ν
√
= Q Q−1 (PFA ) − d2 , (D.36)
H0 : x[i] ∼ p0 , (D.37)
H1 : x[i] ∼ pβ , β > 0, (D.38)
Assuming the random variable log(pβ (x)/p0 (x)) has finite mean and variance,
by the central limit theorem, (1/n)Lβ (x) converges to a Gaussian distribution
whose mean and variance we now determine.
The expected value of (1/n)Lβ (x) under hypothesis H0 is −DKL (p0 ||pβ ) be-
cause
1 pβ (x)
Ep0 Lβ (x) = Ep0 log = −DKL (p0 ||pβ ). (D.40)
n p0 (x)
It can be expanded using Proposition B.7 as
β2
−DKL (p0 ||pβ ) = − I(0) + O(β 3 ), (D.41)
2
where I(β) is the Fisher information of one observation,
∂ log pβ (x)
2 1 ∂pβ (x)
2
I(β) = Epβ = , (D.42)
∂β x
pβ (x) ∂β
where the second equality holds for the discrete case (in the continuous case, the
sum is replaced with an integral). See Section D.6 for more details about Fisher
information.
The expected value of (1/n)Lβ (x) under H1 is
1
Epβ Lβ (x) = DKL (pβ ||p0 ). (D.43)
n
Because the leading term of pβ (x) in β is p0 (x), the leading terms of DKL (pβ ||p0 )
and DKL (p0 ||pβ ) are the same up to the sign,
1 β2
Epβ Lβ (x) = I(0) + O(β 3 ). (D.44)
n 2
To compute the variance of (1/n)Lβ under either hypothesis, notice that
pβ (x) pβ (x) 2 pβ (x) 2
Var log =E log − E log . (D.45)
p0 (x) p0 (x) p0 (x)
We already know that the second term is O(β 4 ) under both hypotheses. For the
first expectation, we expand
pβ (x) β ∂pβ (x)
log = log 1 + + O(β 2 ) (D.46)
p0 (x) p0 ∂β β=0
β ∂pβ (x)
= + O(β 2 ), (D.47)
p0 ∂β β=0
and thus both p0 (x) (log(pβ (x)/p0 (x)))2 and pβ (x) (log(pβ (x)/p0 (x)))2 have the
same leading term,
Signal detection and estimation 343
∂pβ (x)
2
β2
. (D.48)
p0 (x) ∂β β=0
Therefore, under both hypotheses, the leading term of the variance is
n
1
2
pβ (x[i]) 1 pβ (x)
Var log = E log (D.49)
n i=1 p0 (x[i]) n p0 (x)
. 1 β2 ∂pβ (x) 2
= (D.50)
n x p0 (x) ∂β β=0
1 2
= β I(0). (D.51)
n
Finally, we can conclude that for small β the likelihood-ratio test is the mean
shifted Gauss–Gauss problem
1 N −β 2 I(0)/2, β 2 I(0)/n under H0
Lβ (x) ∼ 2 (D.52)
n N β I(0)/2, β 2 I(0)/n under H1
and its performance is thus completely described using the deflection coefficient
(accurate up to the order of β 2 ), which is in turn proportional to the Fisher
information
2 2
2 β I(0)/2 + β 2 I(0)/2
d = 2
= nβ 2 I(0). (D.53)
β I(0)/n
Using the so-called J-divergence (sometimes called symmetric KL divergence)
When one of the probability distributions in the hypothesis test is not known or
depends on unknown parameters, we obtain the so-called composite hypothesis-
testing problem, which is substantially more complicated than the simple test.
In general, no optimal detectors exist in this case and one has to resort to sub-
optimal detectors by accepting simplifying or additional assumptions.
In this book, we are primarily interested in the case when the distribution
under the alternative hypothesis depends on an unknown scalar parameter β,
such as the change rate (relative number of embedding changes due to message
hiding). In this case, we know that β ≥ 0, obtaining thus the one-sided hypothesis
344 Appendix D. Signal detection and estimation
test
H0 : x ∼ p0 , (D.56)
H1 : x ∼ pβ , β > 0, (D.57)
where x is again a vector of measurements (e.g., the histogram). Examples of
specific instances of this problem appear in Chapter 10.
If the threshold in the LRT can be set so that the detector has the highest
probability of detection PD for any value of the unknown parameter β, we speak
of a Universally Most Powerful (UMP) detector. Because UMP detectors rarely
exist, other approaches are often explored. One possibility is to constrain our-
selves to small values of the parameter β. In steganalysis, this is the case of
main interest because small payloads are harder to detect than large ones. If β is
the only unknown parameter, one can derive the Locally Most Powerful (LMP)
detector that will have a constant false-alarm rate for small β and thus one will
be able to set a decision threshold in Neyman–Pearson setting. To see this, we
expand log pβ using Taylor expansion around β = 0 and write the log-likelihood
ratio test as
∂ log pβ (x)
Lβ (x) = log pβ (x) − log p0 (x) = β + O(β 2 ) > log γ (D.58)
∂β
β=0
where I(0) is the Fisher information of one observation x[i] (D.42). We also
denoted the marginal distribution of pβ (x) constrained to the ith component
with the same symbol. Notice that the distribution of the test statistic under H0
does not depend on the unknown parameter β. This allows us to compute the
threshold γ for T (x) for a given bound on PFA < FA from
ˆ
Pr{T (x) > γ|H0 } = p0 (x)dx = FA (D.64)
T (x)>γ
where β̂0 and β̂1 are maximum-likelihood estimates of β under the corresponding
hypotheses
H0 : x[i] ∼ p, (D.69)
H1 : x[i] ∼ p. (D.70)
The test is directly applicable only to discrete random variables but can be used
for continuous variables after binning. The binning has typically little influence
on the result if the pdf is “reasonable.” For example, for a unimodal pdf, the bin
346 Appendix D. Signal detection and estimation
width can be chosen as σ̂/2, where σ̂ is the sample standard deviation of the
data.
Suppose we have d bins (also called categories). Let o[k] be the observed num-
ber of data samples in the kth bin, k = 1, . . . , d, and e[i] the expected occupancy
of the ith bin if the data followed the known distribution p. Under the null
hypothesis, the test statistic
d
(o[k] − e[k])2
S= (D.71)
e[k]
k=1
where γ is determined from the bound on the probability of the Type I error (de-
ciding H1 when H0 is correct), FA = Pr{χ2d−1 > γ}, given by the complementary
cumulative distribution function for the chi-square variable with d − 1 degrees
of freedom,
ˆ γ
1
e− 2 t 2 −1
t d−1
FA = Pr{χ2d−1 > γ} = 1 − d−1 d−1 dt. (D.73)
2 2 Γ 2 0
Example D.3: [Fair-die test] Throw a die n = 1000 times and calculate o[k] =
number of throws with outcome k, k = 1, . . . , 6. For a fair die, e[k] = n/6 for all
k = 1, . . . , 6 (we have k = 6 bins). Because e[k] = 1000
6 > 4, the statistic
6
(o[k] − n/6)2
S= (D.74)
n/6
k=1
 = x[1]. (D.78)
1
n
 = x[i]. (D.79)
n i=1
1 1
n n
1 σ2
Var[ x[i]] = 2 Var[x[i]] = 2 nσ 2 = , (D.80)
n i=1 n i=1 n n
348 Appendix D. Signal detection and estimation
which is n times smaller than for the previous estimator. In fact, this is the best
unbiased estimator of A that we can hope for, in some well-defined sense.
a
n
 = x[i], (D.82)
n i=1
where a > 0 is a constant. The MSE of any estimator can be written as the sum
of the estimator variance and the square of its bias, b(θ̂) = E[θ̂] − θ, because
The last equality follows from the fact that E[θ̂ − E[θ̂]] = 0 and that E[θ̂] − θ is
a number and not a random variable. Thus, the MSE of estimator (D.82) is
a2 σ 2
MSE(Â) = Var[Â] + b2 (Â) = + (a − 1)2 A2 . (D.86)
n
d 2aσ 2
MSE(Â) = + 2(a − 1)A2 = 0, (D.87)
da n
A2
a= 2 . (D.88)
A + σ 2 /n
Because of the problem with realizability, we will next only consider unbiased
estimators. Among them, we select as the best the one with the smallest variance.
Such estimators are called Minimum-Variance Unbiased (MVU).
Even MVU estimators may not always exist because the minimum variance for
one fixed estimator may not be minimum for all θ. Again, we do not know when
to switch between estimators if there is an MVU for θ ∈ (−∞, θ0 ) and another
MVU for θ ∈ (θ0 , ∞). There are three possible approaches that we can choose
in practice [125]. We can first obtain the Cramer–Rao Lower Bound (CRLB)
on the estimator variance and show that our estimator’s variance is close to the
theoretical bound, or we can apply the Rao–Blackwell–Lehmann–Sheffe theorem,
or we can restrict our estimator to the class of linear estimators. In this appendix,
we explain the CRLB.
Before formulating the CRLB, we derive a few useful facts and introduce some
terminology.
We say that a pdf p(x; θ) satisfies the regularity condition if
∂ log p(x; θ)
E = 0 for all θ, (D.89)
∂θ
where the expected value is taken with respect to the pdf p(x; θ). As in Sec-
tion D.1, we adopt the same convention and use x to denote a random variable.
The regularity condition means that
ˆ ˆ ˆ
∂ log p(x; θ) ∂p(x; θ) ∂ ∂1
p(x; θ) dx = dx = p(x; θ)dx = = 0 (D.90)
∂θ ∂θ ∂θ ∂θ
if we can exchange the partial derivative and the integral. It turns out that for
most real-life cases this exchange will be justified (there will be an integrable
majorant for the pdf for some range of the parameter). Thus, the regularity
condition is, in fact, not a strong condition and is practically always satisfied.
By differentiating the regularity condition partially with respect to θ, we es-
tablish one more useful fact,
2
∂ 2 log p(x; θ) ∂ log p(x; θ)
E 2
= −E , (D.91)
∂θ ∂θ
350 Appendix D. Signal detection and estimation
Moreover, an unbiased estimator that attains the bound exists for all θ if and
only if
∂ log p(x; θ)
= I(θ) (g(x) − θ) (D.97)
∂θ
for some functions g and I. The MVU estimator is
θ̂ = g(x) (D.98)
∂ log p(x; θ)
= I(θ)(g(x) − θ), (D.107)
∂θ
∂ 2 log p(x; θ)
2
= I (θ)(g(x) − θ) − I(θ), (D.108)
2 ∂θ
∂ log p(x; θ)
E 2
= E[I (θ)(g(x) − θ)] − I(θ) = −I(θ). (D.109)
∂θ
Thus,
2
∂ 2 log p(x; θ) ∂ log p(x; θ)
I(θ) = −E 2
=E . (D.110)
∂θ ∂θ
352 Appendix D. Signal detection and estimation
To calculate the variance of θ̂ = g(x), we rewrite (D.107), square, and take the
expected value of both sides:
1 ∂ log p(x; θ)
= g(x) − θ, (D.111)
I(θ) ∂θ
2
1 ∂ log p(x; θ)
E = E[(g(x) − θ)2 ] = Var[θ̂], (D.112)
I(θ) ∂θ
1 ∂ log p(x; θ) 2
E = Var[θ̂], (D.113)
I(θ)2 ∂θ
1
= Var[θ̂]. (D.114)
I(θ)
The last equality follows from the analytic expression for I(θ).
The quantity I(θ) = E[(∂ log p(x; θ)/∂θ)2 ] is called Fisher information. It is a
measure of how fast the pdf changes with θ at x. Intuitively, the faster the pdf
changes, the more accurately we can estimate θ from the measurements. The
larger it is, the smaller the bound on the unbiased estimator variance, Var[θ̂] ≥
1/I(θ).
The Fisher information is always non-negative and it is additive when the
observations are independent. This is because for independent observations
"
n
p(x; θ) = p(x[i]; θ) (D.115)
i=1
Example D.6: We can use the CRLB to prove that the sample mean is the MVU
estimator for DC level in AWGN (Example D.4). The pdf of each observation is
a Gaussian N (A, σ 2 ). Since all n observations are independent random variables,
their joint pdf is multiplication of their marginal pdfs and we have
"
n
1 (x[i]−A)2 1
n 2
e− 2σ2 = (2πσ 2 )− 2 e− 2σ2 i=1 (x[i]−A) .
n
p(x; A) = √ (D.117)
i=1
2πσ
Signal detection and estimation 353
Thus,
1
n
∂ log p(x; A)
=− 2 −2(x[i] − A)
∂A 2σ i=1
!
1
n
n
= 2 x[i] − nA = 2 (x − A).
σ i=1
σ
From here, the CRLB tells us that the sample mean g(x) = x is the MVU esti-
mator with variance σ 2 /n.
but now let us say we know the DC level A and we wish to estimate the noise
variance σ 2 instead. We start by writing down the pdf of the n observations,
with σ 2 being the unknown parameter,
"
n
1 (x[i]−A)2 1
n 2
e− 2σ2 = (2πσ 2 )− 2 e− 2σ2 i=1 (x[i]−A) .
n
p(x; σ 2 ) = √ (D.119)
i=1
2πσ
We have
!
1
n
∂ log p(x; σ 2 ) ∂ n
2
= − log(2πσ ) − 2
2
(x[i] − A)2 (D.120)
∂σ ∂σ 2 2 2σ i=1
1
n
n
=− 2 + 4 (x[i] − A)2 (D.121)
2σ 2σ i=1
!
1
n
n
= (x[i] − A)2 − σ 2 . (D.122)
2σ 4 n i=1
Thus, the sample variance σ̂ 2 = (1/n) ni=1 (x[i] − A)2 is an MVU estimator and
its variance is 2σ 4 /n. Note that while the MVU estimator of A does not need
knowledge of the noise variance σ 2 , the MVU estimator of variance needs knowl-
edge of A.
If both A and σ 2 are unknown, it is tempting to write an estimator for the
variance as
1 1
n n
2
σ̂ = (x[i] − Â)2 = (x[i] − x)2 . (D.123)
n i=1 n i=1
However, this plug-in estimator is biased. In fact, plug-in estimators are generally
biased unless they are linear in the parameter. The reader is challenged to verify
354 Appendix D. Signal detection and estimation
that
1
n
σ̂ 2 = (x[i] − x)2 (D.124)
n − 1 i=1
Even though the CRLB can be used for derivation of MVU estimators, it is rarely
used in this way except for some simple cases. In practice, alternative approaches
that typically give good estimators are used. The two most common principles
are maximum-likelihood and maximum a posteriori estimation.
In Maximum-Likelihood Estimation (MLE), the parameters are estimated
from measurements x using the following optimization problem:
and the parameter can be obtained by solving the following algebraic equation:
n
∂ log p(x[i]|θ)
= 0. (D.128)
i=1
∂θ
which can be solved for p and s. In this particular case, we can substitute for n/p
from the first equation into the second one and thus deal with only one algebraic
equation for s.
When we have prior information about the distribution of the parameters, we can
utilize this knowledge to obtain a better estimate. This approach to parameter
estimation is called Maximum A Posteriori (MAP) estimation,
where p(θ) is the a priori distribution of the parameter. Note that when p(θ)
is uniform, ML and MAP estimates coincide. The reader is referred to [125] for
more information about these two important estimation methods.
y = f (x; θ) + η, (D.136)
356 Appendix D. Signal detection and estimation
l
J(θ) = (f (xi ; θ) − y[i])2 (D.137)
i=1
∂J(θ) l
∂f (xi ; θ)
=2 (f (xi ; θ) − y[i]) = 0. (D.138)
∂θ[k] i=1
∂θ[k]
In general, this problem needs to be solved numerically, e.g., using the Newton–
Raphson method.
When the model is linear, a closed-form solution to the optimization can be
obtained. Considering the vectors as column vectors, f (x; θ) = x θ, θ ∈ Rd , and
(D.136) can be written in a compact matrix form as
y = Hθ + η, (D.139)
where the ith row of the l × d matrix H is xi . We will further assume that H is
of full rank.
Writing the functional J(θ) as
∂J(θ)
= −2H y + 2H Hθ. (D.142)
∂θ
The least-square estimate of the parameter, θ̂, is obtained by setting the gradient
to zero and solving for θ,
The LSE is usually applied when the properties of the modeling noise are not
known. We note that LSE becomes MLE when the modeling noise η is Gaussian
N (0, σ2 ). This is because MLE maximizes
1
l
n
log p(y|θ) = log(2πσ ) − 2
2
(y[i] − H[i, .]θ)2 , (D.144)
2 2σ i=1
which is a superposition of the original noise-free signal S[i], the true scene,
and Gaussian noise ξ[i]. The following assumptions are made about the original
signal and the noise:
1. S[i] are zero-mean jointly Gaussian random variables with a known covariance
matrix C[i, j] = Cov [S[i], S[j]] = E [S[i]S[j]].
2. The noise is a sequence of Gaussian random variables with a known covariance
matrix. Here, we will assume that the covariance matrix is diagonal with
variance σ 2 on its diagonal (in other words, the sequence of random variables
ξ[i] ∼ N (0, σ 2 ) is iid). We also assume that S and ξ are independent of each
other.
Our task is to estimate S from x in the least-square sense so that the expected
value of the error
2
E S[i] − Ŝ[i] (D.146)
is minimum for each i. We seek the estimate as a linear function of the (noisy)
observations
n
Ŝ[i] = W[i, j]x[j]. (D.147)
j=1
The requirement of minimal mean-square error means that we need to find the
matrix W so that the expected value of the error for the ith sample, e[i], is
minimal,
⎡⎛ ⎞2 ⎤
n
e[i](W[i, 1], . . . , W[i, n]) = E ⎣⎝S[i] − W[i, j]x[j]⎠ ⎦ . (D.148)
j=1
We do so by differentiating e[i] with respect to its arguments and solving for the
local minimum,
!
∂e[i] n
= E −2 S[i] − W[i, k]x[k] x[j] = 0 for all i, j. (D.149)
∂W[i, j]
k=1
358 Appendix D. Signal detection and estimation
Rewriting further and substituting for x[j] = S[j] + ξ[j], we obtain for all i and
j
n
E [S[i] (S[j] + ξ[j])] = E W[i, k] (S[k] + ξ[k]) (S[j] + ξ[j]) . (D.150)
k=1
Using the fact that ξ[j] and ξ[k] are independent for j = k and the fact that S
and ξ are also independent signals, we have E [S[k]ξ[j]] = 0 and E [ξ[k]ξ[j]] =
δ(k − j)σ 2 , where δ is the Kronecker delta. Thus, we can rewrite (D.150) in
matrix notation as
C = W C + σ2 I (D.152)
It should be clear that we have indeed found a minimum because the second
partial derivative with respect to W[i, j] is ∂ 2 e[i]/∂W2 [i, j] = E 2x2 [j] > 0.
In the special case when the covariance matrix C is diagonal,3 we have C =
diag(σ 2 [1], . . . , σ 2 [n]) and the inverse (C + σ 2 I)−1 as well as the matrix W are
also diagonal, W = diag(w[1], . . . , w[n]),
σ 2 [i]
w[i] = . (D.154)
σ 2 [i] + σ 2
The denoised signal is
σ 2 [i]
Ŝ[i] = x[i]. (D.155)
σ 2 [i] + σ 2
is the average grayscale value in the neighborhood N [i, j] of pixel (i, j) containing
|N [i, j]| pixels (e.g., we can take the 3 × 3 neighborhood). Note that
where Var [x[i, j]] can be approximated4 by the sample variance in the neighbor-
hood N [i, j],
1
Var [x[i, j]] ≈ σ̂ 2 [i, j] = x2 [k, l] − μ̂2 [i, j]. (D.159)
|N [i, j]|
(k,l)∈N [i,j]
Now, consider the signal x[i, j] − μ̂[i, j] = S[i, j] − μ̂[i, j] + ξ[i, j] as the equiva-
lent of the noisy signal x[i] in the derivation of the Wiener filter, and S[i, j] −
μ̂[i, j] the equivalent of the noise-free signal S[i]. From (D.158), the variance
of the noise-free signal Var [S[i, j] − μ̂[i, j]] = Var [S[i, j]] = Var [x[i, j]] − σ 2 ≈
σ̂ 2 [i, j] − σ 2 . Thus, we obtain the Wiener filter for denoising images in the form
(D.155)
σ̂ 2 [i, j] − σ 2
Ŝ[i, j] = μ̂[i, j] + (x[i, j] − μ̂[i, j]) . (D.160)
σ̂ 2 [i, j]
We summarize that the Wiener filter is an adaptive linear filter that is opti-
mal in the sense that it minimizes the MSE between the denoised image and the
noise-free image under the assumption that the zero-meaned pixel values form
a sequence of independent zero-mean Gaussian variables (not necessarily with
equal variances – we allow non-stationary signals) and the image is corrupted
with white Gaussian noise of known variance σ 2 . This is exactly the implementa-
tion of wiener2.m in Matlab. The command wiener2(X,[N N ],σ 2 ) returns a
denoised version of X, where N is the size of the square neighborhood N . If the
parameters are not specified by the user, Matlab determines the default value of
σ 2 as the average variance in 3 × 3 neighborhoods over the whole image.
4 We are really making a tacit assumption that the pixel values are locally stationary.
360 Appendix D. Signal detection and estimation
A vector space V is a set of objects that can be added and multiplied by a scalar
from some field T . The addition of elements from V is associative and commuta-
tive and there exists a zero vector (so that x + 0 = x for all x ∈ V). Every element
x ∈ V also has an inverse element, −x ∈ V, so that x + (−x) = 0. The multiplica-
tion by a scalar, α ∈ T , is distributive in the sense that for all α, β ∈ T , x, y ∈ V,
α(x + y) = αx + αy and (α + β)x = αx + βx, and associative, α(βx) = (αβ)x.
Also, for the identity element in T , 1 · x = x for all x ∈ V.
An inner product in vector space is a mapping !x, y" : V × V → R with the
following properties.
We now provide several examples of important vector spaces with inner product.
n
!x, y" = !(x[1], . . . , x[n])(y[1], . . . , y[n])" = x[i]y[i] (D.161)
i=1
Example D.10: [L2 space] V = L2 (a, b) is the space of all quadratically inte-
grable functions on [a, b] with T = R. To be absolutely precise here, we consider
two functions f and g from this space as identical if they differ on a set of
Lebesgue measure 0. Thus, formally speaking, the elements of this vector space
are classes of equivalence of functions. We endow this space with an inner product
ˆb
!f, g" = f (t)g(t)dt. (D.162)
a
It is easy to verify that the four defining properties of an inner product are
´
satisfied. The fourth property is satisfied because if f 2 (t)dt = 0, then f (t) = 0
almost everywhere (in the sense of being zero up to a set of Lebesgue measure
0).
5 For complex inner products that map to C, the set of all complex numbers, the symmetry is
replaced with conjugate symmetry !x, y" = !y, x".
Signal detection and estimation 361
In this book, we will also encounter spaces with a slightly different inner prod-
uct defined as
ˆb
!f, g" = w(t)f (t)g(t)dt, (D.163)
a
Chapter
E - Support vector machines pp. 363-376
Chapter DOI:
Cambridge University Press
E Support vector machines
ˆ
R(f ) = u(−yf (x))dP (x, y), (E.1)
X ×Y
364 Appendix E. Support vector machines
Clearly, 0 ≤ R(f ) ≤ 1, and R(f ) = 0 when f (x) correctly assigns the labels to
all x ∈ X up to a set of measure zero (with respect to P ). We also stress at
this point that unless we know the measure P (x, y), we cannot guarantee that
any estimated decision function is optimal (that it minimizes the risk functional
R). Unfortunately, in most practical applications the measure will not be known.
Note that if P (x, y) were known, we could apply the apparatus of classical detec-
tion theory and derive optimal detectors (classifiers) using the Neyman–Pearson
or Bayesian approach as explained in Appendix D.
1
l
Remp (f ) = u (−y[i]f (xi )) = 0. (E.4)
l i=1
The function classifies the point x ∈ Rn according to on which side of the hyper-
plane w · x − b the point x lies. Because the decision function is fully described by
the separating hyperplane, we will use the terms decision function, hyperplane,
or classifier interchangeably depending on the context.
If the training set is linearly separable, there may exist infinitely many deci-
sion functions f (x) = sign(w · x − b) perfectly classifying the training set with
Remp (f ) = 0. To lower the chance of making an incorrect decision on x not con-
tained in the training set, we select the separating hyperplane with maximum
distance from positive and negative training examples. This hyperplane, which
we denote f ∗ , is uniquely defined. It can be found by solving the following opti-
mization problem:
∗
w • b∗
− w ∗ · w + w∗
= w∗
w∗ b∗
w∗
· w◦ − w∗
=
w∗
w∗
subject to
In other words the distances from the separating hyperplane to the closest points
from each class must be equal (see Figure E.1). If they were not, we could
move the hyperplane away from the closer class and thus decrease the minimum
in (E.5), which would contradict the optimality of the solution [w∗ , b∗ ]. The
closest examples x◦ , x• cannot lie on the separating hyperplane because we have
strict inequality in (E.6). Keeping in mind that x◦ is from the positive class,
there has to exist > 0 so that
w∗ · x◦ − b∗ = +, (E.10)
∗ • ∗
w · x − b = −. (E.11)
366 Appendix E. Support vector machines
ξ•
ξ◦
Figure E.2 Example of a training set on X = R2 that cannot be linearly separated. The
separating hyperplane is again defined by support vectors. Incorrectly classified examples
are displayed as large circles with their slack variables ξ • , ξ ◦ > 0.
6 u(z)
h(z)
−3 −2 −1 0 1 2 3 4 5
z
Figure E.3 Comparison of the step function u(z) and its convex majorant hinge loss
h(z) = max{0, 1 + z}.
The classifier f ∗ with maximum margin and minimal loss R(f ∗ ) is found by
solving the following optimization problem:
1 l
∗ ∗
[w , b ] = arg min w + C ·
2
u(ξ[i]) (E.14)
w,b,ξ 2
i=1
subject to constraints
y[i] ((w · xi ) − b) ≥ 1 − ξ[i], for all i ∈ {1, . . . , l}, (E.15)
ξ[i] ≥ 0, for all i ∈ {1, . . . , l}, (E.16)
for some suitably chosen value of the penalization constant C. The “slack” vari-
ables ξ[i] in (E.14) measure the distance of incorrectly classified examples xi
from the separating hyperplane. Of course, if xi is classified correctly, ξ[i] is zero
and thus u(ξ[i]) = 0.
Unfortunately, the optimization problem (E.14) is NP-complete. The complex-
ity can be significantly reduced by replacing the step function u(z) with the hinge
loss function h(z) = max{0, 1 + z}. Because h(z) is convex and h(z) ≥ u(z) for
z ≥ 0, it transforms (E.14) to a convex quadratic-programming problem
1 l
∗ ∗
[w , b ] = arg min w + C ·
2
ξ[i] (E.17)
w,b,ξ 2
i=1
become clear once we move to kernelized SVMs in Section E.3. The constrained
optimization problem (E.17) can be approached in a standard manner using
Lagrange multipliers. Since the constraints are in the form of inequalities, the
multipliers must be non-negative (for equality constraints, there is no limit on
the multipliers’ values). The Lagrangian is
1 l
L(w, b, ξ, α, r) = ww+C ξ[i] (E.18)
2 i=1
l
l
− α[i] {y[i] ((w · xi ) − b) − 1 + ξ[i]} − r[i]ξ[i] (E.19)
i=1 i=1
subject to constraints
l
α[i]y[i] = 0, (E.24)
i=1
Note that the formulation of the dual problem does not contain the Lagrange
multipliers r[i].
The main advantage of solving the dual problem (E.23) over the primal prob-
lem (E.17) is that the complexity (measured by the number of free variables) of
the dual problem depends on the number of training examples, while the com-
plexity of the primal problem depends on the dimension of the input space X .
Support vector machines 369
After we introduce kernelized SVMs in Section E.3, we will see that the dimen-
sion of the primal problem can be much larger (even infinite) than the number
of training examples.
Denoting again the solutions of the dual problem (E.23) with superscript ∗,
we need to recover the solution of the primal problem (E.17), which is the pair
[w∗ , b∗ ], from α∗ = (α∗ [1], . . . , α∗ [l]). From (E.20), we can easily obtain the hy-
perplane normal w∗ as
l
∗
w = α∗ [i]y[i]xi . (E.26)
i=1
The computation of the threshold b∗ is more involved and here we include only
its most frequently used form without proof,
1 ∗
b∗ = (w · xj ) − y[j], (E.27)
|J |
j∈J
where J = {i ∈ {1, . . . , l}|0 < α∗ [i] < C}. Equation (E.27) can be obtained from
the so-called Karush–Kuhn–Tucker conditions for the primal problem [33, 240].
Solving either the primal or dual optimization problem is commonly called
training of SVMs. Technically, any optimization library that includes a routine
for quadratic programming can be used. Most general-purpose libraries, how-
ever, are usually able to solve only small-scale problems. Therefore, we highly
recommend using algorithms developed specifically for SVMs, such as LibSVM,
http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
The linear SVMs described in the previous section can implement only linear
decision functions f, which is rarely sufficient for real-world problems. In this
section, we extend SVMs to non-linear decision functions. The extension is sur-
prisingly simple.
The main idea is to map the input space X , which is the space where the
observed data lives, to a different space H, using a non-linear data-driven map-
ping φ : X → H, and then find the separating hyperplane in H (for linear SVMs
described above, X = H). The non-linearity introduced through the mapping φ
allows implementation of a non-linear decision boundary in the input space X
as a linear decision boundary in H.
While the input space X is usually given by the nature of the application (it
is the space of features extracted from images), the space H can be freely chosen
as long as it satisfies the following two conditions.
These conditions ensure that the maximum margin hyperplane exists in H and
that it can be found by solving the dual problem (E.23).
The space H is a function space obtained by completing the set of all functions
on X that are linear combinations
f (x) = a[i]k(xi , x) (E.28)
i
where g(x) = j b[j]k(xj , x). Note that the Hilbert space is driven by the data
xi . The positive definiteness of the kernel guarantees that (E.29) is an inner
product.
One of the most popular kernel functions is the Gaussian kernel
k(x, x ) = exp −γx − x 2 , (E.30)
where γ > 0 is a parameter controlling the width of the kernel and x is the
Euclidean norm of x. The Hilbert space H induced by the Gaussian kernel has
infinite dimension. Other popular kernels are the linear kernel k(x, x ) = x · x ,
the polynomial kernel of degree d, defined as k(x, x ) = (r + γx · x )d , and the
sigmoid kernel k(x, x ) = tanh(r + γx · x ), both with two parameters γ, r.
Non-linear SVMs are implemented in the same way as in Section E.2.2, with
the exception that now all operations should be carried out in the Hilbert space
H. Because the dimensionality of the feature space H can be infinite, we now
must use the dual optimization problem (E.23) because its dimensionality is
determined by the cardinality of the training set. Fortunately, because xi always
appears in the inner product, we can simply replace the inner product with
its kernel-based expression k(x, x ) = !φ(x), φ(x )"H . This substitution, called
the “kernel trick,” is possible in all algorithms where the calculation with data
appears exclusively in the form of inner products.
1 k
: nX × X → R is positive definite if and only if ∀n ≥ 1, ∀x1 , . . . , xn ∈ X , ∀c[1], . . . , c[n] ∈ R,
i,j=1
c[i]c[j]k(xi, xj ) ≥ 0.
Support vector machines 371
l
1
l
max L(α, r) = α[i] − α[i]α[j]y[i]y[j] !φ(xi ), φ(xj )"H (E.31)
α,r 2 i,j=1
i=1
l
1
l
= α[i] − α[i]α[j]y[i]y[j]k (xi , xj ) , (E.32)
i=1
2 i,j=1
with constraints
l
α[i]y[i] = 0, (E.33)
i=1
As the constraints do not contain any vector manipulation, they stay the same.
In the equation w∗ = li=1 α∗ [i]y[i]φ(xi ) of the optimal hyperplane w∗ , all ma-
nipulations are carried out in H and we cannot convert it to the input space X .
Fortunately, we do not need to know w∗ explicitly because the decision func-
tion (E.3) can be rewritten as
l
= α∗ [j]y[j]k (xj , x) − b∗ . (E.37)
j=1
1 ∗ 1 ∗
l
b∗ = (w · φ(xj )) − y[j] = α [i]y[i]k (xi , xj ) − y[j], (E.38)
|J | |J | i=1
j∈J j∈J
subject to constraints
subject to constraints
l
α[i]y[i] = 0, (E.43)
i=1
+
C ≥ α[i] ≥ 0, for all i ∈ I + , (E.44)
− −
C ≥ α[i] ≥ 0, for all i ∈ I . (E.45)
1
l
∗
b = −
α∗ [i]y[i]k (xi , xj ) − y[j], (E.46)
|J | + |J + |
j∈J − ∪J + i=1
The kernel type and its parameters as well as the penalization parameter(s)
have a great impact on the accuracy of the classifier. Unfortunately, there is
no universal methodology regarding how to best select them. We provide some
guidelines that often give good results.
E.5.1 Scaling
First, the input data needs to be preprocessed. Assuming X = Rn , which is the
case for steganalysis, the input data is scaled so that all elements of vectors xi
from the training set are in the range [−1, +1]. This scaling is very important as
it ensures that features with large numeric values do not dominate features with
small values. It also increases the numerical stability of the learning algorithm.
error on “unknown” examples. This is repeated k times, each time with different
subsets. k-fold cross-validation usually gives estimates of error close to the error
we can expect on truly unknown data.
When the costs of both errors are known and equal to w+ and w− for an error
on positive and negative classes, we can use weighted SVMs with the Bayesian
approach and minimize the total cost
where p− and p+ are a priori probabilities of the negative and positive classes.
In this case, the search for the parameters can be described as
Note that even though the grid G has 3 dimensions, its effective dimension is 2
because the ratio C + /C − = w+ p+ /(w− p− ) must stay constant.
In a Neyman–Pearson setting, we impose an upper bound on the probability
of false alarms, PFA ≤ FA < 1, and minimize the probability of missed detection
PMD . The search for (C + , C − , γ) becomes
In this case, the grid G has the effective dimension 3, which makes the search
computationally expensive. Frequently, suboptimal search [44] is used to alleviate
the computational complexity.
The mathematical notation in this book was chosen to be compact and visually
distinct to make it easier for the reader to interpret formulas. For this reason, we
adhere to short names of variables, sets, and functions rather than long descrip-
tive names similar to variable naming in programming as the latter would lead
to long and rather unsightly mathematical formulas. The price paid for the short
variable names is an occasional reuse of some symbols across the text. However,
the author strongly believes that the meaning of the symbols should always be
clear and unambiguous from the context. The most frequently occurring key
symbols, such as the relative message length α, change rate β, cover object x, or
stego object y, are used exclusively and never reused. Whenever possible, vari-
able names are chosen mnemotechnically by the first letter of the concept they
stand for, such as h for histogram, A for alphabet, etc. In some cases, however,
if there exists a widely accepted notation, rather than coining a new symbol,
we accept the established notation. A good example is the parity-check matrix,
which is almost exclusively denoted as H in the coding literature.
Everywhere in this book, vectors and matrices of predefined dimensions are
printed in boldface with indices following the symbol in square brackets. Thus,
the ijth element of the parity-check matrix H is H[i, j] and the jth element
of the histogram is h[j]. This notation generalizes to higher-dimensional arrays.
The transpose of a matrix is denoted with a prime: H is the transpose of H.
Sequences of vectors or matrices will be indexed with a subscript. Thus, fk will be
a sequence of vectors f . If it is meaningful and useful, we may sometimes arrange
a sequence of vectors into a matrix and address individual vector elements using
two indices, e.g., f [i, k]. Often, quantities may depend on parameters, such as
the relative payload α. The histogram hα [j] stands for the histogram of the
stego image embedded with relative payload α. If the vector or matrix quantities
depend on other parameters or indices, we may put them as superscripts. An
example would be the histogram of the DCT mode (k, l) from a JPEG stego
(k,l)
image, hα [j]. Strings of symbols will also be typeset as vectors. The length of
string s is |s|.
Random variables are typeset in sans serif font x, y. A vector/matrix random
variable will then be boldface sans serif, x, y. When a random variable x follows
probability distribution f , we will write x ∼ f . By a common abuse of notation,
we will sometimes write x ∼ f , meaning that x, whose realization is vector x,
378 Notation and symbols
follows the distribution f . When the probability distribution is clear from the
context or unspecified, we will write Pr{·}, where the dot stands for an expression
involving a random variable. For example, Pr{x ≤ x} stands for the probability
that random variable x attains the value x or smaller. The expected value, vari-
ance, and covariance will be denoted as E[x], Var[x], and Cov[x, y] when it is clear
from the context with respect to what probability distribution we are taking the
expected value or variance. Sometimes, we may stress the probability distribu-
tion by writing Ex∼f [y(x)] or simply Ex [y(x)], which is the expected value of y
when x follows the distribution f of variable x. The sample mean will be denoted
with a bar, x̄.
Calligraphic font will be used for sets or regions, X , Y, with |X | denoting the
cardinality of X .
Estimated quantities will be denoted with a hat, meaning that θ̂ is an estimator
of θ. A tilde is often used for quantities subjected to distortion, e.g., ỹ is a noisy
version of y.
In this book, we use Landau’s big O symbol f = O(g) meaning that |f (x)| <
Ag(x) for all x and a constant A > 0. The asymptotic equality f (x) ≈ g(x) holds
for two functions if and only if limx→∞ f (x)/g(x) = 1. We will also use the symbol
≈ with scalars, in which case it stands for approximate equality (e.g., α ≈ 1 or
k ≈ m).
Among other, less common symbols, we point out the eXclusive OR operation
on bits, which we denote with ⊕. Concatenation of strings is denoted as &. We
reserve for convolution, and x, x for rounding down and up, respectively.
The function round(x) is rounding to the closest integer, while trunc(x) is the
operation of truncation to a finite dynamic range (e.g., to bring the real values
of pixels back to the 0, . . . , 255 range of integers).
Below, we provide definitions for the most common symbols used throughout
the text.
α relative payload
B bin
β change rate
C code
d2 deflection coefficient
f feature vector
h histogram
h(k,l) histogram of the (k, l)th DCT mode (in a JPEG file)
Notation and symbols 381
H parity-check matrix
Ik k × k identity matrix
J1 , J2 JPEG images
k∈K secret stego key from the set of all stego keys
Λ lattice
Ψ isomorphism
QΛ quantizer to lattice Λ
qf quality factor
R covering radius
s syndrome
σ2 variance
V variation
W noise residual
W Wiener filter
384 Notation and symbols
y stego image
→ mapping
D
→ convergence in distribution
P
→ convergence in probability
sign(x) sign of x
Chapter
Glossary pp. 387-408
Chapter DOI:
Cambridge University Press
Glossary
Alice. A fictitious character from the prisoners’ problem who secretly commu-
nicates with Bob.
Alphabet. A set of symbols used to construct codes.
Amplifier noise. Image noise component due to signal amplification.
APS. Active Pixel Sensor.
Arithmetic coder. A lossless compression algorithm.
Attack. An algorithm whose goal is to detect the usage of steganography.
AUC/AUR. Two acronyms for Area Under ROC Curve.
Average distance to code. The average Hamming distance from a randomly
(uniformly) selected word to a code.
Ball (of radius r and center x). A set of points at distance at most r from x.
Basis pattern. One of the 64 patterns forming the discrete cosine transform
(DCT).
Batch steganography. Steganographic communication by splitting payload
into multiple cover objects.
Bayer pattern. A popular arrangement of red, green, and blue color filters
allowing a sensor to register color.
BCH. A class of parametrized error-correcting codes invented by Bose, Ray-
Chaudhuri, and Hocquenghem.
Benchmarking. A procedure aimed to compare and evaluate systems.
Bernoulli random variable. A random variable B(p) reaching values in {0, 1}
with probabilities p and 1 − p.
Between-image error. A component of the estimation error of quantitative
steganalyzers specific to each image.
Bias. A systematic estimator error, b(θ) = E[θ̂] − θ, where θ is a parameter that
is being estimated.
Binary
– arithmetic. Arithmetic in finite field {0, 1}.
– classification. A classification to two classes.
– entropy function. Entropy of a Bernoulli random variable B(x), H(x) =
−x log x − (1 − x) log(1 − x).
Blind steganalysis. An approach to steganalysis whose aim is to detect an
arbitrary steganographic method.
Blockiness. Sum of discontinuities at boundaries of 8 × 8 blocks in a decom-
pressed JPEG file.
Glossary 389
Defective
– memory. A memory in which some cells are defective or permanently stuck
to zero or one.
– pixel. A pixel whose response to light does not comply with design speci-
fications.
Deflection coefficient. A numerical quantity measuring the performance of a
detector.
Demosaicking. Process of interpolating colors from a signal acquired by an
imaging sensor equipped with a color filter array.
Denoising. Process of removing noise from a signal.
DES. Data-Encryption Standard. A symmetric cryptosystem.
Detection probability. Probability of correctly detecting a stego image as
stego.
Detector. An algorithm that detects presence of a signal.
DFT. Discrete Fourier Transform.
Digital image acquisition. Process of acquiring an image through an imaging
sensor.
Discrete cosine transformation. An orthogonal transformation used in JPEG
format.
Distance to code. Distance between a given word and the closest codeword
from the code.
Distortion. Signal degradation typically due to noise or processing.
Distortion-limited embedder. An embedding algorithm whose embedding dis-
tortion is bounded.
Dithering. Process of spreading quantization error to neighboring pixels to
prevent creating bands during color quantization.
DNG. Digital NeGative file format.
Double JPEG compression. Image that was JPEG compressed twice, each
time with a different quantization matrix.
Dry elements. See changeable elements.
Dual
– code. The orthogonal complement of a linear code.
– histogram. A feature used in blind steganalysis of JPEG images.
DWT. Discrete Wavelet Transform.
394 Glossary
Embedding
– algorithm/mapping. Algorithm that embeds a secret message in a cover
object.
– capacity. Maximal number of bits that can be embedded using a given
steganographic method.
– distortion. Distortion imposed onto the cover object due to embedding a
secret message.
– efficiency. The expected number of bits communicated per unit expected
embedding distortion.
– impact. Increase in statistical detectability of embedding changes.
– operation. Procedure that modifies individual cover elements during em-
bedding.
– path. A path along which message bits are embedded.
– while dithering. A steganographic method for palette images.
Empirical risk. Risk function evaluated from data samples.
Entropy. Uncertainty (information content) of a random variable.
– encoder/compressor. A lossless compression algorithm capable of com-
pressing a stream of symbols close to its entropy.
– decoder/decompressor. Algorithm inverting the action of an entropy
encoder.
Erasure channel. A communication channel in which some symbols may get
replaced by an erasure symbol.
√ ´x 2
Error function. Erf(x) = (2/ π) 0 e−t dt.
Estimator. An algorithm that computes an estimate of an unknown parameter.
Eve. A fictitious character (the warden) from the prisoners’ problem.
Extraction algorithm/mapping. Algorithm that extracts secret message
from the stego object.
EzStego. A steganographic program for palette images.
F5. A steganographic algorithm for the JPEG format.
False alarm. An error when a cover object is mistakenly identified as stego.
– probability. Probability of encountering a false alarm.
Feasible covert channel. A covert channel that complies with the bound on
the embedding distortion and perfect security.
Feature. A numerical quantity extracted from an object.
Finite field. A finite alphabet of symbols that can be added or multiplied so
that the operations have similar properties as operations of addition and
multiplication on real numbers.
Glossary 395
– variable. A random variable with mean μ and variance σ 2 with the follow-
ing probability distribution:
1 (x−μ)2
√ e− 2σ2 .
2πσ
k
k
ri k
j=1 rj
ri log ≥ ri log k .
i=1
s i i=1 j=1 s j
Lossless compression. Compression of data that does not incur any loss of
information.
398 Glossary
Lower embedding efficiency. The ratio between the payload length and the
largest number of embedding changes that may be needed to embed that
payload.
LSB. Least Significant Bit. The last bit in a big-endian representation of an
integer.
– embedding. A steganographic method in which message bits are embedded
in a sequence of natural numbers by replacing their least significant bits
with message bits.
– matching. See ±1 embedding.
– pair. A pair of values differing only in their LSBs.
– plane. Array of LSBs.
LSE. Least-Square Estimation. A general estimation method in which data
model parameters are determined by minimizing the squared error between
the model and observations.
Luminance. Color component carrying information about light intensity.
LT
– code. Sparse linear code designed for the erasure channel.
– process. An algorithm based on LT codes used for realizing non-shared
selection channels.
Machine learning. The field of artificial intelligence.
Macroblock. A block of 16 × 16 pixels used in JPEG format.
MAD. Median Absolute Deviation. A robust measure of spread.
MAE. Median Absolute Error. A robust measure of error spread.
MAP. Maximum A Posteriori Estimation. A general estimation method for
determining an unknown parameter of a distribution. The parameter is de-
termined by maximizing the probability of the parameter given the measure-
ments.
Margin. The distance between the support vector closest to the separating hy-
perplane.
Markov
– chain. A sequence of random variables xi in which each variable depends
only on the previous variable (Pr{xi |xi−1 } = Pr{xi |xi−1 , . . . , x1 }). For dis-
crete variables, a Markov chain is completely described using the transition
probability matrix A[k, l] = Pr{xi = l|xi−1 = k}.
– features. Features for blind steganalysis of JPEG images obtained by quan-
tifying the relationship between neighboring DCT coefficients of the same
block.
Glossary 399
– well. A portion of pixel collecting free electrons released due to the photo-
electric effect at image acquisition.
Plaintext. Message to be encrypted.
pmf. Probability mass function (defined for discrete variables).
PNG. Portable Network Graphics image format.
Poisson distribution. A mathematical form of the law of rare events.
PQe. A version of the perturbed quantization algorithm for JPEG images.
PQt. A version of the perturbed quantization algorithm for JPEG images.
Precover. A hypothetical cover used to justify a priori distribution of some
cover-image statistics.
Primary set. A set of pixel pairs with specified relationship (used to construct
Sample Pairs Analysis attack on LSB embedding).
Prisoners’ problem. A fictitious scenario involving Alice, Bob, and warden
Eve, used to demonstrate the problem of steganography and steganalysis.
PRNG. Pseudo-Random Number Generator.
PRNU. Pixel Photo-Response Non-Uniformity. A systematic imperfection due
to varying response to light of individual pixels.
Probabilistic algorithm. An algorithm whose mechanism involves random-
ness.
Pseudo-random selection channel. A selection channel determined using a
PRNG.
PSNR. Peak Signal-to-Noise Ratio.
Public-key
– encryption. Asymmetric cryptosystem (realized using the so-called public
key) and decryption, realized with the private key).
– steganography. Steganographic channel with public access to the secret
message, which is encrypted using a private key.
q-ary. Related to an alphabet consisting of q symbols.
QIM. Quantization Index Modulation (a robust watermarking technique).
Quality factor. A scalar value that controls the quality of a JPEG file.
Quantitative steganalysis. Steganalysis whose objective is to estimate the
embedded payload or the change rate (or the number of embedding changes).
Quantization. Process of rounding real-valued quantities to represent them
with bits.
– error. Distortion induced by quantization.
Glossary 403
∂ log p(x; θ)
Ep(x;θ) = 0 for all θ.
∂θ
Space-filling curve (planar). A curve that goes through every point in a unit
square.
Sparse codes. Codes with codewords of small Hamming weight.
Sparsity measure. A function evaluating sparsity of data points at a given
point.
Spatial frequency. A pair of integers characterizing a DCT mode.
Sphere-covering bound. An inequality that binds the number of codewords
in a code, its length, and its covering radius.
Spread-spectrum (data hiding). A method for information hiding in which a
bit is spread into a longer random sequence that is superimposed on the cover
object. Usually used to achieve robustness with respect to channel distortion.
Square-root law. Thesis claiming that the steganographic capacity of imperfect
stegosystems grows only as the square root of the number of cover elements.
Standard quantization matrix. A family of quantization matrices
parametrized by quality factor as specified in the JPEG format standard.
Statistical
– detectability. A measure of how detectable embedding changes are using
standard methods of hypothesis testing.
– restoration. A general principle for constructing steganographic methods
that preserve selected statistics of the cover object.
Steganalysis. The counterpart of steganography. Effort directed towards dis-
covering the presence of a secret message.
Steganalysis-aware steganography. Steganography constructed to avoid
known steganalysis attacks.
Steganalyzer. A specific implementation of some steganalysis attack.
Steganographic
– capacity. Maximum number of message bits that can be embedded in a
cover object without introducing statistically detectable artifacts.
– channel. A covert communication system consisting of the cover source,
stego key source, and message source, and the physical communication
channel used to exchange messages.
– communication. See steganographic channel.
– file system. A tool to thwart “rubber-hose attacks” that allows the user
to plausibly deny that encrypted files reside on the disk.
– scheme. See steganographic channel.
– security. Impossibility to construct an attack on a steganographic scheme.
Steganography. The art of communicating messages in a covert manner.
406 Glossary
βn
n
≤ 2nH(β) .
i=0
i
Warden. The subject or a computer program that monitors the traffic between
Alice and Bob in the prisoners’ problem.
– active. A warden that slightly distorts the communication so that it still
conveys the same overt meaning. The goal is to prevent steganography.
– malicious. A warden who tries to trick the communicating parties by
impersonating them or by other means that are based on the warden’s
knowledge of the steganographic protocol.
– passive. A warden that passively observes the communication.
Watermarking. Data-hiding application in which the hidden data supplements
the cover object.
Wavelet transform. A signal transform with basis functions that are localized
simultaneously both in the spatial and in the frequency domain.
Weak Law of large numbers. The law that states that the sample mean
converges to the mean in probability.
Weighted SVM. A support vector machine in which misclassifications are pe-
nalized on the basis of weights assigned to false alarms and missed detections.
Wet paper codes. Codes that enable communication using non-shared selection
channels.
Wet pixels. Pixels that are not to be changed during steganographic embedding.
White balance. Color transformation adjusting the gain of each color channel.
Wiener filter. An adaptive linear denoising filter.
Within-image error. A component of the estimation error of quantitative ste-
ganalyzers that depends on the placement of embedding changes in the stego
image.
WMF. Windows Meta File image format.
Word. A string of symbols from an alphabet.
WS. Weighted stego method for detection of LSB embedding.
XOR. Exclusive or.
YUV. Luminance and two chrominance signals in a color signal.
Zig-zag scan. Scanning order of quantized DCT coefficients used during JPEG
compression.
ZZW construction. A general method for constructing matrix-embedding
methods from existing methods.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
References pp. 409-426
Chapter DOI:
Cambridge University Press
References
electronically at http://www.gatsby.ucl.ac.uk/aistats/.
[14] M. Backes and C. Cachin. Public-key steganography with active attacks. In J. Kilian,
editor, 2nd Theory of Cryptography Conference, volume 3378 of Lecture Notes in Com-
puter Science, pages 210–226, Cambridge, MA, February 10–12, 2005. Springer-Verlag,
Heidelberg.
[15] F. Bacon. Of the Advancement and Proficiencie of Learning or the Partitions of Sci-
ences, volume VI. Leon Lichfield, Oxford, for R. Young and E. Forest, 1640.
[16] M. Barni and F. Bartolini. Watermarking Systems Engineering: Enabling Digital Assets
Security and Other Applications, volume 21 of Signal Processing and Communications.
Boca Raton, FL: CRC Press, 2004.
[17] R. J. Barron, B. Chen, and G. W. Wornell. The duality between information embedding
and source coding with side information and some applications. IEEE Transactions on
Information Theory, 49(5):1159–1180, 2003.
[18] R. Bergmair. Towards linguistic steganography: A systematic investigation of ap-
proaches, systems, and issues. Final year thesis, April 2004. University of Derby,
http://bergmair.cjb.net/pub/towlingsteg-rep-inoff-a4.ps.gz.
[19] J. Bierbrauer. Introduction to Coding Theory. London: Chapman & Hall/CRC, 2004.
[20] J. Bierbrauer and J. Fridrich. Constructing good covering codes for applications in
steganography. LNCS Transactions on Data Hiding and Multimedia Security, 4920:1–
22, 2008.
[21] R. Böhme. Weighted stego-image steganalysis for JPEG covers. In K. Solanki, K. Sul-
livan, and U. Madhow, editors, Information Hiding, 10th International Workshop, vol-
ume 5284 of Lecture Notes in Computer Science, pages 178–194, Santa Barbara, CA,
June 19–21, 2007. Springer-Verlag, New York.
[22] R. Böhme. Improved Statistical Steganalysis Using Models of Heterogeneous Cover Sig-
nals. PhD thesis, Faculty of Computer Science, Technische Universität Dresden, 2008.
[23] R. Böhme and A. D. Ker. A two-factor error model for quantitative steganalysis. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security,
Steganography, and Watermarking of Multimedia Contents VIII, volume 6072, pages
59–74, San Jose, CA, January 16–19, 2006.
[24] R. Böhme and A. Westfeld. Feature-based encoder classification of compressed audio
streams. ACM Multimedia System Journal, 11(2):108–120, 2005.
[25] I. A. Bolshakov. Method of linguistic steganography based on collocationally-verified
synonymy. In J. Fridrich, editor, Information Hiding, 6th International Workshop, vol-
ume 3200 of Lecture Notes in Computer Science, pages 180–191, Toronto, May 23–25,
2005. Springer-Verlag, Berlin.
[26] K. Borders and A. Prakash. Web tap: Detecting covert web traffic. In V. Atluri, B. Pfitz-
mann, and P. Drew McDaniel, editors, Proceedings 11th ACM Conference on Computer
and Communications Security (CCS), pages 110–120, Washington, DC, October 25–29,
2004.
[27] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge: Cambridge University
Press, 2004.
[28] W. S. Boyle and G. E. Smith. Charge coupled semiconductor devices. Bell Systems
Technical Journal, 49:587–593, 1970.
[29] J. Brassil, S. Low, N. F. Maxemchuk, and L. O’Gorman. Hiding information in docu-
ment images. In Proceedings of the Conference on Information Sciences and Systems,
CISS, pages 482–489, Johns Hopkins University, Baltimore, MD, March 22–24, 1995.
REFERENCES 411
[30] R. P. Brent, S. Gao, and A. G. B. Lauder. Random Krylov spaces over finite fields.
SIAM Journal of Discrete Mathematics, 16(2):276–287, 2003.
[31] D. Brewster. Microscope, volume XIV. Encyclopaedia Britannica or the Dictionary of
Arts, Sciences, and General Literature, Edinburgh, IX – Application of photography
to the microscope, 8th edition, 1857.
[32] C. W. Brown and B. J. Shepherd. Graphics File Formats. Greenwich, CT: Manning
Publications Co., 1995.
[33] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2):121–167, 1998.
[34] S. Cabuk, C. E. Brodley, and C. Shields. IP covert timing channels: Design and de-
tection. In V. Atluri, B. Pfitzmann, and P. Drew McDaniel, editors, Proceedings 11th
ACM Conference on Computer and Communications Security (CCS), pages 178–187,
Washington, DC, October 25–29, 2004.
[35] C. Cachin. An information-theoretic model for steganography. In D. Aucsmith, editor,
Information Hiding, 2nd International Workshop, volume 1525 of Lecture Notes in
Computer Science, pages 306–318, Portland, OR, April 14–17, 1998. Springer-Verlag,
New York.
[36] G. Cancelli and M. Barni. MPSteg-color: A new steganographic technique for color
images. In T. Furon, F. Cayre, G. Doërr, and P. Bas, editors, Information Hiding,
9th International Workshop, volume 4567 of Lecture Notes in Computer Science, pages
1–15, Saint Malo, June 11–13, 2007. Springer-Verlag, Berlin.
[37] G. Cancelli, G. Doërr, I. J. Cox, and M. Barni. A comparative study of ±1 steganalyzers.
In Proceedings IEEE International Workshop on Multimedia Signal Processing, pages
791–796, Cairns, Australia, October 8–10, 2008.
[38] G. Cancelli, G. Doërr, I. J. Cox, and M. Barni. Detection of ±1 LSB steganography
based on the amplitude of histogram local extrema. In Proceedings IEEE, International
Conference on Image Processing, ICIP 2008, pages 1288–1291, San Diego, CA, October
12–15, 2008.
[39] R. Chandramouli and N. D. Memon. A distributed detection framework for steganaly-
sis. In J. Dittmann, K. Nahrstedt, and P. Wohlmacher, editors, Proceedings of the 3rd
ACM Multimedia & Security Workshop, pages 123–126, Los Angeles, CA, November 4,
2000.
[40] R. Chandramouli and N. D. Memon. Analysis of LSB based image steganography tech-
niques. In Proceedings IEEE International Conference on Image Processing, ICIP 2001,
Thessaloniki, October 7–10, 2001. CD ROM version.
[41] M. Chapman, G. I. Davida, and M. Rennhard. A practical and effective approach to
large-scale automated linguistic steganography. In Proceedings of the 4th International
Conference on Information Security, volume 2200 of Lecture Notes in Computer Sci-
ence, pages 156–165, Malaga, October 1–3, 2001. Springer-Verlag, Berlin.
[42] B. Chen and G. Wornell. Quantization index modulation: A class of provably good
methods for digital watermarking and information embedding. IEEE Transactions on
Information Theory, 47(4):1423–1443, 2001.
[43] M. Chen, J. Fridrich, and M. Goljan. Determining image origin and integrity using
sensor noise. IEEE Transactions on Information Forensics and Security, 1(1):74–90,
March 2008.
[44] H. G. Chew, R. E. Bogner, and C. C. Lim. Dual-ν support vector machine with er-
ror rate and training size biasing. In Proceedings IEEE, International Conference on
412 REFERENCES
Acoustics, Speech, and Signal Processing, volume 2, pages 1269–1272, Salt Lake City,
UT, May 7–11, 2001.
[45] C. T. Clelland, V. Risca, and C. Bancroft. Hiding messages in DNA microdots. Nature,
399:533–534, June 10, 1999.
[46] G. D. Cohen, I. Honkala, S. Litsyn, and A. Lobstein. Covering Codes, volume 54.
Amsterdam: Elsevier, North-Holland Mathematical Library, 1997.
[47] E. Cole. Hiding in Plain Sight: Steganography and the Art of Covert Communication.
New York: Wiley Publishing Inc., 2003.
[48] S. Coll and S. B. Glasser. Terrorists turn to web as base of operations. Washington
Post, page A01, August 7, 2005.
[49] P. Comesana and F. Pérez-Gonzáles. On the capacity of stegosystems. In J. Dittmann
and J. Fridrich, editors, Proceedings of the 9th ACM Multimedia & Security Workshop,
pages 3–14, Dallas, TX, September 20–21, 2007.
[50] T. M. Cover and J. A. Thomas. Elements of Information Theory. New York: John
Wiley & Sons, Inc., 1991.
[51] I. J. Cox, M. L. Miller, J. A. Bloom, J. Fridrich, and T. Kalker. Digital Watermarking
and Steganography. Morgan Kaufman Publishers Inc., San Francisco, CA, 2007.
[52] R. Crandall. Some notes on steganography. Steganography Mailing List, available from
http://os.inf.tu-dresden.de/~westfeld/crandall.pdf, 1998.
[53] S. Craver. On public-key steganography in the presence of an active warden. In D. Auc-
smith, editor, Information Hiding, 2nd International Workshop, volume 1525 of Lecture
Notes in Computer Science, pages 355–368, Portland, OR, April 14–17, 1998. Springer-
Verlag, New York.
[54] N. Cristianini and J. Shawe-Taylor. Support Vector Machines and Other Kernel-Based
Learning Methods. Cambridge: Cambridge University Press, 2000.
[55] O. Dabeer, K. Sullivan, U. Madhow, S. Chandrasekaran, and B. S. Manjunath. De-
tection of hiding in the least significant bit. IEEE Transactions on Signal Processing,
52:3046–3058, 2004.
[56] N. Dedic, G. Itkis, L. Reyzin, and S. Russell. Upper and lower bounds on black-box
steganography. In J. Kilian, editor, Theory of Cryptography, volume 3378 of Lecture
Notes in Computer Science, pages 227–244, Cambridge, MA, February 10–12, 2005.
Springer-Verlag, London.
[57] Y. Desmedt. Subliminal-free authentication and signature. In C. G. Günther, editor,
Advances in Cryptology – EUROCRYPT ’88, Workshop on the Theory and Application
of Cryptographic Techniques, volume 330 of Lecture Notes in Computer Science, pages
22–33, Davos, May 25–27, 1988. Springer-Verlag, Berlin.
[58] J. Dittmann, T. Vogel, and R. Hillert. Design and evaluation of steganography for
voice-over-IP. In Proceedings IEEE International Symposium on Circuits and Systems
(ISCAS), Kos, May 21–24, 2006.
[59] E. R. Dougherty. Random Processes for Image and Signal Processing, volume Mono-
graph PM44. Washington, DC: SPIE Press, International Society for Optical Engineer-
ing, 1998.
[60] I. Dumer. Handbook of Coding Theory: Volume II, chapter 23, Concatenated Codes
and Their Multilevel Generalizations, pages 1911–1988. Elsevier Science, Amsterdam,
1998.
[61] S. Dumitrescu and X. Wu. LSB steganalysis based on higher-order statistics. In A. M.
Eskicioglu, J. Fridrich, and J. Dittmann, editors, Proceedings of the 7th ACM Multi-
REFERENCES 413
media & Security Workshop, pages 25–32, New York, NY, August 1–2, 2005.
[62] S. Dumitrescu, X. Wu, and N. D. Memon. On steganalysis of random LSB embedding
in continuous-tone images. In Proceedings IEEE, International Conference on Image
Processing, ICIP 2002, pages 324–339, Rochester, NY, September 22–25, 2002.
[63] S. Dumitrescu, X. Wu, and Z. Wang. Detection of LSB steganography via Sample
Pairs Analysis. In F. A. P. Petitcolas, editor, Information Hiding, 5th International
Workshop, volume 2578 of Lecture Notes in Computer Science, pages 355–372, Noord-
wijkerhout, October 7–9, 2002. Springer-Verlag, New York.
[64] J. Eggers, R. Bäuml, and B. Girod. A communications approach to steganography. In
E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security
and Watermarking of Multimedia Contents IV, volume 4675, pages 26–37, San Jose,
CA, January 21–24, 2002.
[65] T. Ernst. Schwarzweisse Magie. Der Schlüssel zum dritten Buch der Steganographia des
Trithemius. Daphnis, 25(1), 1996.
[66] J. M. Ettinger. Steganalysis and game equilibria. In D. Aucsmith, editor, Informa-
tion Hiding, 2nd International Workshop, volume 1525 of Lecture Notes in Computer
Science, pages 319–328, Portland, OR, April 14–17, 1998. Springer-Verlag, New York.
[67] H. Farid and L. Siwei. Detecting hidden messages using higher-order statistics and
support vector machines. In F. A. P. Petitcolas, editor, Information Hiding, 5th Inter-
national Workshop, volume 2578 of Lecture Notes in Computer Science, pages 340–354,
Noordwijkerhout, October 7–9, 2002. Springer-Verlag, New York.
[68] T. Filler and J. Fridrich. Binary quantization using belief propagation over factor graphs
of LDGM codes. In 45th Annual Allerton Conference on Communication, Control, and
Computing, Allerton, IL, September 26–28, 2007.
[69] T. Filler and J. Fridrich. Complete characterization of perfectly secure stegosystems
with mutually independent embedding. In Proceedings IEEE, International Conference
on Acoustics, Speech, and Signal Processing, Taipei, April 19–24, 2009.
[70] T. Filler and J. Fridrich. Fisher information determines capacity of -secure steganog-
raphy. In S. Katzenbeisser and A.-R. Sadeghi, editors, Information Hiding, 11th Inter-
national Workshop, volume 5806 of Lecture Notes in Computer Science, pages 31–47,
Darmstadt, June 7–10, 2009. Springer-Verlag, New York.
[71] T. Filler, A. D. Ker, and J. Fridrich. The Square Root Law of steganographic capacity
for Markov covers. In N. D. Memon, E. J. Delp, P. W. Wong, and J. Dittmann, editors,
Proceedings SPIE, Electronic Imaging, Security and Forensics of Multimedia XI, volume
7254, pages 08 1–08 11, San Jose, CA, January 18–21, 2009.
[72] G. Fisk, M. Fisk, C. Papadopoulos, and J. Neil. Eliminating steganography in Internet
traffic with active wardens. In F. A. P. Petitcolas, editor, Information Hiding, 5th
International Workshop, volume 2578 of Lecture Notes in Computer Science, pages
18–35, Noordwijkerhout, October 7–9, 2003. Springer-Verlag, New York.
[73] M. A. Fitch and R. E. Jamison. Minimum sum covers of small cyclic groups. Congressus
Numerantium, 147:65–81, 2000.
[74] J. D. Foley, A. van Dam, S. K. Feiner, J. F. Hughes, and R. L. Phillips. Introduction
to Computer Graphics. New York: Addison-Wesley, 1997.
[75] E. Franz. Steganography preserving statistical properties. In F. A. P. Petitcolas, ed-
itor, Information Hiding, 5th International Workshop, volume 2578 of Lecture Notes
in Computer Science, pages 278–294, Noordwijkerhout, October 7–9, 2002. Springer-
Verlag, New York.
414 REFERENCES
Proceedings of the 8th ACM Multimedia & Security Workshop, pages 11–16, Geneva,
September 26–27, 2006.
[149] C. Krätzer and J. Dittmann. Pros and cons of mel-cepstrum based audio steganalysis
using SVM classification. In T. Furon, F. Cayre, G. Doërr, and P. Bas, editors, Infor-
mation Hiding, 9th International Workshop, pages 359–377, Saint Malo, June 11–13,
2007.
[150] C. Krätzer, J. Dittmann, A. Lang, and T. Kühne. WLAN steganography: A first prac-
tical review. In S. Voloshynovskiy, J. Dittmann, and J. Fridrich, editors, Proceedings
of the 8th ACM Multimedia & Security Workshop, pages 17–22, Geneva, September
26–27, 2006.
[151] M. Kutter and F. A. P. Petitcolas. A fair benchmark for image watermarking systems.
In E. J. Delp and P. W. Wong, editors, Proceedings SPIE, Electronic Imaging, Security
and Watermarking of Multimedia Contents I, volume 3657, pages 226–239, San Jose,
CA, 1999.
[152] A. V. Kuznetsov and B. S. Tsybakov. Coding in a memory with defective cells. Problems
of Information Transmission, 10:132–138, 1974.
[153] Tri Van Le. Efficient provably secure public key steganography. Technical report,
Florida State University, 2003. Cryptography ePrint Archive, http://eprint.iacr.
org/2003/156.
[154] Tri Van Le and K. Kurosawa. Efficient public key steganography secure against adap-
tively chosen stegotext attacks. Technical report, Florida State University, 2003. Cryp-
tography ePrint Archive, http://eprint.iacr.org/2003/244.
[155] K. Lee, C. Jung, S. Lee, and J. Lim. New steganalysis methodology: LR cube analysis
for the detection of LSB steganography. In M. Barni, J. Herrera, S. Katzenbeisser, and
F. Pérez-González, editors, Information Hiding, 7th International Workshop, volume
3727 of Lecture Notes in Computer Science, pages 312–326, Barcelona, June 6–8, 2005.
Springer-Verlag, Berlin.
[156] K. Lee and A. Westfeld. Generalized category attack – improving histogram-based
attack on JPEG LSB embedding. In T. Furon, F. Cayre, G. Doërr, and P. Bas, edi-
tors, Information Hiding, 9th International Workshop, volume 4567 of Lecture Notes
in Computer Science, pages 378–392, Saint Malo, June 11–13, 2007. Springer-Verlag,
Berlin.
[157] K. Lee, A. Westfeld, and S. Lee. Category attack for LSB embedding of JPEG images.
In Y.-Q. Shi, B. Jeon, Y.Q. Shi, and B. Jeon, editors, Digital Watermarking, 5th Inter-
national Workshop, volume 4283 of Lecture Notes in Computer Science, pages 35–48,
Jeju Island, November 8–10, 2006. Springer-Verlag, Berlin.
[158] X. Li, B. Yang, D. Cheng, and T. Zeng. A generalization of LSB matching. IEEE Signal
Processing Letters, 16(2):69–72, February 2009.
[159] X. Li, T. Zeng, and B. Yang. Detecting LSB matching by applying calibration technique
for difference image. In A. D. Ker, J. Dittmann, and J. Fridrich, editors, Proceedings
of the 10th ACM Multimedia & Security Workshop, pages 133–138, Oxford, September
22–23, 2008.
[160] X. Li, T. Zeng, and B. Yang. Improvement of the embedding efficiency of LSB matching
by sum and difference covering set. In Proceedings IEEE, International Conference on
Multimedia and Expo, pages 209–212, Hannover, June 23–April 26, 2008.
[161] Tsung-Yuan Liu and Wen-Hsiang Tsai. A new steganographic method for data hiding
in Microsoft Word documents by a change tracking technique. IEEE Transactions on
420 REFERENCES
[210] Y. Q. Shi, C. Chen, and W. Chen. A Markov process based approach to effective
attacking JPEG steganography. In J. L. Camenisch, C. S. Collberg, N. F. Johnson,
and P. Sallee, editors, Information Hiding, 8th International Workshop, volume 4437 of
Lecture Notes in Computer Science, pages 249–264, Alexandria, VA, July 10–12, 2006.
Springer-Verlag, New York.
[211] F. Y. Shih. Digital Watermarking and Steganography: Fundamentals and Techniques.
Boca Raton, FL: CRC Press, 2007.
[212] B. Shimanovsky, J. Feng, and M. Potkonjak. Hiding data in DNA. In F. A. P. Petitcolas,
editor, Information Hiding, 5th International Workshop, volume 2578 of Lecture Notes
in Computer Science, pages 373–386, Noordwijkerhout, October 7–9, 2002. Springer-
Verlag, Berlin.
[213] M. Sidorov. Hidden Markov models and steganalysis. In J. Dittmann and J. Fridrich,
editors, Proceedings of the 6th ACM Multimedia & Security Workshop, pages 63–67,
Magdeburg, September 20–21, 2004.
[214] G. J. Simmons. The prisoner’s problem and the subliminal channel. In D. Chaum,
editor, Advances in Cryptology, CRYPTO ’83, pages 51–67, Santa Barbara, CA, August
22–24, 1983. New York: Plenum Press.
[215] A. J. Smola and B. Schölkopf. A tutorial on support vector regression. NeuroCOLT2
Technical Report NC2-TR-1998-030, 1998.
[216] T. Sohn, J. Seo, and J. Moon. A study on the covert channel detection of TCP/IP
header using support vector machine. In S. Qing, D. Gollmann, and J. Zhou, editors,
Proceedings of the 5th International Conference on Information and Communications
Security, volume 2836 of Lecture Notes in Computer Science, pages 313–324, Huhe-
haote, October 10–13, 2003. Springer-Verlag, Berlin.
[217] K. Solanki, A. Sarkar, and B. S. Manjunath. YASS: Yet another steganographic scheme
that resists blind steganalysis. In T. Furon, F. Cayre, G. Doërr, and P. Bas, editors,
Information Hiding, 9th International Workshop, volume 4567 of Lecture Notes in Com-
puter Science, pages 16–31, Saint Malo, June 11–13, 2007. Springer-Verlag, New York.
[218] K. Solanki, K. Sullivan, U. Madhow, B. S. Manjunath, and S. Chandrasekaran. Provably
secure steganography: Achieving zero K–L divergence using statistical restoration. In
Proceedings IEEE, International Conference on Image Processing, ICIP 2006, pages
125–128, Atlanta, GA, October 8–11, 2006.
[219] A. Somekh-Baruch and N. Merhav. On the capacity game of public watermarking
systems. IEEE Transactions on Information Theory, 50(3):511–524, 2004.
[220] D. Soukal, J. Fridrich, and M. Goljan. Maximum likelihood estimation of secret message
length embedded using ±k steganography in spatial domain. In E. J. Delp and P. W.
Wong, editors, Proceedings SPIE, Electronic Imaging, Security, Steganography, and
Watermarking of Multimedia Contents VII, volume 5681, pages 595–606, San Jose,
CA, January 16–20, 2005.
[221] M. R. Spiegel. Schaum’s Outline of Theory and Problems of Statistics. McGraw-Hill,
New York, 3rd edition, 1961.
[222] M. Stamm and K. J. Ray Liu. Blind forensics of contrast enhancement in digital images.
In Proceedings IEEE, International Conference on Image Processing, ICIP 2008, pages
3112–3115, San Diego, CA, October 12–15, 2008.
[223] I. Steinwart. On the influence of the kernel on the consistency of support vector ma-
chines. Journal of Machine Learning Research, 2:67–93, 2001. Available electronically
at http://www.jmlr.org/papers/volume2/steinwart01a/steinwart01a.ps.gz.
424 REFERENCES
[241] A. Westfeld. High capacity despite better steganalysis (F5 – a steganographic algo-
rithm). In I. S. Moskowitz, editor, Information Hiding, 4th International Workshop,
volume 2137 of Lecture Notes in Computer Science, pages 289–302, Pittsburgh, PA,
April 25–27, 2001. Springer-Verlag, New York.
[242] A. Westfeld. Detecting low embedding rates. In F. A. P. Petitcolas, editor, Informa-
tion Hiding, 5th International Workshop, volume 2578 of Lecture Notes in Computer
Science, pages 324–339, Noordwijkerhout, October 7–9, 2002. Springer-Verlag, Berlin.
[243] A. Westfeld. Space filling curves in steganalysis. In E. J. Delp and P. W. Wong, editors,
Proceedings SPIE, Electronic Imaging, Security, Steganography, and Watermarking of
Multimedia Contents VII, volume 5681, pages 28–37, San Jose, CA, January 16–20,
2005.
[244] A. Westfeld. Generic adoption of spatial steganalysis to transformed domain. In
K. Solanki, K. Sullivan, and U. Madhow, editors, Information Hiding, 10th Interna-
tional Workshop, volume 5284 of Lecture Notes in Computer Science, pages 161–177,
Santa Barbara, CA, June 19–21, 2007. Springer-Verlag, New York.
[245] A. Westfeld and R. Böhme. Exploiting preserved statistics for steganalysis. In
J. Fridrich, editor, Information Hiding, 6th International Workshop, volume 3200 of
Lecture Notes in Computer Science, pages 82–96, Toronto, May 23–25, 2004. Springer-
Verlag, Berlin.
[246] A. Westfeld and A. Pfitzmann. Attacks on steganographic systems. In A. Pfitzmann,
editor, Information Hiding, 3rd International Workshop, volume 1768 of Lecture Notes
in Computer Science, pages 61–75, Dresden, September 29–October 1, 1999. Springer-
Verlag, New York.
[247] E. H. Wilkins. A History of Italian Literature. Oxford University Press, London, 1954.
[248] F. J. M. Williams and N. J. Sloane. The Theory of Error-Correcting Codes. North-
Holland, Amsterdam, 1977.
[249] P. W. Wong, H. Chen, and Z. Tang. On steganalysis of plus–minus one embedding
in continuous-tone images. In E. J. Delp and P. W. Wong, editors, Proceedings SPIE,
Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Con-
tents VII, volume 5681, pages 643–652, San Jose, CA, January 16–20, 2005.
[250] F. B. Wrixon. Codes, Ciphers and Other Cryptic and Clandestine Communication. New
York: Black Dog & Leventhal Publishers, 1998.
[251] Z. Wu and W. Yang. G.711-based adaptive speech information hiding approach. In
De-Shuang Huang, K. Li, and G. W. Irwin, editors, Proceedings of the International
Conference on Intelligent Computing, volume 4113 of Lecture Notes in Computer Sci-
ence, pages 1139–1144, Kunming, August 16–19, 2006. Springer-Verlag, Berlin.
[252] G. Xuan, Y. Q. Shi, J. Gao, D. Zou, C. Yang, Z. Z. P. Chai, C. Chen, and W. Chen. Ste-
ganalysis based on multiple features formed by statistical moments of wavelet charac-
teristic functions. In M. Barni, J. Herrera, S. Katzenbeisser, and F. Pérez-González, ed-
itors, Information Hiding, 7th International Workshop, volume 3727 of Lecture Notes in
Computer Science, pages 262–277, Barcelona, June 6–8, 2005. Springer-Verlag, Berlin.
[253] C. Yang, F. Liu, X. Luo, and B. Liu. Steganalysis frameworks of embedding in multiple
least-significant bits. IEEE Transactions on Information Forensics and Security, 3:662–
672, 2008.
[254] R. Zamir, S. Shamai, and U. Erez. Nested linear/lattice codes for structured multiter-
minal binning. IEEE Transactions on Information Theory, 48(6):1250–1276, 2002.
426 REFERENCES
[255] T. Zhang and X. Ping. A fast and effective steganalytic technique against Jsteg-like
algorithms. In Proceedings of the ACM Symposium on Applied Computing, pages 307–
311, Melbourne, FL, March 9–12, 2003.
[256] T. Zhang and X. Ping. A new approach to reliable detection of LSB steganography in
natural images. Signal Processing, 83(10):2085–2094, October 2003.
[257] W. Zhang, X. Zhang, and S. Wang. Maximizing steganographic embedding efficiency
by combining Hamming codes and wet paper codes. In K. Solanki, K. Sullivan, and
U. Madhow, editors, Information Hiding, 10th International Workshop, volume 5284
of Lecture Notes in Computer Science, pages 60–71, Santa Barbara, CA, June 19–21,
2008. Springer-Verlag, New York.
[258] X. Zhang, W. Zhang, and S. Wang. Efficient double-layered steganographic embedding.
Electronics Letters, 43:482–483, April 2007.
Cambridge Books Online
http://ebooks.cambridge.org/
Chapter
Plate section pp. null-null
Chapter DOI:
Cambridge University Press
Plate 1 Fig 2.3
(c) (d)
Plate 8 Fig 5.7