0% found this document useful (0 votes)
13 views

Cryptography 04 00015 v2

- This document provides a survey of major techniques in power side-channel attacks over the past 20 years. - It explores concepts like simple power analysis (SPA), differential power analysis (DPA), template attacks (TA), correlation power analysis (CPA), and mutual information analysis (MIA). - The document aims to introduce these concepts for newcomers to the field in a clear and accessible way, providing references for further reading.

Uploaded by

aaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Cryptography 04 00015 v2

- This document provides a survey of major techniques in power side-channel attacks over the past 20 years. - It explores concepts like simple power analysis (SPA), differential power analysis (DPA), template attacks (TA), correlation power analysis (CPA), and mutual information analysis (MIA). - The document aims to introduce these concepts for newcomers to the field in a clear and accessible way, providing references for further reading.

Uploaded by

aaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

cryptography

Article
Power Side-Channel Attack Analysis: A Review of
20 Years of Study for the Layman
Mark Randolph * and William Diehl
The Bradley Department of Electrical and Computer Engineering,
Virginia Polytechnic Institute and State University; Blacksburg, VA 24061, USA; [email protected]
* Correspondence: [email protected]

Received: 2 March 2020; Accepted: 15 May 2020; Published: 19 May 2020 

Abstract: Physical cryptographic implementations are vulnerable to so-called side-channel attacks,


in which sensitive information can be recovered by analyzing physical phenomena of a device
during operation. In this survey, we trace the development of power side-channel analysis of
cryptographic implementations over the last twenty years. We provide a foundation by exploring, in
depth, several concepts, such as Simple Power Analysis (SPA), Differential Power Analysis (DPA),
Template Attacks (TA), Correlation Power Analysis (CPA), Mutual Information Analysis (MIA), and
Test Vector Leakage Assessment (TVLA), as well as the theories that underpin them. Our introduction,
review, presentation, and survey of topics are provided for the “non expert”, and are ideal for new
researchers entering this field. We conclude the work with a brief introduction to the use of test
statistics (specifically Welch’s t-test and Pearson’s chi-squared test) as a measure of confidence that a
device is leaking secrets through a side-channel and issue a challenge for further exploration.

Keywords: side-channel analysis; DPA; SPA; CPA; mutual information; t-test; chi-squared; survey

1. Introduction
Cryptography is defined in Webster’s 9th New Collegiate Dictionary as the art and science of
“enciphering and deciphering of messages in secret code or cipher”. Cryptographers have traditionally
gauged the power of the cipher, i.e., “a method of transforming a text in order to conceal its meaning”,
by the difficulty posed in defeating the algorithm used to encode the message. However, in the modern
age of electronics, there is another exposure that has emerged. With the movement of encoding from a
human to a machine, the examination of vulnerabilities to analytic attacks (e.g., differential [1–3] and
linear [4–6] cryptanalysis) is no longer enough; we must also look at how the algorithm is implemented.
Encryption systems are designed to scramble data and keep them safe from prying eyes, but
implementation of such systems in electronics is more complicated than the theory itself. Research has
revealed that there are often relationships between power consumption, electromagnetic emanations,
thermal signatures, and/or other phenomena and the encryptions taking place on the device. Over
the past two decades, this field of study, dubbed Side-Channel Analysis (SCA), has been active in
finding ways to characterize “side-channels”, exploit them to recover encryption keys, and protect
implementations from attack.
While much research has been done in the field of SCA over the past 20 years, it has been confined
to focused areas, and a singular paper has not been presented to walk through the history of the
discipline. The purpose of this paper is to provide a survey, in plain language, of major techniques
published, while pausing along the way to explain key concepts necessary for someone new to the
field (i.e., a layman), to understand these highpoints in the development of SCA.
In-depth treatment of each side-channel is not possible in a paper of this length. Instead, we
spend time exploring the power consumption side-channel as an exemplar. While each side-channel

Cryptography 2020, 4, 15; doi:10.3390/cryptography4020015 www.mdpi.com/journal/cryptography


Cryptography 2020, 4, 15 2 of 33

phenomenon may bring some uniqueness in the way data are gathered (measurements are taken), in
the end, the way the information is exploited is very similar. Thus, allowing power analysis to guide
us is both reasonable and prudent.
The remainder of this paper is organized as follows: First, we discuss how power measurements
are gathered from a target device. Next, we explore what can be gained from direct observations of
a system’s power consumption. We then move into a chronological survey of methods to exploit
power side-channels and provide a brief introduction to leakage detection, using the t-test and χ2 -test.
Finally, we offer a challenge to address an open research question as to which test procedures are most
efficient for detecting side-channel-related phenomena that change during cryptographic processing,
and which statistical methods are best for explaining these changes.
This paper is meant to engage and inspire the reader to explore further. To that end, it is replete
with references for advanced reading. Major headings call out what many consider to be foundational
publications on the topic, and references throughout the text are intended as jumping-off points for
additional study. Furthermore, major developments are presented in two different ways. A theory
section is offered which presents a layperson’s introduction, followed by a more comprehensive
treatment of the advancement. We close the introduction by reiterating that side-channels can take
many forms, and we explicitly call out references for further study in power consumption [7,8],
electromagnetic emanations [9–11], thermal signatures [12–14], optical [15,16], timing [17,18], and
acoustics [19].

2. Measuring Power Consumption


Most modern-day encryption relies on electronics to manipulate ones and zeros. The way this
is physically accomplished is by applying or removing power to devices called transistors, to either
hold values or perform an operation on that value. To change a one to a zero or vice versa, a current is
applied or removed from each transistor. The power consumption of an integrated circuit or larger
device then reflects the aggregate activity of its individual elements, as well as the capacitance and
other electrical properties of the system.
Because the amount of power used is related to the data being processed, power consumption
measurements contain information about the circuit’s calculations. It turns out that even the effects
of a single transistor do appear as weak correlations in power measurements [7]. When a device is
processing cryptographic secrets, its data-dependent power usage can expose those secrets to attack.
To actually measure the current draw on a circuit, we make use of Ohm’s law: I = V R , where
V is voltage, I is current, and R is resistance. By introducing a known fixed resistor into either the
current being supplied to or on the ground side of the encryption device, a quality oscilloscope can be
used to capture changes in voltage over time. Thanks to Ohm’s law, this change in voltage is directly
proportional to the change in current of the device. By using a stable resistor whose resistance does
not change with temperature, pressure, etc. (sometimes referred to as a sense resistor), we are able to
capture high-quality measurements.
Access to a device we wish to test can be either destructive and involve actually pulling pins off a
printed circuit board to insert a sense resistor or benign by placing a resistor in line between a power
supply and processor power input. The closer to the circuit performing the encryption we can get our
probes, the lower the relative noise will be and, hence, the better the measurement will be.

3. Direct Observation: Simple Power Analysis


The most basic power side-channel attack we look into is solely the examination of graphs of
this electrical activity over time for a cryptographic hardware device. This discipline, known as
Simple Power Analysis (SPA), can reveal a surprisingly rich amount information about the target and
underlying encryption code it employs.
Cryptography 2020, 4, 15 3 of 33

Cryptography 2020, 5, x FOR PEER REVIEW 3 of 32


Cryptography
3.1. Simple 2020,
Power5, xAnalysis
FOR PEER REVIEW
and Timing Attacks 3 of 32

Simple Power
Simple Power Analysis
Analysis (SPA)
(SPA) isis the
the term
term given
given to
to direct
direct interpretation
direct interpretation of
interpretation of observed
observed power
observed power
consumption during
consumption during cryptographic
cryptographic operations,
operations, and
and much
much can
can be
be learned
learned from
learned fromit.
from it. For
it. For example,
example, in
in
Figure
Figure 1,1, the
1, the
the 1616 rounds
16 rounds
roundsof of the
ofthe Digital
theDigital Encryption
DigitalEncryption Standard
EncryptionStandard (DES)
Standard(DES)
(DES)[20] [20]
[20]areare clearly
areclearly visible;
clearlyvisible;
visible;and and Figure
andFigure 2
Figure
22 reveals
reveals
reveals the
thethe 10 rounds
10 10 rounds
rounds of the
of the
of the Advanced
Advanced
Advanced Encryption
Encryption Standard
Standard
Encryption (AES-128)
(AES-128)
Standard (AES-128) [21]. While
[21]. While
[21]. While the captures
the captures
the captures
come
comedistinct
from
come from distinct
from distinct devices
devicesdevices using different
using different
using different
acquisition acquisition
techniques
acquisition techniques
(subsequently
techniques (subsequently
described),
(subsequently described), both
both examples
described), both
examples
provide
examples provide
glimpses
provideintoglimpses
what is
glimpses into what
happening
into is happening on
on the device
what is happening the
being
on the device being
observed.
device observed.
being observed.

Figure 1.
Figure
Figure 1. Simple
1. Simple Power
Simple
Power Analysis
Power Analysis
Analysis (SPA)
(SPA) trace
(SPA) showing
trace
trace an entire
showing
showing an entire Digital
an entire Encryption
Digital
Digital Standard
Encryption
Encryption (DES)
Standard
Standard (DES)
operation.
(DES) operation.
operation.

Figure 2. Simple Power Analysis (SPA) trace showing an entire Advanced Encryption Standard
Figure2.
Figure 2.Simple
SimplePower
PowerAnalysis
Analysis(SPA)
(SPA)trace
traceshowing
showingan
anentire
entireAdvanced
AdvancedEncryption
EncryptionStandard
Standard(AES)
(AES)
(AES) operation.
operation.
operation.
Each collection of power measurements taken over a period of interest (often a full cryptographic
Eachcollection
operation)
Each collection
is referredof of power
topower measurements
as a trace. Differencestaken
measurements taken over
in theover
amount aaperiod
period ofinterest
of current
of interest
drawn (often aafull
by these
(often full cryptographic
devices are very
cryptographic
operation)
small (e.g., is referred
µA), and into as a
practice,trace. Differences
much noise is in the
induced amount
by the of current
measuring
operation) is referred to as a trace. Differences in the amount of current drawn by these devices are drawn
device by these
itself. devices
Fortunately, are
very
noise
very insmall
small (e.g.,
measurement μA),
(e.g., μA),hasand and
been in practice,
infound much
to bemuch
practice, normally noise is induced
distributed,
noise is induced andbyby the
by taking measuring
the average
the measuring device itself.
of multiple
device itself.
Fortunately,
traces, the effectnoise
of in
the measurement
noise can be has
greatly been
reduced.found to
Simply be normally
stated,
Fortunately, noise in measurement has been found to be normally distributed, and by taking the givendistributed,
a large and
number by
of taking
data the
points,
average
aaverage of multiple
normalofdistribution traces,
multiple traces,is most the effect
thelikely of the
effecttoofconvergenoise
the noisetocan can be
itsbe
mean greatly reduced.
(µ). reduced.
greatly Simply
In applications
Simplywherestated,
stated,thegiven
focus
given a large
is on
a large
numberof
differences
number ofdata
data
betweenpoints,
points, aanormal
groups,normal distribution
normally ismost
distributed
distribution is most likely
noise
likely willtoconverge
to convergeto toits
itsmean
the mean (μ).
same(μ). Inapplications
expected
In applications
value or
where(µ)
mean
where the
thefor focus
bothis
focus isgroups
on differences
on differences
and hence between
cancel out.
between groups, normally
groups, normally distributed
distributed noise
noise will
will converge
converge to to the
the
same expected
sameFigure
expected value
1 shows or mean
valuemeasurements (μ) for
or mean (μ) forfrom both groups
botha groups and
Texas Instruments hence
and hence cancelcancel out.
Tiva-Cout.
LaunchPad with a TM4C Series
Figure11 shows
microcontroller
Figure shows
runningmeasurements fromaa Texas
the Digital Encryption
measurements from Texas Instruments
Standard
Instruments Tiva-C
(DES)Tiva-C
algorithmLaunchPad withaa TM4C
and a Tektronix
LaunchPad with TM4C
MDO4104CSeries
Series
microcontroller
Mixed running the
Domain Oscilloscope
microcontroller running thetoDigital
Digital Encryption
captureEncryption Standard
this averageStandardof 512 traces.(DES)Figure
(DES) algorithm
algorithm2 shows and
and aa Tektronix
Tektronix
measurements
MDO4104C
captured
MDO4104C Mixed
by using
Mixedthe Domain Oscilloscope
flexibleOscilloscope
Domain to
open-sourcetoworkbenchcapture
capture this this average
for average
side-channelof 512
of 512 traces.
analysis Figure 22using
Figure
traces.(FOBOS), shows
shows a
measurements captured by using the flexible open-source
measurements captured by using the flexible open-source workbench for side-channel analysisworkbench for side-channel analysis
(FOBOS), using
(FOBOS), using aa measurement
measurement peripheralperipheral of of the
the eXtended
eXtended eXtensible
eXtensible Benchmarking
Benchmarking eXtensioneXtension
(XXBX), documented in References [22] and [23], respectively.
(XXBX), documented in References [22] and [23], respectively. Measurements show accumulated Measurements show accumulated
Cryptography 2020, 4, 15 4 of 33
Cryptography 2020, 5, x FOR PEER REVIEW 4 of 32

results of 1000peripheral
measurement traces fromof athe
NewAE
eXtendedCW305 Artix 7Benchmarking
eXtensible Field Programmable
eXtension Gate Arraydocumented
(XXBX), (FPGA) target in
board running the AES-128 algorithm.
References [22,23], respectively. Measurements show accumulated results of 1000 traces from a NewAE
CW305While
ArtixFigure
7 Field2 clearly shows the
Programmable 10 Array
Gate rounds of the target
(FPGA) AES function, it is the
board running ability
the AES-128to see what is
algorithm.
goingWhile
on during those rounds that will lead to the discovery of secret keys. Figure
Figure 2 clearly shows the 10 rounds of the AES function, it is the ability to see what 3 is a more detailed
look,
is using
going the same
on during datarounds
those but using
that awill
different
lead torelative time reference.
the discovery of secretFrom
keys. the trace3 alone,
Figure the
is a more
difference
detailed in power
look, consumption
using the same data but between
using the 16 cycles
a different of substitution
relative (S-Boxes)/shift
time reference. From the tracerows,alone,
four
cycles of mix columns, and one cycle of add round key operations are easily seen
the difference in power consumption between the 16 cycles of substitution (S-Boxes)/shift rows, four and compared to
the standard shown in the right panel of that figure. SPA focuses on the use of
cycles of mix columns, and one cycle of add round key operations are easily seen and compared to the visual inspection to
identify power
standard shown fluctuations that give
in the right panel away
of that figure.cryptographic
SPA focuses operations.
on the use ofPatterns can be used
visual inspection to find
to identify
sequences of instructions executed on the device with practical applications,
power fluctuations that give away cryptographic operations. Patterns can be used to find sequences of such as defeating
implementations
instructions in which
executed on thebranches andpractical
device with lookup tables are accessed
applications, todefeating
such as verify that the correct access
implementations in
codes have been entered.
which branches and lookup tables are accessed to verify that the correct access codes have been entered.

(a) (b)

Figure 3. AES trace detail with standard. (a) SPA trace; (b) AES block diagram.

3.1.1. Classic RSA


3.1.1. Classic RSA Attack
Attack
A
A classic
classic attack
attack using
using SPA
SPA waswas published
published inin 1996
1996 byby Paul
Paul Kocher
Kocher against
against thethe RSA
RSA public
public key
key
cryptosystem
cryptosystem [18]. [18].
RSA
RSA [24] [24] is
is widely
widelyused
usedfor forsecure
securedata
datatransmission
transmissionand andisisbased
basedonon asymmetry,
asymmetry, withwithoneone
keykey
to
encrypt
to encrypt andanda separate but related
a separate key to key
but related decrypt data. Because
to decrypt the algorithm
data. Because is considered
the algorithm somewhat
is considered
slow,
somewhat slow, it is often used to pass encrypted shared keys, making it of high interestwould-be
it is often used to pass encrypted shared keys, making it of high interest to many to many
attackers. RSA has been commonly used to authenticate users by having
would-be attackers. RSA has been commonly used to authenticate users by having them encrypt a them encrypt a challenge
phrase
challengewith the only
phrase withencryption key that willkey
the only encryption work
thatwith
willthe paired
work withpublic key—namely
the paired their private
public key—namely
key. If an attacker is able to observe this challenge-response, or better yet
their private key. If an attacker is able to observe this challenge-response, or better yet invoke it, he/she may be able
invoke it,
to easily recover the private key. To understand how this works, we first provide
he/she may be able to easily recover the private key. To understand how this works, we first provide a simplified primer
on RSA.
a simplified primer on RSA.
The
The encryption
encryption portion
portion of of the
the RSA
RSA algorithm
algorithm takes
takes the
the message
message to to be
be encoded
encoded as
as aa number
number (for(for
e
example 437265) and raises it to the power of the public key e modulo n: ((
example 437265) and raises it to the power of the public key 𝑒 modulo 𝑛: ((437265) 𝑚𝑜𝑑 𝑛) = 437265 ) mod n ) = 𝑒ciphertext.
ciphertext. The decryption portion takes the ciphertext and raises it to the power of the private key 𝑑
Cryptography 2020, 4, 15 5 of 33

Cryptography 2020, 5, x FOR PEER REVIEW 5 of 32


The decryption portion takes the ciphertext and raises it to the power of the private key d modulo
n: ((ciphertext
modulo )d mod n𝑑) 𝑚𝑜𝑑
𝑛: ((ciphertext) = 437265. The keyThe
𝑛) = 437265. to making this work
key to making thisiswork
the relationship between
is the relationship the key
between
keye pair
thepair and 𝑒d. andHow 𝑑.the
How keys
thearekeysgenerated
are generatedin RSA in is complex,
RSA and the
is complex, anddetails can be
the details canfound in the
be found
standard. RSA is secure because its underlying security premise, that of
in the standard. RSA is secure because its underlying security premise, that of factorization of large factorization of large integers,
is a hard
integers, is problem.
a hard problem.
d
AA historical
historical method
method of of performing
performing thethe operation
operation “(ciphertext
“(ciphertext) 𝑑 ) mod n” is binary exponentiation,
𝑚𝑜𝑑 𝑛” is binary exponentiation,
using
using anan algorithm
algorithm known
known as as square
square andand multiply,
multiply, which
which cancan
bebe performed
performed in in hardware.
hardware. In In
itsits binary
binary
form,
form, each
each bitbitininthe
theprivate
privatekeykeyexponent
exponent 𝑑 d isisexamined.
examined.IfIfititisisaaone,
one,a asquare
square operation
operation followed
followed by a
by a multiply operation occurs, but if it is a zero, only a square operation is performed. By observingthe
multiply operation occurs, but if it is a zero, only a square operation is performed. By observing
theamount
amountofofpowerpowerthe thedevice
deviceuses,
uses, itit can
can bebe determined
determinedififthethemultiply
multiplywas wasexecuted
executedorornot.not. (While
(While
ways
ways to to blunt
blunt thethe square
square andand multiply
multiply exploit
exploit have
have been
been found
found in in modern
modern cryptography
cryptography (e.g.,
(e.g., thethe
Montgomery Powering Ladder [25]), its discovery remains
Montgomery Powering Ladder [25]), its discovery remains a milestone in SCA.) a milestone in SCA).
Figure
Figure 4 [26]
4 [26] shows
shows thethe output
output of of
anan oscilloscope
oscilloscope during
during a Simple
a Simple Power
Power Attack
Attack against
against thethe
RSARSA
algorithm,
algorithm, using
using a square
a square and
and multiply
multiply routine.
routine. Even
Even without
without thethe annotations,
annotations, it easy
it is is easyto to
seeseea a
difference in processes and how mere observation can yield the secret
difference in processes and how mere observation can yield the secret key. Because performing two key. Because performing two
operations
operations in in
thethe case
case of of a “1”
a “1” takes
takes more
more time
time than
than a single
a single operation
operation in in
thethe case
case of of a “0”,
a “0”, this
this attack
attack
can also be classified as a timing
can also be classified as a timing attack. attack.

Figure 4. Power
Figure trace
4. Power of aofportion
trace of an
a portion RSA
of an exponentiation
RSA operation.
exponentiation operation.

There
There areare many
many devices,
devices, suchsuchas as smart
smart cards,
cards, that
that cancanbebe challenged
challenged to to authenticate
authenticate themselves.
themselves.
In authentication, the use of private and public keys is reversed.
In authentication, the use of private and public keys is reversed. Here, the smart Here, the smart card is challenged
challenged to
to use
use the
the onboard
onboardprivate
privatekey
keytotoencrypt
encryptaamessage
messagepresented,
presented,and andthe
thepublic
publickey keyisisused
usedtotodecrypt
decryptthe
themessage.
message.Decryption
Decryptionby bypublic
publickey keywill
willonly
onlywork
workififthe
thechallenge
challengetexttextisisraised
raisedtotothe
thepower
powerofofthe
theprivate
privatekey;
key;so,
so,once
oncethe
thepair
pair isis proven to work,
work, the
theauthentication
authenticationisiscomplete.
complete.One One type
type ofof attack
attack
merely
merely requires
requires a power
a power tap
tap andand oscilloscope:
oscilloscope: Simply
Simply challenge
challenge thethe card
card to to authenticate
authenticate itself
itself andand
observe the power reading, to obtain the secret key as the device encrypts the challenge
observe the power reading, to obtain the secret key as the device encrypts the challenge phrase, using phrase, using
itsitsprivate
privatekey.
key. Several
Several other
otherways
waystoto exploit
exploitthethe
RSARSAalgorithm in smartcards
algorithm in smartcardsusing SPA
using areSPA
discussed
are
in [27]. in [27].
discussed

3.1.2.
3.1.2. Breaking
Breaking a High-Security
a High-Security Lock
Lock
During
During thethe DEFCON
DEFCON 24 Exhibition
24 Exhibition in 2016,
in 2016, PlorePlore demonstrated
demonstrated theofuse
the use SPAof to
SPA to determine
determine the
the keycode required to open a Sargent and Greenleaf 6120-332 high-security
keycode required to open a Sargent and Greenleaf 6120-332 high-security electronic lock [28]. During electronic lock [28].
theDuring the demonstration,
demonstration, a random ainput random wasinput
enteredwasinto
entered into the and
the keypad, keypad,
power and power
was was measured
measured as the
correct sequence was read out of an erasable programmable read-only memory (EPROM) forfor
as the correct sequence was read out of an erasable programmable read-only memory (EPROM)
comparison
comparison in in a logic
a logic circuit.
circuit. ByBy simple
simple visual
visual inspection
inspection of of
thethe power
power traces
traces returned
returned from
from thethe
EPROM, the correct code from a “high-security lock” was compromised. Figure
EPROM, the correct code from a “high-security lock” was compromised. Figure 5 is a screen capture 5 is a screen capture
from
from [28]
[28] that
that shows
shows power
power traces
traces fromfrom
twotwo
of of
thethe stored
stored combinations
combinations being
being read
read outout of the
of the EPROM.
EPROM.
In this case the final two digits, 4 (0100) and 5 (0101) can be seen in their binary form
In this case the final two digits, 4 (0100) and 5 (0101) can be seen in their binary form in the power in the power trace.
trace.
Cryptography 2020, 4, 15 6 of 33
Cryptography 2020, 5, x FOR PEER REVIEW 6 of 32

Figure 5. SPA readout of lock keycode [28].

4.4.Milestones
Milestonesin
inthe
theDevelopment
Developmentof
ofPower
PowerSide-Channel
Side-ChannelAnalysis
Analysisand
andAttacks
Attacks
Over time, power
Over time, poweranalysis
analysisattack
attacktechniques
techniques have
have coalesced
coalesced intointo
twotwo groups:
groups: ModelModel
Based Based
and
and Profiling. In the first, a leakage model is used that defines a relationship
Profiling. In the first, a leakage model is used that defines a relationship between the power between the power
consumption
consumption of of the
the device
device and
and the
the secret
secret key
key itit is
is employing.
employing. Measurements
Measurements are are then
then binned
binned into
into
classes based on the leakage model. Statistics of the classes (e.g., mean of the power
classes based on the leakage model. Statistics of the classes (e.g., mean of the power samples in a first- samples in a
first-order Differential Power Analysis) are used to reduce the classes to a single
order Differential Power Analysis) are used to reduce the classes to a single master trace. These master trace. These
master
master traces can then
traces can then bebe compared
compared to to modeled
modeled key key guesses,
guesses, using
using different
different statistics to look
statistics to look for
for aa
match. As side-channel analysis advanced over time, different statistical tests
match. As side-channel analysis advanced over time, different statistical tests were explored, were explored, including
difference of means [7,29,30],
including difference Pearson correlation
of means [7,29,30], coefficient
Pearson correlation [31–34], [31–34],
coefficient BayesianBayesian
classification [35,36],
classification
and others
[35,36], and[37,38], to determine
others [37,38], if the modeled
to determine guess matched
if the modeled the observed
guess matched output,output,
the observed resulting in the
resulting
leakage being useful for determining secret keys. Profiling techniques, on the other
in the leakage being useful for determining secret keys. Profiling techniques, on the other hand, use hand, use actual
power measurements
actual power of a device
measurements of asurrogate of the target
device surrogate to build
of the targetstencils
to buildof how certain
stencils encryption
of how certain
keys leak information to recover keys in the wild. We now explore, in a chronological
encryption keys leak information to recover keys in the wild. We now explore, in a chronological manner, how
several of these techniques developed.
manner, how several of these techniques developed.

4.1. Differential Power Analysis


4.1. Differential Power Analysis
4.1.1. Theory
4.1.1. Theory
In 1999, Paul Kocher, Joshua Jaffe, and Benjamin Jun published an article entitled “Differential
In 1999, Paul Kocher, Joshua Jaffe, and Benjamin Jun published an article entitled “Differential
Power Analysis” [7], which is considered by many to be the bedrock for research involving SCA at the
Power Analysis” [7], which is considered by many to be the bedrock for research involving SCA at
device level.
the device level.
It is the first major work that expands testing from an algorithm’s mathematical structure to testing
It is the first major work that expands testing from an algorithm's mathematical structure to
a device implementing the cryptographic algorithm. The premise of Kocher et al. was that “security
testing a device implementing the cryptographic algorithm. The premise of Kocher et al. was that
faults often involve unanticipated interactions between components designed by different people”.
“security faults often involve unanticipated interactions between components designed by different
One of these unanticipated interactions is power consumption during the course of performing an
people”. One of these unanticipated interactions is power consumption during the course of
encryption. While we have already discussed this phenomenon in some length in our paper, it was not
performing an encryption. While we have already discussed this phenomenon in some length in our
widely discussed in academic literature prior to the 1999 publication.
paper, it was not widely discussed in academic literature prior to the 1999 publication.
The visual techniques of SPA are interesting, but difficult to automate and subject to interpretation.
The visual techniques of SPA are interesting, but difficult to automate and subject to
Additionally, in practice, the information about secret keys is often difficult to directly observe creating
interpretation. Additionally, in practice, the information about secret keys is often difficult to directly
a problem of how to distinguish them from within traces. Kocher et al. developed the idea of a
observe creating a problem of how to distinguish them from within traces. Kocher et al. developed
model-based side-channel attack. By creating a selection function [39] (known now as a leakage mode),
the idea of a model-based side-channel attack. By creating a selection function [39] (known now as a
traces are binned into two sets of data or classes. A statistic is chosen to compare one class to the other
leakage mode), traces are binned into two sets of data or classes. A statistic is chosen to compare one
and determine if they are in fact statistically different from each other. In the classic Differential Power
class to the other and determine if they are in fact statistically different from each other. In the classic
Analysis (DPA), the First moment or mean is first used to reduce all the traces in each class down to a
Differential Power Analysis (DPA), the First moment or mean is first used to reduce all the traces in
master trace. The class master traces are then compared at each point in the trace, to determine if those
each class down to a master trace. The class master traces are then compared at each point in the
points are significantly different from each other.
trace, to determine if those points are significantly different from each other.
While the example we walk through in this paper utilizes the First moment (mean), Appendix
A is provided as a reference for higher-order moments that have also been used.
Cryptography 2020, 4, 15 7 of 33

While the example we walk through in this paper utilizes the First moment (mean), Appendix A
Cryptography 2020, 5, x FOR PEER REVIEW 7 of 32
is provided as a reference for higher-order moments that have also been used.
AsAsdiscussed
discussed in in
Section
Section2, by placing
2, by a resistor
placing a resistorin in
lineline
with thethe
with power
powerbeing supplied
being supplied to to
thethe
encryption hardware and placing a probe on either side of the resistor, voltage
encryption hardware and placing a probe on either side of the resistor, voltage changes can be changes can be observed
and recorded.
observed andWithout
recorded.knowing
Without the plaintext
knowing being processed,
the plaintext we recorded
being processed, observed powerobserved
we recorded traces
and the resulting
power traces andciphertext for m ciphertext
the resulting encryptionfor operations.
m encryptionFor the sake of illustration,
operations. For the sakewe of focus on onewe
illustration,
point in time (clock cycle) as we walk through the concept. Then, we describe,
focus on one point in time (clock cycle) as we walk through the concept. Then, we describe, in practical terms, how in
thepractical
algorithm is run
terms, howto recover a complete
the algorithm is runsubkey (K16 aincomplete
to recover this illustration).
subkey (K16 in this illustration).
Figure 6 is an illustration of the last round of the DES algorithm.
Figure 6 is an illustration of the last round of the DES algorithm. We are going
We toarecreate
goinga selection
to create a
function that computes one bit of the internal state right before the last
selection function that computes one bit of the internal state right before the last round round of DES. Thisof target
DES.bit This
wastarget
chosen as the seventh bit of L and is the black bit circled and labeled “A” in Figure
bit was chosen as the seventh bit of L15 and is the black bit circled and labeled “A” in Figure 6.
15 6. (DES is the
archetypal
(DES is theblock cipher which
archetypal blockoperates
cipher whichon a block of 64
operates onbits broken
a block into
of 64 two
bits half-blocks
broken into twoas they run
half-blocks
through a series of encryption operations.
as they run through a series of encryption operations.Here, L 15 refers to the left half-block entering the last
Here, L15 refers to the left half-block enteringof 16
identical
the laststages of processing
of 16 identical stages(rounds).) [40] (rounds).) [40]
of processing

Figure
Figure 6. The
6. The lastlast round
round of DES
of DES internals
internals (based
(based on on [40]).
[40]).

WeWe havehaveaccess
accessto the
to thevalue of our
value of ourtarget bit bit
target (A)(A)exclusive
exclusiveor or
(XOR)
(XOR) thetheoutput
outputof the
of the Feistel
Feistel
function (F
functionout ) and read it directly from the ciphertext we have captured (bit
(Fout) and read it directly from the ciphertext we have captured (bit labeled ‘B’ in labeled ‘B’ in R 16R16Using
). ). Using
 aassymbol
⊕ as a symbol for for
thetheXOR, XOR,we we
knowknowthatthat  F=outB,= then
if Aif⊕AFout B ⊕BF
B, then outF=outA.
= A. Therefore,
Therefore,wewe cancanmakemake
inferences about one bit of the output
inferences about one bit of the outputout (F ) of the Feistel function, F, and calculate the value
(Fout) of the Feistel function, F, and calculate the value for our for our
target bit.bit.
target With ourour
With selection
selectionfunction
function setBas⊕ BF 
set as = A,
F =weA, examine
we examine internals
internals for the
for Feistel function,
the Feistel F,
function,
andF,see
andthe
seepracticality of thisoftechnique.
the practicality this technique.
In In
thethe
DPA DPAattack,
attack,thethe
goal is to
goal is determine
to determine thethe
encryption
encryption keykey(K16(K) 16by solving
) by solving B ⊕B F=FA= A
thethe
equation
equationoneonebitbit
at aattime.
a time. Inside
Insidethethe
Feistel
Feistelfunction
function (Figure
(Figure7),7),
wewe seeseethat thethe
that single
singlebitbit
wewe areare
interested in is
interested informed
is formed in ainsomewhat
a somewhat complex
complex butbut
predictable
predictable process:
process:
Cryptography 2020, 4, 15 8 of 33
Cryptography 2020, 5, x FOR PEER REVIEW 8 of 32

Inside the
Figure 7. Inside the DES
DES Feistel
Feistel function
function (based on [40]
[40]).).

The
The first
first bit
bit ofof the
the SS11 S-box
S-box (circled
(circled andand labeled
labeled C) C) isis created
created afterafter being
being fedfed six six bits
bits of
of input
input from
from
H.
H. Both
Boththe theS-box
S-box functions
functions andand finalfinal
permutation
permutation are part areofpart
the published standardstandard
of the published for DES; therefore,
for DES;
given
therefore, given H, it is trivial to calculate C and Fout. The value of H is G  K, and Gexpansion
H, it is trivial to calculate C and Fout . The value of H is G ⊕ K, and G is the of bits
is the expansion
from R 15
of bits from as shown. Notice
R15 as shown. Notice 15 that R is directly written into the ciphertext
that R15 is directly written into the ciphertext as L (Figure 6), giving
16 as L16 (Figure 6), giving direct
access
direct to its value.
access to itsWithvalue. theWith
expansion function shown
the expansion function in Figure
shown7 in published
Figure 7inpublished
the DES standard,
in the DES the
only unknown in the Feistel function is the value
standard, the only unknown in the Feistel function is the value of the key (K). of the key (K).
If
If we wash the
we wash the complexity
complexity away, away, at at the
the heart,
heart, we we seesee anan XOR
XOR happening
happening between between something
something we we
can read directly (R 15 ), and 6 bits of a key we would like to
can read directly (R15), and 6 bits of a key we would like to resolve. Guesses are now made for the resolve. Guesses are now made for the
value
value of of K,
K, andand thethe collected
collected tracestraces areare divided
divided into into two
two classes
classes basedbased on on thethe computed
computed value value (A).
(A).
When
When the the value
value of of AA isis observed
observed as as 1,
1, associated
associated power power tracestraces areare placed
placed intointo one
one bin, bin, and
and when
when it it is
is
computed as 0, associated power traces are placed
computed as 0, associated power traces are placed into a different bin (class). into a different bin (class).
The
The average
average power power trace trace forfor each
each class
class is is computed
computed to to minimize
minimize noise noise from
from the the measurement
measurement
process,
process, and the difference between the class averages are computed. If we made assumption
and the difference between the class averages are computed. If we made the correct the correct
on the subkey
assumption on (K), our calculation
the subkey for A will be
(K), our calculation forcorrect
A will every time;every
be correct if wetime;
do not feed
if we dothenotselection
feed the
function
selection the correct
function thesubkey, it will beit correct
correct subkey, half thehalf
will be correct time.the (Given
time. (Givena randoma random string of input,
string the
of input,
probability
the probability of theofoutput
the outputof the of Feistel functionfunction
the Feistel returning a 0 is equal
returning a 0toisthe probability
equal of it returning
to the probability of ita
1.) Hence, the difference between the mean power trace for
returning a 1.) Hence, the difference between the mean power trace for bin 1 and the mean power bin 1 and the mean power trace for bin 0
will
traceresult
for bin in greater powerin
0 will result signal
greaterreturnpowerthansignalfor thereturn
case where
than for the the
subkey
casewas where incorrect.
the subkey was
In
incorrect. practice, the algorithm is run as follows [41]. One target S-box is chosen for which all the
possible input values (2 6 ) are listed. Since we know the ciphertexts, we can calculate the value of some
In practice, the algorithm is run as follows [41]. One target S-box is chosen for which all the
of the bitsinput
in L15 for every possible 6 ). We choose one of these bits as the target bit,
possible values (26) are listed.S-box
Sinceinput
we know value the(2ciphertexts, we can calculate the value of some
and
of thethe value
bits in Lbecomes our selection function D. If D = 1, the corresponding power measurement is put
15 for every possible S-box input value (26). We choose one of these bits as the target bit,
in
andsample
the value . If D = 0,our
set S1becomes it is selection
binned to function
sample set D.S0If. This
D=1,process is repeated for
the corresponding all traces
power in m, leaving
measurement is
us with, for every ciphertext and all possible S-box input values,
put in sample set S1. If D=0, it is binned to sample set S0. This process is repeated for all traces in m,a classification of the corresponding
measurement.
leaving us with, Thefor classifications
every ciphertext are listed
and in m × 26 matrix,
allanpossible S-box input with everyvalues, row a being a possible
classification of key
the
for the target S-box,
corresponding and everyThe
measurement. column the classification
classifications are listed of one
in anciphertext
m x 26 matrix,and measurement.
with every row being
For thekey
a possible DPAfor attack,
the we processed
target S-box, everyand everyrow of column
the matrix, thetoclassification
construct the of twoone sample sets S1 and
ciphertext and
Smeasurement.
0. Next, we computed the pointwise mean of the samples in the sets, and we computed the difference.
For the Forcorrect
the DPA S-box inputwe
attack, values, a peak
processed in the
every rowdifference of traces
of the matrix, to will appear.
construct the two sample sets S1
Figure 8, composed of power traces provided
and S0. Next, we computed the pointwise mean of the samples in the sets, by [7], clearly shows the andresults weofcomputed
this method. the
The first trace shows the average power consumption during
difference. For the correct S-box input values, a peak in the difference of traces will appear. DES operations on the test smart card.
The second
Figure 8, trace is a differential
composed of powertrace traces showing
provided theby correct guessshows
[7], clearly for subkey (K), and
the results the method.
of this last two
The first trace shows the average power consumption during DES operations on the test smart card.
Cryptography 2020,
Cryptography 5, 4,
2020, x FOR
15 PEER REVIEW 9 of9 of
32 33

The second trace is a differential trace showing the correct guess for subkey (K), and the last two
traces
traces show
show incorrect
incorrect guesses.
guesses. While
While there
there is is a modest
a modest amount
amount ofof noise
noise inin the
the signal,
signal, it it
is is easy
easy toto see
see
correlation for the correct key guess.
correlation for the correct key guess.

Figure 8. DPA traces, one correct and two incorrect, with power reference [7].
Figure 8. DPA traces, one correct and two incorrect, with power reference [7].
4.1.2. Practice
4.1.2. Practice
Messerges et al., in their 2002 paper [42], excel in laying out equations for what we have just
walkedMesserges
through et al., in their but
in words, 2002forpaper [42], excelofinthis
the purposes laying out equations
layman’s guide, weforproduce
what weahave just
simplified
walked through
version here. in words, but for the purposes of this layman’s guide, we produce a simplified
versionAhere. DPA attack starts by running the encryption algorithm m times to capture Tm traces. We use
theAnotation
DPA attack Ti [ j] starts
to standby running
for the jththetimeencryption algorithm
offset within m times
the trace Ti . In capture 𝑇to
to addition traces. Wepower
𝑚 collecting use
the notation 𝑇
traces, the output𝑖 [𝑗] to stand for the jth time offset within the trace
ciphertext is captured and cataloged with Ci corresponding 𝑇𝑖 . In addition to collecting power
to its ith trace. Finally,
traces, the output
the selection ciphertext
function is captured
is defined as D(C and
,
i nK cataloged
) , where with
Kn is 𝐶
the
𝑖 corresponding
key guess to
(simply its
K ith trace. Finally,
in Figure 7). Since
the 𝐷(𝐶 𝑖 , 𝐾 𝑛 ), 𝐾
D(Ci , Kn ) is a binary selection function, the total number of times the selection function returns Since
selection function is defined as where 𝑛 is the key guess (simply K in Figure 7). a “1” is
𝐷(𝐶 𝑖 , 𝐾𝑛 )by
given is the
a binary
following:selection function, the total number of times the selection function returns a “1”
is given by the following: X m

𝑚 D(Ci , Kn ), (1)
i=1
∑ 𝐷(𝐶𝑖 , 𝐾𝑛 ), (1)
Moreover, the average power trace observed
𝑖=1 for the selection function “1’s” is as follows.
Moreover, the average power trace observed Pm for the selection function “1’s” is as follows.
i=1 D(Ci , Kn ) ∗ Ti [ j]
(2)
∑𝑚
Pm
𝑖=1 𝐷(𝐶 (C𝑛i), K
𝑖, 𝐾
i=1 D
∗ n𝑇)𝑖 [𝑗]
(2)
∑𝑚
𝑖=1 𝐷(𝐶𝑖 , 𝐾𝑛 )
Similarly, the total number of times the selection function returns a “0” is given by the
Similarly,
following the total number of times the selection function returns a “0” is given by the following
equation:
equation: X m
𝑚 (1 − D(Ci , Kn )), (3)
i=1
∑(1 − 𝐷(𝐶𝑖 , 𝐾𝑛 )), (3)
Furthermore, the average power trace
𝑖=1observed for the selection function “0’s” is as follows:
Furthermore, the average power trace
Pm observed for the selection function “0’s” is as follows:
i=1 (1 − D(Ci , Kn )) ∗ Ti [ j]
∑𝑚 (4)
𝑖=1(1 −(𝐷(𝐶 𝑖 , 𝐾𝑛 )) ∗ 𝑇𝑖 [𝑗]
Pm
i=1 1 − D(Ci , Kn )) (4)
∑𝑚
𝑖=1(1 − 𝐷(𝐶𝑖 , 𝐾𝑛 ))

Hence, each point 𝑗 in the differential trace 𝑇∆ for the guess 𝐾𝑛 is determined by the following
equation:
∑𝑚
𝑖=1 𝐷(𝐶𝑖 ,𝐾𝑛 )∗𝑇𝑖 [𝑗] ∑𝑚
𝑖=1(1−𝐷(𝐶𝑖 ,𝐾𝑛 ))∗𝑇𝑖 [𝑗]
T = 𝑇∆ [𝑗] = ∑𝑚
− ∑𝑚
(5)
𝑖=1 𝐷(𝐶𝑖 ,𝐾𝑛 ) 𝑖=1(1−𝐷(𝐶𝑖 ,𝐾𝑛 ))
Cryptography 2020, 4, 15 10 of 33

Hence, each point j in the differential trace T for the guess Kn is determined by the
following equation:
Pm Pm
i=1 D(Ci , Kn ) ∗ Ti [ j] i=1 (1 − D(Ci , Kn )) ∗ Ti [ j]
T = T [ j] = Pm − Pm (5)
i=1 D(Ci , Kn ) i=1 (1 − D(Ci , Kn ))

The guesses for Kn that produce the largest spikes in the differential trace T are considered to be
the most likely candidates for the correct value.

4.1.3. Statistics: Univariate Gaussian Distribution


In DPA, traces are separated into two classes that need to be compared to see if they are distinct or
different from each other. In practice, each trace assigned to a class by the target bit will not be identical,
so a quick look at statistics is appropriate for understanding how they can actually be combined
and compared.
In SCA, we measure very small differences in power consumption, magnetic fields, light emission,
or other things that we suspect are correlated to the encryption being performed in the processor. In
power consumption, the “signal” we are measuring is voltage, and it remains a good exemplar for
signals of interest in other side-channels.
Electrical signals are inherently noisy. When we use a probe to take a measurement of voltage, it
is unrealistic to expect to see a perfect, consistent reading. For example, if we are measuring a 5 volt
power supply and take five measurements, we might collect the following measurements: 4.96, 5.00,
5.09, 4.99, and 4.98. One way of modeling this voltage source is as follows:

f (x) = Voltageactual + ε, (6)

where Voltageactual is the noise-free level, and ε is the additional noise. More formally, the noise ε
is a summation of several noise components, including external noise, intrinsic noise, quantization
noise, and other components [43–45] that will vary over time. In our example, Voltageactual would be
exactly 5 volts. Since ε is a random variable, every time we take a measurement, we can expect it to
have a different value. Further, because ε is a random variable the value of our function, f (x) is also a
random variable.
A simple and accurate model for these random variables uses a Gaussian or LaPlace–Gauss
distribution (which is also known as a normal distribution as referenced on page 3). The probability
density function (PDF) provides a measure of the relative likelihood that a value of the random variable
would equal a particular value, and a Gaussian distribution is given by the following equation:

−(x−µ)2
1
f (x) = √ e 2σ2 , (7)
2πσ2

where µ is the mean, and σ is the standard deviation of the set of all the possible values taken on by the
random variable of interest.
The familiar bell-curve shape of the Gaussian distribution PDF for our example is shown in
Figure 9 and describes the probability of registering a particular voltage given that the power supply
is 5 volts (the mean here). For example, the likelihood of the measurement f (4.9) ≈ 0.7821 and
f (7.0) ≈ 0.003. We are unlikely to see a reading of 7 volts but do expect to encounter a 4.9 ~ 78% of
the time.
where µ is the mean, and  is the standard deviation of the set of all the possible values taken on by
the random variable of interest.
The familiar bell-curve shape of the Gaussian distribution PDF for our example is shown in
Figure 9 and describes the probability of registering a particular voltage given that the power supply
is Cryptography
5 volts (the 4, 15 here). For example, the likelihood of the measurement 𝑓(4.9) ≈ 0.7821 11
mean
2020, and
of 33
𝑓(7.0) ≈ 0.003. We are unlikely to see a reading of 7 volts but do expect to encounter a 4.9 ~ 78% of
the time.

Figure 9. Normal
Figure distribution
9. Normal plot.
distribution plot.

In the case of a circuit (such as what we are measuring in DPA), we can express Equation (6)
as follows: X
P(t) = f ( g, t) + ε, (8)
g

where f (g,t) is the power consumption of gate, g, at time, t, and ε is a summation of the noise components
(or error) associated with the measurement. We can, and do, consider the function f (g,t) as a random
variable from an unknown probability distribution (our sample traces). So, formally, if all f (g,t) are
randomly and independently drawn from our sample set, then the Central Limit Theorem says that
P(t) is normally distributed [46] and a shape similar to Figure 9 will always accurately describe the
distribution of our measurement.
The trick in DPA (and all SCA dealing with comparison of one variable at a time) is to use
characteristics of these distributions for comparisons. In First Order DPA, we consider the highest
point of the curve which is the First moment or mean (average) and compare that between classes. In
Second Order DPA, we consider the spread of the curve to compare the classes [47,48]. In like manner,
other characteristics of these PDFs have been used to compare classes.

4.1.4. A Brief Look at Countermeasures


While the purpose of this paper is not to provide an exhaustive summation of each technique
surveyed, it should be mentioned that, as vulnerabilities are discovered, countermeasures are developed.
Further, as countermeasures are developed, new attacks are sought. For example, one class of
countermeasures involves introducing desynchronizations during the encryption process so that the
power traces no longer align within the same acquisition set. Several techniques, such as fake cycles
insertion, unstable clocking, or random delays [49], can be employed to induce alignment problems.
To counter these, signal processing can be employed in many cases to correct and align traces [50–52].
Other countermeasures seek to add noise or employ filtering circuitry [53]. Here we find much
research on adding additional side-channels, such as electromagnetic radiation [9,10], to increase the
signal-to-noise ratio enough for DPA to work. It is in the countering of countermeasures that more
involved processing techniques become more important.
Some countermeasures add noise intentionally that is not Gaussian, such as in a hardware
implementation that uses a masking countermeasure to randomly change the representation of the
secret parameters (e.g., implementation [18,54]). In this case, averaging alone to cancel noise in the
signal is not enough. Here, the mean (average) of both sets of traces after applying the selection
function may be statistically indistinguishable from each other. However, by carefully observing the
power distribution formed from the multiple traces within each class, higher-order moments such as
variance, skewness, or kurtosis (see Appendix A) often can be used to distinguish the distributions [55].
Cryptography 2020, 4, 15 12 of 33

Consider two sets of data with the distribution given in Figure 10. Trying to distinguish the
two groups by comparison of means would fail, as both datasets have a mean of 15. However, by
comparing the spread of the curves, the two groups can easily be distinguished from each other. This
is equivalent to comparing the variances (Second moment) in data for each point, rather than the
mean (First moment), and is the zero-offset, second-order DPA attack (ZO2DPA) described in [55].
Similarly, if the countermeasure distributes the energy from handling the sensitive variable with a
profile that is skewed or shows distinctive kurtosis, the Third and Forth moments (example curves
given in Figure 11) may be desired.
Cryptography 2020, 5, x FOR PEER REVIEW 12 of 32

Figure 10. 10.


Figure Comparison of standard
Comparison deviation
of standard for for
deviation fixed mean.
fixed mean.

Figure
Figure 11.11. Examples
Examples of variation
of variation in skewness
in skewness andand kurtosis.
kurtosis.

4.1.5.
4.1.5. Statistics:
Statistics: Multivariate
Multivariate Gaussian
Gaussian Distribution
Distribution
TheThe one-variable
one-variable (univariate)
(univariate) Gaussian
Gaussian distribution
distribution justjust discussed
discussed works
works well
well forfor single
single point
point
measurement comparisons, but what if we wish to compare multiple points
measurement comparisons, but what if we wish to compare multiple points on each trace to one on each trace to one other?
other? Consider for now voltage measurements of traces taken precisely at clock cycle 10 (call these X)
and clock cycle
Consider 15 (call
for now thesemeasurements
voltage Y). At first flush,of we could
traces write
taken down aatmodel
precisely for X 10
clock cycle by (call
using a normal
these X)
distribution, and a separate model for Y by using a different normal distribution.
and clock cycle 15 (call these Y). At first flush, we could write down a model for X by using a normal However, in doing
so, we would
distribution, andbe saying that
a separate X and
model forYYarebyindependent; whennormal
using a different X goesdistribution.
down, there However,
is no guarantee that Y
in doing
will follow it. In our example, we are measuring power changes on an encryption
so, we would be saying that X and Y are independent; when X goes down, there is no guarantee that standard, and often
a change
Y will followinit.
oneIn clock cycle willwe
our example, directly influence power
are measuring a change a set number
changes of clock cycles
on an encryption later; it
standard, anddoes
not always make sense to consider these variables independent.
often a change in one clock cycle will directly influence a change a set number of clock cycles later; it
does not Multivariate
always make distributions allow these
sense to consider us to variables
model multiple random variables that may or may not
independent.
influence each other.
Multivariate In a multivariate
distributions allow usdistribution, instead random
to model multiple of using avariables
single variance
that mayσ2 ,or
wemay keepnot
track
of a whole matrix of covariances (how the variables change with respect to each
influence each other. In a multivariate distribution, instead of using a single variance 2, we keep other). For example,
track of a whole matrix of covariances (how the variables change with respect to each other). For
example, to model three points in time of our trace by using random variables (X,Y,Z), the matrix of
covariances would be as follows:
𝑉𝑎𝑟(𝑋) 𝐶𝑜𝑣(𝑋, 𝑌) 𝐶𝑜𝑣(𝑋, 𝑍)
 = [𝐶𝑜𝑣(𝑌, 𝑋) 𝑉𝑎𝑟(𝑌) 𝐶𝑜𝑣(𝑌, 𝑍)] (9)
Cryptography 2020, 4, 15 13 of 33

to model three points in time of our trace by using random variables (X,Y,Z), the matrix of covariances
would be as follows:  
X  Var(X) Cov(X, Y) Cov(X, Z) 

=  Cov(Y, X) Var(Y) Cov(Y, Z)  (9)

 
Cov(Z, X) Cov(Z, Y) Var(Z)
The distribution has a mean for each random variable given as follows:

 µX
 

µ =  µY
 

 (10)
µZ
 

The PDF of the multivariate distribution is more complicated: instead of using a single number as
an argument, it uses a vector with all of the variables in it (x = [x, y, z, . . .]T ). The probability density
function (PDF) of a multivariate Gaussian distribution is given by the following:

1 −(x−µ)∗ −1 ∗(x−µ)
P

f (x) = q e 2 (11)
k P
(2π) | |

As with univariate distribution, the PDF of the multivariate distribution gives an indication of
how likely a certain observation is. In other words, if we put k points of our power trace into X and we
find that f (x) is very high, then we conclude that we have most likely found a good guess.

4.2. Template Attacks

4.2.1. Theory
In 2002, Suresh Chari, Josyula R. Rao, and Pankaj Rohatgi took advantage of multivariate
probability density functions in their paper entitled simply “Template Attacks” [35]. In their work, they
claim that Template Attacks are the “strongest form of side-channel attack possible in an information
theoretic sense”. They base this assertion on an adversary that can only obtain a single or small number
of side-channel samples. Instead of building a leakage model that would divide collections into classes,
they pioneered a profile technique that would directly compare collected traces with a known mold.
Building a dictionary of expected side-channel emanations beforehand allowed a simple lookup of
observed samples in the wild.
Template attacks are as the name suggests, a way of comparing collected samples from a target
device to a stencil of how that target device processes data to obtain the secret key. To perform a
template attack, the attacker must first have access to a complete replica of the victim device that they
can fully control. Before the attack, a great deal of preprocessing is done to create the template. This
preprocessing may take tens of thousands of power traces, but once complete only requires a scant few
victim collections to recover keys. There are four phases in a template attack:

1. Using a clone of the victim device, use combinations of plaintexts and keys and gather a large
number of power traces. Record enough traces to distinguish each subkey’s value.
2. Create a template of the device’s operation. This template will highlight select “points of interest”
in the traces and derive a multivariate distribution of the power traces for this set of points.
3. Collect a small number of power traces from the victim device.
4. Apply the template to the collected traces. Examine each subkey and compute values most likely
to be correct by how well they fit the model (template). Continue until the key is fully recovered.
Cryptography 2020, 4, 15 14 of 33

4.2.2. Practice
In the most basic case, a template is a lookup table that, when given input key k, will return the
distribution curve fk (x). By simply using the table in reverse, the observed trace can be matched to a
distribution curve in the table, and the value of k read out.
One of the main drawbacks of template attacks is that they require a large number of traces be
processed to build a table before the attack can take place. Consider a single 8-bit subkey for AES-128.
For this one subkey, we need to create power consumption models for each of the possible 28 = 256
values it can take on. Each of these 256 power consumption models requires tens of thousands of traces
to be statistically sound.
What seems an intractable problem becomes manageable because we do not have to model every
single key. By focusing on sensitive parts of an algorithm, like substitution boxes in AES, we can use
published information about ciphers to our advantage. We further concentrate on values for the keys
that are statistically far apart. One way to do this is to make one model for every possible Hamming
weight (explained in depth later), which reduces our number of models from 256 down to 9. This
reduces our resolution and means multiple victim samples will be needed, but reduces our search
space 28-fold.
The power in a Template Attack is in using the relationship between multiple points in each
template model instead of relying on the value of a single position. Modeling the entire sample space
for a trace (often 5000 samples or more) is impractical, and fortunately not required for two main
reasons: (1) Many times, our collection equipment samples multiple times per clock cycle; and (2)
our choice of subkey does not affect the entire trace. Practical experience has shown that instead of
5000-dimention distribution, 3-dimentional, or 5-dimentional are often sufficient [56].
Finding the correct 3–5 points is not trivial computationally, but can be straightforward. One
simple approach is to look for points that vary strongly between separate operations by using the
sum-of-the-differences statistical method. By denoting an operation (employment of  subkey
 or
intermediate Hamming weight model) as k, and every sample as i, the average power, Mk,i , for Tk
traces is given by the following equation:

Tk
1 X
Mk,i = t j,i (12)
Tk
j=1

The pairwise differences of these means are calculated and summed to give a master trace with
peaks where the samples’ averages are different.
X
Di = |Mk1 ,i − Mk2 ,i | (13)
k1 ,k2

The peaks of Di are now “pruned” to pick points that are separated in time (distance in our
trace) from each other. Several methods can be used to accomplish this pruning, including those
that involved elimination of nearest neighbors. It is interesting to note that, while Template Attacks
were first introduced in 2002, several more modern machine-learning techniques seek to improve on
its basic premise. For example, Lerman et al. make use of machine learning to explore a procedure
which, amongst other things, optimizes this dimensionality reduction [57]. In fact, machine learning
represents a large area of growth in modern SCA, and has become the primary focus for modern
profiling techniques. As machine learning is beyond the scope of this work, we would direct the
interested reader to the following works: [57–69].
From the chosen peaks of Di , we have I points of interest, which are at sample locations
si , i ∈ {0, I − 1} within each trace. By building a multivariate PDF for each operation (employment of
subkey or intermediate Hamming weight model) at these sample points, the template can be compared
Cryptography 2020, 4, 15 15 of 33

for each trace of the victim observed and matches determined. The PDF for each operation k is built
as follows:
Separate the template power traces by operation k and denote the total number of these as Tk . Let
t j,si represent the value at trace j and point of interest i and compute the average power µi :

1 XTk
µi = t j,si , (14)
Tk j=1

To construct the vector, use the following equation:


 
 µ1 
 

 µ2 

µ = 
 µ3 
 (15)
 .. 


 . 


Calculate the covariance, ci,i0 between the power at every pair of points of interest (i and i0 ), noting
that this collapses to a variance on the axis of the matrix where ci = ci0 :

1 XTk  
ci,i0 = (t j,si − µi ) t j,si0 − µi0 , (16)
Tk j=1

To construct the matrix, use the following equation:


 
 v1 c1,2 c1,3 ··· 
X  c3,2 v2 c2,3 ···
 

=  c (17)
 
 3,1 c3,2 v3 ··· 

 . .. .. .. 
 .
. . . .

Once mean and covariance matrices are constructed for every operation of interest, the template is
complete and the attack moves forward as follows. Deconstruct the collected traces from the victim into
vectors with values at only our template points of interests. Form a series of vectors, as shown below:
 
 a j,1 
 
 a j,2 
 
a j =  a j,3
 
 (18)
 . 
 .. 

 
a j,...

Using Equation (11), calculate the following for all attack traces:

pk,j = fk (a j ) (19)

Equation (19) returns a probability that key k is correct for trace j based on the PDF calculated in
building the template. Combining these probabilities for all attack traces can be done several ways.
One of the most basic is to simply multiple the probabilities together and choose the largest value:
Y
Pk = pk,j (20)
j=1
in building the template. Combining these probabilities for all attack traces can be done several ways.
One of the most basic is to simply multiple the probabilities together and choose the largest value:

𝑃𝑘 = ∏ 𝑝𝑘,𝑗 (20)
𝑗=1
Cryptography 2020, 4, 15 16 of 33

4.3. Correlation Power Analysis


4.3. Correlation Power Analysis
4.3.1. Theory
4.3.1.
In Theory
2004, Eric Brier, Christophe Clavier, and Francis Olivier published a paper called “Correlation
PowerInAnalysis
2004, Eric with a Leakage
Brier, Christophe Model”
Clavier,[31]and
thatFrancis
took DPA and
Olivier the Template
published a paper Attack
calledwork a step
“Correlation
further. In the paper, Brier et al. again examined the notion of looking at multiple
Power Analysis with a Leakage Model” [31] that took DPA and the Template Attack work a step further. positions in a power
trace
In theforpaper,
correlation
Brier etinstead
al. againofexamined
being restricted
the notionto of a single
lookingpoint in time.
at multiple They use
positions in aa power
model-based
trace for
approach in their side-channel attack, but in place of comparing a descriptor
correlation instead of being restricted to a single point in time. They use a model-based approach for a single point in timein
(such as difference of the means), they employ a multivariate approach to
their side-channel attack, but in place of comparing a descriptor for a single point in time (such as form and distinguish
classes.
difference of the means), they employ a multivariate approach to form and distinguish classes.
When
Whenconsidering
consideringmultiple
multiplebitsbitsatataatime,
time,itit is
is important
important toto realize that the
realize that the power
power consumption
consumptionis
isbased
basedsolely
solelyon onthe
thenumber
numberof ofbits
bitsthat
thatareareaalogical
logical“1” “1”and
andnot
notthe
thenumber
numberthosethosebitsbitsare
aremeant
meantto
torepresent.
represent.For Forexample,
example, in Figure 12, five different numbers are represented
in Figure 12, five different numbers are represented as light-emitting diodesas light-emitting
diodes
(LEDs)(LEDs) in anregister.
in an 8-bit 8-bit register.
However,However,
with thewith the exception
exception of the
of the top row,topzero,
row,thezero, the subsequent
subsequent rows all
rows all consume
consume the samethe same amount
amount of power.ofBecause
power. of Because
this, our ofmodel
this, our model
cannot cannoton
be based bevalue
based andonmust
valuebe
and must be based on something else. For most techniques, this model
based on something else. For most techniques, this model is called Hamming weight and is based is called Hamming weight
and is based
simply on thesimply on the
amount amount
of 1’s of 1’s in we
in a grouping a grouping we are comparing.
are comparing.

Cryptography 2020, 5, x FOR PEER REVIEW Figure 12. Numbers in an 8-bit register. 16 of 32
Figure 12. Numbers in an 8-bit register.
One
One common
common setup
setup forfor a CPA
a CPA is is shown
shown in in Figure
Figure 13.13.
ToTo mount
mount this
this attack,
attack, weweuseuse a computer
a computer
that can send random but known messages to the device we are attacking, and
that can send random but known messages to the device we are attacking, and trigger a device trigger a deviceto to
record
record power
power measurements
measurements of of
thethe data
data bus.
bus. After
After some
some amount
amount of of time,
time, wewe end
end upup with
with a data
a data pairpair
of known input and power measurements.
of known input and power measurements.

Figure
Figure 13.13.
CPACPA attack.
attack.

Next,
Next, forfor each
each known
known input
input (Input
(Input Data),
Data), wewe guess
guess at at a key
a key value
value (Hyp.
(Hyp. Key),
Key), XORXOR those
those twotwo
bytes, and run that through a known lookup table (S-box) to arrive at a hypothetical
bytes, and run that through a known lookup table (S-box) to arrive at a hypothetical output value output value
(Hyp.
(Hyp. Output).
Output). The
The hypothetical
hypothetical outputvalue
output valueisisthen
thenevaluated
evaluatedat atits
its binary
binary equivalent
equivalent to
to arrive
arrive at
at its
Hamming weight (Figure 14). To summarize, for every hypothetical key value, we
its Hamming weight (Figure 14). To summarize, for every hypothetical key value, we are generatingare generating what
the the
what Hamming
Hamming weight would
weight be as be
would seen
ason the on
seen collected power traces
the collected powerattraces
some point in time
at some point(specifically,
in time
when that value goes over the data bus).
(specifically, when that value goes over the data bus).
Next, for each known input (Input Data), we guess at a key value (Hyp. Key), XOR those two
bytes, and run that through a known lookup table (S-box) to arrive at a hypothetical output value
(Hyp. Output). The hypothetical output value is then evaluated at its binary equivalent to arrive at
its Hamming weight (Figure 14). To summarize, for every hypothetical key value, we are generating
what the Hamming
Cryptography 2020, 4, 15weight would be as seen on the collected power traces at some point in time
17 of 33
(specifically, when that value goes over the data bus).

Figure 14.14.
Figure Visualized CPA
Visualized processing.
CPA processing.

Figure
Figure1515
is is
a visual representation
a visual representationofofthetheactual
actualcollected
collectedvalues
valuesininthe
theleft
left column,
column, and values of
of our model for different key guesses in the remaining columns. (In reality,
model for different key guesses in the remaining columns. (In reality, the power the power measurements
measurements
areare
captured
captured as as
vectors, butbut
vectors, forfor
understanding
understanding wewevisualize them
visualize themas as
a trace.) Since
a trace.) wewe
Since have applied
have applied
ourour
model
model to to
all all
possible values
possible forfor
values thethe
secret key,
secret thethe
key, guess
guessthat produces
that produces thethe
closest match
closest to to
match thethe
measured power at a specific point in time along the trace must be the correct key value.
measured power at a specific point in time along the trace must be the correct key value. The Pearson The Pearson
correlation
correlationcoefficient
coefficientisis used
used toto compare
comparecaptured
captured power
power measurements
measurements withwith
each each
column column or
or estimate
estimate table
table and2020,
Cryptography and
determinedetermine the
the closest
5, x FOR PEER closest
REVIEWmatch. match. 17 of 32

Figure
Figure 15.15.
CPACPA final.
final.

4.3.2. Practice
4.3.2. Practice
Hamming weight is defined as the number of bits set to 1 in a data word. In an m-bit microprocessor,
Hamming weight is defined asj the number of bits set to 1 in a data word. In an m-bit
binary data are coded D = m−1
P
0 d j2 D the bit values d j = 0 or 1. The Hamming weight is simply
, with𝑚−1
microprocessor, Pm−1binary data arej=
coded = ∑𝑗=0 𝑑𝑗 2𝑗 , with the bit values 𝑑𝑗 = 0 or 1. The Hamming
HW(D) =
weight is simply d . If D 𝑚−1 m independent and uniformly distributed bits, the data word has a
contains
j=0HW(D) = ∑𝑗=0 𝑑𝑗 . If D contains m independent and uniformly distributed bits, the
j
mean
data wordHamming
has a meanweight µH = m/2
Hamming and variance
weight µ𝐻 = 𝑚/2 σ2H and
= m/4. Since 𝜎
variance we
2 assume that the power used in a
𝐻 = 𝑚/4. Since we assume that
the power used in a circuit correlates to the number of bits changing fromthe
circuit correlates to the number of bits changing from one state to another, oneterm
stateHamming distance
to another, the
wasHamming
term coined. If distance
we definewas the reference
coined. Ifstate for a data
we define theword as R, state
reference the difference
for a databetween
word asHW(R)
R, the and
difference between HW(R) and HW(D) is known as the Hamming distance, and can be computed(In
HW(D) is known as the Hamming distance, and can be computed simply as HD = HW ( D ⊕ R ) .
the theory
simply as HDportion
= HW(DR).of this section,
(In thethe reference
theory stateoffor
portion data
this was taken
section, to be zero state
the reference and hence Hamming
for data was
distance
taken to becollapsed
zero andtohence simply the Hamming
Hamming weight.
distance In thistosection,
collapsed simply wetheconsider
Hamming a more realistic
weight. state
In this
section, we consider a more realistic state where it is the change of voltage on the data bus among allwe
where it is the change of voltage on the data bus among all the other power being drawn that
theare interested
other in.) Ifdrawn
power being D is athat
uniform
we arerandom
interested in.) If DDis⊕aRuniform
variable, and HW (D ⊕ R)variable,
random will be asDRwell,
and and
HW(D ⊕will
HW(DR) R) has
be the sameand
as well, mean (µHD = m/2)
HW(DR) has and variance
the same mean (σ (µ=HD
2 m/4) and variance (2 = m/4) as
as HW(D).
= m/2)
HW(D).
While HW(DR) does not represent the entire power consumption in a cryptographic
processor, modeling it as the major dynamic portion and adding a term for everything else, b, works
well. This brings us to a basic model for data dependency:
𝑊 = 𝑎HW(DR) + 𝑏 𝑜𝑟 𝑚𝑜𝑟𝑒 𝑠𝑖𝑚𝑝𝑙𝑦: 𝑊 = 𝑎HD + 𝑏, (21)
where 𝑎 is a scaling factor between the Hamming distance (HD) and power consumed (W).
Examining how the variables of W and HD change with respect to each other is interesting, but
Brier et al. take it a step further by quantifying the relationship between the two. Using the Pearson
Cryptography 2020, 4, 15 18 of 33

While HW(D ⊕ R) does not represent the entire power consumption in a cryptographic processor,
modeling it as the major dynamic portion and adding a term for everything else, b, works well. This
brings us to a basic model for data dependency:

W = aHW(D ⊕ R) + b or more simply : W = aHD + b, (21)

where a is a scaling factor between the Hamming distance (HD) and power consumed (W).
Examining how the variables of W and HD change with respect to each other is interesting, but
Brier et al. take it a step further by quantifying the relationship between the two. Using the Pearson
correlation coefficient, they normalize the covariance by dividing it by the standard deviations for both
W and HD (σW σHD ), and reduce the covariance to a quantity (correlation index) that can be compared.

cov(W, HD) aσ aσHD a m
ρW,HD = = HD = q = q (22)
σW σHD σW
a2 σ2HD + σb 2 ma2 + 4σ2b

As in all correlations, the values will satisfy the inequality 0 ≤ |ρW, HD| ≤ 1; with the
upper-bound-achieved IFF, the measured power perfectly correlates with the Hamming distance of the
model. The lower bound is reached if the measured value and Hamming distance are independent,
but the opposite does not hold: measured power and Hamming distance can be dependent and have
their correlation equal to zero.
Brier et al. also note that “a linear model implies some relationships between the variances of the
different terms considered as random variables: σ2W = a2 σ2H + σ2b ” and in Equation (22) show an easy
way to calculate the Pearson correlation coefficient.
If the model used only applies to l, independent bits of the m bit data word, a partial correlation
still exists and is given by the following:
√ √
a l l
ρW,HDl/m = q = ρW,HD √ (23)
ma2 + 4σ2b m

CPA is most effective in analysis where the device leakage model is well understood. Measured
output is compared to these models by using correlation factors to rate how much they leak. Multi-bit
values in a register, or on a bus, are often targeted by using Hamming weight for comparison to their
leakage model expressed in Equation (23). In like manner, Hamming distance between values in the
register or bus and the value it replaces are used to assess correlation to the model. CPA is important
in the progression of SCA study for its ability to consider multiple intermediates simultaneously, and
harness the power (signal-to-noise ratio) of multiple bit correlation.
CPA is a parametric method test and relies on the variables being compared being normally
distributed. In fact, thus far, in the survey, we have only seen parametric methods of statistics being
used. Nonparametric tests can sometimes report better whether groups in a sample are significantly
different in some measured attribute, but does not allow one to generalize from the sample to the
population from which it was drawn. Having said that, nonparametric tests play an important part in
SCA research.

4.4. Mutual Information Analysis

4.4.1. Theory
In 2008, Benedikt Gierlichs, Lejla Batina, Pim Tuyls, and Bart Preneel published a paper entitled
“Mutual Information Analysis A Generic Side-Channel Distinguisher” [37]. This was the first time
that Mutual Information Analysis (MIA), which measures the total dependency between two random
variables, was proposed for use in DPA. MIA is a nonparametric test and was expected to be more
Cryptography 2020, 4, 15 19 of 33

powerful than other distinguishers for three main reasons: (1) DPA (and indeed all model based SCAs)
to this point relied on linear dependencies only, and as such, were not taking advantage of all the
information of the trace measurements; (2) comparing all dependencies between complete observed
device leakage and modeled leakage should be stronger than data-dependent leakage, and hence be a
“generic” distinguisher; and (3) Mutual Information (MI) is multivariate by design. Data manipulated
(preprocessed) to consider multiple variables in univariate distinguishers loses information in the
translation, which is not the case with MI.
While investigations such as [70–73] have failed to bear out the first two expectations in practice,
the third has been substantiated in [37,70,71], so we spend some time explaining MI here.
MIA is another model-based power analysis technique and shares in common with this family the
desire to compare different partitions of classes for key guesses, to find the best fit to our leakage model.
MIA introduces the notion of a distinguisher to describe this process of finding the best fit and, instead
of the Pearson correlation coefficient that is used in CPA, uses the amount of difference or Entropy as a
distinguisher. This is revolutionary, as it marks a departure from using strictly two-dimensional (linear
or parametric) comparison to multidimensional space in SCA research.
We now lay out how this distinguisher works, but first, we must start with some background on
information theory.

4.4.2. Statistics: Information Theory


Claude Shannon in his 1948 paper “A Mathematical Theory of Communication” [74] took the
notion of information as the resolution of uncertainty and began the discipline of Information Theory.
Shannon’s work has been applied to several fields of study, including MIA, and for that reason, major
threads are explored in this section, and they are illustrated in Figure 16.
Cryptography 2020, 5, x FOR PEER REVIEW 19 of 32

Figure 16.Entropy
Figure16. Entropy and
and information diagrams.
information diagrams.

Shannon
Shannondefined
defined thethe
entropy
entropy  (Greek
H (Greek capital eta)eta)
capital of aofrandom
a random variable X on
variable a discrete
X on space
a discrete X as
space
a measure of its uncertainty
X as a measure during during
of its uncertainty an experiment. Based Based
an experiment. on theonprobability mass mass
the probability function for each
function for X,
each
H(X), X,entropy
the H(X), theH entropy H is
is given by given by(24).
Equation Equation (24). (The probability
(The probability mass
mass function function
(PMF) (PMF) is that
is a function a
function
gives that gives the
the probability thatprobability
a discrete that a discrete
random random
variable variable
is exactly is exactly
equal to someequal to some
value. This value. This
is analogous
is analogous
to the to the PDF
PDF discussed discussed
earlier earlier for variable.)
for a continuous a continuous variable.)
(X) = − ∑ X Pr(𝑋 = 𝑥) ∗ log 𝑏 (Pr(𝑋 = 𝑥)) (24)
H(X) = − 𝑥∈𝑋 Pr(X = x) ∗ logb (Pr(X = x)) (24)
Note, in Equation (24), the logarithm
x∈X base (b) is somewhat arbitrary and determines the units
for the entropy. Common bases include 10 (hartleys), Euler’s number e (nats), 256 (bytes), and 2 (bits).
Note, in Equation (24), the logarithm base (b) is somewhat arbitrary and determines the units for
In MIA, it is common to use base 2. We drop the subscript in the rest of this paper.
the entropy. Common bases include 10 (hartleys), Euler’s number e (nats), 256 (bytes), and 2 (bits). In
The joint entropy of two random variables (X, Y) is the uncertainty of the combination of these
MIA, it is common to use base 2. We drop the subscript in the rest of this paper.
variables:
The joint entropy of two random variables (X, Y) is the uncertainty of the combination of
(X, Y) = − ∑ Pr(𝑋 = 𝑥, 𝑌 = 𝑦) ∗ log(Pr(𝑋 = 𝑥, 𝑌 = 𝑦)) (25)
these variables:
𝑥∈𝑋,𝑦𝑌
The joint entropy is largest when
X the variables are independent, as illustrated in Figure 16a, and
H(X, Y) = − Pr(X = x, Y = y) ∗ log (Pr(X = x, Y = y)) (25)
decreases by the quantity I(X;Y) with the increasing influence of one variable on the other (Figure
x∈X, y∈Y
16b,c). This mutual information, I(X;Y), is a general measure of the dependence between random
variables, and it quantifies the information obtained on X, having observed Y.
The conditional entropy is a measure of uncertainty of a random variable X on a discrete space
X as a measure of its uncertainty during an experiment given that the random variable Y is known.
In Figure 16d, the amount of information that Y provides about X is shown in gray. The quantity
(X|Y) is seen as the blue circle H(X), less the information provided by Y in gray. The conditional
Cryptography 2020, 4, 15 20 of 33

The joint entropy is largest when the variables are independent, as illustrated in Figure 16a, and
decreases by the quantity I(X;Y) with the increasing influence of one variable on the other (Figure 16b,c).
This mutual information, I(X;Y), is a general measure of the dependence between random variables,
and it quantifies the information obtained on X, having observed Y.
The conditional entropy is a measure of uncertainty of a random variable X on a discrete space X
as a measure of its uncertainty during an experiment given that the random variable Y is known. In
Figure 16d, the amount of information that Y provides about X is shown in gray. The quantity (X|Y) is
seen as the blue circle H(X), less the information provided by Y in gray. The conditional entropy of a
random variable X having observed Y leads to a reduction in the uncertainty of X and is given by the
following equation:
X
H(X|Y) = − Pr(X = x, Y = y) ∗ log(Pr(X = x|Y = y)) (26)
x∈X,y∈Y

The conditional entropy is zero only when the variables are independent, as in Figure 16a, and
decreases to zero when the variables are deterministic.
In like manner, entropy of a random variable X on discrete spaces can be extended to continuous
spaces, where it is useful in expressing measured data from analog instruments.
Z
H(X) = − Pr(X = x) ∗ log (Pr(X = x))dx (27)
X
Z
H(X, Y) = − Pr(X = x, Y = y) ∗ log (Pr(X = x, Y = y))dxdy (28)
X,Y
Z
H( X|Y ) = − Pr(X = x, Y = y) ∗ log(Pr(X = x|Y = y))dxdy (29)
X,Y

Mutual information in the discrete domain can be expressed directly, as follows:


!
X Pr(X = x, Y = y)
I(X; Y) = Pr(X = x, Y = y) ∗ log (30)
Pr(X = x) ∗ Pr(Y = y)
x∈X,y∈Y

Mutual information in the continuous domain can be expressed directly, as follows:


Z Z !
Pr(X = x, Y = y)
I(X; Y) = Pr(X = x, Y = y) ∗ log dxdy (31)
X Y Pr(X = x) ∗ Pr(Y = y)

Finally, it can be shown the mutual information between a discrete random variable X and a
continuous random variable Y is defined as follows:
XZ !
Pr(X = x, Y = y)
I(X; Y) = Pr(X = x, Y = y) ∗ log dy (32)
Y Pr(X = x) ∗ Pr(Y = y)
x∈X

4.4.3. Practice
Throughout this survey, we have seen that side-channel attacks for key recovery have a similar
attack model. A model is designed based on knowledge of the cryptographic algorithm being attacked,
such that when fed with the correct key guess, its output will be as close as possible to the output of the
device being attacked when given the same input. Figure 17 illustrates that model from a high-level
perspective and terminates in an expression for what is called the distinguisher. A distinguisher is any
statistic which is used to compare side-channel measurements with hypothesis-dependent predictions,
in order to uncover the correct hypothesis.
Throughout
attack model. Athis survey,
model we have seen
is designed basedthat
onside-channel
knowledge attacks for key recovery
of the cryptographic have a similar
algorithm being
attack model.
attacked, such Athatmodel
whenisfed designed
with thebased
correcton knowledge
key of the will
guess, its output cryptographic
be as close algorithm
as possiblebeing
to the
attacked,
output ofsuch that when
the device beingfed with the
attacked correct
when keythe
given guess,
sameits output
input. will 17
Figure be illustrates
as close asthat
possible
modeltofrom
the
output of the device being attacked when given the same input. Figure 17 illustrates
a high-level perspective and terminates in an expression for what is called the distinguisher. A that model from
adistinguisher
high-level perspective andwhich
is any statistic terminates
is usedintoan expression
compare for whatmeasurements
side-channel is called the distinguisher.
with hypothesis- A
Cryptography 2020, 4, 15 21 of 33
distinguisher is any statistic which is used to compare side-channel
dependent predictions, in order to uncover the correct hypothesis. measurements with hypothesis-
dependent predictions, in order to uncover the correct hypothesis.

Figure17.
Figure 17. Side-channel
Side-channel attack
attackmodel.
model.
Figure 17. Side-channel attack model.
MIA’s
MIA’sdistinguisher
distinguisherisisdefined
definedasasfollows:
follows:
MIA’s distinguisher is defined as follows:
D(K)
D(K=) I(L Lk;+𝑀
= 𝑘I(+ ε;𝑘M) =) (L 𝑘 + ) − ((L𝑘 + )|M𝑘 ),
k = H(Lk + ε) − H((Lk + ε) |M k ),
(33)
(33)
D(K) = I(L𝑘 + ; 𝑀𝑘 ) = (L𝑘 + ) − ((L𝑘 + )|M𝑘 ), (33)
where  is the differential entropy of Equation (27). Here, we are using mutual information as a
where
measure is
H ofthehow differential
differential
well knowledgeentropy of
of Equation
entropyprovided
Equation (27). Here, model
(27).chosen
by the Here, we arereduces
we using mutual information
information
uncertainty in whatasaswe
a
measure
measure of how well knowledge
physically observe. Inknowledge provided
provided
the ideal case by the chosen
by model, the
of a perfect model
model reduces uncertainty
reduces(entropy)
uncertainty uncertainty in what
in what
of the we
observed
physically
physically observe.
), given
value (L𝑘 +observe. Inthe
In themodel
the ideal case
ideal case of
(blue of aa perfect
perfect
shading), model,
model,
would the uncertainty
the uncertainty
approach (entropy)inofFigure
(entropy)
zero, as depicted the observed
18a, and
value (L +
its maximum
k
𝑘
ε),
) , value of its uncertainty (entropy) alone (red circle) when the model is not relatedand
given
given the
the model
model (blue
(blue shading),
shading), would
would approach
approach zero, as
zero, depicted
as in
depicted Figure
in 18a,
Figure 18a, its
toand
the
maximum
its maximum
observed value of its
valueFigure
leakage uncertainty
of its uncertainty
18b. (entropy) alone (red circle) when the model
(entropy) alone (red circle) when the model is not is not related to the
observed leakage Figure 18b.

Figure 18. Mutual information analysis distinguisher.


Figure 18.
Figure Mutual information
18. Mutual information analysis
analysis distinguisher.
distinguisher.

This can be compared with the generic Pearson correlation coefficient that measures the joint
variability of the model and observations. The distinguisher for the Pearson correlation coefficient is
given as follows:
Cov(Lk + ε, Mk )
D(K) = ρ(Lk + ε, Mk ) = (34)
σLk +ε ∗ σMk
Note that Equation (22) is the special case of the Pearson correlation coefficient when the model is
defined to be the Hamming distance.
The contrast between these two distinguishers is worth noting, and it shows a bifurcation in the
literature for approaches. The MIA model is multivariate by nature and reveals bit patterns that move
together in both the model and observed readings, without having to manipulate the data [70,73,75,76].
Because the correlation coefficient is a univariate mapping, bit patterns that are suspected in combination
of being influencers on the model must first be combined through transformation to appear as one
before processing for linear relationship [31,77,78]. Anytime transformations are invoked, information
is lost, and while efficiency is gained, the model loses robustness. As an aside, the term “higher order”
Cryptography 2020, 4, 15 22 of 33

is applied to cases where multiple bits are considered together as influencers in the model. Further, the
MIA model is not a linear model, while correlation is. MIA allows models with polynomials of orders
greater than one (also called higher order), to fit observed data, while correlation is strictly linear.
Correlation assumes a normal distribution in the data, while MIA does not; herein lies the trade.
By restricting a distinguisher to a normal distribution, the comparison becomes a matter of simply
using well-studied statistical moments to judge fit. Indeed, we have seen throughout the survey the
use of mean (first moment), variance (second moment), and combinations of the same. The use of
mutual information is not restricted to a fixed distribution, and in fact, determining the distribution of
the random variables modeled and collected is problematic. Several techniques have been studied for
determining the probability density functions, such as histograms [37], kernel density estimation [70,79],
data clustering [80], and vector quantization [81,82].
As the title of the Gierlichs et al. paper [37] suggests, MIA truly is a generic distinguisher in the
sense that it can capture linear, non-linear, univariate, and multivariate relationships between models
and actual observed leakages. However, while MIA offers the ability to find fit where other methods
such as correlation do not, by fixing variable (i.e., assuming a normalized distribution in the data) first,
it is often possible to be much more efficient, and coalesce on an answer faster by using a more limited
distinguisher [70].
MIA suffers in many cases from seeking to be generic. Whereas CPA and other distinguishers
assume a normal distribution with well-behaved characteristics, the distribution in MIA is problematic
to estimate. Moreover, the outcomes of MIA are extremely sensitive to the choice of estimator [76].

5. An Expanding Focus and Way Ahead


In the last section of this paper, we discussed the two groups of power analysis attacks techniques:
Model Based (i.e., Simple Power Analysis, Differential Power Analysis, Correlation Power Analysis,
and Mutual Information Analysis) and Profiling (i.e., Template Attacks and Machine Learning). We
then went on and highlighted some key accomplishments in developing attacks within each of these
branches of side-channel analysis. In this section, we note a pivot away from developing specific
attacks to implement, to a broader look at determining if a “black box” device running a known
cryptographic algorithm is leaking information in a side-channel. Here, we explore the theory and
practice of Test Vector Leakage Assessment (TVLA), introduce two common statistical methods used,
and issue a challenge for further study.

5.1. Test Vector Leakage Assessment

5.1.1. Theory
In 2011, Gilbert Goodwill, Benjamin Jun, Josh Jaffe, and Pankaj Rohatgi published an article titled
“A Testing Methodology for Side-Channel Resistance Validation” [83], which expanded the focus of
side-channel analysis. Here, instead of developing attacks to recover secret key information, they
proposed a way to detect and analyze leakage directly in a device under test (DUT). Their TVLA
method seeks to determine if countermeasures put in place to correct known vulnerabilities in the
hardware implementation of cryptographic algorithms are effective, or if the device is still leaking
information. The paper proposes a step-by-step methodology for testing devices regardless of how
they implement the encryption algorithm in hardware based on two modes of testing.
The first mode of TVLA testing is the non-specific leakage test, which examines differences in
collected traces formed from a DUT encrypting fixed vs. varying data. This technique seeks to amplify
leakages and identify vulnerabilities in the generic case, where exploits might not even have been
discovered yet.
The second mode of TVLA testing specifies and compares two classes (A and B) of collected
traces with the classes selected according to known sensitive intermediate bits. Differences between
Cryptography 2020, 4, 15 23 of 33

the classes can be determined by a number of statistical tests, although the paper focuses exclusively
on the Welch’s t-test. Differences between classes indicate leakage and an exposure for attack.
TVLA is different in what we have explored in that, instead of a single selection function, several
are employed in a series of tests. Test vectors are standardized to focus on common leakage models for
a particular algorithm, and collected traces are split into two classes based on a number of selection
functions. (Specific leakage models (e.g., Hamming weight, weighted sum, toggle count, zero value,
variance [84]) are woven into the test vectors to optimize class differences when using selection
functions based on known vulnerability areas in algorithm implementation.) For example, TVLA
testing the AES algorithm uses Hamming weight to target S-box outputs, as we saw in CPA attacks.
Unlike CPA, after TVLA has separated the traces into classes, it leaves open the possibility to test
for differences by using the Pearson correlation coefficient, difference of means, or any of a host of
statistical devices. We will explore how Welch’s t-test and Pearson’s χ2 -test can be used for this task,
and leave open the exploration of other methods to the reader.

5.1.2. Practice
As part of creating a strict protocol for testing, Goodwill et al. proposed creating two datasets
whose specific combinations are parsed in different ways to look for leakage. Dataset 1 is created
by providing the algorithm being tested a fixed encryption key, and a pseudorandom test string of
data 2n times to produce 2n traces. Dataset 2 is created by providing the algorithm under test the
same encryption key used in Dataset 1, and a fixed test string of data n times to produce n traces. The
length of the test string and encryption is chosen to match the algorithm being explored. Importantly,
the repeating fixed block of Dataset 2 is chosen to isolate changes to sensitive variables by using the
criterion outlined in [83].
Once the datasets are completed, they are combined and parsed into two groups for performing
independent testing with each group, following an identical procedure. If a testing point shows leakage
in one group that does not appear in the second group, it is treated as an anomaly. However, if identical
test points in both groups pass threshold for leakage, that point is regarded as a valid leakage point.
Figure 19 is a graphical depiction of this test vector leakage assessment (TVLA) methodology for
AES encryption, and the structure it depicts is common to all algorithm testing in that it has a general
case (Test 0) and specific testing (Test 1–896). Since test vectors and tests are constructed to exploit
the encryption algorithm vs. the implementation, they remain useful regardless of how the device
processes data.
The general case (non-specific leakage test) is composed of fixed and random datasets (Group
1: {e,a} and Group 2: {f,b} in Figure 19). Sometimes referred to as the non-specific leakage test, this
method discovers differences in the collected traces between operations on fixed and varying data.
The fixed input can amplify leakages and identify vulnerabilities where specific attacks might not have
even been developed yet. Because of this, the fixed vs. random test gives a sense of vulnerability, but
does not necessarily guarantee an attack is possible.
Specific testing targets intermediate variables in the cryptographic algorithm, using only the
random dataset (Group 1: {a,b} and Group 2: {c,d} in the figure). Sensitive variables within these
intermediates are generally starting points for attack, as we have seen from the early days of DPA.
Typical intermediates to investigate include look-up table operations, S-box outputs, round outputs,
or the XOR during a round input or output. This random vs. random testing provides specific
vulnerabilities in the algorithm implementation being tested.
Both the general test and specific tests partition data into two subsets or classes. For the general
test, the discriminator is the dataset source itself, while in the specific tests, other criteria are chosen.
In the case of AES, there are 896 specific tests conducted by using five parsing functions that divide
test vectors into classes A (green boxes) and B (blue boxes), as shown in Figure 19, according to
known sensitive intermediate bits. Any statistically significant differences between classes A and B are
Cryptography 2020, 4, 15 24 of 33

evidence that vulnerabilities in the algorithm employed have been left unprotected, and a sensitive
computational
Cryptography 2020,intermediate is still influencing the side-channel.
5, x FOR PEER REVIEW 23 of 32

Figure19.
Figure 19.Example
Exampletests
testsfor
forAES.
AES.

The general case (non-specific leakage test) is composed of fixed and random datasets (Group 1:
{e,a} and Group 2: {f,b} in Figure 19). Sometimes referred to as the non-specific leakage test, this
method discovers differences in the collected traces between operations on fixed and varying data.
The fixed input can amplify leakages and identify vulnerabilities where specific attacks might not
have even been developed yet. Because of this, the fixed vs. random test gives a sense of vulnerability,
but does not necessarily guarantee an attack is possible.
Cryptography 2020, 4, 15 25 of 33

Statistical testing is important for determining if the two classes are, in fact, different. Goodwill
et al.’s paper focuses exclusively on the Welch’s t-test and the difference between the means of each
class. We suggest that other methods should be explored. To that end, we present a brief introduction
to Welch’s t-test and Pearson’s χ2 -test before summarizing this survey and issue a challenge for the
reader to explore further.

5.1.3. Statistics: Welch’s t-Test


Statistical tests generally provide a confidence level to accept (or reject) an underlying
hypothesis [46]. In the case where a difference between two populations is considered, the hypothesis
is most often postured as follows:
H0 : (null hypothesis): The samples in both sets are drawn from the same population.
Ha : (alternate hypothesis): The samples in both sets are not drawn from the same population.
Welch’s t-test, where the test statistic follows a Student’s t-distribution, accepts (or fails to accept)
the null hypothesis by comparing the estimated means of the two populations. Each set (e.g., class)
is reduced to its sample mean (XA , XB ), the sample standard deviation (SA , SB ), and the number of
data points within each class used to compute those values (NA, NB ). The test statistic (tobs ) is then
calculated by using Equation (35) and compared to a t-distribution, using both the (tobs ) and a value
known as degrees of freedom (Equation (36)). Degrees of freedom of an estimate is the number of
independent pieces of information that went into calculating the estimate. In general, it is not the same
as the number of items in the sample [46].

X − XB
tobs = r A (35)
S2A S2B
NA − NB

2
S2A S2B

NA − NB
v= (36)
S 2 !2 S2
!2
A B
NA NB
NA −1 + NB −1

The probability (p) that the samples in both sets are drawn from the same population can be
calculated by using Equation (37), where Γ(.) denotes the gamma function. For example, if p is
computed to be 0.005, the probability that there is a difference between the classes is 99.5%.
  !− v+2 1
Z ∞ Γ v+2 1 t2
p=2 f (t, v)dt, f (t, v) = √   1+ (37)
|t| πvΓ v2 v

The t-distribution is a series of probability density curves (based on degrees of freedom), but for
sample sizes n > 100, they converge to the normal distribution curve. This allows a fixed confidence
interval to be set and hence a fixed criterion for testing. Goodwill et al. suggest a confidence interval of
tobs = ±4.5 be used. For n = 100, this yields a probability that 99.95% of all observations will fall within
±4.5, and for n = 5000, this probability rises to 99.999%. To make the argument even more convincing,
Goodwill et al. have added the criteria to the protocol that to reject a device, the tobs must exceed the
threshold for the same test at the same pointwise time mark in both groups.

5.1.4. Statistics: Pearson’s χ2 -Test


Pearson’s chi-squared test of independence is used to evaluate the dependence between unpaired
observations on two variables. Its null hypothesis states that the occurrences of these observations
are independent, and its alternative hypotheses is that the occurrences of these observations are not
independent. In contrast to Welch’s t-test, traces (i.e., observations) are not averaged to form a master
Pearson’s chi-squared test of independence is used to evaluate the dependence between
unpaired observations on two variables. Its null hypothesis states that the occurrences of these
observations are independent, and its alternative hypotheses is that the occurrences of these
observations are
Cryptography 2020, not independent. In contrast to Welch’s t-test, traces (i.e., observations) are
4, 15 not
26 of 33
averaged to form a master trace that will then be compared to the master trace in a second class.
Rather, each trace is examined at each point in time, and its magnitude is recorded in a contingency
trace that
table, withwill
the then be compared
frequencies tocell
of each the of
master traceused
the table in a to
second
derive class. Rather,
the test each trace
statistic, whichisfollows
examined a χat
2-

each point
distribution.in time, and its magnitude is recorded in a contingency table, with the frequencies of each
cell of
Tothe tableunderstand
better used to derive thethe test statistic,
concept of a χ2which follows athe
-test, consider χ2 -distribution.
following example. Assume two
2
classes, one with 300 and the other with 290 samples. The distribution ofexample.
To better understand the concept of a χ -test, consider the following each classAssume
is giventwo classes,
by Figure
one with 300 and the other with 290 samples. The distribution of each class
20. Where the t-test would compute mean (Class A: 2.17, Class B: 2.82) and standard deviation (Class is given by Figure 20.
Where
A: 1.05,the t-test
Class would
B: 1.10) to compute mean
characterize the(Class
entireA: 2.17,the
class, Class B: 2.82)
χ2-test andcharacterizes
instead standard deviation (Class
the class by the A:
1.05, Class B: 1.10) to characterize the entire class, the χ 2 -test instead characterizes the class by the
distribution of observations within each class and forms the following contingency Table 1.
distribution of observations within each class and forms the following contingency Table 1.

(a) (b)
Figure
Figure 20.
20. Histograms
Histogramsfor
for two
two example
example sets
sets of
of data.
data. (a)
(a)Class
ClassA;
A;(b)
(b)Class
ClassB.
B.

Table 1.
Table Contingency table
1. Contingency table for
for two
two example
example sets
sets of
of data.
data.

Fi,j =i,j0
jF j = 0 j =j 1= 1 j = j2= 2j = 3 j=3
Total Total
i=0 i100
=0 100 9292 65 65 43 300
43 300
i=1 50
i=1 50 5656 80 80 104 104
290 290
Total 150 148 145 147 590
Total 150 148 145 147 590

Finally, the 2 test statistic is built from the table. By denoting the number of rows and columns of
Finally, the χχ2 test statistic is built from the table. By denoting the number of rows and columns
thethe
of contingency
contingency table
table r and
as as c, respectively,
r and the the
c, respectively, frequency
frequency of i-th
of the rowrow
the i-th j-th column
and and as Fi,jas, and
j-th column Fi,j,
the total number of samples as N, the χ 2 test statistic x and the degrees of freedom, v, are computed
and the total number of samples as N, the χ test statistic x and the degrees of freedom, v, are
2

as follows:as follows:
computed
r−1 c−1
𝑐−1 X
𝑟−1 X (Fi,j − E2i,j )2
x= (𝐹 𝑖,𝑗 − 𝐸𝑖,𝑗 ) (38)
𝑥 = ∑∑ E (38)
i=0 j=0 𝐸𝑖,𝑗 i,j
𝑖=0 𝑗=0
𝑣 = (𝑟v =−(1)r −∗1(𝑐
) ∗−(c1)
− 1) (39)
(39)
P  P 
c−1 r−1
(∑𝑐−1 =0)F∗i,k(∑∗𝑟−1
𝑘=0 𝐹k𝑖,𝑘 𝑘=0 𝐹
k= 0 )Fk,j
𝑘,𝑗
=i,j =
𝐸𝑖,𝑗 E , , (40)
(40)
𝑁 N
where
where E𝐸i,j isthe
𝑖,𝑗 is theexpected
expectedfrequency
frequencyfor foraagiven
givencell.
cell.
In our example, the degrees of freedom,
In our example, the degrees of freedom, v, 𝑣, can easily be calculated with the the number
number of
of rows
rows and
and
columns: 𝑣v =
columns: = (2
(2 − ) ∗∗((4
− 11) 4 −−11)
) ==3.3.As
Asan anexemplar,
exemplar,we wecalculate
calculatethe theexpected
expectedfrequency:
frequency:

(100 + 92 + 65 + 43) ∗ (100 + 50)


E0,0 = ≈ 76.27
560
and provide the complete Table 2.

Table 2. Expected values for two example sets of data.

Eij j=0 j=1 j=2 j=3


i=0 76.27 75.25 73.73 74.75
i=1 73.73 72.75 71.27 72.25
Cryptography 2020, 4, 15 27 of 33

Using both tables, we can compute the portions of the χ2 value corresponding to each cell. As an
exemplar for cell i = 0, j = 0, we have the following:

(100 − 76.27)2
≈ 7.38
76.27
By summing up these portions for all cells, we arrive at the x value:

7.38 + 3.73 + 1.03 + 13.48 + 7.64 + 3.85 + 1.07 + 13.95 = 52.13

The probability (p) that the observations are independent and that the samples in both sets are
drawn from the same population can be calculated by using Equation (41), where Γ(.) denotes the
gamma function. In our example, p is calculated to be 2.81 × 10−11 , which amounts to a probability of
approximately 99.999999999% that the classes are different from each other.

∞ v x
x 2 −1 ∗ e− 2
Z (
p= f (x, v)dx, f (x, v) = v v for x > 0; else 0 (41)
x 2 2 ∗Γ( 2 )

6. Summary
In this paper, we have provided a review of 20 years of power side-channel analysis development,
with an eye toward someone just entering the field. We discussed how power measurements are
gathered from a target device and explored what can be gained from direct observations of a system’s
power consumption. We moved through a chronological survey of key papers explaining methods
that exploit side channels and emphasized two major branches of side-channel analysis: model based
and profiling techniques. Finally, we described an expanding focus in research that includes methods
to detect if a device is leaking information that makes it vulnerable, without always mounting a
time-consuming attack to recover secret key information.
Two different areas have emerged for further study. The first being choice of a selection function:
What is the discriminator that parses collected traces into two or more classes, to determine if the
device is leaking information about its secret keys? The second follows from the first: How can classes
be distinguished as different from each other? In the TVLA section, we showed two different test
statistics (t-test, χ2 -test) and left that as a jumping off point for further study.
If we have met our goal of engaging and inspiring you, the reader, we charge you with exploring
further the many references provided. Spend some time exploring those areas that interest you. Seek
out the latest research being done and join us in expanding this fascinating field.

Author Contributions: Writing—original draft, M.R.; writing—review and editing, W.D. Both the authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding
Conflicts of Interest: The authors declare no conflict of interest.

Appendix A. Descriptive Statistics


Descriptive statistics is a field of study that seeks to describe sets of data. Here, we find more
tools for comparing our datasets, so we pause for an aside.
Consider the following dataset: [12 14 14 17 18]. We are really looking at distances as the 12
really means the distance away from zero. When we try to find the average distance to 0, we use
the following:
P
( xi − 0 ) (12 − 0) + 2 ∗ (14 − 0) + (17 − 0) + (18 − 0)
. . . here = 15 (A1)
n 5
Cryptography 2020, 4, 15 28 of 33

This averaging of distance is called the First moment of the distribution, but is more commonly
known as the mean or average and represented as µ01 . However, there are other datasets that have the
same distance from zero (e.g., [15 15 15 15 15]):

5 ∗ (15 − 0)
= 15, (A2)
5
which poses the question, how do we distinguish them?
Consider taking the square of the distance, we get the following:

(xi − 0)2 (12 − 0)2 + 2 ∗ (14 − 0)2 + (17 − 0)2 + (18 − 0)2
P
. . . here = 229.8 (A3)
n 5
vs.
5 ∗ (15 − 0)2
= 225 (A4)
5
Here, the numbers differ because of the spread or variance about the first dataset. This squaring
of the differences is called the Second moment (crude) of the distribution. Once we remove the bias of
the reference point zero and instead measure from the mean (First moment), we have what is known
as the Second moment (centered) or variance and represented as µ02 . Continuing with our example, we
obtain the following:

(xi − µ01 )2
P
(12 − 15)2 + 2 ∗ (14 − 15)2 + (17 − 15)2 + (18 − 15)2
. . . here = 4.8 (A5)
n 5
vs.
5 ∗ (15 − 15)2
=0 (A6)
5
Higher-order moments follow in like fashion. In the Second moment, centering alone standardized
it by removing previous bias. In higher-order moments, standardization requires additional adjustment,
as shown in the net out effects of the prior moments (to give just additional information for what the
higher order gives).
The Third moment of the distribution is more commonly known as skewness µ03 and is given by
the following equation, where n is the number of samples and σ is the standard deviation:

(x − µ01 )3
P
1
∗ (A7)
n σ3
The Fourth moment of the distribution is more commonly known as kurtosis µ04 and is given by
the following:
(x − µ01 )4
P
1
∗ (A8)
n σ4

References
1. Biham, E.; Shamir, A. Differential cryptanalysis of DES-like cryptosystems. In Proceedings of the Advances
in Cryptology—CRYPTO’90, Berlin, Germany, 11–15 August 1990; pp. 2–21.
2. Miyano, H. A method to estimate the number of ciphertext pairs for differential cryptanalysis. In Advances
in Cryptology—ASIACRYPT’91, Proceedings of the International Conference on the Theory and Application of
Cryptology, Fujiyosida, Japan, 11–14 November 1991; Springer: Berlin, Germany, 1991; pp. 51–58.
3. Jithendra, K.B.; Shahana, T.K. Enhancing the uncertainty of hardware efficient Substitution box based on
differential cryptanalysis. In Proceedings of the 6th International Conference on Advances in Computing,
Control, and Telecommunication Technologies (ACT 2015), Trivandrum, India, 31 October 2015; pp. 318–329.
Cryptography 2020, 4, 15 29 of 33

4. Matsui, M. Linear cryptanalysis method for DES cipher. In Advances in Cryptology—EUROCRYPT’93,


Proceedings of the Workshop on the Theory and Application of Cryptographic Techniques, Lofthus, Norway, 23–27
May 1993; Springer: Berlin, Germany, 1993; pp. 386–397.
5. Courtois, N.T. Feistel schemes and bi-linear cryptanalysis. In Advances in Cryptology—CRYPTO 2004,
Proceedings of the 24th Annual International Cryptology Conference, Santa Barbara, CA, USA, 15–19 August 2004;
Springer: Berlin, Germany, 2004; pp. 23–40.
6. Soleimany, H.; Nyberg, K. Zero-correlation linear cryptanalysis of reduced-round LBlock. Des. Codes Cryptogr.
2014, 73, 683–698. [CrossRef]
7. Kocher, P.; Jaffe, J.; Jun, B. Differential power analysis. In Proceedings of the 19th Annual International
Cryptology Conference (CRYPTO 1999), Santa Barbara, CA, USA, 15–19 August 1999; pp. 388–397.
8. Mangard, S.; Oswald, E.; Popp, T. Power Analysis Attacks: Revealing the Secrets of Smart Cards; Springer: Berlin,
Germany, 2007; p. 338. [CrossRef]
9. Agrawal, D.; Archambeault, B.; Rao, J.R.; Rohatgi, P. The EM sidechannel(s). In Proceedings of the 4th
International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2002), Redwood Shores,
CA, USA, 13–15 August 2002; pp. 29–45.
10. Gandolfi, K.; Mourtel, C.; Olivier, F. Electromagnetic analysis: Concrete results. In Proceedings of the 3rd
International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2001), Paris, France,
14–16 May 2001; pp. 251–261.
11. Kasuya, M.; Machida, T.; Sakiyama, K. New metric for side-channel information leakage: Case study on EM
radiation from AES hardware. In Proceedings of the 2016 URSI Asia-Pacific Radio Science Conference (URSI
AP-RASC), Piscataway, NJ, USA, 21–25 August 2016; pp. 1288–1291.
12. Gu, P.; Stow, D.; Barnes, R.; Kursun, E.; Xie, Y. Thermal-aware 3D design for side-channel information leakage.
In Proceedings of the 34th IEEE International Conference on Computer Design (ICCD 2016), Scottsdale, AZ,
USA, 2–5 October 2016; pp. 520–527.
13. Hutter, M.; Schmidt, J.-M. The temperature side channel and heating fault attacks. In Proceedings of the
12th International Conference on Smart Card Research and Advanced Applications (CARDIS 2013), Berlin,
Germany, 27–29 November 2013; pp. 219–235.
14. Masti, R.J.; Rai, D.; Ranganathan, A.; Muller, C.; Thiele, L.; Capkun, S. Thermal Covert Channels on Multi-core
Platforms. In Proceedings of the 24th USENIX Security Symposium, Washington, DC, USA, 12–14 August
2015; pp. 865–880.
15. Ferrigno, J.; Hlavac, M. When AES blinks: Introducing optical side channel. IET Inf. Secur. 2008, 2, 94–98.
[CrossRef]
16. Stellari, F.; Tosi, A.; Zappa, F.; Cova, S. CMOS circuit analysis with luminescence measurements and
simulations. In Proceedings of the 32nd European Solid State Device Research Conference, Bologna, Italy,
24–26 September 2002; pp. 495–498.
17. Brumley, D.; Boneh, D. Remote timing attacks are practical. Comput. Netw. 2005, 48, 701–716. [CrossRef]
18. Kocher, P.C. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In
Advances in Cryptology—CRYPTO ’96 Proceedings of the 16th Annual International Cryptology Conference, Santa
Barbara, CA, USA, 18–22 August 1996; Springer: Berlin/Heidelberg, Germany, 1996; pp. 104–113.
19. Toreini, E.; Randell, B.; Hao, F. An Acoustic Side Channel Attack on Enigma; Newcastle University: Newcastle,
UK, 2015.
20. Standards, N.B.O. Data Encryption Standard; Federal Information Processing Standards Publication (FIPS
PUB) 46: Washington, DC, USA, 1977.
21. Standards, N.B.O. Advanced Encryption Standard (AES); Federal Information Processing Standards Publication
(FIPS PUB) 197: Washington, DC, USA, 2001.
22. Cryptographic Engineering Research Group (CERG), Flexible Open-Source Workbench for Side-Channel
Analysis (FOBOS). Available online: https://cryptography.gmu.edu/fobos/ (accessed on 1 March 2020).
23. Cryptographic Engineering Research Group (CERG), eXtended eXtensible Benchmarking eXtension (XXBX).
Available online: https://cryptography.gmu.edu/xxbx/ (accessed on 1 March 2020).
24. Rivest, R.L.; Shamir, A.; Adleman, L. A method for obtaining digital signatures and public-key cryptosystems.
Commun. ACM 1978, 21, 120–126. [CrossRef]
Cryptography 2020, 4, 15 30 of 33

25. Joye, M.; Sung-Ming, Y. The Montgomery powering ladder. In Cryptographic Hardware and Embedded
Systems—CHES 2002, Proceedings of the 4th International Workshop, Redwood Shores, CA, USA, 13–15 August
2002; Revised Papers; Springer: Berlin, Germany, 2002; pp. 291–302.
26. Rohatgi, P. Protecting FPGAs from Power Analysis. Available online: https://www.eetimes.com/protecting-
fpgas-from-power-analysis (accessed on 21 April 2020).
27. Messerges, T.S.; Dabbish, E.A.; Sloan, R.H. Power analysis attacks of modular exponentiation in smartcards.
In Proceedings of the 1st Workshop on Cryptographic Hardware and Embedded Systems (CHES 1999),
Worcester, MA, USA, 12–13 August 1999; pp. 144–157.
28. Plore. Side Channel Attacks on High Security Electronic Safe Locks. Available online: https://www.youtube.
com/watch?v=lXFpCV646E0 (accessed on 15 January 2020).
29. Aucamp, D. Test for the difference of means. In Proceedings of the 14th Annual Meeting of the American
Institute for Decision Sciences, San Francisco, CA, USA, 22–24 November 1982; pp. 291–293.
30. Cohen, A.E.; Parhi, K.K. Side channel resistance quantification and verification. In Proceedings of the 2007
IEEE International Conference on Electro/Information Technology (EIT 2007), Chicago, IL, USA, 17–20 May
2007; pp. 130–134.
31. Brier, E.; Clavier, C.; Olivier, F. Correlation power analysis with a leakage model. In Proceedings of the 6th
International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2004), Cambridge, MA,
USA, 11–13 August 2004; pp. 16–29.
32. Souissi, Y.; Bhasin, S.; Guilley, S.; Nassar, M.; Danger, J.L. Towards Different Flavors of Combined Side Channel
Attacks. In Topics in Cryptology–CT-RSA 2012, Proceedings of the Cryptographers’ Track at the RSA Conference
2012, San Francisco, CA, USA, 27 February–2 March 2012; Springer: Berlin, Germany, 2012; pp. 245–259.
33. Zhang, H.; Li, J.; Zhang, F.; Gan, H.; He, P. A study on template attack of chip base on side channel power
leakage. Dianbo Kexue Xuebao/Chin. J. Radio Sci. 2015, 30, 987–992. [CrossRef]
34. Socha, P.; Miskovsky, V.; Kubatova, H.; Novotny, M. Optimization of Pearson correlation coefficient calculation
for DPA and comparison of different approaches. In Proceedings of the 2017 IEEE 20th International
Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Los Alamitos, CA, USA,
19–21 April 2017; pp. 184–189.
35. Chari, S.; Rao, J.R.; Rohatgi, P. Template attacks. In Proceedings of the 4th International Workshop on
Cryptographic Hardware and Embedded Systems (CHES 2002), Redwood Shores, CA, USA, 13–15 August
2002; pp. 13–28.
36. Chen, L.; Wang, S. Semi-naive bayesian classification by weighted kernel density estimation. In Proceedings
of the 8th International Conference on Advanced Data Mining and Applications (ADMA 2012), Nanjing,
China, 15–18 December 2012; pp. 260–270.
37. Gierlichs, B.; Batina, L.; Tuyls, P.; Preneel, B. Mutual information analysis: A generic side-channel distinguisher.
In Cryptographic Hardware and Embedded Systems—CHES 2008, Proceedings of the 10th International Workshop,
Washington, DC, USA, 10–13 August 2008; Springer: Berlin, Germany, 2008; pp. 426–442.
38. Souissi, Y.; Nassar, M.; Guilley, S.; Danger, J.-L.; Flament, F. First principal components analysis: A new side
channel distinguisher. In Proceedings of the 13th International Conference on Information Security and
Cryptology (ICISC 2010), Seoul, Korea, 1–3 December 2010; pp. 407–419.
39. Whitnall, C.; Oswald, E.; Standaert, F.X. The myth of generic DPA...and the magic of learning. In Topics in
Cryptology—CT-RSA 2014, Proceedings of the Cryptographer’s Track at the RSA Conference, San Francisco, CA,
USA, 25–28 February 2014; Springer: Berlin, Germany, 2014; pp. 183–205.
40. Wong, D. Explanation of DPA: Differential Power Analysis (from the paper of Kocher et al); YouTube: San Bruno,
CA, USA, 2015.
41. Aigner, M.; Oswald, E. Power Analysis Tutorial; Institute for Applied Information Processing and
Communication; University of Technology Graz: Graz, Austria, 2008.
42. Messerges, T.S.; Dabbish, E.A.; Sloan, R.H. Examining smart-card security under the threat of power analysis
attacks. IEEE Trans. Comput. 2002, 51, 541–552. [CrossRef]
43. Messerges, T.S.; Dabbish, E.A.; Sloan, R.H. Investigations of power analysis attacks on smart cards.
In Proceedings of the USENIX Workshop on Smartcard Technology, Berkeley, CA, USA, 10–11 May 1999;
pp. 151–161.
Cryptography 2020, 4, 15 31 of 33

44. Kiyani, N.F.; Harpe, P.; Dolmans, G. Performance analysis of OOK modulated signals in the presence of
ADC quantization noise. In Proceedings of the IEEE 75th Vehicular Technology Conference, VTC Spring
2012, Yokohama, Japan, 6 May–9 June 2012.
45. Le, T.H.; Clediere, J.; Serviere, C.; Lacoume, J.L. Noise reduction in side channel attack using fourth-order
cumulant. IEEE Trans. Inf. Forensics Secur. 2007, 2, 710–720. [CrossRef]
46. Ott, R.L.; Longnecker, M. An Introduction to Statistical Methods & Data Analysis, Seventh ed.; Cengage Learning:
Boston, MA, USA, 2016; p. 1179.
47. Messerges, T.S. Using second-order power analysis to attack DPA resistant software. In Cryptographic
Hardware and Embedded Systems—CHES 2000, Proceedings of the Second International Workshop, Worcester, MA,
USA, 17–18 August 2000; Springer: Berlin, Germany, 2000; pp. 238–251.
48. Oswald, E.; Mangard, S.; Herbst, C.; Tillich, S. Practical second-order DPA attacks for masked smart card
implementations of block ciphers. In Topics in Cryptology-CT-RSA 2006, Proceedings of the Cryptographers’
Track at the RAS Conference 2006, San Jose, CA, USA, 13–17 February 2006; Springer: Berlin, Germany, 2006;
pp. 192–207.
49. Clavier, C.; Coron, J.-S.; Dabbous, N. Differential power analysis in the presence of hardware countermeasures.
In Proceedings of the 2nd International Workshop on Cryptographic Hardware and Embedded Systems
(CHES 2000), Worcester, MA, USA, 17 August 2000; pp. 252–263.
50. Debande, N.; Souissi, Y.; Nassar, M.; Guilley, S.; Thanh-Ha, L.; Danger, J.L. Re-synchronization by moments:
An efficient solution to align Side-Channel traces. In Proceedings of the 2011 IEEE International Workshop
on Information Forensics and Security (WIFS 2011), Piscataway, NJ, USA, 29 November–2 December 2011;
p. 6.
51. Qizhi, T.; Huss, S.A. A general approach to power trace alignment for the assessment of side-channel
resistance of hardened cryptosystems. In Proceedings of the 2012 Eighth International Conference on
Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), Los Alamitos, CA, USA, 18–20
July 2012; pp. 465–470.
52. Thiebeauld, H.; Gagnerot, G.; Wurcker, A.; Clavier, C. SCATTER: A New Dimension in Side-Channel. In
Constructive Side-Channel Analysis and Secure Design, Proceedings of the 9th International Workshop (COSADE
2018), Singapore, 23–24 April 2018; Springer: Berlin, Germany, 2008; pp. 135–152.
53. Shamir, A. Protecting smart cards from passive power analysis with detached power supplies. In Cryptographic
Hardware and Embedded Systems—CHES 2000, Proceedings of the Second International Workshop, Worcester, MA,
USA, 17–18 August 2000; Springer: Berlin, Germany, 2000; pp. 71–77.
54. Coron, J.-S. Resistance against differential power analysis for elliptic curve cryptosystems. In Proceedings of
the 1st Workshop on Cryptographic Hardware and Embedded Systems (CHES 1999), Worcester, MA, USA,
12–13 August 1999; pp. 292–302.
55. Waddle, J.; Wagner, D. Towards efficient second-order power analysis. In Proceedings of the 6th International
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2004), Cambridge, MA, USA, 11–13
August 2004; pp. 1–15.
56. ChipWhisperer®. Template Attacks. Available online: https://wiki.newae.com/Template_Attacks (accessed
on 3 April 2020).
57. Lerman, L.; Bontempi, G.; Markowitch, O. Power analysis attack: An approach based on machine learning.
Int. J. Appl. Cryptogr. 2014, 3, 97–115. [CrossRef]
58. Markowitch, O.; Lerman, L.; Bontempi, G. Side Channel Attack: An Approach Based on Machine Learning; Center
for Advanced Security Research Darmstadt: Darmstadt, Germany, 2011.
59. Hospodar, G.; Gierlichs, B.; De Mulder, E.; Verbauwhede, I.; Vandewalle, J. Machine learning in side-channel
analysis: A first study. J. Cryptogr. Eng. 2011, 1, 293–302. [CrossRef]
60. Ramezanpour, K.; Ampadu, P.; Diehl, W. SCAUL: Power Side-Channel Analysis with Unsupervised Learning.
arXiv e-Prints 2020, arXiv:2001.05951.
61. Hettwer, B.; Gehrer, S.; Guneysu, T. Applications of machine learning techniques in side-channel attacks: A
survey. J. Cryptogr. Eng. 2019. [CrossRef]
62. Lerman, L.; Martinasek, Z.; Markowitch, O. Robust profiled attacks: Should the adversary trust the dataset?
IET Inf. Secur. 2017, 11, 188–194. [CrossRef]
63. Martinasek, Z.; Iglesias, F.; Malina, L.; Martinasek, J. Crucial pitfall of DPA Contest V4.2 implementation.
Secur. Commun. Netw. 2016, 9, 6094–6110. [CrossRef]
Cryptography 2020, 4, 15 32 of 33

64. Martinasek, Z.; Zeman, V.; Malina, L.; Martinásek, J. k-Nearest Neighbors Algorithm in Profiling Power
Analysis Attacks. Radioengineering 2016, 25, 365–382. [CrossRef]
65. Golder, A.; Das, D.; Danial, J.; Ghosh, S.; Sen, S.; Raychowdhury, A. Practical Approaches toward
Deep-Learning-Based Cross-Device Power Side-Channel Attack. IEEE Trans. Very Large Scale Integr.
(Vlsi) Syst. 2019, 27, 2720–2733. [CrossRef]
66. Jin, S.; Kim, S.; Kim, H.; Hong, S. Recent advances in deep learning-based side-channel analysis. ETRI J.
2020, 42, 292–304. [CrossRef]
67. Libang, Z.; Xinpeng, X.; Junfeng, F.; Zongyue, W.; Suying, W. Multi-label Deep Learning based Side Channel
Attack. In Proceedings of the 2019 Asian Hardware Oriented Security and Trust Symposium (AsianHOST),
Piscataway, NJ, USA, 16–17 December 2019; p. 6.
68. Yu, W.; Chen, J. Deep learning-assisted and combined attack: A novel side-channel attack. Electron. Lett.
2018, 54, 1114–1116. [CrossRef]
69. Wang, H.; Brisfors, M.; Forsmark, S.; Dubrova, E. How Diversity Affects Deep-Learning Side-Channel
Attacks. In Proceedings of the 5th IEEE Nordic Circuits and Systems Conference, NORCAS 2019: NORCHIP
and International Symposium of System-on-Chip, SoC 2019, Helsinki, Finland, 29–30 October 2019; IEEE
Circuits and Systems Society (CAS). Tampere University: Tampere, Finland, 2019.
70. Batina, L.; Gierlichs, B.; Prouff, E.; Rivain, M.; Standaert, F.-X.; Veyrat-Charvillon, N. Mutual information
analysis: A comprehensive study. J. Cryptol. 2011, 24, 269–291. [CrossRef]
71. Prouff, E.; Rivain, M. Theoretical and practical aspects of mutual information based side channel analysis. In
Proceedings of the 7th International Conference on Applied Cryptography and Network Security (ACNS
2009), Paris-Rocquencourt, France, 2–5 June 2009; pp. 499–518.
72. Standaert, F.-X.; Gierlichs, B.; Verbauwhede, I. Partition vs. comparison side-channel distinguishers: An
empirical evaluation of statistical tests for univariate side-channel attacks against two unprotected CMOS
devices. In Proceedings of the 11th International Conference on Information Security and Cryptology (ICISC
2008), Seoul, Korea, 3–5 December 2008; pp. 253–267.
73. Veyrat-Charvillon, N.; Standaert, F.-X. Mutual information analysis: How, when and why? In Proceedings
of the 11th International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2009),
Lausanne, Switzerland, 6–9 September 2009; pp. 429–443.
74. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [CrossRef]
75. Carbone, M.; Teglia, Y.; Ducharme, G.R.; Maurine, P. Mutual information analysis: Higher-order statistical
moments, efficiency and efficacy. J. Cryptogr. Eng. 2017, 7, 1–17. [CrossRef]
76. Whitnall, C.; Oswald, E. A Comprehensive Evaluation of Mutual Information Analysis Using a Fair
Evaluation Framework. In Advances in Cryptology—CRYPTO 2011, Proceedings of the 31st Annual Cryptology
Conference, Santa Barbara, CA, USA, 14–18 August 2011; Springer: Berlin, Germany, 2011; pp. 316–334.
77. Fan, H.-F.; Yan, Y.-J.; Xu, J.-F.; Ren, F. Simulation of correlation power analysis against AES cryptographic
chip. Comput. Eng. Des. 2010, 31, 260–262.
78. Socha, P.; Miskovsky, V.; Kubatova, H.; Novotny, M. Correlation power analysis distinguisher based on the
correlation trace derivative. In Proceedings of the 21st Euromicro Conference on Digital System Design
(DSD 2018), Prague, Czech Republic, 29–31 August 2018; pp. 565–568.
79. Raatgever, J.W.; Duin, R.P.W. On the variable kernel model for multivariate nonparametric density estimation.
In Proceedings of the COMPSTAT 1978 Computational Statistics; Physica: Wien, Austria, 1978; pp. 524–533.
80. Batina, L.; Gierlichs, B.; Lemke-Rust, K. Differential cluster analysis. In Proceedings of the 11th International
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2009), Lausanne, Switzerland, 6–9
September 2009; pp. 112–127.
81. Silva, J.; Narayanan, S.S. On data-driven histogram-based estimation for mutual information. In Proceedings
of the 2010 IEEE International Symposium on Information Theory (ISIT 2010), Piscataway, NJ, USA, 13–18
June 2010; pp. 1423–1427.
82. Lange, M.; Nebel, D.; Villmann, T. Partial Mutual Information for Classification of Gene Expression Data
by Learning Vector Quantization. In Advances in Self-Organizing Maps and Learning Vector Quantization,
Proceedings of the 10th International Workshop (WSOM 2014), Mittweida, Germany, 2–4 July 2014; Springer:
Berlin/Heidelberg, Germany, 2014; pp. 259–269.
Cryptography 2020, 4, 15 33 of 33

83. Goodwill, G.; Jun, B.; Jaffe, J.; Rohatgi, P. A testing methodology for side-channel resistance validation. In
Nist Non-Invasive Attack Testing Workshop; NIST: Gaithersburg, MA, USA, 2011.
84. Mather, L.; Oswald, E.; Bandenburg, J.; Wojcik, M. Does my device leak information? An a priori statistical
power analysis of leakage detection tests. In Proceedings of the 19th International Conference on the
Theory and Application of Cryptology and Information Security (ASIACRYPT 2013), Bengaluru, India, 1–5
December 2013; pp. 486–505.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like