0% found this document useful (0 votes)
95 views

Reliable CRC-Based Error Detection Constructions For Finite Field

This document discusses reliable error detection constructions for finite field multipliers that are used in cryptography applications. It proposes efficient CRC-based error detection schemes for the Luov post-quantum cryptographic algorithm's finite field multipliers. Software and hardware implementations of the original Luov multipliers with the proposed CRC schemes on an FPGA verify that the schemes provide high error coverage with acceptable overhead.

Uploaded by

vsangvai26
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Reliable CRC-Based Error Detection Constructions For Finite Field

This document discusses reliable error detection constructions for finite field multipliers that are used in cryptography applications. It proposes efficient CRC-based error detection schemes for the Luov post-quantum cryptographic algorithm's finite field multipliers. Software and hardware implementations of the original Luov multipliers with the proposed CRC schemes on an FPGA verify that the schemes provide high error coverage with acceptable overhead.

Uploaded by

vsangvai26
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Reliable CRC-Based Error Detection Constructions for Finite Field

Multipliers With Applications in Cryptography

Abstract: Finite-field multiplication has received prominent attention in the literature with

applications in cryptography and error-detecting codes. For many cryptographic algorithms, this

arithmetic operation is a complex, costly, and time-consuming task that may require millions of

gates. In this work, we propose efficient hardware architectures based on cyclic redundancy

check (CRC) as error-detection schemes for postquantum cryptography (PQC) with case studies

for the Luov cryptographic algorithm. Luov was submitted for the National Institute of Standards

and Technology (NIST) PQC standardization competition and was advanced to the second

round. The CRC polynomials selected are in-line with the required error-detection capabilities

and with the field sizes as well. We have developed verification codes through which software

implementations of the proposed schemes are performed to verify the derivations of the

formulations. Additionally, hardware implementations of the original multipliers with the

proposed error-detection schemes are performed over a Xilinx field-programmable gate array

(FPGA), verifying that the proposed schemes achieve high error coverage with acceptable

overhead.
CHAPTER 1

INTRODUCTION

Many modern, sensitive applications and systems use finite-field operations in their schemes,

among which finite-field multiplication has received prominent attention. Finite-field multipliers

perform multiplication modulo, an irreducible polynomial used to define the finite field. For

postquantum cryptography (PQC), the inputs can be very large, and the finite-field multipliers

may require millions of logic gates. Therefore, it is a complex task to implement such

architectures resilient to natural and malicious faults; consequently, research has focused on

ways to eliminate errors and obtain more reliability with acceptable overhead [1]–[6]. Moreover,

there has been previous work on countering fault attacks and providing reliability for PQC.

Sarker et al. [7] used error-detection schemes of number theoretic transform (NTT) to detect both

permanent and transient faults. Mozaffari-Kermani et al. [8] performed fault detection for

stateless hash-based PQC signatures. Additionally, error-detection hash trees for stateless hash-

based signatures are proposed in [9] to make such schemes more reliable against natural faults

and help protecting them against malicious faults. In [10], algorithm-oblivious constructions are

proposed through recomputing with swapped ciphertext and additional authenticated blocks,

which can be applied to the Galois counter mode (GCM) architectures using different finite-field

multipliers in G F(2128). Several countermeasures based on error-detection checksum codes and

spatial/temporal redundancies for the NTRU encryption algorithm have been presented in [11].
Our proposed error-detection architectures are adapted to the Luov cryptographic algorithm [12];

however, they can be applied to different PQC algorithms that use finite-field multipliers. The

Luov algorithm was submitted for National Institute of Standards and Technology (NIST)

standardization competition [13] and was advanced to the second round [14]. Cyclic redundancy

check (CRC) error-detection schemes are applied in our proposed hardware constructions to

make sure that they are overhead-aware with high error coverage. Our contributions in this brief

are summarized as follows.

1) Error-detection schemes for the finite-field multipliers G F(2m ) with m > 1 used in the Luov

cryptographic algorithm are proposed. These error-detection architectures are based on CRC-5.

Additionally, we explore and study both primitive and standardized generator polynomials for

CRC-5, comparing their complexity.

2) We derive new formulations for the error-detection schemes of Luov’s algorithm, performing

software implementations for the sake of verifications. We note that such derivation covers a

wide range of applications and security levels. Nevertheless, the presented schemes are not

confined to these case studies.

3) The proposed error-detection architectures are embedded into the original finite-field

multipliers. We perform the implementations using Xilinx field-programmable gate array

(FPGA) family Kintex Ultrascale+ for device xcku5p-ffvd900-1-i to confirm that the schemes

are overhead-aware and that they provide high error coverage.


LITERATURE SURVEY:

1. Reliable Hardware Architectures for the Third-Round SHA-3 Finalist

Grostl Benchmarked on FPGA Platform:

The third round of competition for the SHA-3 candidates is ongoing to select the winning

function in 2012. Although much attention has been devoted to the performance and security of

these candidates, the approaches for increasing their reliability have not been presented to date.

In this paper, for the first time, we propose a high-performance scheme for fault detection of the

SHA-3 round-three candidate Grostl which is inspired by the Advanced Encryption Standard

(AES). We propose a low-overhead fault detection scheme by presenting closed formulations for

the predicted signatures of different transformations of this SHA-3 third-round finalist. These

signatures are derived to achieve low overhead and include one or multi-bit parities and

byte/word-wide predicted signatures. The proposed reliable hardware architectures for Grostl are

implemented on Xilinx Virtex-6 FPGA family to benchmark their hardware and timing

characteristics. The results of our evaluations show high error coverage and acceptable overhead

for the proposed scheme.

2. Efficient and Reliable Error Detection Architectures of Hash-Counter-


Hash Tweakable Enciphering Schemes:

Through pseudorandom permutation, tweakable enciphering schemes (TES) constitute block

cipher modes of operation which perform length-preserving computations. The state-of-the-art

research has focused on different aspects of TES, including implementations on hardware [field-
programmable gate array (FPGA)/ application-specific integrated circuit (ASIC)] and software

(hard/soft-core microcontrollers) platforms, algorithmic security, and applicability to sensitive,

security-constrained usage models. In this article, we propose efficient approaches for protecting

such schemes against natural and malicious faults. Specifically, noting that intelligent attackers

do not merely get confined to injecting multiple faults, one major benchmark for the proposed

schemes is evaluation toward biased and burst fault models. We evaluate a variant of TES, i.e.,

the Hash-Counter-Hash scheme, which involves polynomial hashing as other variants are either

similar or do not constitute finite field multiplication which, by far, is the most involved

operation in TES. In addition, we benchmark the overhead and performance degradation on the

ASIC platform. The results of our error injection simulations and ASIC implementations show

the suitability of the proposed approaches for a wide range of applications including deeply

embedded systems.

3. Reliable Hardware Architectures for Cryptographic Block Ciphers LED

and HIGHT:

Cryptographic architectures provide different security properties to sensitive usage models.

However, unless reliability of architectures is guaranteed, such security properties can be

undermined through natural or malicious faults. In this paper, two underlying block ciphers

which can be used in authenticated encryption algorithms are considered, i.e., light encryption

device and high security and lightweight block ciphers. The former is of the Advanced

Encryption Standard type and has been considered area-efficient, while the latter constitutes a

Feistel network structure and is suitable for low-complexity and low-power embedded security

applications. In this paper, we propose efficient error detection architectures including variants of
recomputing with encoded operands and signature-based schemes to detect both transient and

permanent faults. Authenticated encryption is applied in cryptography to provide confidentiality,

integrity, and authenticity simultaneously to the message sent in a communication channel. In

this paper, we show that the proposed schemes are applicable to the case study of simple

lightweight CFB for providing authenticated encryption with associated data. The error

simulations are performed using Xilinx Integrated Synthesis Environment tool and the results are

benchmarked for the Xilinx FPGA family Virtex-7 to assess the reliability capability and

efficiency of the proposed architectures.

4. Reliable and Error Detection Architectures of Pomaranch for False-

Alarm-Sensitive Cryptographic Applications:

Efficient cryptographic architectures are used extensively in sensitive smart infrastructures.

Among these architectures are those based on stream ciphers for protection against

eavesdropping, especially when these smart and sensitive applications provide life-saving or vital

mechanisms. Nevertheless, natural defects call for protection through design for fault detection

and reliability. In this paper, we present implications of fault detection cryptographic

architectures (Pomaranch in the hardware profile of European Network of Excellence for

Cryptology) for smart infrastructures. In addition, we present low-power architectures for its

nine-to-seven uneven substitution box [tower field architectures in GF(33)]. Through error

simulations, we assess resiliency against false-alarms which might not be tolerated in sensitive

intelligent infrastructures as one of our contributions. We further benchmark the feasibility of the

proposed approaches through application-specific integrated circuit realizations. Based on the


reliability objectives, the proposed architectures are a step-forward toward reaching the desired

objective metrics suitable for intelligent, emerging, and sensitive applications.

5. Reliable Architecture-Oblivious Error Detection Schemes for Secure

Cryptographic GCM Structures:

To augment the confidentiality property provided by block ciphers with authentication, the

Galois Counter Mode (GCM) has been standardized by the National Institute of Standards

and Technology. The GCM is used as an add-on to 128-bit block ciphers, such as the

Advanced Encryption Standard (AES), SMS4, or Camellia, to verify the integrity of data.

Prior works on the error detection of the GCM either use linear codes to protect the GCM

architectures or are based on AES–GCM architectures, confining the mechanisms to the AES

block cipher. Although such structures are efficient, they are not only confined to specific

architectures of the GCM but might also not fully take advantage of the parallel architectures

of the GCM. Moreover, linear codes have been shown to be potentially ineffective with

respect to biased faults. In this paper, we propose algorithm-oblivious constructions through

recomputing with swapped ciphertext and additional authenticated blocks, which can be

applied to the GCM architectures using different finite field multipliers in GF (2128 ). Such

obliviousness for the proposed constructions used in the GCM gives freedom to the

designers. We present the results of error simulations and application-specific integrated

circuit implementations to demonstrate the utility of the presented schemes. Based on the

overhead/degradation tolerance for implementation/performance metrics, one can fine-tune

the proposed method to achieve more reliable architectures for the GCM.
CHAPTER 2

CYCLIC REDUNDANCY CHECK

Cyclic Redundancy Check:

The cyclic redundancy check, or CRC, is a technique for detecting errors in digital data, but not

for making corrections when errors are detected. It is used primarily in data transmission. In the

CRC method, a certain number of check bits, often called a checksum, are appended to the

message being transmitted. The receiver can determine whether or not the check bits agree with

the data, to ascertain with a certain degree of probability whether or not an error occurred in

transmission. If an error occurred, the receiver sends a “negative acknowledgement” (NAK) back

to the sender, requesting that the message be retransmitted.

The technique is also sometimes applied to data storage devices, such as a disk drive. In this

situation each block on the disk would have check bits, and the hardware might automatically

initiate a reread of the block when an error is detected, or it might report the error to software.

The material that follows speaks in terms of a “sender” and a “receiver” of a “message,” but it

should be understood that it applies to storage writing and reading as well.

Background:
There are several techniques for generating check bits that can be added to a message. Perhaps

the simplest is to append a single bit, called the “parity bit,” which makes the total number of 1-

bits in the code vector (message with parity bit appended) even (or odd). If a single bit gets

altered in transmission, this will change the parity from even to odd (or the reverse). The sender

generates the parity bit by simply summing the message bits modulo 2—that is, by exclusive

or’ing them together. It then appends the parity bit (or its complement) to the message. The

receiver can check the message by summing all the message bits modulo 2 and checking that the

sum agrees with the parity bit. Equivalently, the receiver can sum all the bits (message and

parity) and check that the result is 0 (if even parity is being used).

This simple parity technique is often said to detect 1-bit errors. Actually it detects errors in any

odd number of bits (including the parity bit), but it is a small comfort to know you are detecting

3-bit errors if you are missing 2-bit errors.

For bit serial sending and receiving, the hardware to generate and check a single parity bit is very

simple. It consists of a single exclusive or gate together with some control circuitry. For bit

parallel transmission, an exclusive or tree may be used, as illustrated in Figure . Efficient ways to

compute the parity bit in software are given in


FIGURE . Exclusive or tree

Other techniques for computing a checksum are to form the exclusive or of all the bytes in the

message, or to compute a sum with end-around carry of all the bytes. In the latter method the

carry from each 8-bit sum is added into the least significant bit of the accumulator. It is believed

that this is more likely to detect errors than the simple exclusive or, or the sum of the bytes with

carry discarded.

A technique that is believed to be quite good in terms of error detection, and which is easy to

implement in hardware, is the cyclic redundancy check. This is another way to compute a

checksum, usually eight, 16, or 32 bits in length, that is appended to the message. We will briefly

review the theory and then give some algorithms for computing in software a commonly used

32-bit CRC checksum.

Theory:

The CRC is based on polynomial arithmetic, in particular, on computing the remainder of

dividing one polynomial in GF(2) (Galois field with two elements) by another. It is a little like

treating the message as a very large binary number, and computing the remainder on dividing it

by a fairly large prime such as Intuitively, one would expect this to give a reliable checksum.

A polynomial in GF(2) is a polynomial in a single variable x whose coefficients are 0 or 1.

Addition and subtraction are done modulo 2—that is, they are both the same as the exclusive or

operator. For example, the sum of the polynomials


is as is their difference. These polynomials are not usually written with minus signs, but they

could be, because a coefficient of –1 is equivalent to a coefficient of 1.

Multiplication of such polynomials is straightforward. The product of one coefficient by another

is the same as their combination by the logical and operator, and the partial products are summed

using exclusive or. Multiplication is not needed to compute the CRC checksum.

Division of polynomials over GF(2) can be done in much the same way as long division of

polynomials over the integers. Below is an example.

The reader might like to verify that the quotient of multiplied by the divisor of plus the

remainder of equals the dividend.

The CRC method treats the message as a polynomial in GF(2). For example, the message

11001001, where the order of transmission is from left to right (110…) is treated as a
representation of the polynomial The sender and receiver agree on a certain fixed polynomial

called the generator polynomial. For example, for a 16-bit CRC the CCITT 1 has chosen the

polynomial x16 + x12 + x5 + 1, which is now widely used for a 16-bit CRC checksum. To

compute an r-bit CRC checksum, the generator polynomial must be of degree r. The sender

appends r 0-bits to the m-bit message and divides the resulting polynomial of degree by the

generator polynomial. This produces a remainder polynomial of degree (or less). The remainder

polynomial has r coefficients, which are the checksum. The quotient polynomial is discarded.

The data transmitted (the code vector) is the original m-bit message followed by the rbit

checksum.

There are two ways for the receiver to assess the correctness of the transmission. It can compute

the checksum from the first m bits of the received data, and verify that it agrees with the last r

received bits. Alternatively, and following usual practice, the receiver can divide all the received

bits by the generator polynomial and check that the r-bit remainder is 0. To see that the

remainder must be 0, let M be the polynomial representation of the message, and let R be the

polynomial representation of the remainder that was computed by the sender. Then the

transmitted data corresponds to the polynomial (or, equivalently, Mxr + R).

By the way R was computed, we know that where G is the generator polynomial and Q is the

quotient (that was discarded). Therefore the transmitted data, is equal to QG, which is clearly a

multiple of G. If the receiver is built as nearly as possible just like the sender, the receiver will

append r 0-bits to the received data as it computes the remainder R. But the received data with 0-

bits appended is still a multiple of G, so the computed remainder is still 0.


That’s the basic idea, but in reality the process is altered slightly to correct for such deficiencies

as the fact that the method as described is insensitive to the number of leading and trailing 0-bits

in the data transmitted. In particular, if a failure occurred that caused the received data, including

the checksum, to be all-0, it would be accepted.

Choosing a “good” generator polynomial is something of an art, and beyond the scope of this

text. Two simple observations: For an r-bit checksum, G should be of degree r, because

otherwise the first bit of the checksum would always be 0, which wastes a bit of the checksum.

Similarly, the last coefficient should be 1 (that is, G should not be divisible by x), because

otherwise the last bit of the checksum would always be 0 (because if G is divisible by x, then R

must be also). The following facts about generator polynomials are proved in [PeBr] and/or

[Tanen]:

• If G contains two or more terms, all single-bit errors are detected.

• If G is not divisible by x (that is, if the last term is 1), and e is the least positive integer such

that G evenly divides then all double errors that are within a frame of e bits are detected. A

particularly good polynomial in this respect is for which

• If is a factor of G, all errors consisting of an odd number of bits are detected.

• An r-bit CRC checksum detects all burst errors of length (A burst error of length r is a string of

r bits in which the first and last are in error, and the intermediate bits may or may not be in

error.)

It is interesting to note that if a code of any type can detect all double-bit and single-bit errors,

then it can in principle correct single-bit errors. To see this, suppose data containing a single-bit
error is received. Imagine complementing all the bits, one at a time. In all cases but one, this

results in a double-bit error, which is detected. But when the erroneous bit is complemented, the

data is error-free, which is recognized. In spite of this, the CRC method does not seem to be used

for single-bit error correction. Instead, the sender is requested to repeat the whole transmission if

any error is detected.

Practice Table shows the generator polynomials used by some common CRC standards. The

“Hex” column shows the hexadecimal representation of the generator polynomial; the most

significant bit is omitted, as it is always 1.

TABLE GENERATOR POLYNOMIALS OF SOME CRC CODES

The CRC standards differ in ways other than the choice of generating polynomial. Most initialize

by assuming that the message has been preceded by certain nonzero bits, others do no such

initialization. Most transmit the bits within a byte least significant bit first, some most significant

bit first. Most append the checksum least significant byte first, others most significant byte first.

Some complement the checksum.


CRC-12 is used for transmission of 6-bit character streams, and the others are for 8-bit

characters, or 8-bit bytes of arbitrary data. CRC-16 is used in IBM’s BISYNCH communication

standard. The CRC-CCITT polynomial, also known as ITU-TSS, is used in communication

protocols such as XMODEM, X.25, IBM’s SDLC, and ISO’s HDLC [Tanen]. CRC-32 is also

known as AUTODIN-II and ITU-TSS (ITU-TSS has defined both 16- and a 32-bit polynomials).

It is used in PKZip, Ethernet, AAL5 (ATM Adaptation Layer 5), FDDI (Fiber Distributed Data

Interface), the IEEE-802 LAN/MAN standard, and in some DOD applications. It is the one for

which software algorithms are given here.

The first three polynomials in Table 14–1 have as a factor. The last (CRC-32) does not.

To detect the error of erroneous insertion or deletion of leading 0’s, some protocols prepend one

or more nonzero bits to the message. These don’t actually get transmitted, they are simply used

to initialize the key register (described below) used in the CRC calculation. A value of r 1-bits

seems to be universally used. The receiver initializes its register in the same way.

The problem of trailing 0’s is a little more difficult. There would be no problem if the receiver

operated by comparing the remainder based on just the message bits to the checksum received.

But, it seems to be simpler for the receiver to calculate the remainder for all bits received

(message and checksum) plus r appended 0-bits. The remainder should be 0. But, with a 0

remainder, if the message has trailing 0-bits inserted or deleted, the remainder will still be 0, so

this error goes undetected.

The usual solution to this problem is for the sender to complement the checksum before

appending it. Because this makes the remainder calculated by the receiver nonzero (usually), the
remainder will change if trailing 0’s are inserted or deleted. How then does the receiver

recognize an error-free transmission? Using the “mod” notation for remainder, we know that

Denoting the “complement” of the polynomial R by we have

Thus the checksum calculated by the receiver for an error-free transmission should be

This is a constant (for a given G). For CRC-32 this polynomial, called the residual or

residue, is

or hex C704DD7B [Black].

Hardware:

To develop a hardware circuit for computing the CRC checksum, we reduce the polynomial

division process to its essentials. The process employs a shift register, which we denote by
CRC. This is of length r (the degree of G) bits, not as you might expect. When the

subtractions (exclusive or’s) are done, it is not necessary to represent the high-order bit,

because the high-order bits of G and the quantity it is being subtracted from are both 1. The

division process might be described informally as follows:

Initialize the CRC register to all 0-bits. Get first/next message bit m. If the high-order bit of

CRC is 1, Shift CRC and m together left 1 position, and XOR the result with the low-order r

bits of G.

Otherwise,

Just shift CRC and m left 1 position.

If there are more message bits, go back to get the next one.

It might seem that the subtraction should be done first, and then the shift. It would be done

that way if the CRC register held the entire generator polynomial, which in bit form is r + 1

bits. Instead, the CRC register holds only the low-order r bits of G, so the shift is done first,

to align things properly.

Below is shown the contents of the CRC register for the generator G = x3 + x + 1 and the

message M = x7 + x6 + x5 + x2 + x. Expressed in binary, G = 1011 and M = 11100110.

000 Initial CRC contents. High-order bit is 0, so just shift in first message bit.

001 High-order bit is 0, so just shift in second message bit, giving:

011 High-order bit is 0 again, so just shift in third message bit, giving: 111 High-order

bit is 1, so shift and then XOR with 011, giving:


101 High-order bit is 1, so shift and then XOR with 011, giving:

001 High-order bit is 0, so just shift in fifth message bit, giving:

011 High-order bit is 0, so just shift in sixth message bit, giving:

111 High-order bit is 1, so shift and then XOR with 011, giving:

101 There are no more message bits, so this is the remainder.

These steps can be implemented with the (simplified) circuit shown in Figure ,

which is known as a feedback shift register.

FIGURE . Polynomial division circuit for G = x3 + x + 1.

The three boxes in the figure represent the three bits of the CRC register. When a message

bit comes in, if the high-order bit (x2 box) is 0, simultaneously the mes- sage bit is shifted

into the x0 box, the bit in x0 is shifted to x1, the bit in x1 is shifted to x2, and the bit in x2 is

discarded. If the high-order bit of the CRC register is 1, then a 1 is present at the lower input

of each of the two exclusive or gates. When a message bit comes in, the same shifting takes

place but the three bits that wind up in the CRC register have been exclusive or’ed with

binary 011. When all the message bits have been processed, the CRC holds M mod G.

If the circuit of Figure 14–2 were used for the CRC calculation, then after processing the

message, r (in this case 3) 0-bits would have to be fed in. Then the CRC register would have

the desired checksum, Mx r mod G. But, there is a way to avoid this step with a simple

rearrangement of the circuit.


Instead of feeding the message in at the right end, feed it in at the left end, r steps away, as

shown in FIG. This has the effect of pre multiplying the input message M by xr. But pre

multiplying and post multiplying are the same for polynomials. Therefore, as each message

bit comes in, the CRC register contents are the remainder for the portion of the message

processed, as if that portion had r 0-bits appended.

FIG CRC circuit for G = x3 + x + 1.

Figure shows the circuit for the CRC-32 polynomial

FIGURE . CRC circuit for CRC-32

Software Figure shows a basic implementation of CRC-32 in software. The CRC-32 protocol

initializes the CRC register to all 1’s, transmits each byte least significant bit first, and
complements the checksum. We assume the message consists of an integral number of bytes.

CHAPTER 3

PRELIMINARIES:

There are five popular PQC algorithm classes: code-based, hash-based, isogeny-based, lattice-

based, and multivariate-quadraticequation-based cryptosystems [15]. Code-based

cryptography differs from others in that its security relies on the hardness of decoding in a

linear error-correcting code. Hash-based cryptography creates signature algorithms based on

the security of a selected cryptographic hash function. The security of isogeny-based

cryptography is based on the hard problem to find an isogeny between two given

supersingular elliptic curves. Lattice-based cryptography is capable of creating a public-key

cryptosystem based on lattices. Lastly, the security of multivariate-quadratic-equation-based

cryptography depends on the difficulty of solving a system of multivariate polynomials over a

finite field. Such cryptographic schemes use large field sizes to provide the needed security

levels.

Luov is a multivariate public key cryptosystem and an adaptation of the unbalanced oil and

vinegar (UOV) signature scheme, but there is a restriction on the coefficients of the public

key. Instead, the scheme uses two finite fields: one is the binary field of two elements,

whereas the other is its extension of degree m. F2 is the binary field and F2m is its extension

of degree m. The central map F: Fn 2m → Fo 2m is a quadratic map, where o and v satisfy n


= o + v, αi,j,k , βi,k and γk

Fig. 1. Finite-field multiplier with the proposed error-detection schemes based on CRC

are chosen from the base field F2, and whose components f1,..., fo are in the form f k(x) = v i=1

n j=i αi,j,k xi x j +n i=1 βi,k xi +γk.

These finite-field multiplications are very complex and require large-area footprint. Therefore, it

is a complex task to implement such architectures resilient to natural and malicious faults. The

aim of this work is to provide countermeasures against natural faults and fault injections for the

finite-field multipliers used in cryptosystems such as the Luov algorithm as a case study, noting

that the proposed error-detection schemes can be adapted to other applications and cryptographic

algorithms whose building blocks need finite-field multiplications. Readers who are interested in

knowing more details about the Luov’s cryptographic algorithm are encouraged to refer to [12].

PROPOSED FAULT-DETECTION ARCHITECTURES:

The multiplication of any two elements A and B of G F(2m ), following the approach in [16], can

be presented as A· B mod f (x) = m−1 i=0 bi · ((Aαi ) mod f (x)) = m−1 i=0 bi · X(i), where the
set of αi’s is the polynomial basis of element A, the set of bi’s is the B coefficients, f (x) is the

field polynomial, X(i) = α·X(i−1) mod f (x), and X(0) = A. To perform finite-field

multiplication, three different modules are needed: sum, α, and pass-thru modules. The sum

module adds two elements in G F(2m ) using m two-input XOR gates, the α module multiplies

an element of G F(2m ) by α and then reduces the result modulo f (x), and lastly, the pass-thru

module multiplies a G F(2m ) element by a G F(2) element. One finite-field multiplication uses a

total of m − 1 sum modules, m − 1 α modules, and m pass-thru modules to get the output. Fault

injection can occur in any of these modules, and formulations for parity signatures in G F(2m )

are derived in [16]. Parity signatures provide an error flag (EF) on each module. The major

drawback of parity signatures is that their error coverage is approximately 50%, that is, if the

number of faults is even, the approach would not be able to detect the faults. This highly

predictable countermeasure can be circumvented by intelligent fault injection.

In this work, our aim is the derivation of error-detection schemes that provide a broader and

higher error coverage than parity signatures and explore the application of such schemes to the

Luov algorithm. Thus, we derive and apply CRC signatures [17] to the finite-field multipliers

used in Luov algorithm. This would be a step forward toward detecting natural and malicious

intelligent faults, especially and as discussed in this brief, considering both primitive and

standardized CRCs with different fault multiplicity coverage. CRC was first proposed in 1961

and it is based on the theory of cyclic error-correcting codes. To implement CRC, a generator

polynomial g(x) is required. The message becomes as the dividend, the quotient is discarded, and

the remainder produces the result. In CRC, a fixed number of check bits are appended to the data

and these check bits are inspected when the output is received to detect any errors. The entire

finite-field multiplier with our error-detection schemes is shown in Fig. 1, where actual CRC
(ACRC) and predicted CRC (PCRC) stand for ACRC signatures and PCRC signatures,

respectively. In Fig. 1, only one EF is shown for clarity; however, for CRC-5, which is the case

study proposed in this brief, 5 EFs are computed on each module. In Fig. 2, the α module is

shown more in-depth to clarify how the proposed CRC signatures work in each finite-field

multiplier.

For the sum and pass-thru modules, it follows the approach as for parity signatures described in

[16]. For the sum module in CRC-1, pˆx is equal to the sum of the parity bits of the input

elements A and B in G F(2m ), pˆX = pA + pB. Furthermore, for the pass-thru module in CRC-1,

pˆX = b· pA, where b is an element in G F(2). For any other CRC-n scheme, instead of summing

all the bits, it checks n bits at a time in the sum and pass-thru modules. For the α module, we

have

for which a set of derivations is needed to implement CRC-n into it. In Table I, the generator

polynomials used to derive the CRC-5 signatures are shown. The generator polynomial g0(x) is

one of the standards used for radio frequency identification [18]. The other three generator

polynomials g1(x), g2(x), and g3(x) are primitive polynomials. The benefit of using a primitive

polynomial as the generator that the resulting code has full total block length, which means that

all 1-bit errors within that block length have separate


TABLE

STANDARDIZED (STAND.) AND PRIMITIVE (PRIM.) GENERATOR POLYNOMIALS

AND THEIR CORRESPONDING CRC SIGNATURES

remainders. Moreover, since the remainder is a linear function of the block, all 2-bit errors within

that block length can be identified. For the α module of the Luov’s finite-field multipliers, g0(x)

= x5 + x3 + 1 is used as the standardized generator polynomial for CRC-5. To find its CRC

signatures, this fixed polynomial is used as follows:

According to (1), we obtain A(x) · x = a15 · x16 + a14 · x15 + ··· + a1 · x2 + a0 · x. Then,

applying the irreducible polynomial f (x) = x16 + x12 + x3 + x + 1, one obtains


To calculate the PCRC-5 for G F(216) in the α module (PCRC516), the generator polynomial is

applied as

To calculate the ACRC-5 for G F(216) in the α module (AC RC516), we rename the coefficients

of (3): a14 as γ15, ..., a0 as γ1:


The predicted output and the actual output are divided into five parity groups as shown in (4) and

(6), respectively. These parity groups are XORed with each other to determine if there has been

any fault, for example, flip of bits, during the α module operation. In total, each α module

outputs five EFs. Fig. 2 shows the implementation of the α module with the proposed error-

detection schemes. A(x) is the input with the form p(x) = am−1xm−1 + ··· + a1x + a0, which

goes to two different modules that run in parallel. In the α module, (1) takes place. The output

from the α module is divided into five groups in the ACRC module, which are denoted as x1 a –

x5 a in Fig. 2. Meanwhile, A(x) is also being divided into five groups in the PCRC module,

which are denoted as x1 p–x5 p. Once the two CRC modules are done, each group is XORed
TABLE:

OVERHEADS OF THE PROPOSED ERROR-DETECTION SCHEMES FOR THE FINITE-

FIELD MULTIPLIERS USED IN THE LUOV ALGORITHM DURING THE POLYNOMIAL

GENERATION ON XILINX FPGA FAMILY KINTEX ULTRASCALE+ FOR DEVICE

XCKU5P-FFVD900-1-I

Fig. Proposed error-detection constructions for α module.

with its respective one to produce five EFs, which are represented as E F1–E F5. As an example,

to obtain E F1, x1 p (or a15 + a13 + a12 + a10 + a9 + a8 + a6 + a4 for g0(x)) is XORed with x1 a

(or γ14 + γ13 + γ11 + γ10 + γ9 + γ7 + γ5 + γ0 for g0(x)), which are calculated in (4) and (6),
respectively. For our case study, the outputs are divided into five groups since we use CRC-5;

however, if any other CRC-n is used, there will be n EFs and the actual and predicted outputs

will be divided into n groups. In Table I, the CRC signatures for the different primitive

polynomials are shown. We note that the choice of the utilized CRC can be tailored based on the

reliability requirements and the overhead to be tolerated. In other words, for applications such as

game consoles in which performance is critical (and power consumption is not because these are

plugged in), one can increase the size of CRC. However, for deeply embedded systems such as

implantable and wearable medical devices, smaller CRC is preferred.

ERROR COVERAGE AND FPGA IMPLEMENTATIONS:

Finite-field multiplication is a costly operation and requires large footprint. We implement Luov

polynomial generation to show that the proposed error-detection schemes provide high error

coverage with acceptable overhead. Such implementation produces a polynomial p(x) =

am−1xm−1 +···+a1x +a0, which requires m−1 finite-field multiplications and m−1 XOR

operations. As pointed out before, each finite-field multiplication uses three different modules

called α, sum, and pass − thru modules. A total of m − 1 α modules, m − 1 sum modules, and m

pass − thru modules are needed to perform each finite-field multiplication. Moreover, a total of

m−1 sum modules are needed to perform an XOR operation. For each architecture, the error

coverage is calculated as 100 ·(1− (1/2)sign)%, where sign denotes the number of signatures.

Luov uses the finite-field G F(216), or m = 16. Implementing its polynomials in the form of p(x)

= a15x15 + ··· + a1x + a0 requires 14 finite-field multiplications and 15 XOR operations. Since

each finite-field multiplication uses m − 1 α modules, m − 1 sum modules, and m pass−thru

modules, 14×15 α modules, 14×15 sum modules, and 14 × 16 pass − thru modules are needed.
Moreover, a total of 14multiplications · (15α + 15sum + 16pass-thru) + 15XOR or 659

signatures are implemented. The error coverage percentage for the generation of Luov’s

polynomial using the finite-field G F(216) is 100 · (1 − (1/2)659)%. In Table II, we present the

overhead of our error-detection architectures in terms of area-configurable logic blocks (CLBs),

delay, power consumption (at the frequency of 50 MHz), throughput, and efficiency for the

generation of polynomial p(x), where p(x) = am−1xm−1 +···+ a1x + a0.

We utilize Xilinx FPGA family Kintex Ultrascale+ for device xcku5p-ffvd900-1-i, using Verilog

as the hardware design entry and Vivado as the tool for the implementations. As shown in Table

II, when CRC signatures are applied to the original architecture, with higher error coverage, they

end up having higher overhead in terms of area, delay, and power, and lower overhead in terms

of throughput and efficiency. CLBs, which are the main resources for implementing general-

purpose combinational and sequential circuits, are read in the Vivado’s place utilization report to

obtain the area. To determine the delay, we use the Timing Constraints Wizard function in

Vivado, setting a primary clock period constraint of 20 ns, which equals to a frequency of 50

MHz. We also report the total on-chip power, which is the power consumed internally within the

FPGA and it is obtained by adding device static power and design power. Throughput is obtained

by dividing the total number of output bits over the delay and efficiency is obtained by dividing

throughput over area. As seen in this table, acceptable overheads are obtained with efficiency

degradations of at most 19%. The error-detection architecture that uses the primitive generator

polynomial g2(x) has the least amount of area overhead with 9.17%; however, the error-

detection implementation using g0(x), or the standardized generator polynomial for CRC-5,

performs the fastest, obtaining the least amount of delay overhead with 3.71%.
There has not been any prior work done on this type of error-detection methods for the Luov’s

finite-field multipliers to the best of our knowledge. For qualitative comparison to verify that the

overheads incurred are acceptable, let us go over some case studies. Subramanian et al. [19]

presented a signature-based fault diagnosis for cryptographic block Ciphers LED and HIGHT,

obtaining a combined area and delay overhead of 21.9% and 31.9% for LED and HIGHT,

respectively. Additionally, Mozaffari-Kermani et al. [6] have presented the fault diagnosis of

Pomaranch cipher, obtaining a combined area and throughput overhead of 35.5%. The proposed

schemes in this brief have combined area and delay overheads of less than 32% (worst case

scenario). In [7], the worst case area overhead obtained by applying error-detection schemes of

NTT architectures is 24%.

The worst case area overhead of [8] and [9] is more than 33% with a performance degradation of

more than 14% when fault-detection architectures are applied to stateless hash-based signatures.

These and similar prior works on classical cryptography verify that the proposed error-detection

architectures obtain similar overheads compared to other works on fault detection, achieving an

acceptable overhead. These degradations are acceptable for providing error detection to the

original architectures which lack such capability to thwart natural or malicious faults.
CHAPTER 4

XILINX Software

Xilinx Tools is a suite of software tools used for the design of digital circuits implemented using Xilinx Field
Programmable Gate Array (FPGA) or Complex Programmable Logic Device (CPLD). The design
procedure consists of (a) design entry, (b) synthesis and implementation of the design, (c) functional simulation
and (d) testing and verification. Digital designs can be entered in various ways using the above CAD tools:
using a schematic entry tool, using a hardware description language (HDL) – Verilog or VHDL or a
combination of both. In this lab we will only use the design flow that involves the use of VerilogHDL.

The CAD tools enable you to design combinational and sequential circuits starting with Verilog HDL design
specifications. The steps of this design procedure are listed below:

1. Create Verilog design input file(s) using template driveneditor.


2. Compile and implement the Verilog designfile(s).
3. Create the test-vectors and simulate the design (functional simulation) without
using a PLD (FPGA orCPLD).
4. Assign input/output pins to implement the design on a targetdevice.
5. Download bitstream to an FPGA or CPLDdevice.
6. Test design on FPGA/CPLDdevice

A Verilog input file in the Xilinx software environment consists of the following segments:

Header: module name, list of input and output ports.


Declarations: input and output ports, registers and wires.

Logic Descriptions: equations, state machines and logic functions.

End: endmodule

All your designs for this lab must be specified in the above Verilog input format. Note that the
state diagram segment does not exist for combinational logic designs.

2. Programmable Logic Device:FPGA

In this lab digital designs will be implemented in the Basys2 board which has a Xilinx Spartan3E
–XC3S250E FPGA with CP132 package. This FPGA part belongs to the Spartan family of FPGAs. These
devices come in a variety of packages. We will be using devices that are packaged in 132 pin package with the
following part number: XC3S250E-CP132. This FPGA is a device with about 50K gates. Detailed information
on this device is available at the Xilinx website.

3. Creating a NewProject
Xilinx Tools can be started by clicking on the Project Navigator Icon on the Windows desktop. This should
open up the Project Navigator window on your screen. This window shows (see Figure 1) the last accessed
project.
Figure 1: Xilinx Project Navigator window (snapshot from Xilinx ISE software)

3.1 Opening aproject

Select File->New Project to create a new project. This will bring up a new project window (Figure 2) on the
desktop. Fill up the necessary entries as follows:
Figure 2: New Project Initiation window (snapshot from Xilinx ISE software)

ProjectName: Write the name of your newproject

Project Location: The directory where you want to store the new project (Note: DO
NOT specify the project location as a folder on Desktop or a folder in the Xilinx\bin
directory. Your H: drive is the best place to put it. The project location path is NOT to
have any spaces in it eg: C:\Nivash\TA\new lab\sample exercises\o_gate is NOT to be
used)

Leave the top level module type as HDL.

Example: If the project name were “o_gate”, enter “o_gate” as the project name and then click “Next”.
Clicking on NEXT should bring up the following window:

Figure 3: Device and Design Flow of Project (snapshot from Xilinx ISE software)

For each of the properties given below, click on the ‘value’ area and select from the list of values that
appear.

o Device Family: Family of the FPGA/CPLD used. In this laboratory we will be


using the Spartan3EFPGA’s.
o Device: The number of the actual device. For this lab you may enterXC3S250E
(this can be found on the attached prototyping board)
o Package:Thetypeofpackagewiththenumberofpins.TheSpartanFPGAusedin
this lab is packaged in CP132package.
o Speed Grade: The Speed grade is“-4”.
o Synthesis Tool: XST[VHDL/Verilog]
o Simulator: The tool used to simulate and verify the functionality of the
design. Modelsim simulator is integrated in the Xilinx ISE. Hence choose
“Modelsim-XE Verilog” as the simulator or even Xilinx ISE Simulator can
beused.
o Then click on NEXT to save theentries.

All project files such as schematics, netlists, Verilog files, VHDL files, etc., will be stored in a subdirectory
with the project name. A project can only have one top level HDL source file (or schematic). Modules can be
added to the project to create a modular, hierarchical design (see Section 9).

In order to open an existing project in Xilinx Tools, select File->Open Project to show the list of projects on
the machine. Choose the project you want and click OK.

Clicking on NEXT on the above window brings up the following window:

Figure 4: Create New source window (snapshot from Xilinx ISE software)

If creating a new source file, Click on the NEW SOURCE.


3.2 Creating a Verilog HDL input file for a combinational logicdesign

In this lab we will enter a design using a structural or RTL description using the Verilog HDL. You can create a
Verilog HDL input file (.v file) using the HDL Editor available in the Xilinx ISE Tools (or any text editor).

In the previous window, click on the NEW SOURCE

A window pops up as shown in Figure 4. (Note: “Add to project” option is selected by


default. If you do not select it then you will have to add the new source file to the project
manually.)
Figure 5: Creating Verilog-HDL source file (snapshot from Xilinx ISE software)

Select Verilog Module and in the “File Name:” area, enter the name of the Verilog source file you are going to

create. Also make sure that the option Add to project is selected so that the source need not be added to the
project again. Then click on Next to accept the entries. This pops up the following window (Figure 5).
Figure 6: Define Verilog Source window (snapshot from Xilinx ISE software)

In the Port Name column, enter the names of all input and output pins and specify the Direction
accordingly. A Vector/Bus can be defined by entering appropriate bit numbers in the MSB/LSB
columns. Then click on Next> to get a window showing all the new source information (Figure
6). If any changes are to be made, just click on <Back to go back and make changes. If
everything is acceptable, click on Finish > Next > Next > Finish tocontinue.

Figure 7: New Project Information window(snapshot from Xilinx ISE software)


Once you click on Finish, the source file will be displayed in the sources window in the
Project Navigator (Figure 1).

If a source has to be removed, just right click on the source file in the Sources in Project
window in the Project Navigator and select Removein that. Then select Project -> Delete
Implementation Data from the Project Navigator menu bar to remove any relatedfiles.

3.3 Editing the Verilog sourcefile

The source file will now be displayed in the Project Navigator window (Figure 8). The source
filewindowcanbeusedasatexteditortomakeanynecessarychangestothesourcefile.All
The input/output pins will be displayed. Save your Verilog program periodically by selecting the
File->Save from the menu. You can also edit Verilog programs in any text editor and add them

to the project directory using “Add Copy Source”.

Figure 8: Verilog Source code editor window in the Project Navigator (from Xilinx ISE
software)

Adding Logic in the generated Verilog Source codetemplate:

A brief Verilog Tutorial is available in Appendix-A. Hence, the language syntax and construction of
logic equations can be referred to Appendix-A.

The Verilog source code template generated shows the module name, the list of ports and also the
declarations (input/output) for each port. Combinational logic code can be added to the verilog code
after the declarations and before the endmodule line.

For example, an output z in an OR gate with inputs a and b can be described as, assign z =
a | b;
Remember that the names are case sensitive.

Other constructs for modeling the logicfunction:

A given logic function can be modeled in many ways in verilog. Here is another
example in which the logic function, is implemented as a truth table using a case statement:

moduleor_gat
e(a,b,z); input
a;

inp
ut
b;
out
put
z;

reg z;

always
@(a or b)
begin

case
({a,b})
00: z
=1'b0;

01: z =1'b1;

10: z =1'b1;

11: z =1'b1;

endcase

end
e
ndmo
dule

Suppose we want to describe an OR gate. It can be done using the logic equation as shown in Figure 9a or
using the case statement (describing the truth table) as shown in Figure 9b. These are just two example
constructs to design a logic function. Verilog offers numerous such constructs to efficiently model designs. A
brief tutorial of Verilog is available in Appendix-A.
Figure 9: OR gate description using assign statement (snapshot from Xilinx ISE
software)
Figure 10: OR gate description using case statement (from Xilinx ISE software)

4. Synthesis and Implementation of theDesign

The design has to be synthesized and implemented before it can be checked for correctness, by running
functional simulation or downloaded onto the prototyping board. With the top-level Verilog file opened (can be
done by double-clicking that file) in the HDL editor window in the right half of the Project Navigator, and the
view of the project being in the Module view , the implement design option can be seen in the process view.
Design entry utilities and Generate Programming File options can also be seen in the process view. The former
can be used to include user constraints, if any and the latter will be discussed later.
To synthesize the design, double click on the Synthesize Design option in the Processes
window.

To implement the design, double click the Implement design option in the Processes window. It will go
through steps like Translate, Map and Place & Route. If any of these steps could not be done or done with
errors, it will place a X mark in front of that, otherwise a tick mark will be placed after each of them to indicate
the successful completion. If everything is done successfully, a tick mark will be placed before the Implement
Design option. If thereare

warnings, one can see mark in front of the option indicating that there are some warnings.
One can look at the warnings or errors in the Console window present at the bottom of the
Navigator window. Every time the design file is saved; all these marks disappear asking for a
freshcompilation.
Figure 11: Implementing the Design (snapshot from Xilinx ISE software)

The schematic diagram of the synthesized verilog code can be viewed by double clicking View RTL Schematic
under Synthesize-XST menu in the Process Window. This would be a handy way to debug the code if the
output is not meeting our specifications in the proto type board.

By double clicking it opens the top level module showing only input(s) and output(s) as shown below.

Figure 12: Top Level Hierarchy of the design


By double clicking the rectangle, it opens the realized internal logic as
shown below.

Figure 13: Realized logic by the XilinxISE for the verilog code

5. Functional Simulation of CombinationalDesigns


5.1 Adding the testvectors

To check the functionality of a design, we have to apply test vectors and simulate the circuit. In order to
apply test vectors, a test bench file is written. Essentially it will supply all the inputs to the module designed
and will check the outputs of the module. Example: For the 2 input OR Gate, the steps to generate the test
bench is as follows:

In the Sources window (top left corner) right click on the file that you want to generate
the test bench for and select ‘New Source’

Provide a name for the test bench in the file name text box and select ‘Verilog test fixture’ among the
file types in the list on the right side as shown in figure 11.

Figure 14: Adding test vectors to the design (snapshot from Xilinx ISE software)
Click on ‘Next’ to proceed. In the next window select the source file with which you want to associate the
test bench.

Figure 15: Associating a module to a testbench (snapshot from Xilinx ISE software)

Click on Next to proceed. In the next window click on Finish. You will now be provided with a template
for your test bench. If it does not open automatically click the radio button next to Simulation .
You should now be able to view your test bench template. The code generated would be something like this:
moduleo_gate_tb_v;

//
Inp
uts
reg
a;

reg b;

//
Outp
uts
wire
z;

// Instantiate the Unit Under


Test (UUT) o_gateuut (

.a(a),
.b(b),

.z(z)

);

initialbegin

// Initialize
Inputs a =
0;

b =0;

// Wait 100 ns for global


reset tofinish #100;

// Add stimulus here

end

endmodule

The Xilinx tool detects the inputs and outputs of the module that you are going to test an assigns them initial values.
In order to test the gate completely we shall provide all the different input combinations. ‘#100’ is the time delay
for which the input has to maintain the current value. After 100 units of time have elapsed the next set of values
can be assign to the inputs.
Complete the test bench as shown below:
moduleo_gate_tb_v;

//
Inp
uts
reg
a;
reg
b;

//
Outp
uts
wire
z;

// Instantiate the Unit Under


Test (UUT) o_gateuut (

.a(a),

.b(b),

.z(z)

);
initialbegin

// Initialize
Inputs a =
0;

b =0;

// Wait 100 ns for global reset to finish #100;

a = 0;

b =1;

// Wait 100 ns for global


reset tofinish #100;

a = 1;

b =0;

// Wait 100 ns for global


reset tofinish #100;

a = 1;

b =1;
// Wait 100 ns for global
reset tofinish #100;

end

endmodule

Save your test bench file using the File menu.s

5.2 Simulating and Viewing the OutputWaveforms

Now under the Processes window (making sure that the testbench file in the Sources
window is selected) expand the ModelSim simulator Tab by clicking on the add sign next
to it. Double Click on Simulate Behavioral Model. You will probably receive a complier
error. This is nothing to worry about – answer “No” when asked if you wish to abort simulation.
This should cause ModelSim to open. Wait for it to complete execution. If you wish to not receive
the compiler error, right click on Simulate Behavioral Model and select process properties. Mark
the

checkbox next to “Ignore Pre-Complied Library Warning Check”.


Figure 16: Simulating the design (snapshot from Xilinx ISE software)

5.3 Saving the simulationresults

To save the simulation results, Go to the waveform window of the Modelsim simulator, Click on File -> Print
to Postscript -> give desired filename and location.

Notethatbydefault,thewaveformis“zoomedin”tothenanosecondlevel. Use the


zoom controls to display the entirewaveform.

Else a normal print screen option can be used on the waveform window and subsequently stored in Paint.

Figure 17: Behavioral Simulation output Waveform (Snapshot from


ModelSim)
For taking printouts for the lab reports, convert the black background to white in Tools -> Edit Preferences.
Then click Wave Windows -> Wave Background attribute.

Figure 18: Changing Waveform Background in ModelSim


CONCLUSION:

In this work, we have derived error-detection schemes for the finite-field multipliers used in

postquantum cryptographic algorithms such as Luov, noting that the proposed error-detection

schemes can be adapted to other applications and cryptographic algorithms whose building

blocks need finite-field multiplications. The error-detection architectures proposed in this work

are based on CRC-5 signatures and we have performed software implementations for the sake of

verification. Additionally, we have explored and studied both primitive and standardized

generator polynomials for CRC-5, comparing the complexity for each of them. We have

embedded the proposed error-detection schemes into the original finite-field multipliers of the

Luov’s algorithm, obtaining high error coverage with acceptable overhead.

You might also like