Implementing BLAKE With AVX, AVX2, and XOP
Implementing BLAKE With AVX, AVX2, and XOP
Abstract. In 2013 Intel will release the AVX2 instructions, which introduce 256-bit single-
instruction multiple-data (SIMD) integer arithmetic. This will enable desktop and server
processors from this vendor to support 4-way SIMD computation of 64-bit add-rotate-xor
algorithms, as well as 8-way 32-bit SIMD computations. AVX2 also includes interesting in-
structions for cryptographic functions, like any-to-any permute and vectorized table-lookup.
In this paper, we explore the potential of AVX2 to speed-up the SHA-3 finalist BLAKE,
and present the first working assembly implementations of BLAKE-256 and BLAKE-512
with AVX2. We then investigate the potential of the recent AVX and XOP instructions to
accelerate BLAKE, and report new speed records on Sandy Bridge and Bulldozer microar-
chitectures (7.47 and 11.64 cycles per byte for BLAKE-256, 5.71 and 6.95 for BLAKE-512).
Keywords: hash functions, SHA-3, implementation, SIMD
1 Introduction
NIST plans announce the winner of the SHA-3 competition in the second quarter of 2012. At the
time of writing, no significant security weakness is known for any of the five finalists, and all seem
to provide a comfortable security margin. Performance and ease of implementation will thus be
decisive in the choice of SHA-3. An important performance criterion is hashing speed on high-
end CPUs, as found in laptops, desktops, or servers. Arguably, systems hashing large amounts of
data—be it many short messages or a fewer large ones—will only switch from SHA-2 to SHA-3 if
the latter is noticeably faster; fast hashing is for example needed for authenticating data in secure
cloud storage services.
This paper focuses on the hashing speed of the SHA-3 finalist BLAKE. We investigate how the
current and future instructions sets by the CPU vendors Intel (with AVX and AVX2) and AMD
(with AVX and XOP) can be exploited to create efficient implementations of BLAKE-256 and
BLAKE-512, the two main instances of BLAKE. Previous implementations of BLAKE exploited
the SSE instruction sets, which provide single-input multiple-data (SIMD) instructions over the
128-bit XMM registers. Thanks to BLAKE’s inherent internal parallelism, such instructions often
lead to a significant speed-up compared to non-SIMD code. The 2011 AVX and XOP instruction
sets and the future AVX2 extend SIMD capabilities to 256-bit registers, and thus provide new
avenues for optimized implementations of BLAKE.
We wrote C and assembly implementations of BLAKE-256 and BLAKE-512 for AVX2, whose
correctness was verified through Intel’s emulator. As AVX2 is not rolled out in today’s CPUs, a
best effort was to make heuristical estimates based on the information available. We also wrote
an assembly implementation of BLAKE-256 running at 7.47 cycles per byte on our Sandy Bridge
CPU, a new record on this platform. On the same machine, our implementation of BLAKE-512 for
AVX runs at 5.71 cycles per byte, another record. On AMD’s hardware, our XOP implementations
also beat previous ones, with respectively 11.64 and 6.95 cycles per byte for BLAKE-256 and
BLAKE-512.
Besides setting new speed records on recent CPUs, our work shows that BLAKE can benefit
from a number of the sophisticated instructions integrated in modern hardware architectures,
although these were often added for completely different purposes.
∗
A preliminary version of this work was presented at the Third SHA-3 Conference. Additions in this
revised and augmented version include observations on tree hashing, AVX implementation of BLAKE-512,
improved AVX implementations of BLAKE-256, and XOP implementations and analyses.
The paper starts with a brief description of BLAKE (§2) followed by an overview of AVX,
AVX2, and XOP (§3). We then successively consider implementations of BLAKE-512 with AVX2
(§4), then of BLAKE-256 with AVX2 (§5) and AVX (§6), and finally of both with XOP (§7),
before concluding (§8).
In 2008 Intel announced the Advanced Vector Extensions (AVX), introducing 256-bit wide vector
instructions. These improve on the previous SSE extensions, which work on 128-bit XMM registers.
In addition to SIMD operations extending SSE’s capabilities from 128- to 256-bit width, AVX
brings to implementers non-destructive operations with a 3- and 4-operand syntax (including for
legacy 128-bit SIMD extensions), as well as relaxed memory alignment constraints, compared to
SSE.
AVX operates on 256-bit SIMD registers called YMM, divided in two 128-bit lanes, such that
the low lanes (lower 128 bits) are aliased to the respective 128-bit XMM registers. Most instructions
work “in-lane”, that is, each source element is applied only to other elements of the same lane.
Some more expensive “cross-lane” instructions do exist, most notably shuffles.
AVX2 is an extension of AVX announced in 2011 that promotes most of the 128-bit SIMD
integer instructions to 256-bit capabilities. AVX2 supports 4-way 64-bit integer addition, XOR, and
vector shifts, thus enabling SIMD implementations of BLAKE-512. AVX2 also includes instructions
to perform any-to-any permutation of words over a 256-bit register and vectorized table lookup
to load elements in memory to YMM registers (see the instructions vperm* and vpgatherd* in
§§3.2). AVX2 was recently proposed to optimize SHA-2 implementations [5].
AVX is supported by Intel processors based on the Sandy Bridge microarchitecture (and future
ones). The first processors commercialized were Core i7 and Core i5 in January 2011. AVX2 will
be introduced in Intel’s Haswell 22 nm architecture, to be released in 2013.
We focus on a small subset of the AVX2 instructions, with for each a brief explanation of what it
does. For a better understanding, the most sophisticated instructions are also described with an
equivalent description in C syntax using only general-purpose registers. Table 1 summarizes the
main instructions along with their C intrinsic functions.
ARX SIMD. To implement add-rotate-xor (ARX) algorithms with AVX2, the following in-
structions are available: vpaddd for 8-way 32-bit integer addition, vpaddq for 4-way 64-bit integer
addition, vpxor for 256-bit wide XOR, and vpsllvd, vpsrlvd, vpsllvq, and vpsrlvq for variable
left and right shift of 32- and 64-bit words (that is, each word within a YMM register may be
shifted of a different value).
Cross-lane permutes. AVX2 provides instructions to realize any permutation of 32- and 64-bit
words within a YMM register, through the following instructions: vpermd shuffles 32-bit words of
a full YMM register across lanes using two YMM registers as inputs: one as source, the other as
the permutation’s indices:
uint32_t a [8] , b [8] , c [8];
for ( i =0; i < 8; ++ i ) c [ i ] = a [ b [ i ]];
vpermq is similar to vpermd but shuffles 64-bit words and takes an immediate operand instead as
the permutation:
uint64_t a [4] , c [4]; int b ;
for ( i =0; i < 4; ++ i ) c [ i ] = a [( b > >(2* i ))%4];
Vectorized table look-ups. The “gather” instructions are among the most remarkable of the
AVX2 extensions: vpgatherdd performs eight table lookups in parallel, as in the code below:
uint8_t * b ; uint32_t scale , idx [8] , c [8];
for ( i =0; i < 8; ++ i ) c [ i ] = *( uint32_t )( b + idx [ i ]* scale );
vpblendd ( mm256 blend epi32), similar to the SSE4.1 pblendw instruction, permits the selec-
tion of words from 2 different sources according to an immediate index, placing them in a third
destination register:
uint32_t a [8] , b [8] , c [8]; int sel ;
for ( i =0; i < 8; ++ i )
if (( sel > > i )&1) c [ i ] = b [ i ];
else c [ i ] = a [ i ];
vextracti128 ( mm256 extracti128 si256) and vinserti128 ( mm256 inserti128 si256) ex-
tract and insert an XMM register into the lower or upper halves of a YMM register. vextracti128
is equivalent to:
uint32_t a [8] , c [4]; int imm ;
for ( i =0; i < 4; ++ i ) c [ i ] = a [ i + 4* imm ];
Processors carrying the AVX2 instruction set are only expected to be available in 2013. There is
currently no hard data on the performance of the instructions described above; one can, however,
make some educated guesses, by using the Sandy Bridge as starting point.
The vpaddd, vpaddq, vpsllvd, vpsllvq, vpsrlvd, vpsrlvq, and vpxor instructions’ perfor-
mance can be expected to be on-par with Sandy Bridge’s vpxor instruction, which requires a single
cycle to complete. The vpermd and vpermq instructions cross register lanes; on Sandy Bridge, this
adds one extra cycle of latency. We can estimate that this penalty gets no worse on Haswell, and
that vpermd and vpermq require two cycles to complete. The gather instructions remain the most
elusive; it is unknown whether this consists of a large number of micro-ops, or uses dedicated
circuitry. Assuming only one cache-line is accessed, one can expect at least four cycles of latency
for the memory load, plus two for the extra logic.
We speculate that instruction parallelism in AVX2-compatible processors will resemble existing
SSE2 parallelism available in current processors. Current Sandy Bridge processors are capable of
executing three AVX instructions per cycle, namely one floating-point multiply, one floating-point
add, and one logical operation. We expect future processors to be able to sustain such throughput
with integer instructions, as it happens today with XMM registers.
3.4 XOP
In 2007, AMD announced its SSE5 set of new instructions. These featured 3-operand instruc-
tions, more powerful permutations, native integer rotations, and fused-multiply-add capabilities.
After the announcement of AVX, however, SSE5 was shelved in favor of AVX plus XOP, FMA4,
and CVT16. The XOP instruction set [4] extends AVX with new integer multiply-and-accumulate
(vpmac*), rotation (vprot*), shift (vpsha*, vpshl*), permutation (vpperm), and conditional move
(vpcmov) instructions working on XMM registers. These instructions have latency at least two cy-
cles. XOP instructions are integrated in AMD’s Bulldozer microarchitecture, which first appeared
in the FX-series 32 nm processors released in on October 2011.
Below we present the most useful XOP instructions for BLAKE:
Rotate instructions. Whereas SSE and AVX requires rotations to be implemented with a
combination of two shifts and an XOR, XOP introduces rotate instructions with either fixed or
variable counts: the 3-operand vprotd (intrinsics mm roti epi32 and mm rot epi32) sets its
destination XMM register to the four 32-bit words from a source register rotated by possibly
different counts (positive for left rotation, negative for right); vprotq (intrinsics mm roti epi64
and mm rot epi64) is the equivalent instruction for 2-way 64-bit vectorized rotation.
Conditional move. The vpcmov instruction takes four operands among which a destination
register has each of its bits set to the corresponding bit of either the first or the second source
operand, depending on a selector third operand; this is similar to the “?” ternary operator in C.
vpcmov. accepts XMM or YMM registers as operands; for the latter, the instruction is equivalent
to
uint64_t a [4] , b [4] , c [4] , d [4];
for ( i =0; i < 4; ++ i ) d [ i ] = ( a [ i ] & c [ i ]) | ( b [ i ] & ~ c [ i ]);
Byte permutation. With the vpperm instruction, XOP offers more than a simple byte permu-
tation: given two source XMM registers (that is, 256 bits) and a 16-byte selector, vpperm fills the
destination XMM register with bytes that are either a byte chosen from the two source registers,
or a constant 00 or ff. Furthermore, bit-wise logical operations can be applied to source bytes
(invert, reverse, etc.).
4 Implementing BLAKE-512 with AVX2
This section first presents a basic SIMD implementation of BLAKE-512, using AVX2’s 4-way
64-bit SIMD instructions exactly in the same way that BLAKE-256 uses SSE2’s 4-way 32-bit
instructions. We then discuss optimizations exploiting instructions proper to AVX2. For ease of
understanding, we present C code using intrinsics for AVX2 instructions; excerpts of our assembly
implementation can be found in Appendix B.1, and the full assembly will be publicly available.
We used Intel’s Software Development Emulator3 to test the correctness of the AVX2 im-
plementations, and the latest trunk build of the Yasm assembler4 (as the latest release did not
support AVX2) to compile them.
A simple optimization consists in implementing the rotation by 32 bits using the vpshufd instruc-
tion, which implements “in-lane” shuffle of 32-bit words. That is, the line
3
http://software.intel.com/en-us/articles/intel-software-development-emulator/
4
http://yasm.tortall.net/
row4 = _mm256_xor_si256 ( _mm256_srli_epi64 ( row4 , 32) ,
_mm256_slli_epi64 ( row4 , 32));
can be replaced by
row4 = _mm256_shuffle_epi32 ( row4 , _MM_SHUFFLE (2 ,3 ,0 ,1));
Similarly, the rotations by 16 bits can be implemented using vpshufb in a similar fashion as the
ssse3 implementation (see Appendix A.3):
row4 = _mm256_shuffle_epi8 ( row4 , r16 );
where r16 is the alias of a YMM register containing the index values for the byte of row4 at its
respective lane and position.
Based on the estimates in §§3.3, we expect to save at least one cycle per rotation of 16 or 32
bits, thus four cycles per round, 64 cycles per compression function, that is, at least 0.5 cycle per
byte.
Based on estimations in §§3.3, one can attempt to predict the speed of an implementation of
BLAKE-512 with AVX2. For simplicity, we assume that no message caching is used (as we’ll see
later, this seems to be a reasonable assumption). An attempt of a rough performance estimate
may consider the following assumptions:
This section shows how BLAKE-256 can benefit of AVX2. Unlike BLAKE-512, BLAKE-256 is not
naturally adaptable to 256-bit vectors, as there is a maximum of four Gi independently-running
functions per round. Nevertheless, it is possible to take advantage of AVX2 to speedup BLAKE-256
(although the trick of message caching applies, we discuss it in §6, as it is not proper to AVX2).
Excerpts of our assembly implementation appear in Appendix B.2.
The first way to improve message loads is by using the vpgatherdd instruction from the AVX2
instruction set. To perform the full 16-word message permutation required in each round, only
four operations are required:
_m128i m0 = _mm_i32gather_epi32 (m , sigma [ r ][0] , 4);
_m128i m1 = _mm_i32gather_epi32 (m , sigma [ r ][1] , 4);
_m128i m2 = _mm_i32gather_epi32 (m , sigma [ r ][2] , 4);
_m128i m3 = _mm_i32gather_epi32 (m , sigma [ r ][3] , 4);
This can be further improved by using only two YMM registers to store the permuted message:
_m256i m01 = _ mm 25 6_i 32 ga the r_ ep i32 (m , sigma [ r ][0] , 4);
_m256i m23 = _ mm 25 6_i 32 ga the r_ ep i32 (m , sigma [ r ][1] , 4);
The individual 128-bit blocks of message are then accessible through the vextracti128 instruction.
One must also consider the possibility that vpgatherdd will not have acceptable performance,
perhaps due to specific processor design idiosyncrasies; AVX2 can still help us, via the vpermd
and vpblendd instructions:
tmp0 = _ m m 2 5 6 _ p e r m u t e v a r 8 x 3 2 _ e p i 3 2 ( m01 , sigma00 );
tmp1 = _ m m 2 5 6 _ p e r m u t e v a r 8 x 3 2 _ e p i 3 2 ( m23 , sigma01 );
tmp2 = _ m m 2 5 6 _ p e r m u t e v a r 8 x 3 2 _ e p i 3 2 ( m01 , sigma10 );
tmp3 = _ m m 2 5 6 _ p e r m u t e v a r 8 x 3 2 _ e p i 3 2 ( m23 , sigma11 );
m01 = _mm256_blend_epi32 ( tmp0 , tmp1 , mask0 );
m23 = _mm256_blend_epi32 ( tmp2 , tmp3 , mask1 );
In the above code, we permute the elements from the first YMM register into their proper order in
the permutation, after which we permute the elements from the second. A simple blend instruction
suffices to obtain the correct permutation. We repeat the process for the second part of the
permutation. Once again, individual 128-bit blocks are available via vextracti128.
We ran benchmarks on an Intel Core i7 2630QM (2 GHz, Sandy Bridge), reusing tools from SU-
PERCOP [6] for measuring speed on long messages, and compiling our code with Intel’s icc com-
piler, with options -fast -xHost -funroll-loops -unroll-aggressive -inline-forceinline
-no-inline-max-total-size (that is, maximal code inlining and loop unrolling). BLAKE-256
was measured at 7.47 cycles per byte, and BLAKE-512 at 5.71 cycles per byte. Surprisingly,
message caching did not improve, nor degrade, performance.
This section shows the main XOP-specific optimizations for BLAKE-256 and BLAKE-512, with a
focus on the former. Although only a limited number of XOP instructions can be exploited, they
provide a significant speed-up compared to implementations using AVX but not XOP. The latest
version of our xop implementations can be found in SUPERCOP.
The first optimization is straightforward, as it just consists in doing rotations with the dedicated
vprot* instruction. In BLAKE-256, rotations by 16 and 8, previously implemented with SSSE3’s
pshufb, can also be replaced with vprotd. The first half of G can thus be coded as
row1 = _mm_add_epi32 ( _mm_add_epi32 ( row1 , buf ) , row2 );
row4 = _mm_xor_si128 ( row4 , row1 );
row4 = _mm_roti_epi32 ( row4 , -16);
row3 = _mm_add_epi32 ( row3 , row4 );
row2 = _mm_xor_si128 ( row2 , row3 );
row2 = _mm_roti_epi32 ( row2 , -12);
XOP can be used to implement BLAKE’s message permutation without memory look-ups, that is,
by reorganizing the ordered words m0 , . . . , m15 within registers, similarly to the approach in §§5.1.
The key operation is vpperm’s conditional moves, that allow us to copy up to four arbitrary
message words out of eight into an XMM register. For example in the first column step of the
first round, an XMM register needs be loaded with m0 , m2 , m4 , m6 ; with XMM registers m0 and
m1 respectively holding m0 to m3 and m4 to m7 , this can be done as
selector = _mm_set_epi32 ( 0 x1b1a1918 , 0 x13121110 , 0 xb0a0908 , 0 x3020100 );
s0 = _mm_perm_epi8 ( m0 , m1 , selector );
A complete definition of the vpperm selector can be found in [4, p235]. Note that, unlike message
words, constant words can be loaded directly, to be XORed with the message:
s1 = _mm_set_epi32 (0 xec4e6c89 ,0 x299f31d0 ,0 x3707344 ,0 x85a308d3 );
buf = _mm_xor_si128 ( s0 , s1 );
A same procedure can be followed when the four message words to be loaded span three or four
message registers—where the i-th register, i = 0, 1, 2, 3, holds m4i to m4i+1 . An example of the
latter case occurs in the first message load of the fourth round, where we need the following code:
s0 = _mm_perm_epi8 ( m0 , m1 , _mm_set_epi32 ( SEL (0) , SEL (0) , SEL (3) , SEL (7)) );
s0 = _mm_perm_epi8 ( s0 , m2 , _mm_set_epi32 ( SEL (7) , SEL (2) , SEL (1) , SEL (0)) );
s0 = _mm_perm_epi8 ( s0 , m3 , _mm_set_epi32 ( SEL (3) , SEL (5) , SEL (1) , SEL (0)) );
s1 = _mm_set_epi32 (0 x3f84d5b5 ,0 xc0ac29b7 ,0 x85a308d3 ,0 x38d01377 );
buf = _mm_xor_si128 ( s0 , s1 );
In total, 78 calls to vpperm are necessary to implement the first 10 permutations (e.g. when message
caching is used), and 94 if the first rounds’ loads are recomputed (see Table 2 for the detailed
distribution). These numbers may be reduced with new implementation techniques eliminating
redundancies, for example by reusing previously loaded messages to avoid 3-vpperm loads.
Note that one could use vpinsrd instead of vpperm for single-word insertions. This does not
improve speed, however, as vpinsrd has a latency of 12 cycles on Bulldozer, as opposed to simply
2 for the vpperm, due to the decoupling of integer and floating-point units.
Table 2. Number of message loads requiring either one, two, or three calls to vpperm, as a function
of the permutation.
A similar approach can be used to implement BLAKE-512 message loads, however it requires
about twice as many calls to vpperm, as this does not support 256-bit YMM registers.
7.3 Results
BLAKE-512. On long messages, our xop implementation runs at 6.95 cycles per byte, against
9.09 for the fastest non-XOP implementation (our AVX code). The code was compiled with options
-mxop -fomit-frame-pointer -O3 -funroll-loops.
As with BLAKE-256, one can attempt to lower bound the speed of BLAKE-512 on Bulldozer.
A straightforward bound assuming parallelism of each two 2-way vector operations is likely to be
too loose, as some operations share a single execution unit (namely, this approach gives a bound
of 6 cycles per byte). Assuming 1 cycle of penalty due to the rotations being bound to a single
execution unit, this leaves us at 13 cycles per Gi , or 6.5 cycles per byte.
8 Conclusion
We first considered the future AVX2 256-bit vector extensions, and identified the most useful
instructions to implement add-rotate-xor algorithms, and BLAKE in particular. We wrote assem-
bly implementations of BLAKE-256 and BLAKE-512 exploiting AVX2’s unique features, such
as SIMD memory look-up. Although we could test the correctness of our implementations using
Intel’s emulator, actual benchmarks will have to wait until 2013 for processors implementing the
Haswell microarchitecture. We observed that AVX2 may boost the performance of BLAKE-256
in tree and multistream mode, thanks to its ability to compute two instances with a single vector
state.
We then reviewed the applicability of the recent AVX and XOP advanced vector instructions, as
respectively found in Intel and AMD latest CPUs, to implementations of BLAKE-256. While AVX
provides a minor speed-up compared to SSE4 implementations, the powerful XOP instructions lead
to a considerable improvement of more than 3 and 2 cycles per byte for BLAKE-256 and BLAKE-
512, respectively. This is in mainly due to the dedicated rotation instructions, and to the vpperm
instruction, which allows permuted message blocks to be reconstructed very efficiently. Although
message loads take up a considerable amount of instructions, our proposed technique of message
caching doesn’t seem to improve (neither degrade) speed, be it on Intel’s or AMD’s hardware.
Acknowledgments
We thank NAGRA (Kudelski Group) for supporting the purchase of a computer equiped with a
Bulldozer processor.
References
1. Aumasson, J.P., Henzen, L., Meier, W., Phan, R.C.W.: SHA-3 proposal BLAKE. Submission to the
SHA-3 competition (2008) http://www.131002.net/blake/.
2. Intel: Advanced vector extensions programming reference. http://software.intel.com/en-us/avx/
(March 2008) Document no. 319433-002.
3. Intel: Advanced vector extensions programming reference. http://software.intel.com/en-us/avx/
(June 2011) Document no. 319433-012a.
4. AMD: AMD64 Architecture Programmers Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4
and CVT16 Instructions. http://developer.amd.com/documentation/guides/Pages/default.aspx#
manuals (November 2009)
5
The addition with message words can be computed in parallel to a previous operation.
5. Gueron, S., Krasnov, V.: Parallelizing message schedules to accelerate the computations of hash func-
tions. Cryptology ePrint Archive, Report 2012/067 (2012) http://eprint.iacr.org/.
6. Bernstein, D.J., Lange, T.: eBACS: ECRYPT Benchmarking of Cryptographic Systems. http://
bench.cr.yp.to/ Accessed May 16, 2012.
7. Intel: C++ Intrinsics Reference (2007) Document no. 312482-002US.
8. Coke, J., Baliga, H., Cooray, N., Gamsaragan, E., Smith, P., Yoon, K., Abel, J., Valles, A.: Im-
provements in the Intel Core 2 Penryn Processor Family Architecture and Microarchitecture. Intel
Technology Journal 12(3) (October 2008) 179–193
• The ssse3 implementation of BLAKE-256 uses the pshufb instruction (intrinsic mm shuffle epi8)
to perform rotations of 16 and 8 bits, as well as the initial conversion of the message from
little-endian to big-endian byte order, since both can be expressed as byte shuffles (in the
sse2 implementations rotations were implemented as two shifts and an XOR). This brings
a significant speed-up on Core 2 based on the Penryn microarchitecture, which introduced a
dedicated shuffle unit to complete pshufb within one micro-operation, against four on the first
Core 2 chips [8].
• The sse41 implementation of BLAKE-256 uses the pblendw instruction ( mm blend epi16)
in combination with SSE2’s pshufd, pslldq, and others to load m and u words according to
the σ permutations without using table lookups.
In general, the ssse3 implementation is faster than sse2, and sse41 is faster than both6 . For
example, the 20110708 measurements of SUPERCOP on sandy0 (a machine equipped with a
Sandy Bridge Core i7, without AVX activated) report sse41 as the fastest implementation of
BLAKE-256, with the ssse3 and sse2 implementations being respectively 4 % and 24 % slower.
Recently, SUPERCOP included the vect128 and vect128-mmxhack implementations of BLAKE-
256 by Leurent, which slightly outperform the sse41 implementation. The main singularity of
Leurent’s code is its implementation of the σ permutations: vect128 “byte-slices” each message
word accross four XMM registers and uses the pshufb instruction to reorder them according to
σ; vect128-mmxhack instead uses MMX and general-purpose registers to store and unpack the
message words in the correct order into XMM registers.
This section presents excerpts from our assembly implementations. The full implementations will
be made publicly available.
Message loading:
% macro MSGLOAD 1
% ifdef CACHING
% if %1 < 6
vmovdqa [ rsp + 128 + %1*128 + 00] , ymm4
vmovdqa [ rsp + 128 + %1*128 + 32] , ymm5
vmovdqa [ rsp + 128 + %1*128 + 64] , ymm6
vmovdqa [ rsp + 128 + %1*128 + 96] , ymm7
% endif
% endif
% endmacro
% macro UNDIAG 0
vpermq ymm1 , ymm1 , 0 x93
vpermq ymm2 , ymm2 , 0 x4e
vpermq ymm3 , ymm3 , 0 x39
% endmacro
% macro ROUND 1
MSGLOAD %1
G ymm4 , ymm5
DIAG
G ymm6 , ymm7
UNDIAG
% endmacro
AVX2 allows the use of vpgatherdd for direct load of permuted message words from memory:
% macro MSGLOAD 1
% ifdef CACHING
% if %1 < 4
vmovdqa [ rsp + 16*4 + %1*64 + 00] , ymm4
vmovdqa [ rsp + 16*4 + %1*64 + 32] , ymm6
% endif
% endif
% endmacro