0% found this document useful (0 votes)

204 views

How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog

This document summarizes the iterative optimization of a CUDA matrix multiplication kernel to improve its performance toward that of cuBLAS. The initial naive implementation performs at 1.3% of cuBLAS speed due to uncoalesced global memory accesses. Optimization steps include coalescing global memory, using shared memory block tiling, 1D and 2D warp tiling, and vectorizing loads. After these steps, performance reaches 83.2% of cuBLAS. Lower bounds on runtime are calculated based on hardware specs and problem size. Memory access patterns are analyzed to identify optimization opportunities like coalescing.

Uploaded by

daweley389

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

204 views

How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog

Uploaded by

daweley389

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

How to Optimize a CUDA

Matmul Kernel for cuBLAS-like
Performance: a Worklog
December 31, 2022

In this post, I’ll iteratively optimize an implementation of

matrix multiplication written in CUDA. My goal is not to
build a cuBLAS replacement, but to deeply understand the
most important performance characteristics of the GPUs that
are used for modern deep learning. This includes coalescing
global memory accesses, shared memory caching and
occupancy optimizations, among others. 1 2

Matrix multiplication on GPUs may currently be the most

important algorithm that exists, considering it makes up almost
all the FLOPs during the training and inference of large deep-
learning models. So how much work is it to write a performant
CUDA SGEMM 3 from scratch? I’ll start with a naive kernel
and step-by-step apply optimizations until we get within 80%
of the performance of cuBLAS (NVIDIA’s official matrix
library): 4

Performance relative to
Kernel GFLOPs
cuBLAS (fp32)

1: Naive 309 1.3%

2: GMEM
2006 8.2%
Coalescing
3: SMEM
2984 12.2%
Blocktiling

4: 1D
8626 35.3%
Warptiling

5: 2D
16134 66.0%
Warptiling

6: Vectorize
20358 83.2%
loads

0: cuBLAS 24441 100.0%

https://siboehm.com/articles/22/CUDA-MMM 1/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Kernel 1: Naive Implementation

In the CUDA programming model, computation is ordered in
a three-level hierarchy. Each invocation of a CUDA kernel
creates a new grid, which consists of multiple blocks. Each
block consists of up to 1024 individual threads. 5 Threads that
are in the same block have access to the same shared memory
region (SMEM).

The number of threads in a block can be configured using a

variable normally called blockDim, which is a vector consisting
of three ints. The entries of that vector specify the sizes of
blockDim.x, blockDim.y and blockDim.z, as visualized below:

Similarly, the number of blocks in a grid is configurable using

the gridDim variable. When we launch a new kernel from the
host 6, it creates a single grid, containing the blocks and threads
as specified. 7 It’s important to keep in mind that the thread
hierarchy we just talked about mostly concerns program
correctness. For program performance, as we’ll see later, it’s not
a good idea to treat all threads in the same block as equals.

For our first kernel, we’ll use the grid, block and thread
hierarchy to assign each thread a unique entry in the result
matrix C. Then that thread will compute the dot product of
the corresponding row of A and column of B, and write the
result to C. Due to each location of C being written to by only
one thread, we have to do no synchronization. We’ll launch the
kernel like so:

// create as many blocks as necessary to map all of C

dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32), 1);
// 32 * 32 = 1024 thread per block
dim3 blockDim(32, 32, 1);
// launch the asynchronous execution of the kernel on the
// The function call returns immediately on the host
sgemm_naive<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, b

https://siboehm.com/articles/22/CUDA-MMM 2/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

CUDA code is written from a single-thread perspective. In the

code of the kernel, we access the blockIdx and threadIdx built-
in variables. These will contain return different values based on
the thread that’s accessing them. 8 9

global void sgemm_naive(int M, int N, int K, float alp

const float *B, float beta, fl
// compute position in C that this thread is responsible
const uint x = blockIdx.x * blockDim.x + threadIdx.x;
const uint y = blockIdx.y * blockDim.y + threadIdx.y;

// `if` condition is necessary for when M or N aren't mu

if (x < M && y < N) {
float tmp = 0.0;
for (int i = 0; i < K; ++i) {
tmp += A[x * K + i] * B[i * N + y];
}
// C = α*(A@B)+β*C
C[x * N + y] = alpha * tmp + beta * C[x * N + y];
}
}

To visualize this simple kernel: 10

This kernel takes about 0.5s to process three 4092² fp32

matrices on my A6000 GPU. Let’s do some non-
implementation-specific calculations:

Lower Bounding the Fastest Possible Runtime

For a matrix multiplication of two 4092² matrices, followed by
an addition of a 4092² matrix (to make the GEMM):

1. Total FLOPS: 11 2*4092³ + 4092² = 137 GFLOPS

2. Total data to read (minimum!): 3 * 4092² * 4B = 201MB

3. Total data to store: 4092² * 4B = 67MB

So 268MB is the absolute minimum of memory that any

implementation would have to transfer from/to global GPU
memory, 12 assuming it has a big enough cache. 13 Let’s calculate
some upper bounds on kernel performance. The GPU is
advertised with 30TFLOPs/s of fp32 compute throughput and

https://siboehm.com/articles/22/CUDA-MMM 3/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

768GB/s of global memory bandwidth. If we achieved those

numbers, 14 15 we’d need 4.5ms for the calculation and 0.34ms
for the memory transfers. So in our napkin math, the
calculation takes ~10x more time than the memory accesses.
This means our final optimized kernel will be compute-bound,
as long as we end up having to transfer <10x the absolute
minimum memory volume of 278MB. 16

Now that we’ve calculated some lower bounds for our fp32
GEMM calculation, let’s get back to the kernel on hand, to
figure out why it’s so much slower than it could be.

Memory Access Pattern of the Naive Kernel

In our kernel, two threads in the same block with ThreadIds
(0, 0) and (0, 1) will load the same column of B but different
rows of A. If we assume the worst case of zero caching, then
each thread has to load 2*4092+1 floats from global memory. As
we have 4092² threads total, this would result in 548GB of
memory traffic.

Below is a visualization of the memory access pattern of our

naive kernel, taking two threads A (red) and B (green) as an
example:

So to recap, when I run this kernel on an A6000 GPU it

achieves ~300GFLOPs when multiplying two 4092x4092
float32 matrices. Pretty bad, considering that the A6000 is
advertised as being able to achieve almost 30 TFLOPs. 17 So

https://siboehm.com/articles/22/CUDA-MMM 4/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

how can we start to make this faster? One way is to optimize

the memory access pattern of our kernel such that global
memory accesses can be coalesced (=combined) into fewer
accesses.

Kernel 2: Global Memory Coalescing

Before we get into global memory coalescing, we need to learn
about the concept of a warp. For execution, the threads of a
block are grouped into so-called warps, consisting of 32
threads. A warp is then assigned to a warp scheduler, which is
the physical core that executes the instructions. 18 There are four
warp schedulers per multiprocessor. The grouping into warps
happens based on a consecutive threadId. If we set the blockDim
to be multi-dimension, then the threadId is calculated like so:

threadId = threadIdx.x+blockDim.x*(threadIdx.y+blockDim.y*

Then, threads with neighbouring threadId become part of the

same warp. Below I tried to illustrate this, using a smaller
“warpsize” of 8 threads (real warps always contain 32
threads): 19

The concept of a warp is relevant for this second kernel, as

sequential memory accesses by threads that are part of the same
warp can be grouped and executed as one. This is referred to as
global memory coalescing. It’s the most important thing to
keep in mind when optimizing a kernel’s GMEM memory
accesses toward achieving the peak bandwidth.

Below is an example, where consecutive memory accesses by

threads in the same warp are grouped, allowing each warp to
execute 8 memory accesses using only 2 32B loads:

https://siboehm.com/articles/22/CUDA-MMM 5/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

In reality, the GPU supports 32B, 64B and 128B memory

accesses. So, if each thread is loading a 32bit float from global
memory, the warp scheduler (probably the MIO) can coalesce
this 32*4B=128B load into a single transaction. This is only
possible if the floats loaded are consecutive in memory, and if
access is aligned. 20 If they aren’t, or if access cannot be
coalesced for some other reason, then the GPU will execute as
many 32B loads as necessary to fetch all floats, leading to a lot
of wasted bandwidth. Profiling our naive kernel, we can
observe the detrimental effect of non-coalesced access as we
achieve only 15GB/s of GMEM throughput.

Looking back at the previous kernel, we assigned threads their

entry of C like so:

const uint x = blockIdx.x * blockDim.x + threadIdx.x;

const uint y = blockIdx.y * blockDim.y + threadIdx.y;

Hence, threads of the same warp (those with consecutive

threadIdx.x) were loading the rows of A non-consecutively

from memory. The naive kernel’s pattern of accessing the

memory of A looked more like so:

To enable coalescing, we can change how we assign positions of

the result matrix C to threads. This change in the global
memory access pattern is illustrated below:

https://siboehm.com/articles/22/CUDA-MMM 6/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

To implement this, we only need to change the first two lines:

const int x = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOC

const int y = blockIdx.y * BLOCKSIZE + (threadIdx.x % BLOC

if (x < M && y < N) {

float tmp = 0.0;
for (int i = 0; i < K; ++i) {
tmp += A[x * K + i] * B[i * N + y];
}
C[x * N + y] = alpha * tmp + beta * C[x * N + y];
}

And we call it like so: 21

// gridDim stays the same

dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32));
// make blockDim 1-dimensional, but don't change number of
dim3 blockDim(32 * 32);
sgemm_coalescing<<<gridDim, blockDim>>>(M, N, K, alpha, A,

Global memory coalescing increases memory throughput from

15GB/s to 110GB/s. Performance reaches 2000 GFLOPS, a
big improvement compared to the 300 GFLOPS of the first,
naive kernel. For the next kernel, we’ll use the GPU’s fast on-

https://siboehm.com/articles/22/CUDA-MMM 7/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

chip memory, called shared memory, to cache data that will be

re-used.

Kernel 3: Shared Memory Cache-

Blocking
Next to the large global memory, a GPU has a much smaller
region of memory that is physically located on the chip, called
shared memory (SMEM). Physically, there’s one shared
memory per SM. 22 Logically, this shared memory is partitioned
among the blocks. This means that a thread can communicate
with the other threads in its block via the shared memory
chunk. On my A6000 GPU, each block has access to a
maximum of 48KB of shared memory. 23

As the shared memory is located on-chip, it has a much lower

latency and higher bandwidth than global memory. I couldn’t
find good benchmark results for the Ampere architecture but
for Volta (released in 2017) the benchmarks performed in this
paper report 750GiB/s of global memory bandwidth, and
12,080GiB/s of shared memory bandwidth. 24

So for this next kernel, we’ll load a chunk of A and a chunk of

B from global memory into shared memory. Then we’ll
perform as much work as possible on the two chunks, with
each thread still being assigned one entry of C. We’ll move the
chunks along the columns of A and the rows of B performing
partial sums on C until the result is computed.

This is illustrated below:

The important parts of the code are below, with variable names
corresponding to the plot above: 25

https://siboehm.com/articles/22/CUDA-MMM 8/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

// advance pointers to the starting position for this bloc

A += cRow * CHUNKSIZE * K; // row=cRow,
B += cCol * CHUNKSIZE; // row=0, co
// the pointer to the output location stays fixed during t
C += cRow * CHUNKSIZE * N + cCol * CHUNKSIZE; // row=cRow,

// allocate buffer for current chunk of A and chunk of B i

// the amoung of SMEM required by this kernel is fixed at
// this will become important for occupancy calculations l
__shared__ float As[CHUNKSIZE * CHUNKSIZE];
__shared__ float Bs[CHUNKSIZE * CHUNKSIZE];

float tmp = 0.0;

// the outer loop advances A along the columns and B along
// the rows until we have fully calculated the result in C
for (int outer = 0; outer < numBlockSteps; ++outer) {
// Have each thread load one of the elements in A & B fr
// global memory into shared memory.
// Make the innerRow (=threadIdx.x) the consecutive inde
// to allow for global memory access coalescing
Ab[innerCol * CHUNKSIZE + innerRow] = A[innerCol * K + i
Bb[innerCol * CHUNKSIZE + innerRow] = B[innerCol * N + i

// block threads in this block until cache is fully popu

__syncthreads();

// advance pointers onto next chunk

A += CHUNKSIZE;
B += CHUNKSIZE * N;

// execute the dotproduct on the currently cached chunk

for (int inner = 0; inner < CHUNKSIZE; ++inner) {
tmp +=
Ab[innerCol * CHUNKSIZE + inner] * Bb[inner * CHUN
}
// need to sync again at the end, to avoid faster thread
// fetching the next chunk into the cache before slower
__syncthreads();
}
C[innerCol * N + innerRow] = alpha * tmp + beta * C[innerC

This kernel achieves ~2200 GFLOPS, a 50% improvement

over the previous version. 26 We’re still far away from hitting
the ~30 TFLOPs that the GPU can provide. This is obvious
from the roofline plot below: 27

https://siboehm.com/articles/22/CUDA-MMM 9/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

At a CHUNKSIZE of 32, this uses 23232*4B=8KB of shared

memory space. 28 My A6000 GPU has a maximum of 48KB of
shared memory space available for each block, so we’re far
away from hitting that limit. This is not necessarily a problem,
as there are downsides to increasing per-block shared-memory
usage. Each multiprocessor (SM) has a maximum of 100KB of
SMEM available. This means that if we’d modify our kernel to
use the full 48KB of SMEM available, each SM could only
keep two blocks loaded at the same time. In CUDA parlance,
increasing per-block SMEM utilization can decrease
occupancy. Occupancy is defined as the ratio between the
number of active warps per SM and the maximum possible
number of active warps per SM.

High occupancy is useful because it allows us to hide the high

latency of our operations, by having a bigger pool of issue-able
instructions available. 29 There are three main limits to keeping
more active blocks loaded on an SM: register count, warp
count and SMEM capacity. Let’s do an example calculation for
our current kernel.

Occupancy Calculation for Kernel 3

Here are the relevant hardware stats for my GPU, obtained
from the cudaGetDeviceProperties API (Multiprocessors are the
SMs we talked about earlier): 30

Metric Value

NVIDIA RTX
Name
A6000

Compute Capability 8.6

max threads per block 1024

max threads per multiprocessor 1536

threads per warp 32

warp allocation granularity 4

max regs per block 65536

max regs per multiprocessor 65536

reg allocation unit size 256

reg allocation granularity warp

total global mem 48685 MB

max shared mem per block 48 KB

https://siboehm.com/articles/22/CUDA-MMM 10/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Metric Value

CUDA runtime shared mem

1024 B
overhead per block

shared mem per multiprocessor 102400 B

multiprocessor count 84
max warps per multiprocessor 48

And here are the resource demands for our kernel:

Registers per Thread 37

SMEM per Block 8192 B

Threads per Block 1024

Work is scheduled onto the SMs on a block granularity. Each

SM will load more blocks, as long as it has enough resources to
accommodate them. Calculation: 31

Shared memory: 8192B/Block + 1024B/Block for CUDA

runtime usage = 9216B/Block. (102400B per SM) /
(9216B per Block) = 11.11 ⇒ 11 Blocks upper limit.
Threads: 1024 Threads per Block, max 1536 threads per
SM ⇒ Upper limit 1 block.
Registers: 37 regs per thread * 32 threads per warp = 1184
regs per warp. Register allocation granularity is 256 regs on
a warp level, hence rounding up to 1280 regs per warp. We
have (1024 threads / 32) = 32 warps per block, hence 1280
regs per warp * 32 warps per block = 40960 regs per block.
Max 65536 regs per SM ⇒ upper limit 1 block.

So this kernel is limited by the number of threads per block,

and the number of registers per thread. We cannot load more
than one block per SM, giving us a final occupancy of 32 active
warps / 48 max active warps = 66%.

A 66% occupancy is not too bad, so this doesn’t explain why

our kernel runs so slow. Looking at the profiler gives us some
hints. First, if we look at the mix of executed instructions, most
of them are memory loads: 32

https://siboehm.com/articles/22/CUDA-MMM 11/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Our inner loop looks like this in PTX (Godbolt link):

ld.shared.f32 %f91, [%r8+3456];

ld.shared.f32 %f92, [%r7+108];
fma.rn.f32 %f93, %f92, %f91, %f90;

That’s not good, given that a memory load is bound to have a

higher latency than a simple FMA, and given that we know
our kernel should be compute bound. We see this effect when
looking at the profiler’s sampling of warp states. This quantifies
how many cycles were spent in each state per executed
instruction: 33

The meaning of the states is documented in the Kernel

Profiling Guide. For Stall MIO Throttle it reads:

Warp was stalled waiting for the MIO (memory input/output) instruction
queue to be not full. This stall reason is high in cases of extreme utilization of the
MIO pipelines, which include special math instructions, dynamic branches, as
well as shared memory instructions

We’re not using special math instructions, nor dynamic

branches, so it’s clear that we’re stalling waiting for our SMEM
accesses to return. So how do we make our kernel issue less
SMEM instructions? One way is to have each thread compute
more than one output element, which allows us to perform
more of the work in registers and relying less on SMEM.

Kernel 4: 1D Blocktiling for Calculating

Multiple Results per Thread
So this next kernel works like our last kernel, but adds a new
inner loop, for calculating multiple C entries per thread. We
now use a SMEM cache size of BM*BK + BN*BK = 64*8 + 64*8 =
1024 floats, for a total of 4KB per block. Below a visualization.
I have highlighted two of the threads and the values they access
in the inner loop in orange and red.

https://siboehm.com/articles/22/CUDA-MMM 12/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

All of the important changes for this kernel happen in the inner
loop. The loading for GMEM to SMEM stays largely the same
as before. Let’s have a look: 34

// allocate thread-local cache for results in registerfile

float threadResults[TM] = {0.0};

// outer loop over block tiles

for (uint bkIdx = 0; bkIdx < K; bkIdx += BK) {
// populate the SMEM caches (same as before)
As[innerRowA * BK + innerColA] = A[innerRowA * K + inner
Bs[innerRowB * BN + innerColB] = B[innerRowB * N + inner
__syncthreads();

// advance blocktile for outer loop

A += BK;
B += BK * N;

// calculate per-thread results

for (uint dotIdx = 0; dotIdx < BK; ++dotIdx) {
// we make the dotproduct loop the outside loop, which
// reuse of the Bs entry, which we can cache in a tmp
float Btmp = Bs[dotIdx * BN + threadCol];
for (uint resIdx = 0; resIdx < TM; ++resIdx) {
threadResults[resIdx] +=
As[(threadRow * TM + resIdx) * BK + dotIdx] * Bt

https://siboehm.com/articles/22/CUDA-MMM 13/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

}
}
__syncthreads();
}

This kernel achieves ~8600 GFLOPs, 2.2x faster than our

previous kernel. Let’s calculate how many memory accesses
each thread performed in our previous kernel, where each
thread calculated one result:

GMEM: K/32 iterations of outer loop * 2 loads

SMEM: K/32 iterations of outer loop * BLOCKSIZE
(=32) * 2 loads
Memory accesses per result: K/16 GMEM, K*2 SMEM

And for our new kernel, where each thread calculates eight
results:

GMEM: K/8 iterations of outer loop * 2 loads

SMEM: K/8 iterations of outer loop * BK(=8) * (1 +
TM(=8))
Memory accesses per result: K/32 GMEM, K*9/8 SMEM

As expected, we now spend much fewer cycles per instruction

stalling due to memory pressure: 35

Sidenote on Compiler Optimizations

Above we explicitly cached the entry of B into Btmp and
reordered the two inner loops for efficiency. If we don’t do
that, then the code looks like this:

for (uint resIdx = 0; resIdx < TM; ++resIdx) {

for (uint dotIdx = 0; dotIdx < BK; ++dotIdx) {
threadResults[resIdx] +=
As[(threadRow * TM + resIdx) * BK + dotIdx] * Bs[dot
}
}

https://siboehm.com/articles/22/CUDA-MMM 14/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Interestingly, this has no adverse effect on performance. This is

surprising since our inner two loops now incur BK (=8) * TM
(=8) * 2 = 128 SMEM accesses, instead of the previous 72.
Looking at the assembly (Godbolt link) has the answer:

// first inner-most loop

ld.shared.f32 %f45, [%r9];
ld.shared.f32 %f46, [%r8];
fma.rn.f32 %f47, %f46, %f45, %f212;
ld.shared.f32 %f48, [%r9+256];
ld.shared.f32 %f49, [%r8+4];
fma.rn.f32 %f50, %f49, %f48, %f47;
ld.shared.f32 %f51, [%r9+512];
ld.shared.f32 %f52, [%r8+8];
fma.rn.f32 %f53, %f52, %f51, %f50;
ld.shared.f32 %f54, [%r9+768];
ld.shared.f32 %f55, [%r8+12];
fma.rn.f32 %f56, %f55, %f54, %f53;
ld.shared.f32 %f57, [%r9+1024];
ld.shared.f32 %f58, [%r8+16];
fma.rn.f32 %f59, %f58, %f57, %f56;
ld.shared.f32 %f60, [%r9+1280];
ld.shared.f32 %f61, [%r8+20];
fma.rn.f32 %f62, %f61, %f60, %f59;
ld.shared.f32 %f63, [%r9+1536];
ld.shared.f32 %f64, [%r8+24];
fma.rn.f32 %f65, %f64, %f63, %f62;
ld.shared.f32 %f66, [%r9+1792];
ld.shared.f32 %f67, [%r8+28];
fma.rn.f32 %f212, %f67, %f66, %f65;
// second inner-most loop
ld.shared.f32 %f68, [%r8+32];
fma.rn.f32 %f69, %f68, %f45, %f211;
ld.shared.f32 %f70, [%r8+36];
fma.rn.f32 %f71, %f70, %f48, %f69;
ld.shared.f32 %f72, [%r8+40];
fma.rn.f32 %f73, %f72, %f51, %f71;
ld.shared.f32 %f74, [%r8+44];
fma.rn.f32 %f75, %f74, %f54, %f73;
ld.shared.f32 %f76, [%r8+48];
fma.rn.f32 %f77, %f76, %f57, %f75;
ld.shared.f32 %f78, [%r8+52];
fma.rn.f32 %f79, %f78, %f60, %f77;
ld.shared.f32 %f80, [%r8+56];
fma.rn.f32 %f81, %f80, %f63, %f79;
ld.shared.f32 %f82, [%r8+60];
fma.rn.f32 %f211, %f82, %f66, %f81;
// ... continues like this for inner-loops 3-8 ...

The compiler unrolls both loops 36 and then eliminates the

repeated SMEM loads of the Bs entries, so we end up with the
same amount of SMEM accesses as our optimized CUDA
code.

When the PTX is compiled to SASS, the SMEM loads from As

are vectorized: 37

https://siboehm.com/articles/22/CUDA-MMM 15/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

LDS R26, [R35.X4+0x800] // a 32b load from Bs

LDS.128 R8, [R2] // a 128b load from As
LDS.128 R12, [R2+0x20]
LDS R24, [R35.X4+0x900]
LDS.128 R20, [R2+0x60]
LDS R36, [R35.X4+0xb00]
LDS.128 R16, [R2+0x40]
LDS.128 R4, [R2+0x80]
LDS R38, [R35.X4+0xd00]

Areas of Improvement: Arithmetic Intensity

Our current kernel still suffers from the same stalling-for-
memory problem as kernel 3, just to a lesser extent. So we’ll
just apply the same optimization again: computing even more
results per thread. The main reason this makes our kernel run
faster is that it increases arithmetic intensity. 38 Below I tried to
make it more immediately obvious why calculating more
results per thread raises arithmetic intensity: 39

In conclusion, all our kernels perform the same number of

FLOPs, but we can reduce the number of GMEM accesses by
calculating more results per thread. We’ll continue optimizing
arithmetic intensity for as long as we’re still memory bound.

https://siboehm.com/articles/22/CUDA-MMM 16/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Kernel 5: Increasing Arithmetic Intensity

via 2D Blocktiling
The basic idea for kernel 5 will be to compute a grid of 8*8
elements of C per thread. The first stage of the kernel is for all
threads to work together to populate the SMEM cache. We’ll
have each thread load multiple elements. This code looks like
so: 40

for (uint loadOffset = 0; loadOffset < BM; loadOffset += s

As[(innerRowA + loadOffset) * BK + innerColA] =
A[(innerRowA + loadOffset) * K + innerColA];
}
for (uint loadOffset = 0; loadOffset < BK; loadOffset += s
Bs[(innerRowB + loadOffset) * BN + innerColB] =
B[(innerRowB + loadOffset) * N + innerColB];
}
__syncthreads();

Now that the SMEM cache is populated, we have each thread

multiply it’s relevant SMEM entries and accumulate the result
into local registers. Below I illustrated the (unchanged) outer
loop along the input matrices, and the three inner loops for the
dot product and the TN and TM dimension:

https://siboehm.com/articles/22/CUDA-MMM 17/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

The interesting parts of the code look like this: 41

// allocate thread-local cache for results in registerfile

float threadResults[TM * TN] = {0.0};
// register caches for As and Bs
float regM[TM] = {0.0};
float regN[TN] = {0.0};

// outer-most loop over block tiles

for (uint bkIdx = 0; bkIdx < K; bkIdx += BK) {
// populate the SMEM caches
for (uint loadOffset = 0; loadOffset < BM; loadOffset +=
As[(innerRowA + loadOffset) * BK + innerColA] =
A[(innerRowA + loadOffset) * K + innerColA];
}
for (uint loadOffset = 0; loadOffset < BK; loadOffset +=
Bs[(innerRowB + loadOffset) * BN + innerColB] =
B[(innerRowB + loadOffset) * N + innerColB];
}
__syncthreads();

// advance blocktile
A += BK; // move BK columns to right
B += BK * N; // move BK rows down

// calculate per-thread results

for (uint dotIdx = 0; dotIdx < BK; ++dotIdx) {

https://siboehm.com/articles/22/CUDA-MMM 18/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

// load relevant As & Bs entries into registers

for (uint i = 0; i < TM; ++i) {
regM[i] = As[(threadRow * TM + i) * BK + dotIdx];
}
for (uint i = 0; i < TN; ++i) {
regN[i] = Bs[dotIdx * BN + threadCol * TN + i];
}
// perform outer product on register cache, accumulate
// into threadResults
for (uint resIdxM = 0; resIdxM < TM; ++resIdxM) {
for (uint resIdxN = 0; resIdxN < TN; ++resIdxN) {
threadResults[resIdxM * TN + resIdxN] +=
regM[resIdxM] * regN[resIdxN];
}
}
}
__syncthreads();
}

In the inner loop, we can reduce the number of SMEM

accesses by making dotIdx the outer loop, and explicitly
loading the values we need for the two inner loops into
registers. Below is a drawing of the dotIdx loop across time, to
visualize which SMEM entries get loaded into thread-local
registers at each step: 42

Resulting performance: 16TFLOPs, another 2x improvement.

Let’s repeat the memory access calculation. We’re now
calculating TM*TN = 8*8 = 64 results per thread.

GMEM: K/8 (outer loop iters) * 2 (A+B) * 1024/256

(sizeSMEM/numThreads) loads
SMEM: K/8 (outer loop iters) * 8 (dotIdx) * 2 (A+B) * 8
loads
Memory accesses per result: K/64 GMEM, K/4 SMEM

Slowly performance is reaching acceptable levels, however,

warp stalls due to memory pipeline congestion are still too
frequent. For kernel 6 we’ll take two measures to try to
improve that: Transposing As to enable auto-vectorization of

https://siboehm.com/articles/22/CUDA-MMM 19/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

SMEM loads, and promising the compiler alignment on the

GMEM accesses.

Kernel 6: Vectorize SMEM and GMEM

Accesses
The first optimization that I already hinted at earlier is to
transpose As. This will allow us to load from As using
vectorized SMEM loads (LDS.128 in SASS). Below the same
visualization of the three inner loops as for kernel 5, but now
with As transposed in memory:

Looking at the assembly 43 we see that loading As into the

registers, which used to be a 32b LDS load, is now also a 128b
LDS.128 load, just like it had already been for Bs. This gives us a

500GFLOPs speedup, or ~3%.

Next, we’ll vectorize all loads and stores from/to GMEM

using vector datatypes, namely float4.

The code looks like this: 44

float4 tmp =
reinterpret_cast<float4 *>(&A[innerRowA * K + innerCol
// transpose A during the GMEM to SMEM transfer
As[(innerColA * 4 + 0) * BM + innerRowA] = tmp.x;

https://siboehm.com/articles/22/CUDA-MMM 20/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

As[(innerColA * 4 + 1) * BM + innerRowA] = tmp.y;

As[(innerColA * 4 + 2) * BM + innerRowA] = tmp.z;
As[(innerColA * 4 + 3) * BM + innerRowA] = tmp.w;

reinterpret_cast<float4 >(&Bs[innerRowB BN + innerColB

reinterpret_cast<float4 *>(&B[innerRowB * N + innerCol
syncthreads();

This leads to the 32b GMEM load instructions (LDG.E and

STG.E) being replaced with 128b counterparts (LDG.E.128 and

STG.E.128). Initially, I was confused as to why running this:

reinterpret_cast<float4 >(&Bs[innerRowB BN + innerColB

reinterpret_cast<float4 *>(&B[innerRowB * N + innerCol

would be any faster than just manually unrolling the access (or
using pragma unroll):

Bs[innerRowB * BN + innerColB * 4 + 0] = B[innerRowB * N +

Bs[innerRowB * BN + innerColB * 4 + 1] = B[innerRowB * N +
Bs[innerRowB * BN + innerColB * 4 + 2] = B[innerRowB * N +
Bs[innerRowB * BN + innerColB * 4 + 3] = B[innerRowB * N +

Shouldn’t the compiler just be able to coalesce the 2nd version

and also generate 128b loads? I think the reason is that the
compiler has no way to verify that the float* B pointer that is
passed to the kernel is 128b aligned, which would be a
requirement for using LDG.E.128. So the reinterpret_cast’s
only purpose is to promise the compiler that the float* B
pointer will be aligned. 45

Kernel 6 achieves 20TFLOPs, which is as close to the

24TFLOPs of the cuBLAS implementation as well get for
now.

Kernel 7: TBD
I wrote this post as my worklog while I optimized the
SGEMM kernel from scratch. As such, these are the
optimizations I want to experiment with next:

Bank conflicts: Kernel 6 runs into SMEM bank conflicts

while loading from As & Bs. I should try to find a way to
avoid this.
Increasing register usage: Kernel 6 has higher occupancy
than necessary. Each warp spends 1.5 cycles per instruction
stalling while waiting to get scheduled. Therefore it should
be possible to use more registers which will lower

https://siboehm.com/articles/22/CUDA-MMM 21/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

occupancy but may make it possible to use double buffering

our increase arithmetic intensity.

Conclusion
Writing this post was a similar experience to my previous post
on optimizing SGEMM on CPU: Optimizing SGEMM
iteratively is one of the best ways to deeply understand the
performance characteristics of the hardware. For writing the
CUDA programs I was surprised by how easy it was to
implement the code once I had made a good visualization of
how I wanted the kernel to work.

As always, all my code is available on Github.

Lastly, a big thanks to the creators of Godbolt.org (for looking

at PTX and SASS assembly) and Excalidraw (for drawing the
kernels)! Both of these tools are a joy to use and have helped
me learn much faster.

Further Resources and References

I started writing this post because I stumbled over
wangzyon’s Github repository, first experimenting with his
kernels and then rewriting everything from scratch. Also
relevant is this Nvidia Blogpost about the CUTLASS
library.
Mandatory references: the official CUDA Toolkit
Programming Guide and the CUDA Best Practices Guide.
The Kernel Profiling Guide contains even more info on
low-level hardware details like caches and pipelines, and on
the various metrics that can be collected.
Onur Mutlu is a professor at ETH who uploads his lectures
to Youtube. Particularly relevant for this post are
Computer Architecture and Acceleration on
Heterogeneuous Systems.
Understanding Latency Hiding on GPUs, a Ph.D. thesis
that goes in-depth on how to design workloads such that
they fully utilize memory bandwidth and computation.
Lei Mao (an engineer at Nvidia) has good CUDA content
on his blog, including about proper CUDA error handling.
It seems like there aren’t any good official resources for
understanding SASS. There’s Nvidia’s Docs on CUDA
binary utilities. More useful might be looking at Open
Source SASS assemblers, like Da Yan’s turingas.

https://siboehm.com/articles/22/CUDA-MMM 22/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog - December 31, 2022 - Simon Boehm

https://siboehm.com/articles/22/CUDA-MMM 23/23

All New Ecosport Pricelist Dec'19-2
No ratings yet
All New Ecosport Pricelist Dec'19-2
1 page
Static Load Test Qualification of A Geostationary Spacecraft Primary Structure
No ratings yet
Static Load Test Qualification of A Geostationary Spacecraft Primary Structure
9 pages
A Case Study of MSC Cruises - Data and Media Analysis (Final.V) AT3 - Individual Report - Removed
No ratings yet
A Case Study of MSC Cruises - Data and Media Analysis (Final.V) AT3 - Individual Report - Removed
15 pages
COMP1680 Coursework 1
No ratings yet
COMP1680 Coursework 1
14 pages
STRT-431 Reading Notes 10-12-21
No ratings yet
STRT-431 Reading Notes 10-12-21
5 pages
Cab Control
No ratings yet
Cab Control
28 pages
TSPC Logic A Circuit For All Seasons
No ratings yet
TSPC Logic A Circuit For All Seasons
4 pages
ADCsurvey Rev20130629
No ratings yet
ADCsurvey Rev20130629
47 pages
Smart Helmet6 Wiper
No ratings yet
Smart Helmet6 Wiper
4 pages
1280-060 Datasheet
No ratings yet
1280-060 Datasheet
43 pages
Instrukcja EasyMIG210 210s 215 225 EN-1
No ratings yet
Instrukcja EasyMIG210 210s 215 225 EN-1
28 pages
Ahlers Company Brochure
No ratings yet
Ahlers Company Brochure
6 pages
Accellera-SystemC-Tutorial-2014 - Case Study
No ratings yet
Accellera-SystemC-Tutorial-2014 - Case Study
104 pages
Asam MCD-2 MC
No ratings yet
Asam MCD-2 MC
1 page
Applsci 10 05280 v2
No ratings yet
Applsci 10 05280 v2
10 pages
Coma Answers PDF
No ratings yet
Coma Answers PDF
12 pages
A Microstrip Antenna Aperture Coupled To A Microstripline - El - Pozar
No ratings yet
A Microstrip Antenna Aperture Coupled To A Microstripline - El - Pozar
2 pages
Datashhet MS-7C91 PDF
No ratings yet
Datashhet MS-7C91 PDF
72 pages
Telecommunication Engineering Unit 2
No ratings yet
Telecommunication Engineering Unit 2
27 pages
Curso 5G Parte 3
100% (1)
Curso 5G Parte 3
14 pages
TCP & Udp
No ratings yet
TCP & Udp
33 pages
Mas90 Boi v20
No ratings yet
Mas90 Boi v20
196 pages
IFS Chapter 4
No ratings yet
IFS Chapter 4
22 pages
Chapter 2 Fundamentals of Wireless Cellular Communications
No ratings yet
Chapter 2 Fundamentals of Wireless Cellular Communications
73 pages
Nexgenie Base Unit InstallationManual
No ratings yet
Nexgenie Base Unit InstallationManual
2 pages
IFS CHP 1
No ratings yet
IFS CHP 1
9 pages
Bee Solved 2019
No ratings yet
Bee Solved 2019
26 pages
Gautam Buddha University
No ratings yet
Gautam Buddha University
20 pages
Aalbers 2022
No ratings yet
Aalbers 2022
21 pages
Infineon Annual Report 2020
No ratings yet
Infineon Annual Report 2020
244 pages
Acquisition and Analysis of Aerodynamic Loads On Formula 3 Racing Car Wings Using Dynamometric Load Cells
No ratings yet
Acquisition and Analysis of Aerodynamic Loads On Formula 3 Racing Car Wings Using Dynamometric Load Cells
14 pages
Cellular and Mobile Communication
No ratings yet
Cellular and Mobile Communication
277 pages
LB 52
No ratings yet
LB 52
2 pages
GD Indonesia - For KP
No ratings yet
GD Indonesia - For KP
13 pages
Virtualization in Distributed System: A Brief Overview
100% (1)
Virtualization in Distributed System: A Brief Overview
5 pages
The Future For WAAM-WEB
No ratings yet
The Future For WAAM-WEB
5 pages
Lecture1 PH113 Module4
No ratings yet
Lecture1 PH113 Module4
12 pages
Videobehavbrain 1
No ratings yet
Videobehavbrain 1
10 pages
Product Details of 8S 24V 100A Lifepo4 Battery Protection Board UPS Energy Inverter BMS PCB Board With Balance
No ratings yet
Product Details of 8S 24V 100A Lifepo4 Battery Protection Board UPS Energy Inverter BMS PCB Board With Balance
1 page
Using DJIFlightPlanner With Litchi PDF
No ratings yet
Using DJIFlightPlanner With Litchi PDF
21 pages
Infineon-TLE98xx NVM Programming During LIN communication-ApplicationNotes-v01 00-EN
No ratings yet
Infineon-TLE98xx NVM Programming During LIN communication-ApplicationNotes-v01 00-EN
11 pages
Alstom TRAINS
No ratings yet
Alstom TRAINS
5 pages
Hitachi L26 L32A01A
No ratings yet
Hitachi L26 L32A01A
68 pages
801 Voltage Doubler
No ratings yet
801 Voltage Doubler
7 pages
NS-10M - The 1978 Original, Bob Clearmountain Version
No ratings yet
NS-10M - The 1978 Original, Bob Clearmountain Version
1 page
Intel Gemini Lake Block Diagram EJ-11 ZHE 11"
No ratings yet
Intel Gemini Lake Block Diagram EJ-11 ZHE 11"
37 pages
v1.2 MOM StreamServer Overview - Basics in Movex Course
No ratings yet
v1.2 MOM StreamServer Overview - Basics in Movex Course
17 pages
IV ECE ESS Question Bank
No ratings yet
IV ECE ESS Question Bank
3 pages
Inventory & Orderlist TBS Catering Services - 2022 4e Kwarta (ESTONIA)
No ratings yet
Inventory & Orderlist TBS Catering Services - 2022 4e Kwarta (ESTONIA)
6 pages
TPM 05 Ing
No ratings yet
TPM 05 Ing
16 pages
Richards 2004
No ratings yet
Richards 2004
6 pages
Asm1 Final
No ratings yet
Asm1 Final
27 pages
Measure of Central Tendency Project
No ratings yet
Measure of Central Tendency Project
86 pages
Corporate Law 1 - Course Module
No ratings yet
Corporate Law 1 - Course Module
17 pages
Bad SMC Flash 2
No ratings yet
Bad SMC Flash 2
2 pages
VLSI Circuit Layout: Standard Cells: Introduction To Cmos Vlsi Design
No ratings yet
VLSI Circuit Layout: Standard Cells: Introduction To Cmos Vlsi Design
15 pages
ZWZK NTAw N2 NL ZJ I1 Yj A5 OWU1 MDK 5 MM E1 ZTLH ZDBJ ZWQX ZJ I3 NTI3 NQ
100% (1)
ZWZK NTAw N2 NL ZJ I1 Yj A5 OWU1 MDK 5 MM E1 ZTLH ZDBJ ZWQX ZJ I3 NTI3 NQ
106 pages
01 Catalog All Produk - PT - Neosys Indonesia
No ratings yet
01 Catalog All Produk - PT - Neosys Indonesia
40 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Powerflex 750-Series Option Modules: Installation Instructions
No ratings yet
Powerflex 750-Series Option Modules: Installation Instructions
2 pages
Information Technology Policy and Procedure Manual
No ratings yet
Information Technology Policy and Procedure Manual
22 pages
Verbs Related To Hardware
No ratings yet
Verbs Related To Hardware
2 pages
G3Smart PLC
No ratings yet
G3Smart PLC
2 pages
8 Sorting SAS Data Sets
No ratings yet
8 Sorting SAS Data Sets
9 pages
Dell Thunderbolt Dock wd19tbs Data Sheet - Pdf.external
No ratings yet
Dell Thunderbolt Dock wd19tbs Data Sheet - Pdf.external
5 pages
Seca 333 I: EMR Ready Baby Scale With Wi-Fi Function
No ratings yet
Seca 333 I: EMR Ready Baby Scale With Wi-Fi Function
2 pages
PYXIS System 30 - Manual de Usuario (Pabellon) PDF
No ratings yet
PYXIS System 30 - Manual de Usuario (Pabellon) PDF
166 pages
KHD 5000X Manual
No ratings yet
KHD 5000X Manual
2 pages
Types of Computer Mainframe, Mini, Micro Supercomputer
No ratings yet
Types of Computer Mainframe, Mini, Micro Supercomputer
2 pages
Canbus Dtasheet
No ratings yet
Canbus Dtasheet
22 pages
HP Pavilion Ze2000 Compaq Presario m2000
No ratings yet
HP Pavilion Ze2000 Compaq Presario m2000
217 pages
AN3560
No ratings yet
AN3560
34 pages
PIC18F4550 Overview
No ratings yet
PIC18F4550 Overview
1 page
MODULE 5 Caches
No ratings yet
MODULE 5 Caches
49 pages
Grade 7 Computer Science Schemes of Work Term 1 Longhorn 2025 Teacher Co Ke
No ratings yet
Grade 7 Computer Science Schemes of Work Term 1 Longhorn 2025 Teacher Co Ke
13 pages
The Various x86 Modes': On Understanding Key Differences Among The Processor's Several Execution-Architectures
No ratings yet
The Various x86 Modes': On Understanding Key Differences Among The Processor's Several Execution-Architectures
19 pages
Features of The LPC214X Family
No ratings yet
Features of The LPC214X Family
13 pages
Collins Ayeah Phoenix, AZ Summary
No ratings yet
Collins Ayeah Phoenix, AZ Summary
5 pages
DN 6600602607
No ratings yet
DN 6600602607
5 pages
HP LJ 5000 Datasheet
No ratings yet
HP LJ 5000 Datasheet
6 pages
P4m90-M7a P4M89-M7B 0611C
No ratings yet
P4m90-M7a P4M89-M7B 0611C
47 pages
How To Program Arduino by Using USBasp Without Bootloader
No ratings yet
How To Program Arduino by Using USBasp Without Bootloader
6 pages
Sony Vaio Service Manual
0% (1)
Sony Vaio Service Manual
17 pages
What Is A Floppy Disk Drive
No ratings yet
What Is A Floppy Disk Drive
3 pages
1) LED Blinking
No ratings yet
1) LED Blinking
7 pages
Midterm Review
No ratings yet
Midterm Review
79 pages
Directorate of Technical Education, Goa State: (Co301) Computer Hardware
No ratings yet
Directorate of Technical Education, Goa State: (Co301) Computer Hardware
4 pages
Tecnair H pCO2 User Manual Eng
No ratings yet
Tecnair H pCO2 User Manual Eng
51 pages
1.1.1b The Function of The CPU - Answers - BrainQuest
No ratings yet
1.1.1b The Function of The CPU - Answers - BrainQuest
1 page