How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
Performance relative to
Kernel GFLOPs
cuBLAS (fp32)
2: GMEM
2006 8.2%
Coalescing
3: SMEM
2984 12.2%
Blocktiling
4: 1D
8626 35.3%
Warptiling
5: 2D
16134 66.0%
Warptiling
6: Vectorize
20358 83.2%
loads
https://siboehm.com/articles/22/CUDA-MMM 1/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
For our first kernel, we’ll use the grid, block and thread
hierarchy to assign each thread a unique entry in the result
matrix C. Then that thread will compute the dot product of
the corresponding row of A and column of B, and write the
result to C. Due to each location of C being written to by only
one thread, we have to do no synchronization. We’ll launch the
kernel like so:
https://siboehm.com/articles/22/CUDA-MMM 2/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 3/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Now that we’ve calculated some lower bounds for our fp32
GEMM calculation, let’s get back to the kernel on hand, to
figure out why it’s so much slower than it could be.
https://siboehm.com/articles/22/CUDA-MMM 4/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
threadId = threadIdx.x+blockDim.x*(threadIdx.y+blockDim.y*
https://siboehm.com/articles/22/CUDA-MMM 5/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 6/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 7/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
The important parts of the code are below, with variable names
corresponding to the plot above: 25
https://siboehm.com/articles/22/CUDA-MMM 8/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 9/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Metric Value
NVIDIA RTX
Name
A6000
https://siboehm.com/articles/22/CUDA-MMM 10/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Metric Value
multiprocessor count 84
max warps per multiprocessor 48
https://siboehm.com/articles/22/CUDA-MMM 11/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Warp was stalled waiting for the MIO (memory input/output) instruction
queue to be not full. This stall reason is high in cases of extreme utilization of the
MIO pipelines, which include special math instructions, dynamic branches, as
well as shared memory instructions
https://siboehm.com/articles/22/CUDA-MMM 12/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
All of the important changes for this kernel happen in the inner
loop. The loading for GMEM to SMEM stays largely the same
as before. Let’s have a look: 34
https://siboehm.com/articles/22/CUDA-MMM 13/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
}
}
__syncthreads();
}
And for our new kernel, where each thread calculates eight
results:
https://siboehm.com/articles/22/CUDA-MMM 14/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 15/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 16/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 17/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
// advance blocktile
A += BK; // move BK columns to right
B += BK * N; // move BK rows down
https://siboehm.com/articles/22/CUDA-MMM 18/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
https://siboehm.com/articles/22/CUDA-MMM 19/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
float4 tmp =
reinterpret_cast<float4 *>(&A[innerRowA * K + innerCol
// transpose A during the GMEM to SMEM transfer
As[(innerColA * 4 + 0) * BM + innerRowA] = tmp.x;
https://siboehm.com/articles/22/CUDA-MMM 20/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
would be any faster than just manually unrolling the access (or
using pragma unroll):
Kernel 7: TBD
I wrote this post as my worklog while I optimized the
SGEMM kernel from scratch. As such, these are the
optimizations I want to experiment with next:
https://siboehm.com/articles/22/CUDA-MMM 21/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Conclusion
Writing this post was a similar experience to my previous post
on optimizing SGEMM on CPU: Optimizing SGEMM
iteratively is one of the best ways to deeply understand the
performance characteristics of the hardware. For writing the
CUDA programs I was surprised by how easy it was to
implement the code once I had made a good visualization of
how I wanted the kernel to work.
https://siboehm.com/articles/22/CUDA-MMM 22/23
1/5/23, 3:36 PM How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog - December 31, 2022 - Simon Boehm
https://siboehm.com/articles/22/CUDA-MMM 23/23