This document describes matrix multiplication on CPUs and GPUs using CUDA. It begins with performing matrix multiplication on the host CPU by iterating through the matrices. It then describes performing the computation on the GPU using CUDA threads, with each thread calculating one element of the product matrix. Multiple techniques are discussed to improve GPU utilization, such as assigning each thread a tile of the matrix and using multiple thread blocks.