This includes: - Multiply with different modes (map) 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Comparing various launch configs for CUDA based vector multiply. - Sum with different storage types (reduce) 1. Performance of vector element sum using float vs bfloat16 as the storage type. - Sum with different modes (reduce) 1. Performance of sequential execution based vs OpenMP based vector element sum. 2. Performance of memcpy vs in-place based CUDA based vector element sum. 3. Comparing various launch configs for CUDA based vector element sum (memcpy). 4. Comparing various launch configs for CUDA based vector element sum (in-place). - Sum with in-place strategies of CUDA mode (reduce) 1. Comparing various launch configs for CUDA based vector element sum (in-place).