CUDA：实现矩阵转置(附完整源码)

最新推荐文章于 2025-08-01 18:48:32 发布

源代码大师

最新推荐文章于 2025-08-01 18:48:32 发布

阅读量86

点赞数

CC 4.0 BY-SA版权

分类专栏： CUDA实战教程文章标签：矩阵线性代数 CUDA

不予转载，严禁转载，违者必纠。

本文链接：https://blog.csdn.net/it_xiangqiang/article/details/136465921

CUDA实战教程专栏收录该内容

246 篇文章 ¥29.90 ¥99.00

订阅专栏

超级会员免费看

本文提供了一种使用CUDA进行矩阵转置的实现方法，通过二维线程块和共享内存优化，每个线程处理TILE_DIM x TILE_DIM的子矩阵。线程先读取输入矩阵数据到共享内存，同步后写入输出矩阵，最终在主机验证部分转置结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

CUDA：实现矩阵转置

以下是一个使用CUDA实现矩阵转置的示例代码：

#include <iostream>
#include <cuda_runtime_api.h>

#define TILE_DIM 32
#define BLOCK_ROWS 8

// CUDA核函数：矩阵转置
__global__ void transpose(float* in, float* out, int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1];

    int blockIdx_x, blockIdx_y;
    int tx, ty;
    int row, col;

    // 计算输入矩阵的索引
    blockIdx_x = blockIdx.x;
    blockIdx_y = blockIdx.y;
    tx = threadIdx.x;
    ty = threadIdx.y;
    row = blockIdx_y * TILE_DIM + ty;
    col = blockIdx_x * TILE_DIM + tx;

    // 读取输入矩阵的数据到共享内存中
    if (row < height && col < width) {
        tile[ty][tx] = in[row * width + col];
    }
    __syncthreads();

    // 计算输出矩阵的索引
    int newRow = blockIdx_x * TILE_DIM + ty;
    int newCol =

了解本专栏