0% found this document useful (0 votes)
37 views

Lab Report 6

The document summarizes work done on GPU programming labs. It includes tasks on exploring GPU properties, parallelizing vector computations on GPU using CUDA, and performing matrix multiplication on GPU. The tasks are verified by comparing outputs with MATLAB calculations. Rectangular matrix multiplication is implemented on GPU by evolving the code for square matrices. Verification by MATLAB shows the GPU code generates correct results.

Uploaded by

Rama Ali
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Lab Report 6

The document summarizes work done on GPU programming labs. It includes tasks on exploring GPU properties, parallelizing vector computations on GPU using CUDA, and performing matrix multiplication on GPU. The tasks are verified by comparing outputs with MATLAB calculations. Rectangular matrix multiplication is implemented on GPU by evolving the code for square matrices. Verification by MATLAB shows the GPU code generates correct results.

Uploaded by

Rama Ali
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Department of Electrical Engineering

Faculty Member: Dr. Usman Zabit / Jafar Hussain Dated: 23rd October 2019

Semester: Fall 2019 (7th) Section: Group-1

EE-423 – Embedded System Design

Lab 6: Introduction to GPU Programming

with Cuda

PLO4 PLO5 PLO8 PLO9


Name Reg. No Viva / Analysis Modern Ethics Individua
Quiz / Lab of data in Tool and l and
Performa Lab Usage Safety Team
nce Report Work

5 Marks 5 Marks 5 Marks 5 Marks 5 Marks

Muhammad Talha 198257

Rama Ali 182621

Muhammad Fahad Baig 174885


1.1 Learning Objectives
By the end of this lab you will be able to:
1. Exploit GPU architecture to parallelize previously sequential nature of routines.
2. Appreciate significance of CPU and GPU for various genre of computations.
3. Express SIMD nature of computations through Cuda extensions for C programming
language.

1.2 Deliverables
You are required to submit
• Code
• Observations and experiences
in the beginning of next lab.
Lab Tasks
Task B: GPUs & their Properties
Compile and run the prop.cu2 and observe the number of GPUs and their specifications in
your system.

Output:
Task C: Parallelizing a Vector Computation

Task C-I: Block-level Parallelism


Compile and run the vec cpu.cu and vec gpu.cu; observe the results.

Output:
Task C-II: Thread-level Parallelism
Change blockIdx.x to threadIdx.x in line 9 of code in snippet 3.3.2. Replace <<<N,1>>>
with <<<1,N>>> in line 38 as well. Compile and execute the code.

Output:

Observing the maximum thread dimensions allowed for GPU in properties, are you
prompted by the expected result? If not, what reason could have made it possible?
The maximum threads per block are 512, whereas N is 10000. We are not prompted any error or
warning. This can be because once 512 operations are executed in parallel, the next 512 will be
crunched after that automatically by the block and so on serially till all N are complete.

Recommend the maximum thread and block dimensions for optimum parallel processing
in GPUs.

Maximum thread dimensions (512,512,64)

Maximum grid dimensions (65535,65535,1)

Task C-III: Threads & Blocks Combined

Call the kernel with following snippet now:


1 #define threadsPerBlock 64
2 compute<<<ceil (N/ threadsPerBlock ) , threadsPerBlcok >>>(dev a , dev b , dev c ) ;
and change the indexing technique to
1 // indexing with block ID and thread Id combined
2 int i = block Id.x∗blockDim.x + threadIdx.x ;

Output:

Task D: Matrix Multiplication on GPU


Task D-I
Compile the code in multSq.cu5 and observe the output. Verify using MATLAB or any other
tool possible.
Output:

MATLAB code:

A = zeros(64,64);
B = zeros(64,64);
for i = 1:64
for j = 1:64
A(i,j) = i-1+j-1;
B(i,j) = i-1-j+1;
end
end
C = A*B;
x = diag(C);
x(1:32)
MATLAB Output:

The output is the same as for the GPU code, hence verified.

Task D-II:
Evolve the code snippet in section 3.4.1 for rectangular matrix multiplication.

Our Code:

#include <stdio.h>

#define R1 16
#define C1 25
#define R2 25
#define C2 16

__global__ void matMult(int * matProd, int * matA, int * matB)


{
int row = blockIdx.x;

int col = threadIdx.x;

int tmpSum = 0;;

if (row < R1 && col < C2)


{

for (int i=0; i<C1; i++){

tmpSum += matA[row*C1 + i] * matB[i*C2 + col];

matProd[row*C2 + col] = tmpSum;

}
}

int main()
{

// initialize, aalocate and define host memory

int matA[R1*C1] = { 0 };

int matB[R2*C2] = { 0 };

int matProd[R1*C2] = { 0 };

for(int i=0; i<R1; ++i)

for (int j=0; j<C1; ++j)

{
matA[i*C1 + j] = i+j;
}
}

for (int i=0; i<R2; ++i)

for (int j=0; j<C2; ++j)

matB[i*C2 + j] = i-j;

// initialize and allocate device memory

int * dev_matProd, * dev_matA, * dev_matB;

cudaMalloc((void **)&dev_matA, R1*C1*sizeof(int));

cudaMalloc((void **)&dev_matB, R2*C2*sizeof(int));

cudaMalloc((void **)&dev_matProd, R1*C2*sizeof(int));

// copy data to device memory

cudaMemcpy((void *)dev_matA, (void *)matA, R1*C1*sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy((void *)dev_matB, (void *)matB, R2*C2*sizeof(int),cudaMemcpyHostToDevice);

matMult<<<R1,C2>>>(dev_matProd, dev_matA, dev_matB);

// check for successful thread execution


if (cudaDeviceSynchronize() != cudaSuccess)

printf("Error\n");

return -1;

// copy results from device to host memory

cudaMemcpy(matProd, dev_matProd, R1*C2*sizeof(int),cudaMemcpyDeviceToHost);

for (int i=0; i<R1/2; ++i) // inspecting first few diagnols


{
printf(" > Diagonal %d of prudect is %d.\n", i, matProd[i*C2+i]);
}

Output:

MATLAB Code (for verifying result):

A = zeros(16,25);
B = zeros(25,16);
for i = 1:16
for j = 1:25
A(i,j) = i-1+j-1;
end
end
for i = 1:25
for j = 1:16
B(i,j) = i-1-j+1;
end
end
C = A*B;
x = diag(C);
x(1:8)

MATLAB Output:

Conclusion:

We get the same result for rectangular matrix multiplication in MATLAB as for our C
code on GPU. Hence, our code is verified.

You might also like