Matrix Computation On The GPU
Matrix Computation On The GPU
on the GPU
CUBLAS, CUSOLVER and MAGMA by example
Andrzej Chrzȩszczyk
Jan Kochanowski University, Kielce, Poland
Jacob Anders
CSIRO, Canberra, Australia
Version 2017
Foreword
code running on CPU or GPU accesses data allocated this way, the CUDA
system takes care of migrating memory pages to the memory of the accessing
processor. Let us note however, that a carefully tuned CUDA program that
uses streams and cudaMemcpyAsync to efficiently overlap execution with
data transfer may perform better than a CUDA program that only uses
unified memory. Users of unified memory are still free to use cudaMemcpy or
cudaMemcpyAsync for performance optimization. Additionally, aplications
can guide the driver using cudaMemAdvise and explicitly migrate memory
using cudaMemPrefetchAsync. Note also that unified memory examples,
which do not call cudaMemcpy, require an explicit cudaDeviceSynchronize
before the host program can safely use the output from the GPU. The mem-
ory allocated with cudaMallocManaged should be released with cudaFree.
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 CUDA Toolkit 4
1.1 Installing CUDA Toolkit . . . . . . . . . . . . . . . . . . . . 4
1.2 Measuring GPUs performance . . . . . . . . . . . . . . . . . 5
2 CUBLAS by example 8
2.1 General remarks on the examples . . . . . . . . . . . . . . . 8
2.2 CUBLAS Level-1. Scalar and vector based operations . . . . 9
2.2.1 cublasIsamax, cublasIsamin - maximal, minimal
elements . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 cublasIsamax, cublasIsamin - unified memory ver-
sion . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 cublasSasum - sum of absolute values . . . . . . . . . 11
2.2.4 cublasSasum - unified memory version . . . . . . . . 12
2.2.5 cublasSaxpy - compute αx + y . . . . . . . . . . . . 13
2.2.6 cublasSaxpy - unified memory version . . . . . . . . 14
2.2.7 cublasScopy - copy vector into vector . . . . . . . . 15
2.2.8 cublasScopy - unified memory version . . . . . . . . 16
2.2.9 cublasSdot - dot product . . . . . . . . . . . . . . . 17
2.2.10 cublasSdot - unified memory version . . . . . . . . . 18
2.2.11 cublasSnrm2 - Euclidean norm . . . . . . . . . . . . 19
2.2.12 cublasSnrm2 - unified memory version . . . . . . . . 20
2.2.13 cublasSrot - apply the Givens rotation . . . . . . . 21
2.2.14 cublasSrot - unified memory version . . . . . . . . . 23
2.2.15 cublasSrotg - construct the Givens rotation matrix . 24
2.2.16 cublasSrotm - apply the modified Givens rotation . . 25
2.2.17 cublasSrotm - unified memory version . . . . . . . . 27
2.2.18 cublasSrotmg - construct the modified Givens rota-
tion matrix . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.19 cublasSscal - scale the vector . . . . . . . . . . . . 29
2.2.20 cublasSscal - unified memory version . . . . . . . . 30
2.2.21 cublasSswap - swap two vectors . . . . . . . . . . . . 31
2.2.22 cublasSswap - unified memory version . . . . . . . . 33
CONTENTS 4
CUDA Toolkit
+--------------------------------------------------------------------------+
|Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==========================================================================|
| 0 1006 G /usr/lib/xorg/Xorg 131MiB |
| 0 1546 G compiz 88MiB |
+--------------------------------------------------------------------------+
It seems that one of the simplest ways of benchmarking systems with GPU
devices is to use the Magma library. The library will be introduced in one of
the next chapters but now let us remark, that as an by-product of Magma
installation, one obtains the directory testing with ready to use testing
binaries. Bellow we present the results of running four of them. We tested
a system with Linux Ubuntu 16.04 and
Let us repeat that the executables used in this section are contained in
testing subdirectory of Magma installation directory.
Solving the general NxN linear system in double precision.
./testing_dgesv --lapack
% N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec)
%===================================================
1088 1 86.97 ( 0.01) 58.13 ( 0.01)
2112 1 92.03 ( 0.07) 117.56 ( 0.05)
3136 1 102.69 ( 0.20) 145.29 ( 0.14)
4160 1 111.11 ( 0.43) 179.07 ( 0.27)
5184 1 120.00 ( 0.77) 200.15 ( 0.46)
6208 1 126.85 ( 1.26) 211.75 ( 0.75)
7232 1 132.66 ( 1.90) 219.35 ( 1.15)
8256 1 137.11 ( 2.74) 225.68 ( 1.66)
9280 1 142.41 ( 3.74) 232.11 ( 2.30)
10304 1 145.79 ( 5.00) 236.96 ( 3.08)
CUBLAS by example
can add the error checking code from CUBLAS Library User Guide
example with minor modifications.
and
If the extension c is preferred in the second case, then all occurrences of the
function cudaMallocManaged should have the third argument (integer) 1.
If g++ command is preferred, then the syntax of the form
cudaMallocManaged((void**)&x,n*sizeof(float),1); should be used
instead of cudaMallocManaged(&x,n*sizeof(float),1);
the constant EXIT SUCCESS should be replaced by 0, and the header
cuda runtime api.h should be included.
An example of compilation with g++:
stat=cublasIsamax(handle,n,d x,1,&result);
printf ( " max | x [ i ]|:%4.0 f \ n " , fabs ( x [ result -1])); // print
// max {| x [0]| ,... ,| x [n -1]|}
// find the smallest index of the element of d_x with minimum
// absolute value
stat=cublasIsamin(handle,n,d x,1,&result);
printf ( " min | x [ i ]|:%4.0 f \ n " , fabs ( x [ result -1])); // print
// min {| x [0]| ,... ,| x [n -1]|}
cudaFree ( d_x ); // free device memory
cublasDestroy ( handle ); // destroy CUBLAS context
free ( x ); // free host memory
return EXIT_SUCCESS ;
}
// x : 0 , 1 , 2 , 3 , 4 , 5 ,
// max | x [ i ]|: 5
// min | x [ i ]|: 0
2.2 CUBLAS Level-1. Scalar and vector based operations 21
cublasIsamax(handle,n,x,1,&result);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " max | x [ i ]|:%4.0 f \ n " , fabs ( x [ result -1]));
// max {| x [0]| ,... ,| x [n -1]|}
// find the smallest index of the element of x with minimal
// absolute value
cublasIsamin(handle,n,x,1,&result);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " min | x [ i ]|:%4.0 f \ n " , fabs ( x [ result -1]));
// min {| x [0]| ,... ,| x [n -1]|}
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x : 0 , 1 , 2 , 3 , 4 , 5 ,
// max | x [ i ]|: 5
// min | x [ i ]|: 0
stat=cublasSasum(handle,n,d x,1,&result);
// print the result
printf ( " sum of the absolute values of elements of x :%4.0 f \ n " ,
result );
cudaFree ( d_x ); // free device memory
cublasDestroy ( handle ); // destroy CUBLAS context
free ( x ); // free host memory
return EXIT_SUCCESS ;
}
// x : 0 , 1 , 2 , 3 , 4 , 5 ,
// sum of the absolute values of elements of x : 15
// | 0 | + | 1 | + | 2 | + | 3 | + | 4 | + | 5 | = 1 5
cublasSasum(handle,n,x,1,&result);
c u d a D e v i c e S y n c h r o n i z e ();
// print the result
printf ( " sum of the absolute values of elements of x :%4.0 f \ n " ,
result );
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x : 0 , 1 , 2 , 3 , 4 , 5 ,
// sum of the absolute values of elements of x : 15
// | 0 | + | 1 | + | 2 | + | 3 | + | 4 | + | 5 | = 1 5
// y after Saxpy :
// 0 , 3 , 6 , 9 ,12 ,15 ,// 2* x + y = 2*{0 ,1 ,2 ,3 ,4 ,5} + {0 ,1 ,2 ,3 ,4 ,5}
cuda Malloc Manage d (& x , n * sizeof ( float )); // unified mem . for x
for ( j =0; j < n ; j ++)
x [ j ]=( float ) j ; // x ={0 ,1 ,2 ,3 ,4 ,5}
cuda Malloc Manage d (& y , n * sizeof ( float )); // unified mem . for y
for ( j =0; j < n ; j ++)
y [ j ]=( float ) j ; // y ={0 ,1 ,2 ,3 ,4 ,5}
printf ( "x , y :\ n " );
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,x [ j ]); // print x , y
printf ( " \ n " );
cublasCreate (& handle ); // initialize CUBLAS context
float al =2.0; // al =2
// multiply the vector x by the scalar al and add to y
// y = al * x + y , x , y - n - vectors ; al - scalar
cublasSaxpy(handle,n,&al,x,1,y,1);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " y after Saxpy :\ n " ); // print y after Saxpy
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,y [ j ]);
printf ( " \ n " );
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x , y :
// 0 , 1 , 2 , 3 , 4 , 5 ,
// y after Saxpy :
// 0 , 3 , 6 , 9 ,12 ,15 ,// 2* x + y = 2*{0 ,1 ,2 ,3 ,4 ,5} + {0 ,1 ,2 ,3 ,4 ,5}
cublasScopy(handle,n,x,1,y,1);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " y after copy :\ n " );
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,y [ j ]); // print y
printf ( " \ n " );
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return 0;
}
// x : 0, 1, 2, 3, 4, 5,
for complex x, y.
// dot product x . y : // x . y =
// 55 // 1 *1 + 2 *2 + 3 *3 + 4 *4 + 5 *5
cublasSdot(handle,n,x,1,y,1,&result);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " dot product x . y :\ n " );
printf ( " %7.0 f \ n " , result ); // print the result
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x , y :
// 0 , 1 , 2 , 3 , 4 , 5 ,
// dot product x . y : // x . y =
// 55 // 1 *1 + 2 *2 + 3 *3 + 4 *4 + 5 *5
stat=cublasSnrm2(handle,n,d x,1,&result);
printf ( " Euclidean norm of x : " );
printf ( " %7.3 f \ n " , result ); // print the result
cudaFree ( d_x ); // free device memory
cublasDestroy ( handle ); // destroy CUBLAS context
free ( x ); // free host memory
return EXIT_SUCCESS ;
}
// x : 0, 1, 2, 3, 4, 5,
// || x ||=
// Euclidean norm of x : 7.416 //\ sqrt { 0 ^ 2 + 1 ^ 2 + 2 ^ 2 + 3 ^ 2 + 4 ^ 2 + 5 ^ 2 }
float result ;
// Euclidean norm of the vector x : \ sqrt { x [0]^2+...+ x [n -1]^2}
cublasSnrm2(handle,n,x,1,&result);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " Euclidean norm of x : " );
printf ( " %7.3 f \ n " , result ); // print the result
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return 0;
}
// x : 0, 1, 2, 3, 4, 5,
// || x ||=
// Euclidean norm of x : 7.416 //\ sqrt { 0 ^ 2 + 1 ^ 2 + 2 ^ 2 + 3 ^ 2 + 4 ^ 2 + 5 ^ 2 }
// on the device
float * d_x ; // d_x - x on the device
float * d_y ; // d_y - y on the device
cudaStat = cudaMalloc (( void **)& d_x , n * sizeof (* x )); // device
// memory alloc for x
cudaStat = cudaMalloc (( void **)& d_y , n * sizeof (* y )); // device
// memory alloc for y
stat = cublasCreate (& handle ); // initialize CUBLAS context
stat = cublasSetVector (n , sizeof (* x ) ,x ,1 , d_x ,1); // cp x - > d_x
stat = cublasSetVector (n , sizeof (* y ) ,y ,1 , d_y ,1); // cp y - > d_y
float c =0.5;
float s =0.8669254; // s = sqrt (3.0)/2.0
// Givens rotation
// [ c s ] [ row ( x ) ]
// multiplies 2 x2 matrix [ ] with 2 xn matrix [ ]
// [-s c ] [ row ( y ) ]
//
// [1/2 sqrt (3)/2] [0 ,1 ,2 ,3 , 4 , 5]
// [ - sqrt (3)/2 1/2 ] [0 ,1 ,4 ,9 ,16 ,25]
// x after Srot :
// 0.000 , 1.367 , 4.468 , 9.302 , 15.871 , 24.173 ,
// y after Srot :
// 0.000 , -0.367 , 0.266 , 1.899 , 4.532 , 8.165 ,
// // [ x ] [ 0.5 0.867] [0 1 2 3 4 5]
// // [ ]= [ ]*[ ]
// // [ y ] [ -0.867 0.5 ] [0 1 4 9 16 25]
2.2 CUBLAS Level-1. Scalar and vector based operations 33
cublasSrot(handle,n,x,1,y,1,&c,&s);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " x after Srot :\ n " ); // print x after Srot
for ( j =0; j < n ; j ++)
printf ( " %7.3 f , " ,x [ j ]);
printf ( " \ n " );
// x : 0, 1, 2, 3, 4, 5,
// y : 0, 1, 4, 9, 16 , 25 ,
// x after Srot :
// 0.000 , 1.367 , 4.468 , 9.302 , 15.871 , 24.173 ,
// y after Srot :
// 0.000 , -0.367 , 0.266 , 1.899 , 4.532 , 8.165 ,
// // [ x ] [ 0.5 0.867] [0 1 2 3 4 5]
// // [ ]= [ ]*[ ]
// // [ y ] [ -0.867 0.5 ] [0 1 4 9 16 25]
stat=cublasSrotg(handle,&a,&b,&c,&s);
// After Srotg :
// a : 1.41421 // \ sqrt {1^2+1^2}
// c : 0.70711 // cos ( pi /4)
// s : 0.70711 // sin ( pi /4)
// // [ 0.70711 0.70711] [1] [1.41422]
// // [ ]*[ ]=[ ]
// // [ -0.70711 0.70711] [1] [ 0 ]
// x after Srotm :
// 0.000 , 1.500 , 5.000 , 10.500 , 18.000 , 27.500 ,
// y after Srotm :
// 0.000 , -0.500 , 0.000 , 1.500 , 4.000 , 7.500 ,
2.2 CUBLAS Level-1. Scalar and vector based operations 37
// // [ x ] [ 0.5 1 ] [0 1 2 3 4 5]
// // [ ]= [ ]*[ ]
// // [ y ] [ -1 0.5] [0 1 4 9 16 25]
cublasSrotm(handle,n,x,1,y,1,param);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " x after Srotm x :\ n " ); // print x after Srotm
for ( j =0; j < n ; j ++)
printf ( " %7.3 f , " ,x [ j ]);
printf ( " \ n " );
2.2 CUBLAS Level-1. Scalar and vector based operations 38
// x after Srotm :
// 0.000 , 1.500 , 5.000 , 10.500 , 18.000 , 27.500 ,
// y after Srotm :
// 0.000 , -0.500 , 0.000 , 1.500 , 4.000 , 7.500 ,
// // [ x ] [ 0.5 1 ] [0 1 2 3 4 5]
// // [ ]= [ ]*[ ]
// // [ y ] [ -1 0.5] [0 1 4 9 16 25]
float x1 =1.0 f ; // x1 =1
float y1 =2.0 f ; // y1 =2
printf ( " x1 : %7.3 f \ n " , x1 ); // print x1
printf ( " y1 : %7.3 f \ n " , y1 ); // print y1
// find modified Givens rotation matrix H ={{ h11 , h12 } ,{ h21 , h22 }}
// such that the second entry of H *{\ sqrt { d1 }* x1 ,\ sqrt { d2 }* y1 }^ T
// is zero
stat=cublasSrotmg(handle,&d1,&d2,&x1,&y1,param);
printf ( " After srotmg :\ n " );
printf ( " param [0]: %4.2 f \ n " , param [0]);
printf ( " h11 : %7.5 f \ n " , param [1]);
printf ( " h22 : %7.5 f \ n " , param [4]);
// check if the second entry of H *{\ sqrt { d1 )* x1 ,\ sqrt { d2 }* y1 }^ T
// is zero ; the values of d1 , d2 , x1 are overwritten so we use
// their initial values
printf ( " %7.5 f \ n " ,( -1.0)* sqrt (5.0)*1.0+
param [4]* sqrt (5.0)*2.0);
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// d1 : 5.000 // [ d1 ] [5] [ x1 ] [1] [0.5 1 ]
// d2 : 5.000 // [ ]=[ ] , [ ]=[ ] , H =[ ]
// x1 : 1.000 // [ d2 ] [5] [ x2 ] [2] [ -1 0.5]
// y1 : 2.000
// After srotmg :
// param [0]: 1.00
// h11 : 0.50000
// h22 : 0.50000
x = αx.
# define n 6 // length of x
int main ( void ){
cudaError_t cudaStat ; // cudaMalloc status
cublasStatus_t stat ; // CUBLAS functions status
cublasHandle_t handle ; // CUBLAS context
int j ; // index of elements
float * x ; // n - vector on the host
x =( float *) malloc ( n * sizeof (* x )); // host memory alloc for x
for ( j =0; j < n ; j ++)
x [ j ]=( float ) j ; // x ={0 ,1 ,2 ,3 ,4 ,5}
printf ( " x :\ n " );
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,x [ j ]); // print x
printf ( " \ n " );
// on the device
float * d_x ; // d_x - x on the device
cudaStat = cudaMalloc (( void **)& d_x , n * sizeof (* x )); // device
// memory alloc for x
stat = cublasCreate (& handle ); // initialize CUBLAS context
stat = cublasSetVector (n , sizeof (* x ) ,x ,1 , d_x ,1); // cp x - > d_x
float al =2.0; // al =2
// scale the vector d_x by the scalar al : d_x = al * d_x
stat=cublasSscal(handle,n,&al,d x,1);
stat = cublasGetVector (n , sizeof ( float ) , d_x ,1 ,x ,1); // cp d_x - > x
printf ( " x after Sscal :\ n " ); // print x after Sscal :
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,x [ j ]); // x ={0 ,2 ,4 ,6 ,8 ,10}
printf ( " \ n " );
cudaFree ( d_x ); // free device memory
cublasDestroy ( handle ); // destroy CUBLAS context
free ( x ); // free host memory
return EXIT_SUCCESS ;
}
// x :
// 0 , 1 , 2 , 3 , 4 , 5 ,
// x after Sscal :
// 0 , 2 , 4 , 6 , 8 ,10 , // 2*{0 ,1 ,2 ,3 ,4 ,5}
float * x ; // n - vector
cuda Malloc Manage d (& x , n * sizeof ( float )); // unified mem . for x
for ( j =0; j < n ; j ++)
x [ j ]=( float ) j ; // x ={0 ,1 ,2 ,3 ,4 ,5}
printf ( " x :\ n " );
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,x [ j ]); // print x
printf ( " \ n " );
cublasCreate (& handle ); // initialize CUBLAS context
float al =2.0; // al =2
// scale the vector x by the scalar al : x = al * x
cublasSscal(handle,n,&al,x,1);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " x after Sscal :\ n " ); // print x after Sscal :
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,x [ j ]); // x ={0 ,2 ,4 ,6 ,8 ,10}
printf ( " \ n " );
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x :
// 0 , 1 , 2 , 3 , 4 , 5 ,
// x after Sscal :
// 0 , 2 , 4 , 6 , 8 ,10 , // 2*{0 ,1 ,2 ,3 ,4 ,5}
x ← y, y ← x.
// x after Sswap :
// 0 , 2 , 4 , 6 , 8 ,10 , // x <- y
// y after Sswap :
// 0, 1, 2, 3, 4, 5, // y <- x
2.2 CUBLAS Level-1. Scalar and vector based operations 43
cublasSswap(handle,n,x,1,y,1);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " x after Sswap :\ n " ); // print x after Sswap :
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,x [ j ]); // x ={0 ,2 ,4 ,6 ,8 ,10}
printf ( " \ n " );
printf ( " y after Sswap :\ n " ); // print y after Sswap :
for ( j =0; j < n ; j ++)
printf ( " %2.0 f , " ,y [ j ]); // y ={0 ,1 ,2 ,3 ,4 ,5}
printf ( " \ n " );
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x:
// 0, 1, 2, 3, 4, 5,
// y:
// 0 , 2 , 4 , 6 , 8 ,10 ,
// x after Sswap :
// 0 , 2 , 4 , 6 , 8 ,10 , // x <- y
// y after Sswap :
// 0, 1, 2, 3, 4, 5, // y <- x
2.3 CUBLAS Level-2. Matrix-vector operations 44
y = α op(A)x + βy,
cublasSgbmv(handle,CUBLAS OP N,m,n,kl,ku,&al,a,m,x,1,&bet,y,1);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " y after Sgbmv :\ n " ); // print y after Sgbmv
for ( j =0; j < m ; j ++){
printf ( " %7.0 f " ,y [ j ]);
printf ( " \ n " );
}
cudaFree ( a ); // free memory
2.3 CUBLAS Level-2. Matrix-vector operations 47
y = α op(A)x + βy,
// y after Sgemv :
// 115 // [11 17 23 29 35] [1]
// 120 // [12 18 24 30 36] [1]
// 125 // [13 19 25 31 37]* [1]
// 130 // [14 20 26 32 38] [1]
// 135 // [15 21 27 33 39] [1]
// 140 // [16 22 28 34 40]
cublasSgemv(handle,CUBLAS OP N,m,n,&al,a,m,x,1,&bet,y,1);
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " y after Sgemv ::\ n " );
for ( j =0; j < m ; j ++){
printf ( " %5.0 f " ,y [ j ]); // print y after Sgemv
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// a :
// 11 17 23 29 35
// 12 18 24 30 36
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// y after Sgemv :
// 115 // [11 17 23 29 35] [1]
// 120 // [12 18 24 30 36] [1]
// 125 // [13 19 25 31 37]* [1]
// 130 // [14 20 26 32 38] [1]
// 135 // [15 21 27 33 39] [1]
// 140 // [16 22 28 34 40]
A = αxy T + A or A = αxy H + A,
}
cudaFree ( d_a ); // free device memory
cudaFree ( d_x ); // free device memory
cudaFree ( d_y ); // free device memory
cublasDestroy ( handle ); // destroy CUBLAS context
free ( a ); // free host memory
free ( x ); // free host memory
free ( y ); // free host memory
return EXIT_SUCCESS ;
}
// a :
// 11 17 23 29 35
// 12 18 24 30 36
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// a after Sger :
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// 17 23 29 35 41
// 18 24 30 36 42
cuda Malloc Manage d (& y , m * sizeof ( float )); // unified mem . for y
// define an mxn matrix a column by column
int ind =11; // a :
for ( j =0; j < n ; j ++){ // 11 ,17 ,23 ,29 ,35
for ( i =0; i < m ; i ++){ // 12 ,18 ,24 ,30 ,36
a [ IDX2C (i ,j , m )]=( float ) ind ++; // 13 ,19 ,25 ,31 ,37
} // 14 ,20 ,26 ,32 ,38
} // 15 ,21 ,27 ,33 ,39
// 16 ,22 ,28 ,34 ,40
printf ( " a :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %4.0 f " ,a [ IDX2C (i ,j , m )]); // print a row by row
}
printf ( " \ n " );
}
for ( i =0; i < m ; i ++) x [ i ]=1.0 f ; // x ={1 ,1 ,1 ,1 ,1 ,1}^ T
for ( i =0; i < n ; i ++) y [ i ]=1.0 f ; // y ={1 ,1 ,1 ,1 ,1}^ T
cublasCreate (& handle ); // initialize CUBLAS context
float al =2.0 f ; // al =2
// rank -1 update of a : a = al * x * y ^ T + a
// a - mxn matrix ; x -m - vector , y - n - vector ; al - scalar
cublasSger(handle,m,n,&al,x,1,y,1,a,m);
c u d a D e v i c e S y n c h r o n i z e ();
// a :
// 11 17 23 29 35
// 12 18 24 30 36
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// a after Sger :
2.3 CUBLAS Level-2. Matrix-vector operations 54
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// 17 23 29 35 41
// 18 24 30 36 42
y = α Ax + βy,
// y after Ssbmv :
2.3 CUBLAS Level-2. Matrix-vector operations 56
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
}
// y after Ssbmv :
// 28 // [11 17 ] [1] [28]
// 47 // [17 12 18 ] [1] [47]
// 50 // [ 18 13 19 ] [1] = [50]
// 53 // [ 19 14 20 ] [1] [53]
// 56 // [ 20 15 21] [1] [56]
// 37 // [ 21 16] [1] [37]
y = α Ax + βy,
// 11 12 13 14 15 16
// 17 18 19 20 21
// 22 23 24 25
// 26 27 28
// 29 30
// 31
// y after Sspmv :
// 81 // [11 12 13 14 15 16] [1] [ 81]
// 107 // [12 17 18 19 20 21] [1] [107]
// 125 // [13 18 22 23 24 25] [1] = [125]
// 137 // [14 19 23 26 27 28] [1] [137]
// 145 // [15 20 24 27 29 30] [1] [145]
// 151 // [16 21 25 28 30 31] [1] [151]
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " y after Sspmv :\ n " ); // print y after Sspmv
for ( j =0; j < n ; j ++){
printf ( " %7.0 f " ,y [ j ]);
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// upper triangle of a :
// 11 12 13 14 15 16
// 17 18 19 20 21
// 22 23 24 25
// 26 27 28
// 29 30
// 31
// y after Sspmv :
// 81 // [11 12 13 14 15 16] [1] [ 81]
// 107 // [12 17 18 19 20 21] [1] [107]
// 125 // [13 18 22 23 24 25] [1] = [125]
// 137 // [14 19 23 26 27 28] [1] [137]
// 145 // [15 20 24 27 29 30] [1] [145]
// 151 // [16 21 25 28 30 31] [1] [151]
A = αxxT + A,
A = α(xy T + yxT ) + A,
2.3 CUBLAS Level-2. Matrix-vector operations 64
c u d a D e v i c e S y n c h r o n i z e ();
// print the updated upper triangle of a row by row
printf ( " upper triangle of updated a after Sspr2 :\ n " );
l = n ; j =0; m =0;
while (l >0){
for ( i =0; i < m ; i ++) printf ( " " );
for ( i = j ;i < j + l ; i ++) printf ( " %3.0 f " ,a [ i ]);
printf ( " \ n " );
2.3 CUBLAS Level-2. Matrix-vector operations 67
m ++; j = j + l ;l - -;
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// upper triangle of a :
// 11 12 13 14 15 16
// 17 18 19 20 21
// 22 23 24 25
// 26 27 28
// 29 30
// 31
y = αAx + βy,
where A is an n × n symmetric matrix, x, y are vectors and α, β are scalars.
The matrix A can be stored in lower (CUBLAS FILL MODE LOWER) or upper
(CUBLAS FILL MODE UPPER) mode.
// y after Ssymv :
// 82
// 108
// 126
// 138
// 146
// 152
//
// [11 12 13 14 15 16] [1] [1] [ 82]
// [12 17 18 19 20 21] [1] [1] [108]
// 1*[13 18 22 23 24 25]*[1] + 1*[1] = [126]
// [14 19 23 26 27 28] [1] [1] [138]
// [15 20 24 27 29 30] [1] [1] [146]
// [16 21 25 28 30 31] [1] [1] [152]
float * y ; // n - vector
cuda Malloc Manage d (& a , n * n * sizeof ( float )); // unif . memory for a
cuda Malloc Manage d (& x , n * sizeof ( float )); // unif . memory for x
cuda Malloc Manage d (& y , n * sizeof ( float )); // unif . memory for y
// define the lower triangle of an nxn symmetric matrix a
// in lower mode column by column
int ind =11; // a :
for ( j =0; j < n ; j ++){ // 11
for ( i =0; i < n ; i ++){ // 12 ,17
if (i >= j ){ // 13 ,18 ,22
a [ IDX2C (i ,j , n )]=( float ) ind ++; // 14 ,19 ,23 ,26
} // 15 ,20 ,24 ,27 ,29
} // 16 ,21 ,25 ,28 ,30 ,31
}
// print the lower triangle of a row by row
printf ( " lower triangle of a :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j )
printf ( " %5.0 f " ,a [ IDX2C (i ,j , n )]);
}
printf ( " \ n " );
}
for ( i =0; i < n ; i ++){ x [ i ]=1.0 f ; y [ i ]=1.0;} // x ={1 ,1 ,1 ,1 ,1 ,1}^ T
// y ={1 ,1 ,1 ,1 ,1 ,1}^ T
cublasCreate (& handle ); // initialize CUBLAS context
float al =1.0 f ; // al =1.0
float bet =1.0 f ; // bet =1.0
// symmetric matrix - vector multiplication :
// y = al * a * x + bet * y
// a - nxn symmetric matrix ; x , y - n - vectors ;
// al , bet - scalars
// y after Ssymv :
// 82
// 108
// 126
// 138
// 146
// 152
//
// [11 12 13 14 15 16] [1] [1] [ 82]
// [12 17 18 19 20 21] [1] [1] [108]
// 1*[13 18 22 23 24 25]*[1] + 1*[1] = [126]
// [14 19 23 26 27 28] [1] [1] [138]
// [15 20 24 27 29 30] [1] [1] [146]
// [16 21 25 28 30 31] [1] [1] [152]
A = αxxT + A,
where A is an n × n symmetric matrix, x is a vector and α is a scalar.
A is stored in column-major format in lower (CUBLAS FILL MODE LOWER) or
upper (CUBLAS FILL MODE UPPER) mode.
}
// print the lower triangle of a row by row
printf ( " lower triangle of a :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j )
printf ( " %5.0 f " ,a [ IDX2C (i ,j , n )]);
}
printf ( " \ n " );
}
for ( i =0; i < n ; i ++){ x [ i ]=1.0 f ;} // x ={1 ,1 ,1 ,1 ,1 ,1}^ T
// on the device
float * d_a ; // d_a - a on the device
float * d_x ; // d_x - x on the device
cudaStat = cudaMalloc (( void **)& d_a , n * n * sizeof (* a )); // device
// memory alloc for a
cudaStat = cudaMalloc (( void **)& d_x , n * sizeof (* x )); // device
// memory alloc for x
stat = cublasCreate (& handle ); // initialize CUBLAS context
stat = cublasSetMatrix (n ,n , sizeof (* a ) ,a ,n , d_a , n ); // a -> d_a
stat = cublasSetVector (n , sizeof (* x ) ,x ,1 , d_x ,1); // cp x - > d_x
float al =1.0 f ; // al =1.0
// symmetric rank -1 update of d_a : d_a = al * d_x * d_x ^ T + d_a
// d_a - nxn symmetric matrix ; d_x - n - vector ; al - scalar
c u d a D e v i c e S y n c h r o n i z e ();
// print the lower triangle of the updated a after Ssyr
printf ( " lower triangle of updated a after Ssyr :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j )
printf ( " %5.0 f " ,a [ IDX2C (i ,j , n )]);
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// lower triangle of a :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
A = α(xy T + yxT ) + A,
// lower triangle of a :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
c u d a D e v i c e S y n c h r o n i z e ();
// print the lower triangle of the updated a
printf ( " lower triangle of a after Ssyr2 :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j )
printf ( " %5.0 f " ,a [ IDX2C (i ,j , n )]);
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
2.3 CUBLAS Level-2. Matrix-vector operations 78
// lower triangle of a :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
x = op(A)x,
// x after Stbmv :
2.3 CUBLAS Level-2. Matrix-vector operations 80
// 11 // [11 0 0 0 0 0] [1]
// 29 // [17 12 0 0 0 0] [1]
// 31 // = [ 0 18 13 0 0 0]*[1]
// 33 // [ 0 0 19 14 0 0] [1]
// 35 // [ 0 0 0 20 15 0] [1]
// 37 // [ 0 0 0 0 21 16] [1]
// x after Stbmv :
// 11 // [11 0 0 0 0 0] [1]
// 29 // [17 12 0 0 0 0] [1]
// 31 // = [ 0 18 13 0 0 0]*[1]
// 33 // [ 0 0 19 14 0 0] [1]
// 35 // [ 0 0 0 20 15 0] [1]
// 37 // [ 0 0 0 0 21 16] [1]
x = op(A)x,
2.3 CUBLAS Level-2. Matrix-vector operations 84
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " x after Stpmv :\ n " ); // print x after Stpmv
for ( j =0; j < n ; j ++){
printf ( " %7.0 f " ,x [ j ]);
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// x after Stpmv :
// 11 // [11 0 0 0 0 0] [1]
// 29 // [12 17 0 0 0 0] [1]
// 53 // = [13 18 22 0 0 0]*[1]
// 82 // [14 19 23 26 0 0] [1]
// 115 // [15 20 24 27 29 0] [1]
// 151 // [16 21 25 28 30 31] [1]
op(A)x = b,
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " multiplication result :\ n " ); // print x after Strmv
for ( j =0; j < n ; j ++){
printf ( " %7.0 f " ,x [ j ]);
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// multiplication result :
// 11 // [11 0 0 0 0 0] [1]
// 29 // [12 17 0 0 0 0] [1]
// 53 // = [13 18 22 0 0 0]*[1]
// 82 // [14 19 23 26 0 0] [1]
// 115 // [15 20 24 27 29 0] [1]
// 151 // [16 21 25 28 30 31] [1]
op(A)x = b,
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " solution :\ n " ); // print x after Strsv
for ( j =0; j < n ; j ++){
printf ( " %9.6 f " ,x [ j ]);
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// solution :
y = αAx + βy,
// y after Chemv :
2.3 CUBLAS Level-2. Matrix-vector operations 96
// 82+0* I
// 108+0* I
// 126+0* I
// 138+0* I
// 146+0* I
// 152+0* I
//
// [11 12 13 14 15 16] [1] [1] [ 82]
// [12 17 18 19 20 21] [1] [1] [108]
// 1*[13 18 22 23 24 25]*[1] + 1*[1] = [126]
// [14 19 23 26 27 28] [1] [1] [138]
// [15 20 24 27 29 30] [1] [1] [146]
// [16 21 25 28 30 31] [1] [1] [152]
// y after Chemv :
// 82+0* I
// 108+0* I
// 126+0* I
// 138+0* I
// 146+0* I
// 152+0* I
//
// [11 12 13 14 15 16] [1] [1] [ 82]
// [12 17 18 19 20 21] [1] [1] [108]
// 1*[13 18 22 23 24 25]*[1] + 1*[1] = [126]
// [14 19 23 26 27 28] [1] [1] [138]
// [15 20 24 27 29 30] [1] [1] [146]
// [16 21 25 28 30 31] [1] [1] [152]
2.3 CUBLAS Level-2. Matrix-vector operations 98
y = αAx + βy,
y = αAx + βy,
// y after Chpmv :
// 81+0* I // [11 12 13 14 15 16] [1] [0] [ 81]
// 107+0* I // [12 17 18 19 20 21] [1] [0] [107]
// 125+0* I // 1*[13 18 22 23 24 25]*[1] + 1*[0] = [125]
// 137+0* I // [14 19 23 26 27 28] [1] [0] [137]
// 145+0* I // [15 20 24 27 29 30] [1] [0] [145]
// 151+0* I // [16 21 25 28 30 31] [1] [0] [151]
// y after Chpmv :
// 81+0* I // [11 12 13 14 15 16] [1] [0] [ 81]
// 107+0* I // [12 17 18 19 20 21] [1] [0] [107]
// 125+0* I // 1*[13 18 22 23 24 25]*[1] + 1*[0] = [125]
// 137+0* I // [14 19 23 26 27 28] [1] [0] [137]
// 145+0* I // [15 20 24 27 29 30] [1] [0] [145]
// 151+0* I // [16 21 25 28 30 31] [1] [0] [151]
A = αxxH + A,
// on the device
cuComplex * d_a ; // d_a - a on the device
cuComplex * d_x ; // d_x - x on the device
cudaStat = cudaMalloc (( void **)& d_a , n * n * sizeof ( cuComplex ));
// device memory alloc for a
cudaStat = cudaMalloc (( void **)& d_x , n * sizeof ( cuComplex ));
// device memory alloc for x
stat = cublasCreate (& handle ); // initialize CUBLAS context
// copy the matrix and vector from the host to the device
stat = cublasSetMatrix (n ,n , sizeof (* a ) ,a ,n , d_a , n ); // a -> d_a
stat = cublasSetVector (n , sizeof (* x ) ,x ,1 , d_x ,1); // x -> d_x
float al =1.0 f ; // al =1
// rank -1 update of the Hermitian matrix d_a :
// d_a = al * d_x * d_x ^ H + d_a
// d_a - nxn Hermitian matrix ; d_x - n - vector ; al - scalar
c u d a D e v i c e S y n c h r o n i z e ();
// print the lower triangle of updated a
printf ( " lower triangle of updated a after Cher :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j )
printf ( " %5.0 f +%1.0 f * I " ,a [ IDX2C (i ,j , n )]. x , a [ IDX2C (i ,j , n )]. y );
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// lower triangle of a :
2.3 CUBLAS Level-2. Matrix-vector operations 108
// 11+0* I
// 12+0* I 17+0* I
// 13+0* I 18+0* I 22+0* I
// 14+0* I 19+0* I 23+0* I 26+0* I
// 15+0* I 20+0* I 24+0* I 27+0* I 29+0* I
// 16+0* I 21+0* I 25+0* I 28+0* I 30+0* I 31+0* I
A = αxy H + ᾱyxH + A,
c u d a D e v i c e S y n c h r o n i z e ();
// print the lower triangle of updated a
printf ( " lower triangle of updated a after Cher2 :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j )
printf ( " %5.0 f +%1.0 f * I " ,a [ IDX2C (i ,j , n )]. x , a [ IDX2C (i ,j , n )]. y );
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( x ); // free memory
cudaFree ( y ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// lower triangle of a :
2.3 CUBLAS Level-2. Matrix-vector operations 112
// 11+0* I
// 12+0* I 17+0* I
// 13+0* I 18+0* I 22+0* I
// 14+0* I 19+0* I 23+0* I 26+0* I
// 15+0* I 20+0* I 24+0* I 27+0* I 29+0* I
// 16+0* I 21+0* I 25+0* I 28+0* I 30+0* I 31+0* I
A = αxxH + A,
A = αxy H + ᾱyxH + A,
C = αop(A)op(B) + βC,
ind =11; // b:
for ( j =0; j < n ; j ++){ // 11 ,16 ,21 ,26
for ( i =0; i < k ; i ++){ // 12 ,17 ,22 ,27
b [ IDX2C (i ,j , k )]=( float ) ind ++; // 13 ,18 ,23 ,28
} // 14 ,19 ,24 ,29
} // 15 ,20 ,25 ,30
// print b row by row
printf ( " b :\ n " );
for ( i =0; i < k ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %5.0 f " ,b [ IDX2C (i ,j , k )]);
}
printf ( " \ n " );
}
// define an mxn matrix c column by column
ind =11; // c:
for ( j =0; j < n ; j ++){ // 11 ,17 ,23 ,29
for ( i =0; i < m ; i ++){ // 12 ,18 ,24 ,30
c [ IDX2C (i ,j , m )]=( float ) ind ++; // 13 ,19 ,25 ,31
} // 14 ,20 ,26 ,32
} // 15 ,21 ,27 ,33
// 16 ,22 ,28 ,34
// print c row by row
printf ( " c :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %5.0 f " ,c [ IDX2C (i ,j , m )]);
}
printf ( " \ n " );
}
// on the device
float * d_a ; // d_a - a on the device
float * d_b ; // d_b - b on the device
float * d_c ; // d_c - c on the device
cudaStat = cudaMalloc (( void **)& d_a , m * k * sizeof (* a )); // device
// memory alloc for a
cudaStat = cudaMalloc (( void **)& d_b , k * n * sizeof (* b )); // device
// memory alloc for b
cudaStat = cudaMalloc (( void **)& d_c , m * n * sizeof (* c )); // device
// memory alloc for c
stat = cublasCreate (& handle ); // initialize CUBLAS context
// copy matrices from the host to the device
stat = cublasSetMatrix (m ,k , sizeof (* a ) ,a ,m , d_a , m ); // a -> d_a
stat = cublasSetMatrix (k ,n , sizeof (* b ) ,b ,k , d_b , k ); // b -> d_b
stat = cublasSetMatrix (m ,n , sizeof (* c ) ,c ,m , d_c , m ); // c -> d_c
float al =1.0 f ; // al =1
float bet =1.0 f ; // bet =1
// matrix - matrix multiplication : d_c = al * d_a * d_b + bet * d_c
// d_a - mxk matrix , d_b - kxn matrix , d_c - mxn matrix ;
// al , bet - scalars
2.4 CUBLAS Level-3. Matrix-matrix operations 122
// c after Sgemm :
// 1566 2147 2728 3309
// 1632 2238 2844 3450
// 1698 2329 2960 3591 // c = al * a * b + bet * c
// 1764 2420 3076 3732
// 1830 2511 3192 3873
// 1896 2602 3308 4014
2.4 CUBLAS Level-3. Matrix-matrix operations 123
// c :
// 11 17 23 29
// 12 18 24 30
// 13 19 25 31
// 14 20 26 32
// 15 21 27 33
// 16 22 28 34
// c after Sgemm :
// 1566 2147 2728 3309
// 1632 2238 2844 3450
// 1698 2329 2960 3591 // c = al * a * b + bet * c
// 1764 2420 3076 3732
// 1830 2511 3192 3873
// 1896 2602 3308 4014
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " c after Ssymm :\ n " ); // print c after Ssymm
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %7.0 f " ,c [ IDX2C (i ,j , m )]);
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( b ); // free memory
cudaFree ( c ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// lower triangle of a :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
// b (= c ):
// 11 17 23 29
// 12 18 24 30
// 13 19 25 31
// 14 20 26 32
// 15 21 27 33
// 16 22 28 34
// c after Ssymm :
// 1122 1614 2106 2598
// 1484 2132 2780 3428
// 1740 2496 3252 4008 // c = al * a * b + bet * c
// 1912 2740 3568 4396
// 2025 2901 3777 4653
// 2107 3019 3931 4843
C = α op(A)op(A)T + βC,
}
printf ( " \ n " );
}
// on the device
float * d_a ; // d_a - a on the device
float * d_c ; // d_c - c on the device
cudaStat = cudaMalloc (( void **)& d_a , n * k * sizeof (* a )); // device
// memory alloc for a
cudaStat = cudaMalloc (( void **)& d_c , n * n * sizeof (* c )); // device
// memory alloc for c
stat = cublasCreate (& handle ); // initialize CUBLAS context
// copy matrices from the host to the device
stat = cublasSetMatrix (n ,k , sizeof (* a ) ,a ,n , d_a , n ); // a -> d_a
stat = cublasSetMatrix (n ,n , sizeof (* c ) ,c ,n , d_c , n ); // c -> d_c
float al =1.0 f ; // al =1
float bet =1.0 f ; // bet =1
// symmetric rank - k update : d_c = al * d_a * d_a ^ T + bet * d_c ;
// d_c - symmetric nxn matrix , d_a - general nxk matrix ;
// al , bet - scalars
// lower triangle of c :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
// a :
2.4 CUBLAS Level-3. Matrix-matrix operations 132
// 11 17 23 29
// 12 18 24 30
// 13 19 25 31
// 14 20 26 32
// 15 21 27 33
// 16 22 28 34
where op(A), op(B) are n×k matrices, C is a symmetric n×n matrix stored
in lower (CUBLAS FILL MODE LOWER) or upper (CUBLAS FILL MODE UPPER)
mode and α, β are scalars. The value of op(A) can be equal to A in
CUBLAS OP N case or AT (transposition) in CUBLAS OP T case and similarly
for op(B).
// c = al ( a * b ^ T + b * a ^ T ) + bet * c
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " lower triangle of updated c after Ssyr2k :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){
if (i >= j ) // print the lower triangle
printf ( " %7.0 f " ,c [ IDX2C (i ,j , n )]); // of c after Ssyr2k
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( b ); // free memory
cudaFree ( c ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return 0;
}
// lower triangle of c :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
// a (= b ):
// 11 17 23 29
// 12 18 24 30
// 13 19 25 31
// 14 20 26 32
// 15 21 27 33
// 16 22 28 34
// lower triangle of updated c after Ssyr2k :
// 3571
// 3732 3905
// 3893 4074 4254
// 4054 4243 4431 4618
// 4215 4412 4608 4803 4997
// 4376 4581 4785 4988 5190 5391
// c = al ( a * b ^ T + b * a ^ T ) + bet * c
// b :
// 11 17 23 29 35
// 12 18 24 30 36
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// c after Strmm :
// 121 187 253 319 385
// 336 510 684 858 1032
// 645 963 1281 1599 1917 // c = al * a * b
// 1045 1537 2029 2521 3013
// 1530 2220 2910 3600 4290
// 2091 2997 3903 4809 5715
}
// define an mxn matrix b column by column
ind =11; // b:
for ( j =0; j < n ; j ++){ // 11 ,17 ,23 ,29 ,35
for ( i =0; i < m ; i ++){ // 12 ,18 ,24 ,30 ,36
b [ IDX2C (i ,j , m )]=( float ) ind ++; // 13 ,19 ,25 ,31 ,37
} // 14 ,20 ,26 ,32 ,38
} // 15 ,21 ,27 ,33 ,39
// 16 ,22 ,28 ,34 ,40
printf ( " b :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %5.0 f " ,b [ IDX2C (i ,j , m )]); // print b row by row
}
printf ( " \ n " );
}
cublasCreate (& handle ); // initialize CUBLAS context
float al =1.0 f ;
// triangular matrix - matrix multiplication : c = al * a * b ;
// a - mxm triangular matrix in lower mode ,
// b , c - mxn general matrices ; al - scalar
// c after Strmm :
// 121 187 253 319 385
// 336 510 684 858 1032
// 645 963 1281 1599 1917 // c = al * a * b
// 1045 1537 2029 2521 3013
// 1530 2220 2910 3600 4290
// 2091 2997 3903 4809 5715
}
// print the lower triangle of a row by row
printf ( " lower triangle of a :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < m ; j ++){
if (i >= j )
printf ( " %5.0 f " ,a [ IDX2C (i ,j , m )]);
}
printf ( " \ n " );
}
// define an mxn matrix b column by column
ind =11; // b :
for ( j =0; j < n ; j ++){ // 11 ,17 ,23 ,29 ,35
for ( i =0; i < m ; i ++){ // 12 ,18 ,24 ,30 ,36
b [ IDX2C (i ,j , m )]=( float ) ind ; // 13 ,19 ,25 ,31 ,37
ind ++; // 14 ,20 ,26 ,32 ,38
} // 15 ,21 ,27 ,33 ,39
} // 16 ,22 ,28 ,34 ,40
printf ( " b :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %5.0 f " ,b [ IDX2C (i ,j , m )]); // print b row by row
}
printf ( " \ n " );
}
// on the device
float * d_a ; // d_a - a on the device
float * d_b ; // d_b - b on the device
cudaStat = cudaMalloc (( void **)& d_a , m * m * sizeof (* a )); // device
// memory alloc for a
cudaStat = cudaMalloc (( void **)& d_b , m * n * sizeof (* b )); // device
// memory alloc for b
stat = cublasCreate (& handle ); // initialize CUBLAS context
// copy matrices from the host to the device
stat = cublasSetMatrix (m ,m , sizeof (* a ) ,a ,m , d_a , m ); // a -> d_a
stat = cublasSetMatrix (m ,n , sizeof (* b ) ,b ,m , d_b , m ); // b -> d_b
float al =1.0 f ; // al =1
// solve d_a * x = al * d_b ; the solution x overwrites rhs d_b ;
// d_a - mxm triangular matrix in lower mode ;
// d_b , x - mxn general matrices ; al - scalar
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " solution x from Strsm :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %11.5 f " ,b [ IDX2C (i ,j , m )]); // print b after Strsm
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( b ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
2.4 CUBLAS Level-3. Matrix-matrix operations 147
}
// lower triangle of a :
// 11
// 12 17
// 13 18 22
// 14 19 23 26
// 15 20 24 27 29
// 16 21 25 28 30 31
// b :
// 11 17 23 29 35
// 12 18 24 30 36
// 13 19 25 31 37
// 14 20 26 32 38
// 15 21 27 33 39
// 16 22 28 34 40
// lower triangle of a :
// 11+ 0* I
// 12+ 0* I 17+ 0* I
// 13+ 0* I 18+ 0* I 22+ 0* I
// 14+ 0* I 19+ 0* I 23+ 0* I 26+ 0* I
// 15+ 0* I 20+ 0* I 24+ 0* I 27+ 0* I 29+ 0* I
// 16+ 0* I 21+ 0* I 25+ 0* I 28+ 0* I 30+ 0* I 31+ 0* I
// b , c :
2.4 CUBLAS Level-3. Matrix-matrix operations 150
// c after Chemm :
// 1122+0* I 1614+0* I 2106+0* I 2598+0* I 3090+0* I //
// 1484+0* I 2132+0* I 2780+0* I 3428+0* I 4076+0* I //
// 1740+0* I 2496+0* I 3252+0* I 4008+0* I 4764+0* I // c=a*b+c
// 1912+0* I 2740+0* I 3568+0* I 4396+0* I 5224+0* I //
// 2025+0* I 2901+0* I 3777+0* I 4653+0* I 5529+0* I //
// 2107+0* I 3019+0* I 3931+0* I 4843+0* I 5755+0* I //
a [ IDX2C (i ,j , m )]. y );
}
printf ( " \ n " );
}
// define mxn matrices b , c column by column
ind =11; // b , c :
for ( j =0; j < n ; j ++){ // 11 ,17 ,23 ,29 ,35
for ( i =0; i < m ; i ++){ // 12 ,18 ,24 ,30 ,36
b [ IDX2C (i ,j , m )]. x =( float ) ind ; // 13 ,19 ,25 ,31 ,37
b [ IDX2C (i ,j , m )]. y =0.0 f ; // 14 ,20 ,26 ,32 ,38
c [ IDX2C (i ,j , m )]. x =( float ) ind ; // 15 ,21 ,27 ,33 ,39
c [ IDX2C (i ,j , m )]. y =0.0 f ; // 16 ,22 ,28 ,34 ,40
ind ++;
}
}
// print b (= c ) row by row
printf ( "b , c :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){
printf ( " %5.0 f +%2.0 f * I " ,b [ IDX2C (i ,j , m )]. x ,
b [ IDX2C (i ,j , m )]. y );
}
printf ( " \ n " );
}
cublasCreate (& handle ); // initialize CUBLAS context
cuComplex al ={1.0 f ,0.0 f }; // al =1
cuComplex bet ={1.0 f ,0.0 f }; // bet =1
// Hermitian matrix - matrix multiplication :
// c = al * a * b + bet * c ;
// a - mxm hermitian matrix ; b , c - mxn - general matices ;
// al , bet - scalars
c u d a D e v i c e S y n c h r o n i z e ();
printf ( " c after Chemm :\ n " );
for ( i =0; i < m ; i ++){
for ( j =0; j < n ; j ++){ // print c after Chemm
printf ( " %5.0 f +%1.0 f * I " ,c [ IDX2C (i ,j , m )]. x ,
c [ IDX2C (i ,j , m )]. y );
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( b ); // free memory
cudaFree ( c ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// lower triangle of a :
2.4 CUBLAS Level-3. Matrix-matrix operations 152
// 11+ 0* I
// 12+ 0* I 17+ 0* I
// 13+ 0* I 18+ 0* I 22+ 0* I
// 14+ 0* I 19+ 0* I 23+ 0* I 26+ 0* I
// 15+ 0* I 20+ 0* I 24+ 0* I 27+ 0* I 29+ 0* I
// 16+ 0* I 21+ 0* I 25+ 0* I 28+ 0* I 30+ 0* I 31+ 0* I
// b , c :
// 11+ 0* I 17+ 0* I 23+ 0* I 29+ 0* I 35+ 0* I
// 12+ 0* I 18+ 0* I 24+ 0* I 30+ 0* I 36+ 0* I
// 13+ 0* I 19+ 0* I 25+ 0* I 31+ 0* I 37+ 0* I
// 14+ 0* I 20+ 0* I 26+ 0* I 32+ 0* I 38+ 0* I
// 15+ 0* I 21+ 0* I 27+ 0* I 33+ 0* I 39+ 0* I
// 16+ 0* I 22+ 0* I 28+ 0* I 34+ 0* I 40+ 0* I
// c after Chemm :
// 1122+0* I 1614+0* I 2106+0* I 2598+0* I 3090+0* I //
// 1484+0* I 2132+0* I 2780+0* I 3428+0* I 4076+0* I //
// 1740+0* I 2496+0* I 3252+0* I 4008+0* I 4764+0* I // c=a*b+c
// 1912+0* I 2740+0* I 3568+0* I 4396+0* I 5224+0* I //
// 2025+0* I 2901+0* I 3777+0* I 4653+0* I 5529+0* I //
// 2107+0* I 3019+0* I 3931+0* I 4843+0* I 5755+0* I //
C = α op(A)op(A)H + β C,
// on the device
cuComplex * d_a ; // d_a - a on the device
cuComplex * d_c ; // d_c - c on the device
cudaStat = cudaMalloc (( void **)& d_a , n * k * sizeof ( cuComplex ));
// device memory alloc for a
cudaStat = cudaMalloc (( void **)& d_c , n * n * sizeof ( cuComplex ));
// device memory alloc for c
stat = cublasCreate (& handle ); // initialize CUBLAS context
2.4 CUBLAS Level-3. Matrix-matrix operations 154
}
printf ( " \ n " );
}
if (i >= j )
printf ( " %5.0 f +%2.0 f * I " ,c [ IDX2C (i ,j , n )]. x ,
c [ IDX2C (i ,j , n )]. y );
}
printf ( " \ n " );
}
// define nxk matrices a , b column by column
ind =11; // a , b :
for ( j =0; j < k ; j ++){ // 11 ,17 ,23 ,29 ,35
for ( i =0; i < n ; i ++){ // 12 ,18 ,24 ,30 ,36
a [ IDX2C (i ,j , n )]. x =( float ) ind ; // 13 ,19 ,25 ,31 ,37
a [ IDX2C (i ,j , n )]. y =0.0 f ; // 14 ,20 ,26 ,32 ,38
b [ IDX2C (i ,j , n )]. x =( float ) ind ++; // 15 ,21 ,27 ,33 ,39
b [ IDX2C (i ,j , n )]. y =0.0 f ; // 16 ,22 ,28 ,34 ,40
}
}
// print a (= b ) row by row
printf ( "a , b :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < k ; j ++){
printf ( " %5.0 f +%2.0 f * I " ,a [ IDX2C (i ,j , n )]. x ,
a [ IDX2C (i ,j , n )]. y );
}
printf ( " \ n " );
}
// on the device
cuComplex * d_a ; // d_a - a on the device
cuComplex * d_b ; // d_b - b on the device
cuComplex * d_c ; // d_c - c on the device
cudaStat = cudaMalloc (( void **)& d_a , n * k * sizeof ( cuComplex ));
// device memory alloc for a
cudaStat = cudaMalloc (( void **)& d_b , n * k * sizeof ( cuComplex ));
// device memory alloc for b
cudaStat = cudaMalloc (( void **)& d_c , n * n * sizeof ( cuComplex ));
// device memory alloc for c
stat = cublasCreate (& handle ); // initialize CUBLAS context
stat = cublasSetMatrix (n ,k , sizeof (* a ) ,a ,n , d_a , n ); // a -> d_a
stat = cublasSetMatrix (n ,k , sizeof (* a ) ,b ,n , d_b , n ); // b -> d_b
stat = cublasSetMatrix (n ,n , sizeof (* c ) ,c ,n , d_c , n ); // c -> d_c
cuComplex al ={1.0 f ,0.0 f }; // al =1
float bet =1.0 f ; // bet =1
// Hermitian rank -2 k update :
// d_c = al * d_a * d_b ^ H +\ bar { al }* d_b * d_a ^ H + bet * d_c
// d_c - nxn , hermitian matrix ; d_a , d_b - nxk general matrices ;
// al , bet - scalars
c u d a D e v i c e S y n c h r o n i z e ();
// print the updated lower triangle of c row by row
printf ( " lower triangle of c after Cher2k :\ n " );
for ( i =0; i < n ; i ++){
for ( j =0; j < n ; j ++){ // print c after Cher2k
if (i >= j )
printf ( " %6.0 f +%2.0 f * I " ,c [ IDX2C (i ,j , n )]. x ,
c [ IDX2C (i ,j , n )]. y );
}
printf ( " \ n " );
}
cudaFree ( a ); // free memory
cudaFree ( b ); // free memory
cudaFree ( c ); // free memory
cublasDestroy ( handle ); // destroy CUBLAS context
return EXIT_SUCCESS ;
}
// lower triangle of c :
// 11+ 0* I
// 12+ 0* I 17+ 0* I
// 13+ 0* I 18+ 0* I 22+ 0* I
// 14+ 0* I 19+ 0* I 23+ 0* I 26+ 0* I
// 15+ 0* I 20+ 0* I 24+ 0* I 27+ 0* I 29+ 0* I
// 16+ 0* I 21+ 0* I 25+ 0* I 28+ 0* I 30+ 0* I 31+ 0* I
// a , b :
// 11+ 0* I 17+ 0* I 23+ 0* I 29+ 0* I 35+ 0* I
// 12+ 0* I 18+ 0* I 24+ 0* I 30+ 0* I 36+ 0* I
// 13+ 0* I 19+ 0* I 25+ 0* I 31+ 0* I 37+ 0* I
// 14+ 0* I 20+ 0* I 26+ 0* I 32+ 0* I 38+ 0* I
// 15+ 0* I 21+ 0* I 27+ 0* I 33+ 0* I 39+ 0* I
// 16+ 0* I 22+ 0* I 28+ 0* I 34+ 0* I 40+ 0* I
CUSOLVER by example
• C - complex single-precision,
• Z - complex double-precision.
3.1 General remarks on cuSolver 163
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // timer stop
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " getrf + getrs time : % lf sec .\ n " , accum ); // print el . time
cudaStatus = cudaMemcpy (& info_gpu , d_info , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // d_info -> info_gpu
printf ( " after getrf + getrs : info_gpu = % d \ n " , info_gpu );
cudaStatus = cudaMemcpy ( B1 , d_B , N * sizeof ( float ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_B - > B1
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B1 [ i ]);
printf ( " ... " ); // print first components of the solution
printf ( " \ n " );
// free memory
cudaStatus = cudaFree ( d_A );
cudaStatus = cudaFree ( d_B );
3.2 LU decomposition and solving general linear systems 166
cuda Malloc Manage d (& pivot , N * sizeof ( int )); // unif . mem . for pivot
cuda Malloc Manage d (& info , sizeof ( int )); // unif . mem . for info
cusolverStatus = c u s o l v e r D n S g e t r f _ b u f f e r S i z e ( handle , N , N ,
A , N , & Lwork ); // compute buffer size and prep . memory
cuda Malloc Manage d (& Work , Lwork * sizeof ( float ));
clock_gettime ( CLOCK_REALTIME ,& start ); // timer start
// LU factorization of A , with partial pivoting and row
// interchanges ; row i is interchanged with row pivot ( i );
cusolverStatus = cusolverDnSgetrf(handle,N,N,A,N,Work,
pivot, info);
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // timer stop
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " getrf + getrs time : % lf sec .\ n " , accum ); // pr . elaps . time
printf ( " after getrf + getrs : info = % d \ n " , * info );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ... " ); // print first components of the solution
printf ( " \ n " );
// free memory
cudaStatus = cudaFree ( A );
cudaStatus = cudaFree ( B );
cudaStatus = cudaFree ( pivot );
cudaStatus = cudaFree ( info );
cudaStatus = cudaFree ( Work );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
// getrf + getrs time : 0.295533 sec .
// after getrf + getrs : info = 0
// solution : 1.04225 0.873826 , 1.05703 , 1.03822 , 0.883831 , ...
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // timer stop
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " getrf + getrs time : % lf sec .\ n " , accum ); // pr . elaps . time
cudaStatus = cudaMemcpy (& info_gpu , d_info , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // d_info -> info_gpu
printf ( " after getrf + getrs : info_gpu = % d \ n " , info_gpu );
cudaStatus = cudaMemcpy ( B1 , d_B , N * sizeof ( double ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_B - > B1
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B1 [ i ]);
printf ( " ... " ); // print first components of the solution
printf ( " \ n " );
// free memory
cudaStatus = cudaFree ( d_A );
cudaStatus = cudaFree ( d_B );
cudaStatus = cudaFree ( d_pivot );
cudaStatus = cudaFree ( d_info );
cudaStatus = cudaFree ( d_Work );
free ( A ); free ( B ); free ( B1 );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
// getrf + getrs time : 1.511761 sec .
// after getrf + getrs : info_gpu = 0
// solution : 1 , 1 , 1 , 1 , 1 , ...
3.2 LU decomposition and solving general linear systems 170
cusolverStatus = cusolverDnDgetrf(handle,N,N,A,N,Work,
pivot, info);
// use the LU factorization to solve the system A * x = B ;
3.3 QR decomposition and solving general linear systems 171
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // timer stop
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " getrf + getrs time : % lf sec .\ n " , accum ); // pr . elaps . time
printf ( " after getrf + getrs : info = % d \ n " , * info );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ... " ); // print first components of the solution
printf ( " \ n " );
// free memory
cudaStatus = cudaFree ( A );
cudaStatus = cudaFree ( B );
cudaStatus = cudaFree ( pivot );
cudaStatus = cudaFree ( info );
cudaStatus = cudaFree ( Work );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
// getrf + getrs time : 1.595864 sec .
// after getrf + getrs : info_gpu = 0
// solution : 1 , 1 , 1 , 1 , 1 , ...
cudaFree ( d_R );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Sqeqrf time : 0.434779 sec .
// after geqrf : info_gpu = 0
// after orgqr : info_gpu = 0
// | I - Q ** T * Q | = 2.515004 E -04
cudaFree ( Info );
cudaFree ( work );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Sgeqrf time :0.470025
// after geqrf : info = 0
// after orgqr : info = 0
// || I - Q ^ T * Q || = 2.515004 E -04
cudaStat = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Dgeqrf time :% lf sec .\ n " , accum ); // print elapsed time
cudaStat = c u d a D e v i c e S y n c h r o n i z e ();
cudaStat = cudaMemcpy (& info_gpu , devInfo , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy devInfo - > info_gpu
// check orgqr error code
printf ( " after orgqr : info_gpu = % d \ n " , info_gpu );
cudaStat = cudaMemcpy (Q , d_A , sizeof ( double )* lda *n ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_A - > Q
memset (R , 0 , sizeof ( double )* n * n ); // nxn matrix of zeros
for ( int j = 0 ; j < n ; j ++){
R [ j + n * j ] = 1.0; // ones on the diagonal
}
cudaStat = cudaMemcpy ( d_R , R , sizeof ( double )* n *n ,
c u d a M e m c p y H o s t T o D e v i c e ); // copy R - > d_R
// compute R = -Q ** T * Q + I
cublas_status = cublasDgemm_v2 ( cublasH , CUBLAS_OP_T , CUBLAS_OP_N ,
n , n , m , & h_minus_one , d_A , lda , d_A , lda , & h_one , d_R , n );
double dR_nrm2 = 0.0; // norm value
// compute the norm of R = -Q ** T * Q + I
cublas_status = cublasDnrm2_v2 ( cublasH , n *n , d_R ,1 ,& dR_nrm2 );
printf ( " || I - Q ^ T * Q || = % E \ n " , dR_nrm2 ); // print the norm
// free memory
cudaFree ( d_A );
cudaFree ( d_tau );
cudaFree ( devInfo );
cudaFree ( d_work );
cudaFree ( d_R );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Dgeqrf time : 3.324072 sec .
// after geqrf : info_gpu = 0
// after orgqr : info_gpu = 0
// | I - Q ** T * Q | = 4.646390 E -13
cudaStat = c u d a D e v i c e S y n c h r o n i z e ();
// check orgqr error code
printf ( " after orgqr : info = % d \ n " , * Info );
memset (R , 0 , sizeof ( double )* n * n ); // nxn matrix of zeros
for ( int j = 0 ; j < n ; j ++){
R [ j + n * j ] = 1.0; // ones on the diagonal
}
// compute R = -Q ** T * Q + I
cublas_status = cublasDgemm_v2 ( cublasH , CUBLAS_OP_T , CUBLAS_OP_N ,
n , n , m , & h_minus_one , A , lda , A , lda , & h_one , R , n );
double nrm2 = 0.0; // norm value
// compute the norm of R = -Q ** T * Q + I
cublas_status = cublasDnrm2_v2 ( cublasH , n *n ,R ,1 ,& nrm2 );
printf ( " || I - Q ^ T * Q || = % E \ n " , nrm2 ); // print the norm
// free memory
cudaFree ( A );
cudaFree ( Q );
cudaFree ( R );
cudaFree ( tau );
cudaFree ( Info );
cudaFree ( work );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Dgeqrf time :3.398122 sec .
// after geqrf : info = 0
// after orgqr : info = 0
// || I - Q ^ T * Q || = 4.646390 E -13
cudaStat1 = c u d a D e v i c e S y n c h r o n i z e ();
cudaStat1 = cudaMemcpy (X , d_B , sizeof ( float )* ldb * nrhs ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_B - > X
printf ( " solution : " ); // show first components of the solution
for ( int i = 0; i < 5; i ++) printf ( " %g , " , X [ i ]);
printf ( " ... " );
printf ( " \ n " );
// free memory
cudaFree ( d_A );
cudaFree ( d_tau );
cudaFree ( d_B );
cudaFree ( devInfo );
cudaFree ( d_work );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Sgeqrf time : 0.435715 sec .
// after geqrf : info_gpu = 0
// after ormqr : info_gpu = 0
// solution : 1.00008 , 1.02025 , 1.00586 , 0.999749 , 1.00595 , ...
cudaStat1 = c u d a D e v i c e S y n c h r o n i z e ();
printf ( " solution : " ); // show first components of the solution
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ... " );
printf ( " \ n " );
// free memory
cudaFree ( A );
cudaFree ( tau );
cudaFree ( B );
cudaFree ( B1 );
cudaFree ( Info );
cudaFree ( work );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Sgeqrf time : 0.465168 sec .
// after geqrf : info = 0
// after ormqr : info = 0
// solution : 1.00008 , 1.02025 , 1.00586 , 0.999749 , 1.00595 , ...
c u d a M e m c p y H o s t T o D e v i c e ); // A - > d_A
cudaStat2 = cudaMemcpy ( d_B ,B , sizeof ( double )* ldb * nrhs ,
c u d a M e m c p y H o s t T o D e v i c e ); // B - > d_B
// compute buffer size for geqrf and prepare worksp . on device
cusolver_status = c u s o l v e r D n D g e q r f _ b u f f e r S i z e ( cusolverH , m , m ,
d_A , lda , & lwork );
cudaStat1 = cudaMalloc (( void **)& d_work , sizeof ( double )* lwork );
clock_gettime ( CLOCK_REALTIME ,& start ); // start timer
// QR factorization for d_A ; R stored in upper triangle of
// d_A , elementary reflectors vectors stored in lower triangle
// of d_A , elementary reflectors scalars stored in d_tau
cudaFree ( d_tau );
cudaFree ( d_B );
cudaFree ( devInfo );
cudaFree ( d_work );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Dgeqrf time : 3.333913 sec .
// after geqrf : info_gpu = 0
// after ormqr : info_gpu = 0
// solution : 1 , 1, 1, 1, 1 , ...
bet ,B , incy ); // B = A * B1
double * tau , * work ; // elem . reflectors scalars , workspace
int * Info ; // info
int lwork = 0; // workspace size
const double one = 1;
// create cusolver and cublas handles
cusolver_status = cusolverDnCreate (& cusolverH );
cublas_status = cublasCreate (& cublasH );
cuda Malloc Manage d (& tau , m * sizeof ( double )); // unif . mem . for tau
cuda Malloc Manage d (& Info , sizeof ( int )); // unif . mem . for Info
// compute buffer size for geqrf and prepare workspace
cusolver_status = c u s o l v e r D n D g e q r f _ b u f f e r S i z e ( cusolverH , m , m ,
A , lda , & lwork );
cuda Malloc Manage d (& work , lwork * sizeof ( double )); // mem . for work
clock_gettime ( CLOCK_REALTIME ,& start ); // start timer
// QR factorization for A ; R stored in upper triangle of A
// elementary reflectors vectors stored in lower triangle of A
// elementary reflectors scalars stored in tau
cudaStat1 = c u d a D e v i c e S y n c h r o n i z e ();
// check error code of ormqr function
printf ( " after ormqr : info = % d \ n " , * Info );
// write the original system A * X =( Q * R )* X = B in the form
// R * X = Q ^ T * B and solve the obtained triangular system
cudaFree ( B );
cudaFree ( B1 );
cudaFree ( Info );
cudaFree ( work );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// Dgeqrf time : 3.386223 sec .
// after geqrf : info = 0
// after ormqr : info = 0
// solution : 1 , 1 , 1 , 1 , 1 , ...
A X = B,
cusolverStatus = cusolverDnSpotrf(handle,uplo,N,A,N,Work,
Lwork,info);
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
// solve A * X =B , where A is factorized by potrf function
// B is overwritten by the solution
cusolverStatus = cusolverDnSpotrs(handle,uplo,N,1,A,N,
B,N,info);
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Spotrf + Spotrs time : % lf sec .\ n " , accum ); // pr . el . time
printf ( " after Spotrf + Spotrs : info = % d \ n " , * info );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]); // print
printf ( " ... " ); // first components of the solution
printf ( " \ n " );
// free memory
cudaStatus = cudaFree ( A );
cudaStatus = cudaFree ( B );
cudaStatus = cudaFree ( B1 );
cudaStatus = cudaFree ( info );
cudaStatus = cudaFree ( Work );
3.4 Cholesky decomposition and solving positive definite linear systems 194
A X = B,
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " solution : " );
printf ( " Dpotrf + Dpotrs time : % lf sec .\ n " , accum ); // pr . el . time
cudaStatus = cudaMemcpy (& info_gpu , d_info , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_info -> info_gpu
printf ( " after Dpotrf + Dpotrs : info_gpu = % d \ n " , info_gpu );
cudaStatus = cudaMemcpy (B , d_B , N * sizeof ( double ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy solution to host d_B - > B
3.4 Cholesky decomposition and solving positive definite linear systems 196
cudaError cudaStatus ;
cusolverStatus_t cusolverStatus ;
cu so lv er Dn Ha nd le _t handle ;
double * Work ; // workspace
int * info , Lwork ; // info , workspace size
cudaStatus = cudaGetDevice (0);
cusolverStatus = cusolverDnCreate (& handle ); // create handle
cublasFillMode_t uplo = C U B L A S _ F I L L _ M O D E _ L O W E R ;
cuda Malloc Manage d (& info , sizeof ( int )); // unified mem . for info
// compute workspace size and prepare workspace
cusolverStatus = c u s o l v e r D n D p o t r f _ b u f f e r S i z e ( handle ,
uplo ,N ,A ,N ,& Lwork );
cuda Malloc Manage d (& Work , Lwork * sizeof ( double )); // mem . for Work
clock_gettime ( CLOCK_REALTIME ,& start ); // start timer
// Cholesky decomposition d_A = L * L ^T , lower triangle of d_A is
// replaced by the factor L
cusolverStatus = cusolverDnDpotrf(handle,uplo,N,A,N,Work,
Lwork,info);
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
// solve A * X =B , where A is factorized by potrf function
// B is overwritten by the solution
cusolverStatus = cusolverDnDpotrs(handle,uplo,N,1,A,N,
B,N,info);
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Dpotrf + Dpotrs time : % lf sec .\ n " , accum ); // pr . el . time
printf ( " after Dpotrf + Dpotrs : info = % d \ n " , * info );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]); // print
printf ( " ... " ); // first components of the solution
printf ( " \ n " );
// free memory
cudaStatus = cudaFree ( A );
cudaStatus = cudaFree ( B );
cudaStatus = cudaFree ( B1 );
cudaStatus = cudaFree ( info );
cudaStatus = cudaFree ( Work );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
A = L ∗ D ∗ LT ,
ssytrs(&upl,&N,&nrhs,A,&N,piv,B,&N,&info);
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Ssytrf + ssytrs time : % lf sec .\ n " , accum ); // pr . el . time
cudaStatus = cudaMemcpy (& info_gpu , d_info , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_info - > info_gpu
printf ( " after Sytrf : info_gpu = % d \ n " , info_gpu );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ...\ n " ); // first components of the solution
// free memory
cudaStatus = cudaFree ( d_A );
cudaStatus = cudaFree ( d_pivot );
cudaStatus = cudaFree ( d_info );
3.5 Bunch-Kaufman decomposition and solving symmetric linear systems200
cuda Malloc Manage d (& pivot , N * sizeof ( int )); // unif . mem . for pivot
cuda Malloc Manage d (& info , sizeof ( int )); // unif . mem . for info
cusolverStatus = c u s o l v e r D n S s y t r f _ b u f f e r S i z e ( handle ,N ,A ,N ,
& Lwork ); // compute buffer size and prepare memory
cuda Malloc Manage d (& Work , Lwork * sizeof ( float )); // mem . for Work
clock_gettime ( CLOCK_REALTIME ,& start ); // start timer
// Bunch - Kaufman factorization of an NxN symmetric indefinite
// matrix A = L * D * L ^T , where D is symmetric , block - diagonal ,
// with 1 x1 or 2 x2 blocks , L is a product of permutation and
// triangular matrices
ssytrs(&upl,&N,&nrhs,A,&N,pivot,B,&N,info);
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Ssytrf + ssytrs time : % lf sec .\ n " , accum ); // pr . el . time
printf ( " after Ssytrf : info = % d \ n " , * info );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ...\ n " ); // first components of the solution
// free memory
cudaStatus = cudaFree ( A );
cudaStatus = cudaFree ( B );
cudaStatus = cudaFree ( B1 );
cudaStatus = cudaFree ( pivot );
cudaStatus = cudaFree ( info );
cudaStatus = cudaFree ( Work );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
dsytrs(&upl,&N,&nrhs,A,&N,piv,B,&N,&info);
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Dsytrf + dsytrs time : % lf sec .\ n " , accum ); // pr . el . time
cudaStatus = cudaMemcpy (& info_gpu , d_info , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_info - > info_gpu
printf ( " after Dsytrf : info_gpu = % d \ n " , info_gpu );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ...\ n " ); // first components of the solution
// free memory
cudaStatus = cudaFree ( d_A );
cudaStatus = cudaFree ( d_pivot );
cudaStatus = cudaFree ( d_info );
cudaStatus = cudaFree ( Work );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
// Dsytrf + dsytrs time : 1.173202 sec .
// after Dsytrf : info_gpu = 0
// solution : 1 , 1 , 1 , 1 , 1 , ...
3.5 Bunch-Kaufman decomposition and solving symmetric linear systems204
cusolverStatus = cusolverDnDsytrf(handle,uplo,N,A,N,pivot,
Work,Lwork, info );
cudaStatus = c u d a D e v i c e S y n c h r o n i z e ();
// solve the system A * X = B , where A = L * D * L ^ T - symmetric
// coefficient matrix factored using Bunch - Kaufman method ,
// B is overwritten by the solution
dsytrs(&upl,&N,&nrhs,A,&N,pivot,B,&N,info);
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " Dsytrf + dsytrs time : % lf sec .\ n " , accum ); // pr . el . time
printf ( " after Dsytrf : info = % d \ n " , * info );
printf ( " solution : " );
for ( int i = 0; i < 5; i ++) printf ( " %g , " , B [ i ]);
printf ( " ...\ n " ); // first components of the solution
// free memory
cudaStatus = cudaFree ( A );
cudaStatus = cudaFree ( B );
cudaStatus = cudaFree ( B1 );
cudaStatus = cudaFree ( pivot );
cudaStatus = cudaFree ( info );
cudaStatus = cudaFree ( Work );
cusolverStatus = cu solver DnDest roy ( handle );
cudaStatus = cudaDeviceReset ();
return 0;
}
// Dsytrf + dsytrs time : 1.279214 sec .
// after Dsytrf : info = 0
// solution : 1 , 1 , 1 , 1 , 1 , ...
cudaStat = c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " SVD time : % lf sec .\ n " , accum ); // print elapsed time
cudaStat = cudaMemcpy (U , d_U , sizeof ( float )* lda *m ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_U - > U
cudaStat = cudaMemcpy ( VT , d_VT , sizeof ( float )* lda *n ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_VT - > VT
cudaStat = cudaMemcpy (S , d_S , sizeof ( float )* n ,
c u d a M e m c p y D e v i c e T o H o s t ); // copy d_S - > S
cudaStat = cudaMemcpy (& info_gpu , devInfo , sizeof ( int ) ,
c u d a M e m c p y D e v i c e T o H o s t ); // devInfo - > info_gpu
printf ( " after gesvd : info_gpu = % d \ n " , info_gpu );
// multiply d_VT by the diagonal matrix corresponding to d_S
cublas_status = cublasSdgmm ( cublasH , CUBLAS_SIDE_LEFT ,n ,n ,
d_VT , lda , d_S , 1 , d_W , lda ); // d_W = d_S * d_VT
cudaStat = cudaMemcpy ( d_A ,A , sizeof ( float )* lda *n ,
c u d a M e m c p y H o s t T o D e v i c e ); // copy A - > d_A
// compute the difference d_A - d_U * d_S * d_VT
cublas_status = cublasSgemm_v2 ( cublasH , CUBLAS_OP_N , CUBLAS_OP_N ,
m , n , n , & h_minus_one , d_U , lda , d_W , lda , & h_one , d_A , lda );
float dR_fro = 0.0; // variable for the norm
// compute the norm of the difference d_A - d_U * d_S * d_VT
cublas_status = cublasSnrm2_v2 ( cublasH , lda *n , d_A ,1 ,& dR_fro );
printf ( " | A - U * S * VT | = % E \ n " , dR_fro ); // print the norm
// free memory
cudaFree ( d_A );
cudaFree ( d_S );
cudaFree ( d_U );
cudaFree ( d_VT );
cudaFree ( devInfo );
cudaFree ( d_work );
cudaFree ( d_rwork );
cudaFree ( d_W );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
3.6 SVD decomposition 208
cudaStat =
cudaMalloc (( void **)& d_S , sizeof ( double )* n );
cudaStat =
cudaMalloc (( void **)& d_U , sizeof ( double )* lda * m );
cudaStat =
cudaMalloc (( void **)& d_VT , sizeof ( double )* lda * n );
cudaStat =
cudaMalloc (( void **)& devInfo , sizeof ( int ));
cudaStat =
cudaMalloc (( void **)& d_W , sizeof ( double )* lda * n );
cudaStat =
cudaMemcpy ( d_A , A , sizeof ( double )* lda *n ,
c u d a M e m c p y H o s t T o D e v i c e ); // copy A - > d_A
// compute buffer size and prepare workspace
cusolver_status = c u s o l v e r D n D g e s v d _ b u f f e r S i z e ( cusolverH ,m ,n ,
& lwork );
cudaStat = cudaMalloc (( void **)& d_work , sizeof ( double )* lwork );
// compute the singular value decomposition of d_A
// and optionally the left and right singular vectors :
// d_A = d_U * d_S * d_VT ; the diagonal elements of d_S
// are the singular values of d_A in descending order
// the first min (m , n ) columns of d_U contain the left sing . vec .
// the first min (m , n ) cols of d_VT contain the right sing . vec .
signed char jobu = ’A ’; // all m columns of d_U returned
signed char jobvt = ’A ’; // all n columns of d_VT returned
clock_gettime ( CLOCK_REALTIME ,& start ); // start timer
cudaFree ( d_A );
cudaFree ( d_S );
cudaFree ( d_U );
cudaFree ( d_VT );
cudaFree ( devInfo );
cudaFree ( d_work );
cudaFree ( d_rwork );
cudaFree ( d_W );
cublasDestroy ( cublasH );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// SVD time : 22.178122 sec .
// after gesvd : info_gpu = 0
// | A - U * S * VT | = 8.710823 E -12
c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " SVD time : % lf sec .\ n " , accum ); // print elapsed time
printf ( " after gesvd : info = % d \ n " , * Info );
// multiply VT by the diagonal matrix corresponding to S
cublasDdgmm ( cublasH , CUBLAS_SIDE_LEFT ,n ,n ,
VT , lda , S , 1 , W , lda ); // W = S * VT
cudaMemcpy ( A1 ,A , sizeof ( double )* lda *n ,
c u d a M e m c p y H o s t T o D e v i c e ); // copy A - > A1
// compute the difference A1 - U * S * VT
cublasDgemm_v2 ( cublasH , CUBLAS_OP_N , CUBLAS_OP_N ,
m , n , n , & h_minus_one , U , lda , W , lda , & h_one , A1 , lda );
double nrm = 0.0; // variable for the norm
// compute the norm of the difference A1 - U * S * VT
cublasDnrm2_v2 ( cublasH , lda *n , A1 ,1 ,& nrm );
printf ( " | A - U * S * VT | = % E \ n " , nrm ); // print the norm
// free memory
cudaFree ( A );
cudaFree ( A1 );
cudaFree ( U );
cudaFree ( VT );
cudaFree ( S );
cudaFree ( W );
cudaFree ( Info );
cudaFree ( work );
cudaFree ( rwork );
cublasDestroy ( cublasH );
3.7 Eigenvalues and eigenvectors for symmetric matrices 214
c u d a D e v i c e S y n c h r o n i z e ();
clock_gettime ( CLOCK_REALTIME ,& stop ); // stop timer
accum =( stop . tv_sec - start . tv_sec )+ // elapsed time
( stop . tv_nsec - start . tv_nsec )/( double ) BILLION ;
printf ( " syevd time : % lf sec .\ n " , accum ); // print elapsed time
printf ( " after syevd : info = % d \ n " , * Info );
printf ( " eigenvalues :\ n " ); // print first eigenvalues
for ( int i = 0 ; i < 3 ; i ++){
printf ( " W [% d ] = % E \ n " , i +1 , W [ i ]);
}
// free memory
cudaFree ( A );
cudaFree ( W );
cudaFree ( Info );
cudaFree ( work );
cuso lverDn Destro y ( cusolverH );
cudaDeviceReset ();
return 0;
}
// syevd time : 2.246703 sec .
// after syevd : info = 0
// eigenvalues :
// W [1] = -2.582270 E +01
// W [2] = -2.566824 E +01
// W [3] = -2.563596 E +01
MAGMA by example
• Sparse eigenvalues.
• Sparse preconditioners.
To be precise, there exist also some mixed precision routines of the type sc,
dz, ds, zc, but we have decided to omit the corresponding examples.
• We shall restrict our examples to the most popular real, single and
double precision versions. The single precision versions are impor-
tant because in users hands there are millions of inexpensive GPUs
which have restricted double precision capabilities. Installing Magma
on such devices can be a good starting point to more advanced stud-
ies. On the other hand in many applications the double precision is
necessary, so we have decided to present our examples in both versions
(in Magma BLAS case only in single precision). In most examples we
measure the computations times, so one can compare the performance
in single and double precision.
$make
Let us remark, that only two examples of the present chapter contain the
cudaDeviceSynchronize() function. The function magma sync wtime used
in the remaining examples contains the sychronization command and
cudaDeviceSynchronize() is not necessary. Note however that, for ex-
ample in subsection 4.2.4 (vectors swapping with unified memory), omit-
ting cudaDeviceSynchronize() leads to wrong results (vectors are not
swapped).
a ← b, b ← a.
magma_int_t err ;
// allocate the vectors on the host
err = m agma_s malloc _cpu ( & a , m ); // host mem . for a
err = m agma_s malloc _cpu ( & b , m ); // host mem . for b
// allocate the vector on the device
err = magma_smalloc ( & d_a , m ); // device memory for a
err = magma_smalloc ( & d_b , m ); // device memory for b
// a ={ sin (0) , sin (1) ,... , sin (m -1)}
for ( int j =0; j < m ; j ++) a [ j ]= sin (( float ) j );
// b ={ cos (0) , cos (1) ,... , cos (m -1)}
for ( int j =0; j < m ; j ++) b [ j ]= cos (( float ) j );
printf ( " a : " );
for ( int j =0; j <4; j ++) printf ( " %6.4 f , " ,a [ j ]); printf ( " ...\ n " );
printf ( " b : " );
for ( int j =0; j <4; j ++) printf ( " %6.4 f , " ,b [ j ]); printf ( " ...\ n " );
// copy data from host to device
magma_ssetvector ( m , a , 1 , d_a ,1 , queue ); // copy a -> d_a
magma_ssetvector ( m , b , 1 , d_b ,1 , queue ); // copy b -> d_b
// swap the vectors
c = α op(A)b + βc,
where A is a matrix, b, c are vectors, α, β are scalars and op(A) can be equal
to A (MagmaNoTrans case), AT (transposition) in MagmaTrans case or AH
(conjugate transposition) in MagmaConjTrans case.
for ( int j =0; j <4; j ++) printf ( " %9.4 f , " , c2 [ j ]);
printf ( " ...\ n " );
magm a_free _pinne d ( a ); // free host memory
magm a_free _pinne d ( b ); // free host memory
magm a_free _pinne d ( c ); // free host memory
magm a_free _pinne d ( c2 ); // free host memory
magma_free ( d_a ); // free device memory
magma_free ( d_b ); // free device memory
magma_free ( d_c ); // free device memory
m a gm a _ qu e u e_ d e st r o y ( queue );
magma_finalize (); // finalize Magma
return 0;
}
// magma_sgemv time : 0.00016 sec .
// after magma_sgemv :
// c2 : 507.9388 , 498.1866 , 503.1055 , 508.1643 ,...
magma sgemv(MagmaNoTrans,m,n,alpha,a,m,b,1,beta,c,1,queue);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_sgemv time : %7.5 f sec .\ n " , gpu_time );
printf ( " after magma_sgemv :\ n " );
printf ( " c : " );
for ( int j =0; j <4; j ++) printf ( " %9.4 f , " ,c [ j ]);
printf ( " ...\ n " );
magma_free ( a ); // free memory
magma_free ( b ); // free memory
magma_free ( c ); // free memory
m a gm a _ qu e u e_ d e st r o y ( queue );
magma_finalize (); // finalize Magma
return 0;
}
// magma_sgemv time : 0.00504 sec .
// after magma_sgemv :
// c : 507.9388 , 498.1866 , 503.1055 , 508.1643 ,...
c = αAb + βc,
where A is an m × m symmetric matrix, b, c are vectors and α, β are scalars.
The matrix A can be stored in lower (MagmaLower) or upper (MagmaUpper)
mode.
C = αop(A)op(B) + βC,
where A, B, C are matrices and α, β are scalars. The value of op(A) can be
equal to A in MagmaNoTrans case, AT (transposition) in MagmaTrans case,
or AH (conjugate transposition) in MagmaConjTrans case and similarly for
op(B).
magma sgemm(MagmaNoTrans,MagmaNoTrans,m,n,k,alpha,a,m,b,k,
beta,c,m,queue);
// c :
// 498.3723 , 521.3933 , 507.0844 , 515.5119 ,...
// 504.1406 , 517.1718 , 509.3519 , 511.3415 ,...
// 511.1694 , 530.6165 , 517.5001 , 524.9462 ,...
// 505.5946 , 522.4631 , 511.7729 , 516.2770 ,...
// . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C = α op(A)op(A)T + βC,
}
// magma_ssyrk time : 0.09162 sec .
// after magma_ssyrk :
// c :
// 1358.9562 ,...
// 1027.0094 , 1382.1946 ,...
// 1011.2416 , 1022.4153 , 1351.7262 ,...
// 1021.8580 , 1037.6437 , 1025.0333 , 1376.4917 ,...
// . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
magma strmm(MagmaLeft,MagmaUpper,MagmaNoTrans,MagmaNonUnit,
m,n,alpha,d a,m,d c,m,queue);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_strmm time : %7.5 f sec .\ n " , gpu_time );
// copy data from device to host
magma_sgetmatrix ( m , n , d_c , m , c ,m , queue ); // copy d_c -> c
printf ( " after magma_strmm :\ n " );
printf ( " c :\ n " );
for ( int i =0; i <4; i ++){
for ( int j =0; j <4; j ++) printf ( " %10.4 f , " ,c [ i * m + j ]);
printf ( " ...\ n " );}
printf ( " . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \ n " );
magm a_free _pinne d ( a ); // free host memory
magm a_free _pinne d ( c ); // free host memory
magma_free ( d_a ); // free device memory
magma_free ( d_c ); // free device memory
m ag m a _q u e ue _ d es t r oy ( queue ); // destroy queue
magma_finalize (); // finalize Magma
return 0;
}
// magma_strmm time : 0.04829 sec .
// after magma_strmm :
// c :
4.2 Magma BLAS 247
magma strmm(MagmaLeft,MagmaUpper,MagmaNoTrans,MagmaNonUnit,
m,n,alpha,a,m,c,m,queue);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_strmm time : %7.5 f sec .\ n " , gpu_time );
printf ( " after magma_strmm :\ n " );
printf ( " c :\ n " );
for ( int i =0; i <4; i ++){
4.2 Magma BLAS 248
for ( int j =0; j <4; j ++) if (i >= j ) printf ( " %10.4 f , " ,c [ i * m + j ]);
printf ( " ...\ n " );}
printf ( " . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . \ n " );
magma_free ( a ); // free memory
magma_free ( c ); // free memory
m ag m a _q u e ue _ d es t r oy ( queue ); // destroy queue
magma_finalize (); // finalize Magma
return 0;
}
// magma_strmm time : 0.12141 sec .
// after magma_strmm :
// c :
// 2051.0024 , 2038.8608 , 2033.2482 , 2042.2589 ,...
// 2040.4783 , 2027.2789 , 2025.2496 , 2041.6721 ,...
// 2077.4158 , 2052.2390 , 2050.5039 , 2074.0791 ,...
// 2028.7070 , 2034.3572 , 2003.8625 , 2031.4501 ,...
// . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C = α A + C,
// c :
4.2 Magma BLAS 250
magmablas sgeadd(m,n,alpha,a,m,c,m,queue);
A X = B,
4.3 LU decomposition and solving general linear systems 252
magma sgesv(m,n,a,m,piv,c,m,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_sgesv time : %7.5 f sec .\ n " , gpu_time ); // time
printf ( " upper left corner of the solution :\ n " );
magma_sprint ( 4 , 4 , c , m ); // part of the solution
magm a_free _pinne d ( a ); // free host memory
magm a_free _pinne d ( b ); // free host memory
magm a_free _pinne d ( c ); // free host memory
4.3 LU decomposition and solving general linear systems 253
magma sgesv(m,n,a,m,piv,c,m,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_sgesv time : %7.5 f sec .\ n " , gpu_time ); // time
printf ( " solution :\ n " );
magma_sprint ( 4 , 4 , c , m ); // part of the solution
magma_free ( a ); // free memory
magma_free ( b ); // free memory
magma_free ( c ); // free memory
magma_free ( piv ); // free memory
magma_finalize (); // finalize Magma
return 0;
}
// expected solution :
// [
// 0.3924 0.5546 0.6481 0.5479
// 0.9790 0.7204 0.4220 0.4588
// 0.5246 0.0813 0.8202 0.6163
// 0.6624 0.8634 0.8748 0.0717
// ];
// magma_sgesv time : 0.42720 sec .
// solution :
// [
// 0.3927 0.5548 0.6484 0.5483
// 0.9788 0.7204 0.4217 0.4586
// 0.5242 0.0815 0.8199 0.6161
// 0.6626 0.8631 0.8749 0.0717
// ];
A X = B,
magma dgesv(m,n,a,m,piv,c,m,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_dgesv time : %7.5 f sec .\ n " , gpu_time ); // time
printf ( " solution :\ n " );
magma_dprint ( 4 , 4 , c , m ); // part of the solution
magm a_free _pinne d ( a ); // free host memory
magm a_free _pinne d ( b ); // free host memory
magm a_free _pinne d ( c ); // free host memory
free ( piv ); // free host memory
magma_finalize (); // finalize Magma
return 0;
}
// expected solution :
4.3 LU decomposition and solving general linear systems 256
// [
// 0.5440 0.5225 0.8499 0.4012
// 0.4878 0.9321 0.2277 0.7495
// 0.0124 0.7743 0.5884 0.3296
// 0.2166 0.6253 0.8843 0.3685
// ];
// magma_dgesv time : 1.81342 sec .
// solution :
// [
// 0.5440 0.5225 0.8499 0.4012
// 0.4878 0.9321 0.2277 0.7495
// 0.0124 0.7743 0.5884 0.3296
// 0.2166 0.6253 0.8843 0.3685
// ];
blasf77_dgemm ( " N " ," N " ,&m ,& n ,& n ,& alpha ,a ,& m ,b ,& m ,& beta ,c ,& m );
// solve the linear system a * x = c
// c - mxn matrix , a - mxm matrix ;
// c is overwritten by the solution
gpu_time = magma_sync_wtime ( NULL );
magma dgesv(m,n,a,m,piv,c,m,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_dgesv time : %7.5 f sec .\ n " , gpu_time ); // time
printf ( " solution :\ n " );
magma_dprint ( 4 , 4 , c , m ); // part of the solution
magma_free ( a ); // free memory
magma_free ( b ); // free memory
magma_free ( c ); // free memory
magma_free ( piv ); // free memory
magma_finalize (); // finalize Magma
return 0;
}
// expected solution :
// [
// 0.5440 0.5225 0.8499 0.4012
// 0.4878 0.9321 0.2277 0.7495
// 0.0124 0.7743 0.5884 0.3296
// 0.2166 0.6253 0.8843 0.3685
// ];
// magma_dgesv time : 1.69905 sec .
// solution :
// [
// 0.5440 0.5225 0.8499 0.4012
// 0.4878 0.9321 0.2277 0.7495
// 0.0124 0.7743 0.5884 0.3296
// 0.2166 0.6253 0.8843 0.3685
// ];
A X = B,
cuda Malloc Manage d (& a , mm * sizeof ( float )); // unified mem . for a
cuda Malloc Manage d (& b , mn * sizeof ( float )); // unified mem . for b
cuda Malloc Manage d (& c , mn * sizeof ( float )); // unified mem . for c
cuda Malloc Manage d (& piv , m * sizeof ( int )); // unified mem . for piv
// generate matrices
lapackf77_slarnv (& ione , ISEED ,& mm , a ); // randomize a
lapackf77_slarnv (& ione , ISEED ,& mn , b ); // randomize b
printf ( " expected solution :\ n " );
magma_sprint ( 4 , 4 , b , m ); // part of the expected solution
// right hand side c = a * b
blasf77_sgemm ( " N " ," N " ,&m ,& n ,& m ,& alpha ,a ,& m ,b ,& m ,& beta ,c ,& m );
// solve the linear system a * x =c , a - mxm matrix ,
// c - mxn matrix , c is overwritten by the solution ;
// LU decomposition with partial pivoting and row
// interchanges is used , row i is interchanged with row piv ( i )
gpu_time = magma_sync_wtime ( NULL );
A X = B,
// expected solution :
4.3 LU decomposition and solving general linear systems 264
// [
// 0.5440 0.5225 0.8499 0.4012
// 0.4878 0.9321 0.2277 0.7495
// 0.0124 0.7743 0.5884 0.3296
// 0.2166 0.6253 0.8843 0.3685
// ];
// magma_dgesv_gpu time : 1.55957 sec .
// solution :
// [
// 0.5440 0.5225 0.8499 0.4012
// 0.4878 0.9321 0.2277 0.7495
// 0.0124 0.7743 0.5884 0.3296
// 0.2166 0.6253 0.8843 0.3685
// ];
A = P L U,
// [
// 0.9682 0.9682 0.9682 0.9682
// 1.0134 1.0134 1.0134 1.0134
// 1.0147 1.0147 1.0147 1.0147
// 1.0034 1.0034 1.0034 1.0034
// ];
A = P L U,
where P is a permutation matrix, L is lower triangular with unit diagonal,
and U is upper diagonal. The matrix A to be factored is defined on the
host. On exit A contains the factors L, U . The information on the inter-
changed rows is contained in piv. See magma-X.Y.Z/src/sgetrf.cpp for
more details.
Using the obtained factorization one can replace the problem of solving a
general linear system by solving two triangular systems with matrices L and
U respectively. The Lapack function dgetrs uses the LU factorization to
solve a general linear system (it is faster to use Lapack dgetrs than to copy
A to the device).
4.3 LU decomposition and solving general linear systems 268
lapackf77_dlaset ( MagmaFullStr ,& n ,& nrhs ,& alpha ,& alpha ,b ,& n );
// b - nxnrhs matrix of ones
printf ( " upper left corner of the expected solution :\ n " );
magma_dprint ( 4 , 4 , b , m ); // part of the expected solution
// right hand side c = a * b
blasf77_dgemm ( " N " ," N " ,&m ,& nrhs ,& n ,& alpha ,a ,& m ,b ,& m ,& beta ,
c ,& m ); // right hand side c = a * b
// solve the linear system a * x =c , a - mxn matrix , c - mxnrhs ma -
// trix , c is overwritten by the solution ; LU decomposition
// with partial pivoting and row interchanges is used ,
// row i is interchanged with row piv ( i )
gpu_time = magma_sync_wtime ( NULL );
interchanges:
A = P L U,
where P is a permutation matrix, L is lower triangular with unit diagonal,
and U is upper diagonal. The matrix A to be factored and the factors L, U
are defined on the device. The information on the interchanged rows is
contained in piv. See magma-X.Y.Z/src/sgetrf_gpu.cpp for more details.
Using the obtained factorization one can replace the problem of solving a
general linear system by solving two triangular systems with matrices L
and U respectively. The function magma sgetrs gpu uses the L, U factors
defined on the device by magma sgetrf gpu to solve in single precision a
general linear system
A X = B.
The right hand side B is a matrix defined on the device. On exit it is
replaced by the solution. See magma-X.Y.Z/src/sgetrs_gpu.cpp for more
details.
4.3.14 magma sgetrf gpu, magma sgetrs gpu - unified memory ver-
sion
4.3.16 magma dgetrf gpu, magma dgetrs gpu - unified memory ver-
sion
magma_int_t mn = m * n ; // size of a
magma_int_t nnrhs = n * nrhs ; // size of b
magma_int_t mnrhs = m * nrhs ; // size of c
double * a ; // a - mxn matrix
double * b ; // b - nxnrhs matrix
double * c ; // c - mxnrhs matrix
magma_int_t ione = 1;
magma_int_t ISEED [4] = { 0 ,0 ,0 ,1 }; // seed
const double alpha = 1.0; // alpha =1
const double beta = 0.0; // beta =0
cuda Malloc Manage d (& a , mn * sizeof ( double )); // unified mem . for a
cuda Malloc Manage d (& b , nnrhs * sizeof ( double )); // unif . mem . for b
cuda Malloc Manage d (& c , mnrhs * sizeof ( double )); // unif . mem . for c
cud aMallo cManag ed (& piv , m * sizeof ( int )); // unified mem . for piv
// generate matrices a , b ;
lapackf77_dlarnv (& ione , ISEED ,& mn , a ); // randomize a
lapackf77_dlaset ( MagmaFullStr ,& n ,& nrhs ,& alpha ,& alpha ,b ,& n );
// b - nxnrhs matrix of ones
printf ( " upper left corner of the expected solution :\ n " );
magma_dprint ( 4 , 4 , b , m ); // part of the expected solution
// right hand side c = a * b
blasf77_dgemm ( " N " ," N " ,&m ,& nrhs ,& n ,& alpha ,a ,& m ,b ,& m ,& beta ,c ,& m );
// solve the linear system a * x =c , a - mxn matrix ,
// c - mxnrhs matrix , c is overwritten by the solution ;
// LU decomposition with partial pivoting and row interchanges
// is used , row i is interchanged with row piv ( i )
gpu_time = magma_sync_wtime ( NULL );
magma_int_t n2 = m * n ; // size of a , r
magma_int_t nnrhs = n * nrhs ; // size of b
magma_int_t mnrhs = m * nrhs ; // size of c
float *a , * r ; // a , r - mxn matrices on the host
float *b , * c ; // b - nxnrhs , c - mxnrhs matrices on the host
magmaFloat_ptr d_la [ num_gpus ];
float alpha =1.0 , beta =0.0; // alpha =1 , beta =0
magma_int_t n_local ;
magma_int_t ione = 1 , info ;
magma_int_t i , min_mn = min (m , n ) , nb ;
magma_int_t ldn_local ; // m * ldn_local - size of the part of a
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // on i - th device
nb = m ag m a _g e t _s g e tr f _ nb (m , n ); // optim . block size for sgetrf
// allocate memory on cpu
ipiv =( magma_int_t *) malloc ( min_mn * sizeof ( magma_int_t ));
// host memory for ipiv
err = m agma_s malloc _cpu (& a , n2 ); // host memory for a
err = m a g m a _ s m a l l o c _ p i n n e d (& r , n2 ); // host memory for r
err = m a g m a _ s m a l l o c _ p i n n e d (& b , nnrhs ); // host memory for b
err = m a g m a _ s m a l l o c _ p i n n e d (& c , mnrhs ); // host memory for c
// allocate device memory on num_gpus devices
for ( i =0; i < num_gpus ; i ++){
n_local = (( n / nb )/ num_gpus )* nb ;
if ( i < ( n / nb )% num_gpus )
n_local += nb ;
else if ( i == ( n / nb )% num_gpus )
n_local += n % nb ;
ldn_local = (( n_local +31)/32)*32;
magma_setdevice ( i );
err = magma_smalloc (& d_la [ i ] , m * ldn_local ); // device memory
} // on i - th device
magma_setdevice (0);
// generate matrices
lapackf77_slarnv ( & ione , ISEED , & n2 , a ); // randomize a
lapackf77_slaset ( MagmaFullStr ,& n ,& nrhs ,& alpha ,& alpha ,b ,& n );
// b - nxnrhs matrix of ones
lapackf77_slacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
printf ( " upper left corner of the expected solution :\ n " );
magma_sprint (4 ,4 , b , m ); // part of the expected solution
blasf77_sgemm ( " N " ," N " ,&m ,& nrhs ,& n ,& alpha ,a ,& m ,b ,& m ,
& beta ,c ,& m ); // right hand side c = a * b
// LAPACK version of LU decomposition
cpu_time = magma_wtime ();
lapackf77_sgetrf (& m , &n , a , &m , ipiv , & info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " lapackf77_sgetrf time : %7.5 f sec .\ n " , cpu_time );
// copy the corresponding parts of the matrix r to num_gpus
m a g m a _ s s e t m a t r i x _ 1 D _ c o l _ b c y c l i c ( num_gpus , m , n , nb , r , m ,
d_la , m , queues );
// MAGMA
// LU decomposition on num_gpus devices with partial pivoting
// and row interchanges , row i is interchanged with row ipiv ( i )
4.3 LU decomposition and solving general linear systems 280
magma_int_t n_local ;
magma_int_t ione = 1 , info ;
magma_int_t i , min_mn = min (m , n ) , nb ;
magma_int_t ldn_local ; // mxldn_local - size of the part of a
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // on i - th device
nb = m ag m a _g e t _d g e tr f _ nb (m , n ); // optim . block size for dgetrf
// allocate memory on cpu
ipiv =( magma_int_t *) malloc ( min_mn * sizeof ( magma_int_t ));
// host memory for ipiv
err = m agma_d malloc _cpu (& a , n2 ); // host memory for a
err = m a g m a _ d m a l l o c _ p i n n e d (& r , n2 ); // host memory for r
err = m a g m a _ d m a l l o c _ p i n n e d (& b , nnrhs ); // host memory for b
err = m a g m a _ d m a l l o c _ p i n n e d (& c , mnrhs ); // host memory for c
// allocate device memory on num_gpus devices
for ( i =0; i < num_gpus ; i ++){
n_local = (( n / nb )/ num_gpus )* nb ;
if ( i < ( n / nb )% num_gpus )
n_local += nb ;
else if ( i == ( n / nb )% num_gpus )
n_local += n % nb ;
ldn_local = (( n_local +31)/32)*32;
magma_setdevice ( i );
err = magma_dmalloc (& d_la [ i ] , m * ldn_local ); // device memory
} // on i - th device
magma_setdevice (0);
// generate matrices
lapackf77_dlarnv ( & ione , ISEED , & n2 , a ); // randomize a
lapackf77_dlaset ( MagmaFullStr ,& n ,& nrhs ,& alpha ,& alpha ,b ,& n );
// b - nxnrhs matrix of ones
lapackf77_dlacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
printf ( " upper left corner of the expected solution :\ n " );
magma_dprint (4 ,4 , b , m ); // part of the expected solution
blasf77_dgemm ( " N " ," N " ,&m ,& nrhs ,& n ,& alpha ,a ,& m ,b ,& m ,
& beta ,c ,& m ); // right hand side c = a * b
// LAPACK version of LU decomposition
cpu_time = magma_wtime ();
lapackf77_dgetrf (& m , &n , a , &m , ipiv , & info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " lapackf77_dgetrf time : %7.5 f sec .\ n " , cpu_time );
// copy the corresponding parts of the matrix r to num_gpus
m a g m a _ d s e t m a t r i x _ 1 D _ c o l _ b c y c l i c ( num_gpus , m , n , nb , r , m ,
d_la , m , queues );
// MAGMA
// LU decomposition on num_gpus devices with partial pivoting
// and row interchanges , row i is interchanged with row ipiv ( i )
gpu_time = magma_sync_wtime ( NULL );
// host
m a g m a _ d g e t m a t r i x _ 1 D _ c o l _ b c y c l i c ( num_gpus , m , n , nb , d_la ,
m , r , m , queues );
magma_setdevice (0);
// solve on the host the system r * x = c ; x overwrites c ,
// using the LU decomposition obtained on num_gpus devices
lapackf77_dgetrs ( " N " ,&m ,& nrhs ,r ,& m , ipiv ,c ,& m ,& info );
// print part of the solution from dgetrf_mgpu and dgetrs
printf ( " upper left corner of the solution \ n \
from dgetrf_mgpu + dgetrs :\ n " ); // part of the solution from
magma_dprint ( 4 , 4 , c , m ); // m agma_d getrf_ mgpu + dgetrs
free ( ipiv ); // free host memory
free ( a ); // free host memory
magm a_free _pinne d ( r ); // free host memory
magm a_free _pinne d ( b ); // free host memory
magm a_free _pinne d ( c ); // free host memory
for ( i =0; i < num_gpus ; i ++){
magma_free ( d_la [ i ] ); // free device memory
}
for ( int dev = 0; dev < num_gpus ; ++ dev ) {
m a gm a _ qu e u e_ d e st r o y ( queues [ dev ] );
}
magma_finalize (); // finalize Magma
}
// upper left corner of the expected solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
// lapackf77_dgetrf time : 2.82328 sec .
// ma gma_dg etrf_m gpu time : 1.41692 sec .
// upper left corner of the solution
// from dgetrf_mgpu + dgetrs :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
// allocate matrices
cuda Malloc Manage d (& a , mm * sizeof ( double )); // unified mem . for a
cuda Malloc Manage d (& r , mm * sizeof ( double )); // unified mem . for r
cuda Malloc Manage d (& c , mm * sizeof ( double )); // unified mem . for c
cuda Malloc Manage d (& dwork , ldwork * sizeof ( double )); // mem . dwork
cuda Malloc Manage d (& piv , m * sizeof ( int )); // unified mem . for piv
// generate random matrix a
lapackf77_dlarnv (& ione , ISEED ,& mm , a ); // randomize a
magmablas_dlacpy ( MagmaFull ,m ,m ,a ,m ,r ,m , queue ); // a - > r
// find the inverse matrix : a * X = I using the LU factorization
// with partial pivoting and row interchanges computed by
// magma_dgetrf_gpu ; row i is interchanged with row piv ( i );
// a - mxm matrix ; a is overwritten by the inverse
gpu_time = magma_sync_wtime ( NULL );
A X = B,
magma sposv(MagmaLower,m,n,a,m,c,m,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_sposv time : %7.5 f sec .\ n " , gpu_time ); // Magma
printf ( " upper left corner of the Magma solution :\ n " ); // time
magma_sprint ( 4 , 4 , c , m ); // part of Magma solution
free ( a ); // free host memory
free ( b ); // free host memory
free ( c ); // free host memory
magma_finalize (); // finalize Magma
return 0;
}
// upper left corner of the expected solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
// magma_sposv time : 0.41469 sec .
// upper left corner of the Magma solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
double gpu_time ;
magma_int_t info ;
magma_int_t m = 8192; // a - mxm matrix
magma_int_t n = 100; // b , c - mxn matrices
magma_int_t mm = m * m ; // size of a
magma_int_t mn = m * n ; // size of b , c
float * a ; // a - mxm matrix
float * b ; // b - mxn matrix
float * c ; // c - mxn matrix
magma_int_t ione = 1;
magma_int_t ISEED [4] = { 0 ,0 ,0 ,1 }; // seed
const float alpha = 1.0; // alpha =1
const float beta = 0.0; // beta =0
// allocate matrices
cuda Malloc Manage d (& a , mm * sizeof ( float )); // unif . memory for a
cuda Malloc Manage d (& b , mn * sizeof ( float )); // unif . memory for b
cuda Malloc Manage d (& c , mn * sizeof ( float )); // unif . memory for c
// generate random matrices a , b ;
lapackf77_slarnv (& ione , ISEED ,& mm , a ); // randomize a
lapackf77_slaset ( MagmaFullStr ,& m ,& n ,& alpha ,& alpha ,b ,& m );
// symmetrize a and increase diagonal
magma_smake_hpd ( m , a , m );
printf ( " upper left corner of the expected solution :\ n " );
magma_sprint ( 4 , 4 , b , m ); // part of the expected solution
// right hand side c = a * b
blasf77_sgemm ( " N " ," N " ,&m ,& n ,& n ,& alpha ,a ,& m ,b ,& m ,& beta ,c ,& m );
// solve the linear system a * x = c
// c - mxn matrix , a - mxm symmetric , positive def . matrix ;
// c is overwritten by the solution ,
// use the Cholesky factorization a = L * L ^ T
gpu_time = magma_sync_wtime ( NULL );
magma sposv(MagmaLower,m,n,a,m,c,m,&info);
magma dposv(MagmaLower,m,n,a,m,c,m,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " magma_dposv time : %7.5 f sec .\ n " , gpu_time ); // Magma
printf ( " upper left corner of the Magma solution :\ n " ); // time
magma_dprint ( 4 , 4 , c , m ); // part of Magma solution
free ( a ); // free host memory
free ( b ); // free host memory
free ( c ); // free host memory
magma_finalize (); // finalize Magma
return 0;
}
// upper left corner of the expected solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
// magma_dposv time : 1.39989 sec .
// upper left corner of the Magma solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
magma dposv(MagmaLower,m,n,a,m,c,m,&info);
A X = B,
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
A X = B,
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
// magma_spotrf + spotrs time : 0.48314 sec .
// upper left corner of the Magma / Lapack solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
A X = B,
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
A X = B,
4.4.14 magma spotrf gpu, magma spotrs gpu - unified memory ver-
sion
// ];
// magma_spotrf_gpu + magma_spotrs_gpu time : 0.09600 sec .
// upper left corner of the solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
A X = B,
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
// magma_dpotrf_gpu + magma_dpotrs_gpu time : 0.93016 sec .
// upper left corner of the solution :
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
4.4.16 magma dpotrf gpu, magma dpotrs gpu - unified memory ver-
sion
A X = B,
A X = B,
A A−1 = A−1 A = I.
// [
// 1.0000 0.0000 -0.0000 0.0000
// 0.0000 1.0000 0.0000 -0.0000
// -0.0000 0.0000 1.0000 0.0000
// 0.0000 -0.0000 0.0000 1.0000
// ];
4.4 Cholesky decomposition and solving systems with positive definite
matrices 324
A A−1 = A−1 A = I.
magma_int_t ione = 1;
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // seed
const double alpha = 1.0; // alpha =1
const double beta = 0.0; // beta =0
// allocate matrices
cuda Malloc Manage d (& a , mm * sizeof ( double )); // unified mem . for a
cuda Malloc Manage d (& r , mm * sizeof ( double )); // unified mem . for r
cuda Malloc Manage d (& c , mm * sizeof ( double )); // unified mem . for c
// generate random matrix a
lapackf77_dlarnv (& ione , ISEED ,& mm , a ); // randomize a
// symmetrize a and increase diagonal
magma_dmake_hpd ( m , a , m );
lapackf77_dlacpy ( MagmaFullStr ,& m ,& m ,a ,& m ,r ,& m ); // a - > r
// find the inverse matrix a ^ -1: a * X = I for mxm symmetric ,
// positive definite matrix a using the Cholesky decomposition
// obtained by magma_dpotrf ; a is overwritten by the inverse
gpu_time = magma_sync_wtime ( NULL );
A A−1 = A−1 A = I.
4.4 Cholesky decomposition and solving systems with positive definite
matrices 328
computed by magma spotrf gpu. The matrix A is defined on the device and
on exit it is replaced by its inverse. See magma-X.Y.Z/src/spotri_gpu.cpp
for more details.
A A−1 = A−1 A = I.
4.4 Cholesky decomposition and solving systems with positive definite
matrices 331
4.5.1 magma sgels gpu - the least squares solution of a linear sys-
tem using QR decomposition in single precision, GPU in-
terface
This function solves in single precision the least squares problem
min kA X − Bk,
X
4.5.3 magma dgels gpu - the least squares solution of a linear sys-
tem using QR decomposition in double precision, GPU in-
terface
This function solves in double precision the least squares problem
min kA X − Bk,
X
// [
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// 1.0000 1.0000 1.0000 1.0000
// ];
cuda Malloc Manage d (& c1 , mnrhs * sizeof ( double )); // uni . mem . for c1
// Get size for workspace
lhwork = -1;
lapackf77_dgeqrf (& m , &n , a , &m , tau , tmp , & lhwork , & info );
l1 = ( magma_int_t ) MAGMA_D_REAL ( tmp [0] );
lhwork = -1;
lapackf77_dormqr ( MagmaLeftStr , MagmaTransStr ,
&m , & nrhs , & min_mn , a , &m , tau ,
c , &m , tmp , & lhwork , & info );
l2 = ( magma_int_t ) MAGMA_D_REAL ( tmp [0] );
lhwork = max ( max ( l1 , l2 ) , lworkgpu );
cuda Malloc Manage d (& hwork , lhwork * sizeof ( double )); // mem . - hwork
lapackf77_dlarnv ( & ione , ISEED , & mn , a ); // randomize a
lapackf77_dlaset ( MagmaFullStr ,& n ,& nrhs ,& alpha ,& alpha ,b ,& m );
// b - mxnrhs matrix of ones
blasf77_dgemm ( " N " ," N " ,&m ,& nrhs ,& n ,& alpha ,a ,& m ,b ,& m ,& beta ,
c ,& m ); // right hand side c = a * b
// so the exact solution is the matrix of ones
// MAGMA
magma_dsetmatrix ( m , n , a ,m , a1 , ldda , queue ); // copy a -> a1
magma_dsetmatrix ( m , nrhs , c ,m , c1 , lddb , queue ); // c -> c1
gpu_time = magma_sync_wtime ( NULL );
// solve the least squares problem min || a1 *x - c1 ||
// using the QR decomposition ,
// the solution overwrites c
4.5.5 magma sgels3 gpu - the least squares solution of a linear sys-
tem using QR decomposition in single precision, GPU in-
terface
This function solves in single precision the least squares problem
min kA X − Bk,
X
// [
// 1.4768 -10.2685 -5.7679 -5.9999
// -2.2205 2.5265 -0.9392 3.5168
// -2.9938 1.8580 -3.0401 4.5591
// 1.8242 3.6547 3.5300 -1.1885
// ];
cuda Malloc Manage d (& b1 , lddb * nrhs * sizeof ( float )); // mem . for b1
cuda Malloc Manage d (& piv , n * sizeof ( magma_int_t )); // mem . for piv
// Get size for workspace
lhwork = -1;
lapackf77_sgeqrf (& m , &n , a , &m , tau , tmp , & lhwork , & info );
l1 = ( magma_int_t ) MAGMA_D_REAL ( tmp [0] );
lhwork = -1;
lapackf77_sormqr ( MagmaLeftStr , MagmaTransStr ,
&m , & nrhs , & min_mn , a , & lda , tau ,
x , & ldb , tmp , & lhwork , & info );
l2 = ( magma_int_t ) MAGMA_S_REAL ( tmp [0] );
lhwork = max ( max ( l1 , l2 ) , lworkgpu );
// magma_sgels3 needs this workspace
cuda Malloc Manage d (& hwork , lhwork * sizeof ( float )); // mem . f . hwork
// randomize the matrices a , b
lapackf77_slarnv ( & ione , ISEED , & n2 , a ); // random a
n2 = m * nrhs ; // size of b , x , r
lapackf77_slarnv ( & ione , ISEED , & n2 , b ); // random b
// make copies of a and b : a - > a2 , b -> r ( they are overwrit .)
lapackf77_slacpy ( MagmaFullStr ,& m ,& nrhs ,b ,& ldb ,r ,& ldb );
lapackf77_slacpy ( MagmaFullStr ,& m ,& m ,a ,& lda , a2 ,& lda );
// copies of a , b for MAGMA
magma_ssetmatrix (m ,n ,a , lda , a1 , ldda , queue ); // copy a -> a1
magma_ssetmatrix ( m , nrhs ,b , ldb , b1 , lddb , queue ); // b -> b1
gpu_perf = magma_sync_wtime ( NULL );
// solve the least squares problem min || a1 *x - b1 ||
// using the QR decomposition ,
// the solution overwrites b1
// MAGMA version
4.5.7 magma dgels3 gpu - the least squares solution of a linear sys-
tem using QR decomposition in double precision, GPU in-
terface
This function solves in double precision the least squares problem
min kA X − Bk,
X
return EXIT_SUCCESS ;
}
// MAGMA time : 3.032 sec .
// upper left corner of of the Magma solution :
// [
// -2.9416 0.1624 0.2631 -2.0923
// -0.0242 0.5965 -0.4656 -0.3765
// 0.6595 0.5525 0.5783 -0.1609
// -0.5521 -1.2515 0.0901 -0.2223
// ];
// LAPACK time : 18.957 sec .
// upper left corner of of the Lapack solution :
// [
// -2.9416 0.1624 0.2631 -2.0923
// -0.0242 0.5965 -0.4656 -0.3765
// 0.6595 0.5525 0.5783 -0.1609
// -0.5521 -1.2515 0.0901 -0.2223
// ];
// upper left corner of of the Lapack dgesv solution
// for comparison :
// [
// -2.9416 0.1624 0.2631 -2.0923
// -0.0242 0.5965 -0.4656 -0.3765
// 0.6595 0.5525 0.5783 -0.1609
// -0.5521 -1.2515 0.0901 -0.2223
// ];
// [
// -2.9416 0.1624 0.2631 -2.0923
// -0.0242 0.5965 -0.4656 -0.3765
// 0.6595 0.5525 0.5783 -0.1609
// -0.5521 -1.2515 0.0901 -0.2223
// ];
A = Q R,
where A is an m × n general matrix defined on the host, R is upper tri-
angular (upper trapezoidal in general case) and Q is orthogonal. On exit
the upper triangle (trapezoid) of A contains R. The orthogonal matrix Q
is represented as a product of elementary reflectors H(1) . . . H(min(m, n)),
where H(k) = I − τk vk vkT . The real scalars τk are stored in an array τ
and the nonzero components of vectors vk are stored on exit in parts of
columns of A corresponding to the lower triangular (trapezoidal) part of A:
vk (1 : k − 1) = 0, vk (k) = 1 and vk (k + 1 : m) is stored in A(k + 1 : m, k).
See magma-X.Y.Z/src/sgeqrf.cpp for more details.
lapackf77_sgeqrf (& m ,& n ,a ,& m , tau , tmp ,& lhwork ,& info );
lhwork = ( magma_int_t ) MAGMA_S_REAL ( tmp [0] );
lhwork = max ( lhwork , max ( n * nb ,2* nb * nb ));
magm a_smal loc_cp u (& hwork , lhwork ); // host memory for hwork
// Randomize the matrix
lapackf77_slarnv ( & ione , ISEED , & n2 , a ); // randomize a
lapackf77_slacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
// MAGMA
gpu_time = magma_sync_wtime ( NULL );
// compute a QR factorization of a real mxn matrix a
// a = Q *R , Q - orthogonal , R - upper triangular
A = Q R,
where A is an m × n general matrix defined on the host, R is upper tri-
angular (upper trapezoidal in general case) and Q is orthogonal. On exit
the upper triangle (trapezoid) of A contains R. The orthogonal matrix Q
is represented as a product of elementary reflectors H(1) . . . H(min(m, n)),
where H(k) = I − τk vk vkT . The real scalars τk are stored in an array τ
and the nonzero components of vectors vk are stored on exit in parts of
columns of A corresponding to the lower triangular (trapezoidal) part of A:
vk (1 : k − 1) = 0, vk (k) = 1 and vk (k + 1 : m) is stored in A(k + 1 : m, k).
See magma-X.Y.Z/src/dgeqrf.cpp for more details.
A = Q R,
{
magma_init (); // initialize Magma
magma_queue_t queue = NULL ;
magma_int_t dev =0;
ma gm a_ qu eu e_ cr ea te ( dev ,& queue );
double gpu_time , cpu_time ;
magma_int_t m = 8192 , n = 8192 , n2 = m *n , ldda ;
float *a , * r ; // a , r - mxn matrices
float * a1 ; // a1 mxn matrix , used by Magma sgeqrf2_gpu
float * tau ; // scalars defining the elementary reflectors
float * hwork , tmp [1]; // hwork - workspace ; tmp - used in
magma_int_t info , min_mn ; // comp . workspace size
magma_int_t ione = 1 , lhwork ; // lhwork - workspace size
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // seed
ldda = (( m +31)/32)*32; // ldda = m if 32 divides m
min_mn = min (m , n );
float mzone = MAGMA_D_NEG_ONE ;
float matnorm , work [1]; // used in difference computations
// prepare unified memory
cuda Malloc Manage d (& tau , min_mn * sizeof ( float )); // u . mem . for tau
cuda Malloc Manage d (& a , n2 * sizeof ( float )); // unified mem . for a
cuda Malloc Manage d (& r , n2 * sizeof ( float )); // unified mem . for r
cuda Malloc Manage d (& a1 , ldda * n * sizeof ( float )); // uni . mem . for a1
// Get size for workspace
lhwork = -1;
lapackf77_sgeqrf (& m ,& n ,a ,& m , tau , tmp ,& lhwork ,& info );
lhwork = ( magma_int_t ) MAGMA_D_REAL ( tmp [0] );
cuda Malloc Manage d (& hwork , lhwork * sizeof ( float )); // mem . f . hwork
// Randomize the matrix
lapackf77_slarnv ( & ione , ISEED , & n2 , a ); // randomize a
// MAGMA
magma_ssetmatrix ( m , n , a ,m , a1 , ldda , queue ); // copy a -> a1
gpu_time = magma_sync_wtime ( NULL );
// compute a QR factorization of a real mxn matrix a1
// a1 = Q *R , Q - orthogonal , R - upper triangular
A = Q R,
where A is an m × n general matrix defined on the device, R is upper
triangular (upper trapezoidal in general case) and Q is orthogonal. On exit
the upper triangle (trapezoid) of A contains R. The orthogonal matrix Q
is represented as a product of elementary reflectors H(1) . . . H(min(m, n)),
where H(k) = I − τk vk vkT . The real scalars τk are stored in an array τ
and the nonzero components of vectors vk are stored on exit in parts of
columns of A corresponding to the lower triangular (trapezoidal) part of A:
vk (1 : k − 1) = 0, vk (k) = 1 and vk (k + 1 : m) is stored in A(k + 1 : m, k).
See magma-X.Y.Z/src/dgeqrf_gpu.cpp for more details.
A = Q R,
lapackf77_sgeqrf (& m ,& n ,a ,& m , tau , h_work ,& lhwork ,& info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " Lapack dgeqrf time : %7.5 f sec .\ n " , cpu_time );
// MAGMA // print Lapack time
m a g m a _ s s e t m a t r i x _ 1 D _ c o l _ b c y c l i c ( num_gpus , m , n , nb , r ,m , d_la ,
ldda , queues ); // distribute r -> num_gpus devices
gpu_time = magma_sync_wtime ( NULL );
// QR decomposition on num_gpus devices
A = Q R,
where A is an m×n general matrix, R is upper triangular (upper trapezoidal
in general case) and Q is orthogonal. The matrix A and the factors are
4.5 QR decomposition and the least squares solution of general systems 369
n_local [ i ] += n % nb ;
magma_setdevice ( i );
magma_dmalloc (& d_la [ i ] , ldda * n_local [ i ]); // device memory
// on num_gpus GPUs
printf ( " device %2 d n_local =%4 d \ n " ,( int )i ,( int ) n_local [ i ]);
}
magma_setdevice (0);
// Get size for host workspace
lhwork = -1;
lapackf77_dgeqrf (& m , &n , a , &m , tau , tmp , & lhwork , & info );
lhwork = ( magma_int_t ) MAGMA_D_REAL ( tmp [0] );
magm a_dmal loc_cp u (& h_work , lhwork ); // host memory for h_work
// Lapack sgeqrf needs this array
// Randomize the matrix a and copy a -> r
lapackf77_dlarnv (& ione , ISEED ,& n2 , a );
lapackf77_dlacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
// LAPACK
cpu_time = magma_wtime ();
// QR decomposition on the host
lapackf77_dgeqrf (& m ,& n ,a ,& m , tau , h_work ,& lhwork ,& info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " Lapack dgeqrf time : %7.5 f sec .\ n " , cpu_time );
// MAGMA // print Lapack time
m a g m a _ d s e t m a t r i x _ 1 D _ c o l _ b c y c l i c ( num_gpus , m , n , nb , r ,m , d_la ,
ldda , queues ); // distribute r -> num_gpus devices
gpu_time = magma_sync_wtime ( NULL );
// QR decomposition on num_gpus devices
return EXIT_SUCCESS ;
}
// Number of GPUs to be used = 1
// device 0 n_local =8192
// Lapack dgeqrf time : 16.91422 sec .
// Magma dgeqrf_mgpu time : 2.99641 sec .
// difference : 4.933266 e -15
A = L Q,
A = L Q,
where A is an m × n general matrix defined on the host, L is lower tri-
angular (lower trapezoidal in general case) and Q is orthogonal. On exit
the lower triangle (trapezoid) of A contains L. The orthogonal matrix Q
is represented as a product of elementary reflectors H(1) . . . H(min(m, n)),
where H(k) = I − τk vk vkT . The real scalars τk are stored in an array τ
and the nonzero components of vectors vk are stored on exit in parts of
rows of A corresponding to the upper triangular (trapezoidal) part of A:
vk (1 : k − 1) = 0, vk (k) = 1 and vk (k + 1 : n) is stored in A(k, k + 1 : n). See
magma-X.Y.Z/src/dgelqf.cpp for more details.
A = L Q,
where A is an m × n general matrix defined on the device, L is lower
triangular (lower trapezoidal in general case) and Q is orthogonal. On exit
the lower triangle (trapezoid) of A contains L. The orthogonal matrix Q
is represented as a product of elementary reflectors H(1) . . . H(min(m, n)),
where H(k) = I − τk vk vkT . The real scalars τk are stored in an array τ
and the nonzero components of vectors vk are stored on exit in parts of
rows of A corresponding to the upper triangular (trapezoidal) part of A:
vk (1 : k − 1) = 0, vk (k) = 1 and vk (k + 1 : n) is stored in A(k, k + 1 : n). See
magma-X.Y.Z/src/sgelqf_gpu.cpp for more details.
A = L Q,
where A is an m × n general matrix defined on the device, L is lower
triangular (lower trapezoidal in general case) and Q is orthogonal. On exit
the lower triangle (trapezoid) of A contains L. The orthogonal matrix Q
is represented as a product of elementary reflectors H(1) . . . H(min(m, n)),
where H(k) = I − τk vk vkT . The real scalars τk are stored in an array τ
and the nonzero components of vectors vk are stored on exit in parts of
rows of A corresponding to the upper triangular (trapezoidal) part of A:
vk (1 : k − 1) = 0, vk (k) = 1 and vk (k + 1 : n) is stored in A(k, k + 1 : n). See
magma-X.Y.Z/src/dgelqf_gpu.cpp for more details.
cuda Malloc Manage d (& h_work , lwork * sizeof ( float )); // mem . h_work
// Randomize the matrix a and copy a -> r
4.5 QR decomposition and the least squares solution of general systems 386
cuda Malloc Manage d (& h_work , lwork * sizeof ( double )); // mem . h_work
// Randomize the matrix a and copy a -> r
lapackf77_dlarnv (& ione , ISEED ,& n2 , a );
lapackf77_dlacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
// LAPACK
for ( j = 0; j < n ; j ++)
jpvt [ j ] = 0;
cpu_time = magma_wtime ();
// QR decomposition with column pivoting , Lapack version
lapackf77_dgeqp3 (& m ,& n ,r ,& m , jpvt , tau , h_work ,& lwork ,& info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " LAPACK time : %7.3 f sec .\ n " , cpu_time ); // Lapack time
4.6 Eigenvalues and eigenvectors for general matrices 389
// MAGMA
lapackf77_dlacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
for ( j = 0; j < n ; j ++)
jpvt [ j ] = 0 ;
gpu_time = magma_sync_wtime ( NULL );
// QR decomposition with column pivoting , Magma version
magma sgeev(MagmaNoVec,MagmaVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
printf ( " first 5 eigenvalues of a :\ n " );
for ( j =0; j <5; j ++)
printf ( " % f +% f * I \ n " , wr1 [ j ] , wi1 [ j ]); // print eigenvalues
printf ( " left upper corner of right eigenvectors matrix :\ n " );
magma_sprint (5 ,5 , VR , n ); // print right eigenvectors
// Lapack version // in columns
lapackf77_sgeev ( " N " ," V " ,&n ,a ,& n , wr2 , wi2 , VL ,& n , VR ,& n ,
h_work ,& lwork ,& info );
// difference in real parts of eigenvalues
blasf77_saxpy ( &n , & mione , wr1 , & incr , wr2 , & incr );
error = lapackf77_slange ( " M " , &n , & ione , wr2 , &n , work );
printf ( " difference in real parts : % e \ n " , error );
// difference in imaginary parts of eigenvalues
blasf77_saxpy ( &n , & mione , wi1 , & inci , wi2 , & inci );
error = lapackf77_slange ( " M " , &n , & ione , wi2 , &n , work );
printf ( " difference in imaginary parts : % e \ n " , error );
4.6 Eigenvalues and eigenvectors for general matrices 391
// right eigenvectors
float * wr1 , * wr2 ; // wr1 , wr2 - real parts of eigenvalues
float * wi1 , * wi2 ; // wi1 , wi2 - imaginary parts of
magma_int_t ione = 1 , i , j , info , nb ; // eigenvalues
float mione = -1.0 , error , * h_work ; // h_work - workspace
magma_int_t incr = 1 , inci = 1 , lwork ; // lwork - worksp . size
nb = m ag m a _g e t _s g e hr d _ nb ( n ); // optimal block size for sgehrd
float work [1]; // used in difference computations
lwork = n *(2+ nb );
lwork = max ( lwork , n *(5+2* n ));
cuda Malloc Manage d (& wr1 , n * sizeof ( float )); // unified memory for
cuda Malloc Manage d (& wr2 , n * sizeof ( float )); // real parts of eig
cuda Malloc Manage d (& wi1 , n * sizeof ( float )); // unified memory for
cuda Malloc Manage d (& wi2 , n * sizeof ( float )); // imag . parts of eig
cuda Malloc Manage d (& a , n2 * sizeof ( float )); // unif . memory for a
cuda Malloc Manage d (& r , n2 * sizeof ( float )); // unif . memory for r
cuda Malloc Manage d (& VL , n2 * sizeof ( float )); // u . mem . for left and
cuda Malloc Manage d (& VR , n2 * sizeof ( float )); // right eigenvect .
cuda Malloc Manage d (& h_work , lwork * sizeof ( float ));
// define a , r // [1 0 0 0 0 ...]
for ( i =0; i < n ; i ++){ // [0 2 0 0 0 ...]
a [ i * n + i ]=( float )( i +1); // a = [0 0 3 0 0 ...]
r [ i * n + i ]=( float )( i +1); // [0 0 0 4 0 ...]
} // [0 0 0 0 5 ...]
printf ( " upper left corner of a :\ n " ); // .............
magma_sprint (5 ,5 , a , n ); // print a
// compute the eigenvalues and the right eigenvectors
// for a general , real nxn matrix ,
// Magma version , left eigenvectors not computed ,
// right eigenvectors are computed
magma sgeev(MagmaNoVec,MagmaVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
magma dgeev(MagmaNoVec,MagmaVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
printf ( " first 5 eigenvalues of a :\ n " );
for ( j =0; j <5; j ++)
printf ( " % f +% f * I \ n " , wr1 [ j ] , wi1 [ j ]); // print eigenvalues
printf ( " left upper corner of right eigenvectors matrix :\ n " );
magma_dprint (5 ,5 , VR , n ); // print right eigenvectors
// Lapack version // in columns
lapackf77_dgeev ( " N " ," V " ,&n ,a ,& n , wr2 , wi2 , VL ,& n , VR ,& n ,
h_work ,& lwork ,& info );
4.6 Eigenvalues and eigenvectors for general matrices 395
magma dgeev(MagmaNoVec,MagmaVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
printf ( " first 5 eigenvalues of a :\ n " );
for ( j =0; j <5; j ++)
printf ( " % f +% f * I \ n " , wr1 [ j ] , wi1 [ j ]); // print eigenvalues
printf ( " left upper corner of right eigenvectors matrix :\ n " );
magma_dprint (5 ,5 , VR , n ); // print right eigenvectors
// Lapack version // in columns
lapackf77_dgeev ( " N " ," V " ,&n ,a ,& n , wr2 , wi2 , VL ,& n , VR ,& n ,
h_work ,& lwork ,& info );
// difference in real parts of eigenvalues
blasf77_daxpy ( &n , & mione , wr1 , & incr , wr2 , & incr );
4.6 Eigenvalues and eigenvectors for general matrices 397
error = lapackf77_dlange ( " M " , &n , & ione , wr2 , &n , work );
printf ( " difference in real parts : % e \ n " , error );
// difference in imaginary parts of eigenvalues
blasf77_daxpy ( &n , & mione , wi1 , & inci , wi2 , & inci );
error = lapackf77_dlange ( " M " , &n , & ione , wi2 , &n , work );
printf ( " difference in imaginary parts : % e \ n " , error );
magma_free ( wr1 ); // free memory
magma_free ( wr2 ); // free memory
magma_free ( wi1 ); // free memory
magma_free ( wi2 ); // free memory
magma_free ( a ); // free memory
magma_free ( r ); // free memory
magma_free ( VL ); // free memory
magma_free ( VR ); // free memory
magma_free ( h_work ); // free memory
magma_finalize (); // finalize Magma
return EXIT_SUCCESS ;
}
// upper left corner of a :
// [
// 1.0000 0. 0. 0. 0.
// 0. 2.0000 0. 0. 0.
// 0. 0. 3.0000 0. 0.
// 0. 0. 0. 4.0000 0.
// 0. 0. 0. 0. 5.0000
// ];
// first 5 eigenvalues of a :
// 1 .0 00 00 0+ 0. 00 00 00 * I
// 2 .0 00 00 0+ 0. 00 00 00 * I
// 3 .0 00 00 0+ 0. 00 00 00 * I
// 4 .0 00 00 0+ 0. 00 00 00 * I
// 5 .0 00 00 0+ 0. 00 00 00 * I
// left upper corner of right eigenvectors matrix :
// [
// 1.0000 0. 0. 0. 0.
// 0. 1.0000 0. 0. 0.
// 0. 0. 1.0000 0. 0.
// 0. 0. 0. 1.0000 0.
// 0. 0. 0. 0. 1.0000
// ];
// difference in real parts : 0.000000 e +00
// difference in imaginary parts : 0.000000 e +00
ilarly the second parameter answers the question whether the right eigen-
vectors are to be computed. The computed eigenvectors are normalized to
have Euclidean norm equal to one. If computed, the left eigenvectors are
stored in columns of an array VL and the right eigenvectors in columns of
VR. The real and imaginary parts of eigenvalues are stored in arrays wr, wi
respectively. See magma-X.Y.Z/src/sgeev.cpp for more details.
magma sgeev(MagmaNoVec,MagmaNoVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " sgeev gpu time : %7.5 f sec .\ n " , gpu_time ); // Magma
// LAPACK // time
cpu_time = magma_wtime ();
// compute the eigenvalues of a general , real nxn matrix a ,
4.6 Eigenvalues and eigenvectors for general matrices 399
// Lapack version
lapackf77_sgeev ( " N " , " N " , &n , a , &n ,
wr2 , wi2 , VL , &n , VR , &n , h_work , & lwork , & info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " sgeev cpu time : %7.5 f sec .\ n " , cpu_time ); // Lapack
free ( wr1 ); // time
free ( wr2 ); // free host memory
free ( wi1 ); // free host memory
free ( wi2 ); // free host memory
free ( a ); // free host memory
magm a_free _pinne d ( r ); // free host memory
magm a_free _pinne d ( VL ); // free host memory
magm a_free _pinne d ( VR ); // free host memory
magm a_free _pinne d ( h_work ); // free host memory
magma_finalize ( ); // finalize Magma
return EXIT_SUCCESS ;
}
// sgeev gpu time : 46.21376 sec .
// sgeev cpu time : 157.79790 sec .
magma sgeev(MagmaNoVec,MagmaNoVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " sgeev gpu time : %7.5 f sec .\ n " , gpu_time ); // Magma
// LAPACK // time
cpu_time = magma_wtime ();
// compute the eigenvalues of a general , real nxn matrix a ,
// Lapack version
lapackf77_sgeev ( " N " , " N " , &n , a , &n ,
wr2 , wi2 , VL , &n , VR , &n , h_work , & lwork , & info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " sgeev cpu time : %7.5 f sec .\ n " , cpu_time ); // Lapack
magma_free ( wr1 ); // time
magma_free ( wr2 ); // free memory
magma_free ( wi1 ); // free memory
magma_free ( wi2 ); // free memory
magma_free ( a ); // free memory
magma_free ( r ); // free memory
magma_free ( VL ); // free memory
magma_free ( VR ); // free memory
magma_free ( h_work ); // free memory
magma_finalize ( ); // finalize Magma
return EXIT_SUCCESS ;
}
// sgeev gpu time : 40.60117 sec .
// sgeev cpu time : 108.51452 sec .
VR. The real and imaginary parts of eigenvalues are stored in arrays wr, wi
respectively. See magma-X.Y.Z/src/dgeev.cpp for more details.
magma dgeev(MagmaNoVec,MagmaNoVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " dgeev gpu time : %7.5 f sec .\ n " , gpu_time ); // Magma
// LAPACK // time
cpu_time = magma_wtime ();
// compute the eigenvalues of a general , real nxn matrix a ,
// Lapack version
lapackf77_dgeev ( " N " , " N " , &n , a , &n ,
wr2 , wi2 , VL , &n , VR , &n , h_work , & lwork , & info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " dgeev cpu time : %7.5 f sec .\ n " , cpu_time ); // Lapack
4.6 Eigenvalues and eigenvectors for general matrices 402
magma dgeev(MagmaNoVec,MagmaNoVec,n,r,n,wr1,wi1,VL,n,VR,n,
h work,lwork,&info);
gpu_time = magma_sync_wtime ( NULL ) - gpu_time ;
printf ( " dgeev gpu time : %7.5 f sec .\ n " , gpu_time ); // Magma
// LAPACK // time
cpu_time = magma_wtime ();
// compute the eigenvalues of a general , real nxn matrix a ,
// Lapack version
lapackf77_dgeev ( " N " , " N " , &n , a , &n ,
wr2 , wi2 , VL , &n , VR , &n , h_work , & lwork , & info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " dgeev cpu time : %7.5 f sec .\ n " , cpu_time ); // Lapack
magma_free ( wr1 ); // time
magma_free ( wr2 ); // free memory
magma_free ( wi1 ); // free memory
magma_free ( wi2 ); // free memory
magma_free ( a ); // free memory
magma_free ( r ); // free memory
magma_free ( VL ); // free memory
magma_free ( VR ); // free memory
magma_free ( h_work ); // free memory
magma_finalize ( ); // finalize Magma
return EXIT_SUCCESS ;
}
// dgeev gpu time : 62.50911 sec .
// dgeev cpu time : 185.40615 sec .
QT A Q = H,
where Q is an orthogonal matrix and H has zero elements below the first
subdiagonal. The orthogonal matrix Q is represented as a product of el-
ementary reflectors H(ilo) . . . H(ihi), where H(k) = I − τk vk vkT . The real
scalars τk are stored in an array τ and the information on vectors vk is
stored on exit in the lower triangular part of A below the first subdiago-
nal: vk (1 : k) = 0, vk (k + 1) = 1 and vk (ihi + 1 : n) = 0; vk (k + 2 : ihi) is
stored in A(k + 2 : ihi, k). The function uses also an array dT defined on the
device, storing blocks of triangular matrices used in the reduction process.
See magma-X.Y.Z/src/sgehrd.cpp for more details.
4.6 Eigenvalues and eigenvectors for general matrices 404
lapackf77_sgehrd (& n ,& ione ,& n , r1 ,& n , tau , h_work ,& lwork ,& info );
cpu_time = magma_wtime () - cpu_time ;
printf ( " Lapack time : %7.3 f sec .\ n " , cpu_time );
{
int i , j ;
for ( j =0; j <n -1; j ++)
for ( i = j +2; i < n ; i ++)
r1 [ i + j * n ] = MAGMA_S_ZERO ;
}
// difference
blasf77_saxpy (& n2 ,& mone ,r ,& ione , r1 ,& ione );
printf ( " max difference : % e \ n " ,
lapackf77_slange ( " M " , &n , &n , r1 , &n , work ));
free ( a ); // free host memory
free ( tau ); // free host memory
magm a_free _pinne d ( h_work ); // free host memory
magm a_free _pinne d ( r ); // free host memory
magm a_free _pinne d ( r1 ); // free host memory
magma_free ( dT ); // free device memory
magma_finalize ( ); // finalize Magma
return EXIT_SUCCESS ;
}
// Magma time : 0.365 sec .
// upper left corner of the Hessenberg form :
// [
// 0.1206 -19.4276 -11.6704 0.5872 -0.0777
// -26.2667 765.4211 444.0294 -6.4941 0.5035
// 0. 444.5269 258.5998 -4.0942 0.2565
// 0. 0. -15.2374 0.3507 0.0222
// 0. 0. 0. -13.0577 -0.1760
// ];
// Lapack time : 0.916 sec .
// max difference : 1.018047 e -03
QT A Q = H,
where Q is an orthogonal matrix and H has zero elements below the first
subdiagonal. The orthogonal matrix Q is represented as a product of el-
ementary reflectors H(ilo) . . . H(ihi), where H(k) = I − τk vk vkT . The real
scalars τk are stored in an array τ and the information on vectors vk is
stored on exit in the lower triangular part of A below the first subdiago-
nal: vk (1 : k) = 0, vk (k + 1) = 1 and vk (ihi + 1 : n) = 0; vk (k + 2 : ihi) is
stored in A(k + 2 : ihi, k). The function uses also an array dT defined on the
device, storing blocks of triangular matrices used in the reduction process.
See magma-X.Y.Z/src/dgehrd.cpp for more details.
cuda Malloc Manage d (& r , n2 * sizeof ( double )); // unif . memory for r
cuda Malloc Manage d (& r1 , n2 * sizeof ( double )); // unif . mem . for r1
cuda Malloc Manage d (& h_work , lwork * sizeof ( double )); // m . f . h_work
cuda Malloc Manage d (& dT , nb * n * sizeof ( double )); // unif . mem . for dT
// Randomize the matrix a and copy a -> r , a -> r1
lapackf77_dlarnv ( & ione , ISEED , & n2 , a );
lapackf77_dlacpy ( MagmaFullStr ,& n ,& n ,a ,& n ,r ,& n );
lapackf77_dlacpy ( MagmaFullStr ,& n ,& n ,a ,& n , r1 ,& n );
// MAGMA
gpu_time = magma_sync_wtime ( NULL );
// reduce the matrix r to upper Hessenberg form by an
// orthogonal transformation , Magma version
// first 5 eigenvalues of a :
// 1.000000
// 2.000000
// 3.000000
// 4.000000
// 5.000000
// left upper corner of the matrix of eigenvectors :
// [
// 1.0000 0. 0. 0. 0.
// 0. 1.0000 0. 0. 0.
// 0. 0. 1.0000 0. 0.
// 0. 0. 0. 1.0000 0.
// 0. 0. 0. 0. 1.0000
// ];
// difference in eigenvalues : 0.000000 e +00
// define a , r // [1 0 0 0 0 ...]
for ( i =0; i < n ; i ++){ // [0 2 0 0 0 ...]
a [ i * n + i ]=( float )( i +1); // a = [0 0 3 0 0 ...]
r [ i * n + i ]=( float )( i +1); // [0 0 0 4 0 ...]
} // [0 0 0 0 5 ...]
printf ( " upper left corner of a :\ n " ); // .............
magma_sprint (5 ,5 , a , n ); // print part of a
// compute the eigenvalues and eigenvectors for a symmetric ,
// real nxn matrix ; Magma version
// first 5 eigenvalues of a :
// 1.000000
// 2.000000
// 3.000000
// 4.000000
// 5.000000
// [
// 1.0000 0. 0. 0. 0.
// 0. 1.0000 0. 0. 0.
// 0. 0. 1.0000 0. 0.
// 0. 0. 0. 1.0000 0.
// 0. 0. 0. 0. 1.0000
// ];
// difference in eigenvalues : 0.000000 e +00
// first 5 eigenvalues of a :
// 1.000000
// 2.000000
// 3.000000
// 4.000000
// 5.000000
4.7 Eigenvalues and eigenvectors for symmetric matrices 417
magma_int_t incr = 1;
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // seed
cuda Malloc Manage d (& w1 , n * sizeof ( float )); // unified memory for
cuda Malloc Manage d (& w2 , n * sizeof ( float )); // eigenvalues
cuda Malloc Manage d (& a , n2 * sizeof ( float )); // unif . memory for a
cuda Malloc Manage d (& r , n2 * sizeof ( float )); // unif . memory for r
// Query for workspace sizes
float aux_work [1];
magma_int_t aux_iwork [1];
magma_ssyevd ( MagmaVec , MagmaLower ,n ,r ,n , w1 , aux_work , -1 ,
aux_iwork , -1 ,& info );
lwork = ( magma_int_t ) aux_work [0];
liwork = aux_iwork [0]; // unified memory for workspace :
cuda Malloc Manage d (& iwork , liwork * sizeof ( magma_int_t ));
cuda Malloc Manage d (& h_work , lwork * sizeof ( float ));
// Randomize the matrix a and copy a -> r
lapackf77_slarnv (& ione , ISEED ,& n2 , a );
lapackf77_slacpy ( MagmaFullStr ,& n ,& n ,a ,& n ,r ,& n );
gpu_time = magma_sync_wtime ( NULL );
// compute the eigenvalues and eigenvectors for a symmetric ,
// real nxn matrix ; Magma version
magma_int_t incr = 1;
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // seed
cuda Malloc Manage d (& w1 , n * sizeof ( double )); // unified memory for
cuda Malloc Manage d (& w2 , n * sizeof ( double )); // eigenvalues
cuda Malloc Manage d (& a , n2 * sizeof ( double )); // unif . memory for a
cuda Malloc Manage d (& r , n2 * sizeof ( double )); // unif . memory for r
// Query for workspace sizes
double aux_work [1];
magma_int_t aux_iwork [1];
magma_dsyevd ( MagmaVec , MagmaLower ,n ,r ,n , w1 , aux_work , -1 ,
aux_iwork , -1 ,& info );
lwork = ( magma_int_t ) aux_work [0];
liwork = aux_iwork [0]; // unified memory for workspace :
cuda Malloc Manage d (& iwork , liwork * sizeof ( magma_int_t ));
cuda Malloc Manage d (& h_work , lwork * sizeof ( double ));
// Randomize the matrix a and copy a -> r
lapackf77_dlarnv (& ione , ISEED ,& n2 , a );
lapackf77_dlacpy ( MagmaFullStr ,& n ,& n ,a ,& n ,r ,& n );
gpu_time = magma_sync_wtime ( NULL );
// compute the eigenvalues and eigenvectors for a symmetric ,
// real nxn matrix ; Magma version
// define a , r // [1 0 0 0 0 ...]
for ( i =0; i < n ; i ++){ // [0 2 0 0 0 ...]
a [ i * n + i ]=( float )( i +1); // a = [0 0 3 0 0 ...]
r [ i * n + i ]=( float )( i +1); // [0 0 0 4 0 ...]
} // [0 0 0 0 5 ...]
printf ( " upper left corner of a :\ n " ); // .............
magma_sprint (5 ,5 , a , n ); // print part of a
magma_ssetmatrix ( n , n , a , n , d_r , n , queue ); // copy a -> d_r
// compute the eigenvalues and eigenvectors for a symmetric ,
// real nxn matrix ; Magma version
// [
// 1.0000 0. 0. 0. 0.
// 0. 1.0000 0. 0. 0.
// 0. 0. 1.0000 0. 0.
// 0. 0. 0. 1.0000 0.
// 0. 0. 0. 0. 1.0000
// ];
// difference in eigenvalues : 0.000000 e +00
} // [0 0 0 0 5 ...]
printf ( " upper left corner of a :\ n " ); // .............
magma_sprint (5 ,5 , a , n ); // print part of a
magma_ssetmatrix ( n , n , a , n , a1 , n , queue ); // copy a -> a1
// compute the eigenvalues and eigenvectors for a symmetric ,
// real nxn matrix ; Magma version
// first 5 eigenvalues of a :
// 1.000000
// 2.000000
// 3.000000
// 4.000000
// 5.000000
// [
// 1.0000 0. 0. 0. 0.
// 0. 1.0000 0. 0. 0.
// 0. 0. 1.0000 0. 0.
// 0. 0. 0. 1.0000 0.
// 0. 0. 0. 0. 1.0000
// ];
// difference in eigenvalues : 0.000000 e +00
4.7.11 magma dsyevd gpu - compute the eigenvalues and optionally
eigenvectors of a symmetric real matrix in double preci-
sion, GPU interface, small matrix
This function computes in double precision all eigenvalues and, optionally,
eigenvectors of a real symmetric matrix A defined on the device. The first
parameter can take the values MagmaVec or MagmaNoVec and answers the
question whether the eigenvectors are desired. If the eigenvectors are de-
sired, it uses a divide and conquer algorithm. The symmetric matrix A
can be stored in lower (MagmaLower) or upper (MagmaUpper) mode. If the
eigenvectors are desired, then on exit A contains orthonormal eigenvectors.
The eigenvalues are stored in an array w. See magma-X.Y.Z/src/dsyevd_
gpu.cpp for more details.
// first 5 eigenvalues of a :
// 1.000000
// 2.000000
// 3.000000
// 4.000000
// 5.000000
// 1.000000
// 2.000000
// 3.000000
// 4.000000
// 5.000000
A = u σ vT ,
magma_int_t info ;
magma_int_t ione = 1;
double work [1] , error = 1.; // used in difference computations
double mone = -1.0 , * h_work ; // h_work - workspace
magma_int_t lwork ; // workspace size
magma_int_t ISEED [4] = {0 ,0 ,0 ,1}; // seed
cuda Malloc Manage d (& a , n2 * sizeof ( double )); // unif . memory for a
cuda Malloc Manage d (& vt , n * n * sizeof ( double )); // unif . mem for vt
cuda Malloc Manage d (& u , m * m * sizeof ( double )); // unif . memory for u
cuda Malloc Manage d (& s1 , min_mn * sizeof ( double )); // un . mem . for s1
cuda Malloc Manage d (& s2 , min_mn * sizeof ( double )); // un . mem . for s2
cuda Malloc Manage d (& r , n2 * sizeof ( double )); // unif . memory for r
magma_int_t nb = m a g m a_ g e t_ d g es v d _n b (m , n ); // optim . block size
lwork = min_mn * min_mn +2* min_mn +2* min_mn * nb ;
cuda Malloc Manage d (& h_work , lwork * sizeof ( double )); // m . f . h_work
// Randomize the matrix a
lapackf77_dlarnv (& ione , ISEED ,& n2 , a );
lapackf77_dlacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
// MAGMA
gpu_time = magma_wtime ();
// compute the singular value decomposition of r ( copy of a )
// and optionally the left and right singular vectors :
// r = u * sigma * vt ; the diagonal elements of sigma ( s1 array )
// are the singular values of a in descending order
// the first min (m , n ) columns of u contain the left sing . vec .
// the first min (m , n ) columns of vt contain the right sing . vec .
QT A P = B,
magm a_smal loc_cp u (& offdiag , minmn -1); // host mem . for offdiag
m a g m a _ s m a l l o c _ p i n n e d (& r , m * n ); // host memory for r
lhwork = ( m + n )* nb ;
m a g m a _ s m a l l o c _ p i n n e d (& h_work , lhwork ); // host mem . for h_work
// Randomize the matrix a
lapackf77_slarnv ( & ione , ISEED , & n2 , a );
lapackf77_slacpy ( MagmaFullStr ,& m ,& n ,a ,& m ,r ,& m ); // a - > r
// MAGMA
gpu_time = magma_wtime ();
// reduce the matrix r to upper bidiagonal form by orthogonal
// transformations : q ^ T * r *p , the obtained diagonal and the
// superdiagonal are written to diag and offdiag arrays resp .;
// the elements below the diagonal , represent the orthogonal
// matrix q as a product of elementary reflectors described
// by tauq ; elements above the first superdiagonal represent
// the orthogonal matrix p as a product of elementary reflect -
// ors described by taup ;
QT A P = B,
[MATR] Golub G. H, van Loan C. F., Matrix Computations, 3rd ed. Johns
Hopkins Univ. Press, Baltimore 1996
[LAPACK] Anderson E., Bai Z., Bischof C., Blackford S., Demmel J., Don-
garra J., et al LAPACK Users’ Guide, 3rd ed., August 1999
http://www.netlib.org/lapack/lug/