LAPACK and ScaLAPACK Examples
LAPACK and ScaLAPACK Examples
From Documentation
We start by defining a special matrix called STRIDWAD (Sparse TRIDiagonal With Anti-Diagonal). This matrix will be
stored using the dense format (i.e. all elements are defined and are stored in memory). The matrix has the following properties:
It can be of any order N (where N is an even integer)
When STRIDWAD is multiplied by the vector X whose components are X(i)=i it produces a vector C
whose elements are C(i)=N+1 except for the last element which is C(N)=2N+2.
These characteristics make it very easy to work with STRIDWAD while developing computer programs for testing purposes.
Contents
[hide]
In practice we will use large values of N but for illustration purposes the STRIDWAD system for N=6 would be:
[ 3. -1. 0. 0. 0. 1. ] [ 1. ] [ 7. ]
[ -1. 3. -1. 0. 1. 0. ] [ 2. ] [ 7. ]
[ 0. -1. 3. 0. 0. 0. ] [ 3. ] [ 7. ]
[ 0. 0. 0. 3. -1. 0. ] [ 4. ] = [ 7. ]
[ 0. 1. 0. -1. 3. -1. ] [ 5. ] [ 7. ]
[ 1. 0. 0. 0. -1. 3. ] [ 6. ] [ 14. ]
For N=18 it takes this form:
[ 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. ] [ 1. ] [ 19. ]
[ -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. ] [ 2. ] [ 19. ]
[ 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. ] [ 3. ] [ 19. ]
[ 0. 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. ] [ 4. ] [ 19. ]
[ 0. 1. 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. ] [ 5. ] [ 19. ]
[ 1. 0. 0. 0. -1. 3. -1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. ] [ 6. ] [ 19. ]
[ 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. ] [ 7. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. 1. 0. 0. 0. 0. 0. 0. 0. ] [ 8. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. 0. -1. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 9. ] = [ 19. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. -1. 0. 0. 0. 0. 0. 0. 0. ] [ 10. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. -1. 3. -1. 0. 0. 0. 0. 0. 0. ] [ 11. ] [ 19. ]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. -1. 3. -1. 0. 0. 0. 0. 0. ] [ 12. ] [ 19. ]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. 0. 0. ] [ 13. ] [ 19. ]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. 0. ] [ 14. ] [ 19. ]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. 0. ] [ 15. ] [ 19. ]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. 0. ] [ 16. ] [ 19. ]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. -1. ] [ 17. ] [ 19. ]
[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1. 3. ] [ 18. ] [ 38. ]
The elements in the two subdiagonals (lower and upper diagonals) are -1.
The diagonal going from the lower left corner to the upper right corner is called anti-diagonal.
When the anti-diagonal crosses the subdiagonals the 1's cancel out with the -1's.
The components of the X vector are x(i)=i and all the components of C are N+1 except for the last one.
Last component of C is 2*(N+1) where N is the order of the STRIDWAD system.
So in fact what we have established here is an IDENTITY which for computational purposes can be used in the following two
ways:
If we assume X is known and defined by x(i)=i we can use a matrix multiply program
to calculate the product of the STRIDWAD matrix and X and compare the results to C.
Here is the listing of the program used to calculate the Right Hand Side (RHS) vector assuming that X is known and defined
by x(i)=i:
program TMA
implicit none
integer :: istat,i,N_PRT
real(kind=8),dimension(:,:),allocatable :: A
real(kind=8),dimension(:,:),allocatable :: C
real(kind=8),dimension(:,:),allocatable :: D
real(kind=8),dimension(:,:),allocatable :: X
integer :: NH, NH1, nprocs, KOUNT
integer*8 :: N, mem_allocated
real(kind=8) :: mem_gb, EPS
parameter (N = 55388)
! parameter (N = 18)
! --------------------------------------------------------------------------
interface
end interface
! ---------------------------------------------------------------------------
!
! Trace intermediate steps only for small values of N
EPS = 0.00000001
NPROCS = 1
! N must be EVEN
NH = N/2
NH1 = NH + 1
!
! ----- Print out nprocs -----
WRITE(6,101) nprocs,N
101 FORMAT(" nprocs = ",i3," N = ",i12)
call flush(6_4)
!
!
! ----- Allocate LHS, RHS, and pivot -----
!
allocate (a(N, N ), stat=istat)
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for A"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for A"
else
mem_allocated = 8*N*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,103) mem_allocated,mem_gb
103 FORMAT(" memory alloc (A) = ",i12," bytes i.e. ", f8.5," Gb")
endif
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for D"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for D"
else
mem_allocated = 8*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,105) mem_allocated,mem_gb
105 FORMAT(" memory alloc (D) = ",i12," bytes i.e. ", f8.5," Gb")
endif
if (istat/=0) then
print *,"ERROR: ALLOCATE FAILS for X"
call flush(6_4)
stop "ERR:ALLOCATE FAILS for X"
else
mem_allocated = 8*N
mem_gb = mem_allocated / 1000000000.0
WRITE(6,106) mem_allocated,mem_gb
106 FORMAT(" memory alloc (X) = ",i12," bytes i.e. ", f8.5," Gb")
endif
WRITE(6,102) mem_allocated,mem_gb
102 FORMAT(" memory allocated = ",i12," bytes i.e. ", f8.5," Gb")
call flush(6_4)
!
! ----- Initialize: A X = C
!
call init_my_matrix (N,A)
call init_my_vector (N,X)
call init_my_rhs (N,C)
D = MATMUL(A,X)
!
write (6,500)
500 FORMAT(/" C(I): ")
IF (N .le. N_PRT) THEN
DO I=1,N
WRITE (6,501) I,C(I,1)
501 FORMAT(" I = ",I3," ",f6.2)
ENDDO
ELSE
DO I=N-N_PRT,N
WRITE (6,502) I,A(I,I),D(I,1)
502 FORMAT(" I=",I10," A(I,I) = ",f13.2," D(I,1) = ",f13.2)
ENDDO
ENDIF
KOUNT = 0
DO I=1,N
if (abs(C(I,1)-D(I,1)) .gt. EPS) then
KOUNT = KOUNT + 1
write(6,107) I
107 format("For I = ",i12," C .neq. D")
endif
ENDDO
write(6,108) KOUNT
108 format(/"KOUNT = ",i10)
!
! ----- Cleanup arrays -----
!
deallocate (A,C,D,X)
!
end program TMA
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A)
IMPLICIT NONE
INTEGER :: NH,NH1
INTEGER*8 :: N
REAL(kind=8) :: A(:,:)
INTEGER :: I
NH = N/2
NH1 = NH + 1
! Diagonal elements
DO I = 1,N
A(I,I) = 3.0
ENDDO
! Upper diagonal
DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
A(I,I+1) = -1.0d0
180 CONTINUE
! Lower diagonal
DO 160 I = 1,N-1
if (I .EQ. NH) go to 160
A(I+1,I) = -1.0d0
160 CONTINUE
!
! ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190
A(I,N-I+1) = 1.0d0
190 CONTINUE
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C)
IMPLICIT NONE
INTEGER*8 :: N
REAL(kind=8) :: C(:,:)
INTEGER :: I
DO I= 1, N-1
C(I,1) = N + 1.0
ENDDO
C(N,1) = 2*N + 2
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_VECTOR (N,X)
IMPLICIT NONE
INTEGER*8 :: N
REAL(kind=8) :: X(:,:)
INTEGER :: I
DO I= 1, N
X(I,1) = I
ENDDO
return
end
C(I):
I = 1 19.00
I = 2 19.00
I = 3 19.00
I = 4 19.00
I = 5 19.00
I = 6 19.00
I = 7 19.00
I = 8 19.00
I = 9 19.00
I = 10 19.00
I = 11 19.00
I = 12 19.00
I = 13 19.00
I = 14 19.00
I = 15 19.00
I = 16 19.00
I = 17 19.00
I = 18 38.00
KOUNT = 0
Output for N=55388 (the highest possible value that fits in memory) was:
nprocs = 1 N = 55388
memory alloc (A) = 24542644352 bytes i.e. 24.54264 Gb
memory alloc (C) = 443104 bytes i.e. 0.00044 Gb
memory alloc (D) = 443104 bytes i.e. 0.00044 Gb
memory alloc (X) = 443104 bytes i.e. 0.00044 Gb
memory allocated = 24543530560 bytes i.e. 24.54353 Gb
C(I):
I= 55348 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55349 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55350 A(I,I) = 3.00 D(I,1) = 55389.00
...
I= 55386 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55387 A(I,I) = 3.00 D(I,1) = 55389.00
I= 55388 A(I,I) = 3.00 D(I,1) = 110778.00
KOUNT = 0
LAPACK EXAMPLE
In the previous section the STRIDWAD matrices of oder N were defined. We now want to use those matrices as input to the
LAPACK's routines dgetrf/dgetrs to find the inverses of different orders N. The programs should also report the memory used,
cpu time and elapsed time.
Following makefile is used to compile and submit the jobs to a compute node:
# Determine the platform that is being used for this compilation
N:= 1
P:= 4
EXE = DP_LAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
LIB:= -L/opt/sharcnet/acml/current/pathscale64/lib -lacml
$(EXE): $(SRC)
@echo Check if N is in the range [1,40]
./check_N $N
# @echo COMPILING LAPACK DP program with N = $(N),000
# @echo PLATFORM used = $(PLATFORM)
# @echo Edit the parameter_N.inc file
cp parameter_N.inc_basic parameter_N.inc
./ed_parm_N parameter_N.inc $N
# @echo "Using following parameter_N.inc file:"
# cat parameter_N.inc
# @echo "LIB = " $(LIB)
pathf90 -o $(EXE) $(SRC) $(LIB)
cp sub_batch_DP_LAPACK_SOLVE_basic sub_batch_DP_LAPACK_SOLVE
./ed_procs_P sub_batch_DP_LAPACK_SOLVE $P
./sub_batch_DP_LAPACK_SOLVE
superclean: clean
@echo ' '
@echo Removing Output files generated by $(EXE)
@echo '--------------------------------------------------'
@echo ' '
rm -rf ${CLU}_SERIAL_DP_SOLVE_*
@echo ' '
@echo Output files have been removed.
@echo '-------------------------------'
@echo ' '
clean:
@echo ' '
@echo Cleaning executable $(EXE)
@echo '-----------------------------------'
@echo ' '
rm -rf $(EXE)
rm -rf ED_ERROR
@echo ' '
@echo Executable $(EXE) has been removed.
@echo '--------------------------------------------'
@echo ' '
help:
@echo "+-----------------------------------------------------------------------------------+"
@echo "| |"
@echo "| Makefile for running LAPACK example using dgetrf/dgetrs |"
@echo "| |"
@echo "| Usage: make compile program for N=1 (i.e. order=1,000) |"
@echo "| |"
@echo "| make N=n compile program for N=n (i.e. order=n*1000) |"
@echo "| where 1 <= n <= 40 |"
@echo "| |"
@echo "| make P=p submit job with -q threaded -n p |"
@echo "| where 1 <= p <= 4 or 6 (depending on cluster) |"
@echo "| |"
@echo "| make superclean remove output files and executable |"
@echo "| |"
@echo "| make clean remove executable |"
@echo "| |"
@echo "| make help display makefile usage information |"
@echo "| |"
@echo "+-----------------------------------------------------------------------------------+"
@echo " "
@echo PLATFORM $(PLATFORM)
The make utility will use above makefile and we can override any parameters appearing in the the makefile. Thus, if we type
the command:
make N=2 P=6
make will use parameter N=2 and P=6 instead of those appearing in the makefile.
Then make proceeds to specify the executable, source and required library and executes the commands required to generate
the target $(EXE). The first command for generating this target is to invoke the script check_N which takes as argument the
parameter $N.
#!/bin/bash
if [ $# -ne 1 ]; then
echo "Must have 1 argument: value of N"
echo "Usage: $0 <value_of_N> "
exit -1
fi
N=$1
exit
If the value of $N is in the specified range the check_N script returns 0 and make proceeds to the next command, but if $N is
out of the specified range, then the check_N script returns -1 and the make command terminates with an error message.
If the check_N script returned 0 then make proceeds to the next command which copies the parameter_N.inc_basic file into
the parameter_N.inc file. The next scripted_parm_N takes two arguments: parameter_N.inc and $N to modify the file
parameter_N.inc which is used by the fortran program that we will compile in the next command. Here is a listing of the
second script:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Must have 2 argument: Filename to change and value to substitute x for"
echo "Usage: $0 <filename> <value_for_x> "
exit
fi
Filename=$1
ed $Filename <<EOF 2> ED_ERROR
1,2s/x/$2/
w
q
EOF
Now, the make command is ready to compile the source program using the updated version of the include file
parameter_N.inc.
If errors are detected in the compilation then make will terminate, otherwise the next command is executed in the makefile
script, which is the script:
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Must have 2 argument: Filename to change and number of procs to use"
echo "Usage: $0 <filename> <num_of_processors> "
exit -1
fi
Filename=$1
If p is out of range the make command exits, otherwise the next command is executed which is a script submitting the job to a
queue (in which the parameter nprocs will be replaced by what is specified in the makefile):
#!/bin/bash
For completnes we list next the whole fortran source file and working files parameter_N.inc_basic and
sub_batch_DP_LAPACK_SOLVE_basic:
program DP_LAPACK_SOLVE
!
! file name = DP_LAPACK_SOLVE.f90
!
implicit none
integer :: istat,info,i,j,N_PRT,milestone
real(kind=8),dimension(:,:),allocatable :: a
real(kind=8),dimension(:,:),allocatable :: c
integer,dimension(:),allocatable :: ipiv
! parameter (N = 64000)
! parameter (N = 100000)
! parameter (N = 35000)
include "parameter_N.inc"
integer :: day(10),hour(10),minute(10),second(10),millisec(10)
character*30 :: descr_milestone(10)
integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer :: day_end,hour_end,minute_end,second_end,millisec_end
character*23 :: date_time
real(kind=8) :: T1, T2
real(kind=8) :: TOT_TIME
real(kind=8) :: tm
! --------------------------------------------------------------------------
interface
subroutine timestamp(date_time,day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer :: day, hour, minute, second, millisec
integer :: elements(8)
character*3 :: months(12)
end subroutine timestamp
end interface
! ---------------------------------------------------------------------------
!
! Trace intermediate steps only for small values of N
DEBUG = 1
root = 0
IAM = 0
NPROCS = 1
! N must be EVEN
NH = N/2
NH1 = NH + 1
milestone = 1
IF (IAM .eq. 0) write(6,1006) ITEM,iam,date_time,milestone
1006 format(I4," iam ",i3," BEGIN ",a23," milestone = ",i2)
1008 format(I4," iam ",i3," ",a23," milestone = ",i2)
call flush(6_4)
call cpu_time(T1)
!
! ----- Initialize LHS and RHS
!
call init_my_matrix (n,a,nprow,npcol,myrow,mycol)
call init_my_rhs (n,c,nprow,npcol,myrow,mycol)
!
milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
IF (IAM.EQ.0) THEN
IF (N .le. N_PRT) THEN
ITEM = 6000
write (6,500) ITEM
500 FORMAT(I4," SOLUTION: ")
DO I=1,N
ITEM = 6000 + I
WRITE (6,501) ITEM,I,C(I,1)
501 FORMAT(I4," I = ",I3," ",f6.2)
ENDDO
ELSE
DO I=N-N_PRT,N
WRITE (6,502) I,C(I,1)
502 FORMAT(" I=",I10," ",f13.2)
ENDDO
ENDIF
ENDIF
milestone = milestone + 1
IF (IAM .eq. 0) write(6,1008) ITEM,iam,date_time,milestone
call flush(6_4)
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone), millisec(milestone) )
descr_milestone(milestone) = "print and clean up"
ITEM = 7000
do i=1,milestone-1
ITEM = 7000 + I
tm = 86400.0*(day(i+1)-day(i)) + 3600.0*(hour(i+1)-hour(i)) + &
& 60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i)) + &
& 0.001 * (millisec(i+1)-millisec(i))
IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm, &
& descr_milestone(i)
1005 format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
& ,I2," equals ",f10.2,3x,a30)
enddo
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
1011 format(I4," iam ",i3," END ",a23)
call cpu_time(T2)
TOT_TIME = T2-T1
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4, &
& " seconds")
! ---------------------------------------------------------------------------
!
! ----- Cleanup arrays -----
!
deallocate (a,c,ipiv)
!
end program DP_LAPACK_SOLVE
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
REAL(kind=8) :: A(:,:)
INTEGER :: I, J
NH = N/2
NH1 = NH + 1
! Diagonal elements
DO I = 1,N
A(I,I) = 3.0
ENDDO
! Upper diagonal
DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
A(I,I+1) = -1.0d0
180 CONTINUE
! Lower diagonal
DO 160 I = 1,N-1
if (I .EQ. NH) go to 160
A(I+1,I) = -1.0d0
160 CONTINUE
!
! ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190
A(I,N-I+1) = 1.0d0
190 CONTINUE
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NPROW,NPCOL,MYROW,MYCOL
REAL(kind=8) :: C(:,:)
INTEGER :: I
DO I= 1, N-1
C(I,1) = N + 1.0
ENDDO
C(N,1) = 2*N + 2
return
end
!
!-----------------------------------------------------------------------
!
subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
& millisec_beg, day_end, hour_end, minute_end, second_end,&
& millisec_end,tid, ITEM )
!date_and_time
! VALUES
! must be of type default integer and of rank one. It is an INTENT(OUT)
! argument. Its size must be at least eight. The values returned in VALUES
! are as follows:
! VALUES(1)
! is the year (for example, 1998), or -HUGE (0) if no date is available.
! VALUES(2)
! is the month of the year, or -HUGE (0) if no date is available.
! VALUES(3)
! is the day of the month, or -HUGE (0) if no date is available.
! VALUES(4)
! is the time difference with respect to Coordinated Universal Time (UTC)
! in minutes, or -HUGE (0) if this information is not available.
!
! VALUES(5)
! is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
! no clock.
!
! VALUES(6)
! is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
! is no clock.
! !
! VALUES(7)
! is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
! there is no clock.
!
! VALUES (8)
! is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
! if there is no clock.
implicit none
integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
integer :: day_end, hour_end, minute_end, second_end, millisec_end
integer :: day, hour, minute, second, millisec, tid, ITEM
if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0 &
& .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
ITEM = ITEM + 1
IF (tid .eq. 0 ) THEN
write(6,8000) ITEM,tid,day
8000 format(I4," IAM = ",I3," ELAPSED0 days = ",i3)
write(6,8001) ITEM,tid,hour
8001 format(I4," IAM = ",I3," ELAPSED1 hours = ",i3)
write(6,8002) ITEM,tid,minute
8002 format(I4," IAM = ",I3," ELAPSED2 minute = ",i3)
write(6,8003) ITEM,tid,second
8003 format(I4," IAM = ",I3," ELAPSED3 second = ",i3)
write(6,8004) ITEM,tid,millisec
8004 format(I4," IAM = ",I3," ELAPSED4 millisec = ",i3)
ENDIF
endif
return
!
!-----------------------------------------------------------------------
!
day = elements(3)
hour = elements(5)
minute = elements(6)
second = elements(7)
millisec = elements(8)
else
date_time=' '
day = 0
hour = 0
minute = 0
second = 0
millisec = 0
endif
!
!-----------------------------------------------------------------------
!
ScaLAPACK EXAMPLE
ScaLAPACK is a distributed memory version of LAPACK. It uses the Parallel Basic Linear Algebra Subprograms (PBLAS)
as computational building blocks. ScaLAPACK uses the block-cyclic decomposition scheme to distribute block-partitioned
matrices which should make the computation balanced and scalable.
According to the two-dimensional block cyclic data distribution scheme, an M_ by N_ dense matrix is first decomposed into
MB_ by NB_ blocks starting at its upper left corner.
These blocks are then uniformly distributed in each dimension of the Process Grid.
Thus, every process owns a collection of blocks, which are locally and contiguously stored in a two-dimensional ``column
major array. The partitioning of a matrix into blocksand the mapping of these blocks onto a Process Grid is illustrated with a
global 9x9 matrix A. The first step in this process is to partition the matrix A into block. Let us use 2x2 blocks and assume that
the 2-D Process Grid is 2x3.
In the above diagram the Bij are the 2x2 blocks, e.g.
Initially, the 2x3 Process Grid is empty and looks like this:
0 1 2
. . .
0 . . .
. . .
1 . . .
. . .
We identify each process in the Process Grid by two coordinates (row,col). Thus, for the 2x3 Process Grid the processes would
be:
0 1 2
The distribution process starts by taking the Global Bij in first row and distribute them to the first row of the Processor Grid:
0 1 2
1 . . .
. . .
Take Global Bij in next row and distribute them to the next row of the Process Grid: (if previous distribution was on last row
of Process Grid then restart with row 0 of Process Grid ).
0 1 2
Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).
0 1 2
Take Global Bij in next row and distribute them to the next row of the Process Grid:
0 1 2
Take Global Bij in next row and distribute them to the first row of the Process Grid: (restart with row 0 of Process Grid).
0 1 2
Before calling solvers and other functions from the ScaLAPACK library the user needs to configure the processes into a
BLACS 2D Process Grid. ScaLAPACK requires that the data arrays be block-cyclically distributed across the 2D Processor
Grid.
which identifies the process in the process grid and how many processes there are in total.
(2) Next, you should set up a rectangular grid (2D) as close as possible to a square. Following algorithm was developed by
Carlo Cavazzoni of CINECA (this is not part of BLACS), and you must include it in your source:
call gridsetup (nprocs,nprow,npcol)
INPUT OUT OUT
This computes the number of rows and columns we'll have in the process grid, such that NPROW x NPCOL == NPROCS.
Note: 'r' ==> ROW-MAJOR ordering i.e. fill in first row left to right then next row.
'c' ==> COLUMN-MAJOR ordering.
MPI has communicators, BLACS has contexts. For this example, the default context, which includes all processors, is all we
need. The BLACS_GET call simply returns a handle to the default context for use in subsequent BLACS calls.
BLACS_GRIDINIT informs BLACS of the grid extent which we computed in step 2. Behind the scenes, BLACS then assigns
processors to points in this virtual grid.
We can now use subroutine BLACS_GRIDINFO to query BLACS on the information we supplied plus additional information
BLACS has computed such as its coordinate, e.g., COL and ROW, in the grid by specifying "context":
call blacs_gridinfo( context, nprow, npcol, myrow, mycol )
INPUT OUT OUT OUT OUT
Subroutine report_back prints the information we got from the call to blacs_gridinfo.
(a) determine a block size for the global arrays and block-cyclically distribute these blocks across the 2D processor grid
defined in steps (1), (2) and (3):
Here "Block size" refers to the dimensions of the subdivisions into which the global array is decomposed.
The ScaLAPACK User Guide recommends a block size of 64x64 for large arrays.
Carlo Cavazzoni of CINECA has written a subroutine to compute the block size for a given array: "Blockset" chooses a good
block size nb based on the size of the global array (i.e. dimension of matrix A), n, and number of rows and columns in the
processor grid (nprow,npcol). It also honors a maximum block size value, which, following the User Guide recommendation,
can simply be set at 64. We will then simply call blockset to determine the block size, nb:
call blockset( nb, 64, n, nprow, npcol)
OUT INPUT INPUT INPUT INPUT
The ScaLAPACK function NUMROC is used to determine the number of rows of matrix A:
l_nrowsa = numroc(n,nb,myrow,0,nprow)
Similar calls are required for matrix C (the right hand side).
(c) allocate local memory for each process' portion of the array
Following fortran statement will allocate the required local memory for the segment of the matrix A used by this processor:
allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
Similar statements are required for the arrays C and IPIV, etc ...
(d) distribute the actual data values into the allocated memory
The ScaLAPACK subroutine PDELSET is used to store the global values of matrices A and C to the appropriate locations in
each processor.
When we are done with a BLACS context, it should be released by calling subroutine BLACS_GRIDEXITR, and when we are
done with BLACS altogether, we should call BLACS_EXIT. (These calls are similar to the MPI functions, MPI_Comm_free
and MPI_Finalize).
In the next section a full listing of the program designed to solve a system of STRIDWAD equations of an even order N will
appear. This program uses ScaLAPACK routines by distributing the coefficient matrix A and right hand side vector on the
processors of the 2D Process Grid.
The timing routines are used to measure the cpu and elapsed times in different segments of the program. Several milestones are
set up throughout the program and the timings between this milestones is reported at the end of the program.
listings follow:
The make command is used to execute the commands in the makefile. The makefile has several macros with default values but
the user can override these values. Several scripts are invoked to verify some parameters and if these parameters are out of
range then the script returns a non-zero return code which stop make from further executing the commands on the makefile.
For example if the default parameters N, P and FPP_OPT were set to:
N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR
the utility program make will proceed to read the makefile and carry out the following tasks:
Check if N is in the range [1,40] (N times 1000 is the order of the matrix to invert)
Create the include file parameter_N.inc by making a copy of it from
file parameter_N.inc_basic and changing the parameter N to 4
Invoking fpp with option -DPRINT_TIMES and generating a new source
file ScaLAPACK_SOLVE2.f90
Compiling the latter with command compile using the libraries defined in the makefile by
macro $(LIB)
Generating a script to submit the job based on file sub_batch_ScaLAPACK_SOLVE_basic
and setting the parameter for number or processors to use
Submitting the job to the test queue
Since we are submitting to the test queue we must wait until the job finishes running, also we need to type the command:
make clean
If we want to run the case for a system of order 4000 using 128 processors and do not want to time the routines but want to see
how the matrix A is distributed on the processors we could type:
make N=4 P=128 "FPP_OPT=-DNO_PRINT_TIMES -DPRINT_MEMORY_DISTR"
The macro USE_QUEUE is used to decide if the job to be submitted should be run in the test queue or not. This macro must
be set up manually by editing the makefile.
Thus, for jobs to be submitted to the test queue the macro should look like this:
USE_QUEUE:= \-t
# USE_QUEUE:=\-v
For longer production jobs (not to be submitted to the test queue) the macro should look like this:
# USE_QUEUE:= \-t
USE_QUEUE:=\-v
and in the script sub_batch_ScaLAPACK_SOLVE_basic could use larger than 60m values for parameter -r instead of -r
60m.
As you become familiar with the macros in the makefile, you can adjust them to fit your needs. Here is the makefile used in
this online tutorial:
# Determine the platform that is being used for this compilation
N:= 1
P:= 64
FPP_OPT:= -DPRINT_TIMES -DNO_PRINT_MEMORY_DISTR
USE_QUEUE:= \-t
# USE_QUEUE:=\-v
EXE = ScaLAPACK_SOLVE
EXT:= f90
SRC:= ${EXE}.$(EXT)
EXE2:=$(EXE)2
SRC2:=${EXE2}.$(EXT)
$(EXE): $(SRC)
@echo Check if N is in the range [1,40]
./check_N $N
# @echo COMPILING LAPACK DP program with N = $(N),000
# @echo PLATFORM used = $(PLATFORM)
# @echo Edit the parameter_N.inc file
cp parameter_N.inc_basic parameter_N.inc
./ed_parm_N parameter_N.inc $N
# @echo "Using following parameter_N.inc file:"
# cat parameter_N.inc
# @echo "LIB = " $(LIB)
cp sub_batch_ScaLAPACK_SOLVE_basic sub_batch_ScaLAPACK_SOLVE
./ed_procs_P sub_batch_ScaLAPACK_SOLVE $(USE_QUEUE) $P
./sub_batch_ScaLAPACK_SOLVE
superclean: clean
@echo ' '
@echo Removing Output files generated by $(EXE)
@echo '--------------------------------------------------'
@echo ' '
rm -rf ${CLU}_SERIAL_DP_SOLVE_*
@echo ' '
@echo Output files have been removed.
@echo '-------------------------------'
@echo ' '
clean:
@echo ' '
@echo Cleaning executable $(EXE)
@echo '-----------------------------------'
@echo ' '
rm -rf $(EXE)
rm -rf ED_ERROR
@echo ' '
@echo Executable $(EXE) has been removed.
@echo '--------------------------------------------'
@echo ' '
help:
@echo "+-----------------------------------------------------------------------------------+"
@echo "| |"
@echo "| Makefile for running ScaLAPACK example using pdgetrf/pdgetrs |"
@echo "| |"
@echo "| Usage: make compile program for N=1 (i.e. order=1,000) |"
@echo "| |"
@echo "| make N=n compile program for N=n (i.e. order=n*1000) |"
@echo "| where 1 <= n <= 40 |"
@echo "| |"
@echo "| make P=p submit job with -q threaded -n p |"
@echo "| where 1 <= p <= 4 or 6 (depending on cluster) |"
@echo "| |"
@echo "| make superclean remove output files and executable |"
@echo "| |"
@echo "| make clean remove executable |"
@echo "| |"
@echo "| make help display makefile usage information |"
@echo "| |"
@echo "+-----------------------------------------------------------------------------------+"
@echo " "
@echo PLATFORM $(PLATFORM)
Full fortran program listing using ScaLAPACK routines to solve a STRIDWAD system
The full fortran 90 program (which includes fpp directives) is listed next:
program ScaLAPACK_SOLVE
!
! file name = ScaLAPACK_SOLVE.f90
!
implicit none
include 'mpif.h'
integer :: istat,info,i,j,N_PRT,milestone
real(kind=8),dimension(:,:),allocatable :: a
real(kind=8),dimension(:,:),allocatable :: c
real(kind=8),dimension(:),allocatable :: solution
integer,dimension(:),allocatable :: ipiv
integer*8 :: N
real(kind=8) :: memory_local, memory_sum
integer :: root, ierr
real(kind=8),dimension(:),allocatable :: memory_a
logical :: PRT_MEM
include "parameter_N.inc"
integer,parameter :: descriptor_len=9
integer :: desca( descriptor_len )
integer :: descc( descriptor_len )
integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer :: day_end,hour_end,minute_end,second_end,millisec_end
character*23 :: date_time
real(kind=8) :: T1, T2
real(kind=8) :: TOT_TIME
real(kind=8) :: tm
#endif
! --------------------------------------------------------------------------
interface
subroutine report_back ( context, &
iam,nprocs,myrow,nprow,mycol,npcol)
implicit none
integer :: context
integer :: iam,nprocs,myrow,nprow,mycol,npcol
integer :: i,j,pnum
end subroutine report_back
#ifdef PRINT_TIMES
subroutine elapsed_time(day_beg,hour_beg, minute_beg,second_beg,&
& millisec_beg, day_end, hour_end, minute_end, second_end, &
& millisec_end,tid,ITEM )
implicit none
integer :: day_beg,hour_beg,minute_beg,second_beg,millisec_beg
integer :: day_end,hour_end,minute_end,second_end,millisec_end
integer :: day,hour, minute, second, millisec, tid, ITEM
end subroutine elapsed_time
subroutine timestamp(date_time,day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer :: day, hour, minute, second, millisec
integer :: elements(8)
character*3 :: months(12)
end subroutine timestamp
#endif
end interface
! ---------------------------------------------------------------------------
#ifdef PRINT_MEMORY_DISTR
PRT_MEM = .TRUE.
#else
PRT_MEM = .FALSE.
#endif
!
! Trace intermediate steps only for small values of N
DEBUG = 1
root = 0
! N must be EVEN
NH = N/2
NH1 = NH + 1
! ---------------------------------------------------------------------------
#ifdef PRINT_TIMES
call timestamp(date_time,day_beg,hour_beg,minute_beg,second_beg, &
& millisec_beg )
call cpu_time(T1)
milestone = 1
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
l_nrowsc = numroc(n,nb,myrow,0,nprow)
l_ncolsc = numroc(1,nb,mycol,0,npcol)
call descinit( descc, n, 1, nb, nb, 0, 0, context, l_nrowsc, info )
!
! ----- Allocate LHS, RHS, pivot, and solution -----
!
allocate (a(l_nrowsa, l_ncolsa ), stat=istat)
if (istat/=0) stop "ERR:ALLOCATE FAILS for A"
memory_local = l_nrowsa*l_ncolsa*8
IF (IAM.EQ.0) THEN
IF (N .le. N_PRT .or. PRT_MEM ) THEN
ITEM = 6500
memory_sum = 0
write (6,550) ITEM,N
550 FORMAT(I4," N = ",i12 /" Distributed memory for A: ")
DO I=0,nprocs-1
ITEM = 6500 + I
WRITE (6,551) ITEM,I,memory_a(I+1)
memory_sum = memory_sum + memory_a(I+1)
551 FORMAT(I4," processor = ",I3," ",f14.1)
ENDDO
ITEM = ITEM + 1
WRITE (6,552) ITEM,memory_sum,float(N)*float(N)*8
552 FORMAT(I4," TOTAL MEMORY for A = ",f14.1," == N*N*8 = ",f14.1)
ENDIF
call flush(6_4)
ENDIF
!
! ----- Initialize LHS and RHS
!
call init_my_matrix (n,a,nprow,npcol,myrow,mycol,desca)
call init_my_rhs (n,c,nprow,npcol,myrow,mycol,descc)
!
! ----- Show how arrays distributed
!
IF (DEBUG .eq. 1) THEN
IF (IAM.EQ.0) THEN
write(6,300) ITEM,n,n
300 format(i4," DISTRIBUTION OF ARRAY: A - Global dimension:", &
& i3,":",i3)
ENDIF
IAorC = 0
call printlocals ( context, &
& a,iam,nb,nprocs,myrow,mycol,l_nrowsa,l_ncolsa,IAorC)
IF (IAM.EQ.0) THEN
write(6,400) ITEM,N,1
400 FORMAT(i4," DISTRIBUTION OF ARRAY: C - Global dimension:", &
& i3,":",i3)
ENDIF
IAorC = 1000
call printlocals ( context, &
& c,iam,nb,nprocs,myrow,mycol,l_nrowsc,l_ncolsc,IAorC)
ENDIF
#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone),millisec(milestone) )
IF (IAM.EQ.0) THEN
IF (N .le. N_PRT) THEN
ITEM = 6000
write (6,500) ITEM
500 FORMAT(I4," SOLUTION: ")
DO I=1,N
ITEM = 6000 + I
WRITE (6,501) ITEM,I,SOLUTION(I)
501 FORMAT(I4," I = ",I3," ",f6.2)
ENDDO
ELSE
DO I=N-N_PRT,N
WRITE (6,502) I,SOLUTION(I)
502 FORMAT(" I=",I10," ",f13.2)
ENDDO
ENDIF
ENDIF
#ifdef PRINT_TIMES
milestone = milestone + 1
call timestamp( date_time,day(milestone), hour(milestone), &
& minute(milestone),second(milestone), millisec(milestone) )
#ifdef PRINT_TIMES
ITEM = 7000
IF (IAM .eq. 0) WRITE(6,1007) ITEM,iam,milestone
1007 FORMAT(I4," iam = ",i3," FINAL value of milestone = ",i2)
do i=1,milestone-1
ITEM = 7000 + I
tm = 86400.0*(day(i+1)-day(i)) + 3600.0*(hour(i+1)-hour(i)) + &
& 60.0*(minute(i+1)-minute(i)) + (second(i+1)-second(i)) + &
& 0.001 * (millisec(i+1)-millisec(i))
IF (IAM .EQ. 0) write(6,1005) ITEM,iam, i, i+1, tm, &
& descr_milestone(i)
1005 format(I4," iam ",i3," Time from milestone ",i2," to milestone "&
& ,I2," equals ",f10.2,3x,a30)
enddo
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1011) ITEM,iam,date_time
1011 format(I4," iam ",i3," END ",a23)
call cpu_time(T2)
TOT_TIME = T2-T1
ITEM = ITEM + 1
IF (IAM .EQ. 0) write(6,1012) ITEM,iam,TOT_TIME
1012 format(I4," IAM ",i3," Total CPU Time/processor = ",f11.4, &
& " seconds")
#endif
! ---------------------------------------------------------------------------
!
! ----- Cleanup arrays -----
!
deallocate (a,c,ipiv,solution,memory_a)
!
! ----- Exit BLACS cleanly -----
!
call blacs_gridexit( context )
call blacs_exit( 0 )
subroutine gridsetup(nproc,nprow,npcol)
!
! This subroutine factorizes the number of processors (nproc)
! into nprow and npcol, that are the sizes of the 2d processors mesh.
!
! Written by Carlo Cavazzoni
!
implicit none
integer nproc,nprow,npcol
integer sqrtnp,i
return
end
!
!-----------------------------------------------------------------------
!
subroutine blockset( nb, nbuser, n, nprow, npcol)
!
! This subroutine try to choose an optimal block size
! for the distributd matrix.
!
! Written by Carlo Cavazzoni, CINECA
!
implicit none
integer*8 :: N
integer nb, nprow, npcol, nbuser
return
end subroutine blockset
!
!-----------------------------------------------------------------------
!
subroutine report_back ( context, &
iam,nprocs,myrow,nprow,mycol,npcol)
!
! Each processor identifies itself and its place in the processor grid
!
implicit none
integer :: context
integer :: iam,nprocs,myrow,nprow,mycol,npcol
integer :: i,j,pnum,ITEM
do i=0,nprocs-1
call blacs_barrier (context, 'a')
if (iam.eq.i) then
ITEM = 3000 + 100*MYROW + 10*MYCOL
write(6,100) ITEM,iam,nprocs,myrow,mycol,myrow,nprow,mycol, &
& npcol
do pnum=0,nprocs-1
if (iam .eq. pnum) then
ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL
write (6,100) ITEM,iam,myrow,mycol,nb,l_nrows,l_ncols, &
& l_nrows*l_ncols*8
100 format (i4," proc:",i3," grid position:",i3,",",i3, &
& " blksz:",i3," numroc:",i3,":",i3," Memory Allocated "&
& ,i6," bytes")
do i=1,l_nrows
ITEM = IAorC + 4000 + 100*MYROW + 10*MYCOL + I
write (6,200) ITEM,(a(i,j),j=1,l_ncols)
200 format (I4," ",40(" ",f5.1))
call flush(6_4)
enddo
endif
call blacs_barrier (context, 'a')
enddo
call flush(6_4)
end subroutine printlocals
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_MATRIX (N,A,NPROW,NPCOL,MYROW,MYCOL,DESCA)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NH,NH1,NPROW,NPCOL,MYROW,MYCOL
INTEGER :: DESCA(:)
REAL(kind=8) :: A(:,:)
REAL(kind=8) :: AII,AII1,AI1I,AINI1
INTEGER :: I, J
! Compute values for all elements of the global array, but using
! pdelset, only set those elements which occur in local portion.
NH = N/2
NH1 = NH + 1
! Diagonal elements
DO I = 1,N
! A(I,I) = 3.0
AII = 3.0
J = I
CALL PDELSET(A,I,J,DESCA,AII)
ENDDO
! Upper diagonal
DO 180 I = 1,N-1
if (I .EQ. NH) go to 180
! A(I,I+1) = -1.0d0
AII1 = -1.0d0
J = I+1
CALL PDELSET(A,I,J,DESCA,AII1)
180 CONTINUE
! Lower diagonal
DO 160 I = 1,N-1
if (I .EQ. NH) go to 160
! A(I+1,I) = -1.0d0
AI1I = -1.0d0
J = I
CALL PDELSET(A,I+1,J,DESCA,AI1I)
160 CONTINUE
!
! ANTI-DIAGONAL
!
DO 190 I = 1,N
if (I .EQ. NH .OR. I .EQ. NH1) go to 190
! A(I,N-I+1) = 1.0d0
AINI1 = 1.0d0
J = N-I+1
CALL PDELSET(A,I,J,DESCA,AINI1)
190 CONTINUE
return
end
!
!-----------------------------------------------------------------------
!
SUBROUTINE INIT_MY_RHS (N,C,NPROW,NPCOL,MYROW,MYCOL,DESCC)
IMPLICIT NONE
integer*8 :: N
INTEGER :: NPROW,NPCOL,MYROW,MYCOL
INTEGER :: DESCC(:)
REAL(kind=8) :: C(:,:)
REAL(kind=8) :: CI, CN
INTEGER :: I
DO I= 1, N-1
! C(I) = N + 1.0
CI = N + 1.0
CALL PDELSET(C,I,1,DESCC,CI)
ENDDO
! C(N) = 2*N + 2
CN = 2*N + 2
CALL PDELSET(C,I,1,DESCC,CN)
return
end
!
!-----------------------------------------------------------------------
!
subroutine get_solution (n,c,descc,solution)
implicit none
integer*8 :: n
integer :: descc(:)
real(kind=8) :: c(:,:),solution(:)
integer :: i
do i= 1, n
call pdelget('A',' ',solution(i),c,i,1,descc)
enddo
return
end
!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
subroutine elapsed_time(day_beg, hour_beg, minute_beg, second_beg,&
& millisec_beg, day_end, hour_end, minute_end, second_end,&
& millisec_end,tid, ITEM )
!date_and_time
! VALUES
! must be of type default integer and of rank one. It is an INTENT(OUT)
! argument. Its size must be at least eight. The values returned in VALUES
! are as follows:
! VALUES(1)
! is the year (for example, 1998), or -HUGE (0) if no date is available.
! VALUES(2)
! is the month of the year, or -HUGE (0) if no date is available.
! VALUES(3)
! is the day of the month, or -HUGE (0) if no date is available.
! VALUES(4)
! is the time difference with respect to Coordinated Universal Time (UTC)
! in minutes, or -HUGE (0) if this information is not available.
!
! VALUES(5)
! is the hour of the day, in the range 0 to 23, or -HUGE (0) if there is
! no clock.
!
! VALUES(6)
! is the minutes of the hour, in the range 0 to 59, or -HUGE (0) if there
! is no clock.
! !
! VALUES(7)
! is the seconds of the minute, in the range 0 to 60, or -HUGE (0) if
! there is no clock.
!
! VALUES (8)
! is the milliseconds of the second, in the range 0 to 999, or -HUGE (0)
! if there is no clock.
implicit none
integer :: day_beg, hour_beg, minute_beg, second_beg, millisec_beg
integer :: day_end, hour_end, minute_end, second_end, millisec_end
integer :: day, hour, minute, second, millisec, tid, ITEM
if ( .not. ( day .eq. 0 .and. hour .eq. 0 .and. minute .eq. 0 &
& .and. second .eq. 0 .and. millisec .eq. 0 ) ) then
ITEM = ITEM + 1
IF (tid .eq. 0 ) THEN
write(6,8000) ITEM,tid,day
8000 format(I4," IAM = ",I3," ELAPSED0 days = ",i3)
write(6,8001) ITEM,tid,hour
8001 format(I4," IAM = ",I3," ELAPSED1 hours = ",i3)
write(6,8002) ITEM,tid,minute
8002 format(I4," IAM = ",I3," ELAPSED2 minute = ",i3)
write(6,8003) ITEM,tid,second
8003 format(I4," IAM = ",I3," ELAPSED3 second = ",i3)
write(6,8004) ITEM,tid,millisec
8004 format(I4," IAM = ",I3," ELAPSED4 millisec = ",i3)
ENDIF
endif
return
!
!-----------------------------------------------------------------------
!
#ifdef PRINT_TIMES
subroutine timestamp( date_time, day,hour,minute,second,millisec)
implicit none
character*23 :: date_time
integer :: day, hour, minute, second, millisec
integer :: elements(8)
character*3 :: months(12)
day = elements(3)
hour = elements(5)
minute = elements(6)
second = elements(7)
millisec = elements(8)
else
date_time=' '
day = 0
hour = 0
minute = 0
second = 0
millisec = 0
endif
!
!-----------------------------------------------------------------------
!
Results for solutions of the STRIDWAD system for different orders of matrix A
LAPACK RESULTS
These are the results for serial LAPACK runs (nprocs=1). Note that on one processor the highest matrix size is 30,000.
N CPU Time
order of per
matrix processor
[ sec ]
10000 312.33
20000 1808.32
30000 5487.23
32000 N/A
ScaLAPACK RESULTS
CONCLUSIONS
With ScaLAPACK on 256 processors matrices up to order 250,000 can be inverted, compared to order of 30,000 for
LAPACK. When the matrix is distributed over many processors it takes less time and larger order matrices can be inverted.
All ScaLAPACK routines assume that the data has been distributed on the process grid prior to the invocation of the routine.
Detailed descriptions of the appropriate calling sequences for each of the ScaLAPACK routines can be found in the leading
comments of the source code or the ScaLAPACK Users' Guide:
http://netlib2.cs.utk.edu/scalapack/slug/scalapack_slug.html