0% found this document useful (0 votes)
16 views

(Blas Lapack) F7

Uploaded by

Chris Yao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

(Blas Lapack) F7

Uploaded by

Chris Yao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction

• In this lecture we will cover the following topic:


– High performance portable dense linear algebra libraries
• The following libraries will be introduced:
BLAS, LAPACK –

BLAS
LAPACK
BLACS, PBLAS, ScaLAPACK – BLACS
– PBLAS
– ScaLAPACK
High Performance Portable Libraries for
Dense Linear Algebra • This will also be covered:
– FORTRAN 77+

The Overall Picture NETLIB


• All software discussed in this lecture can be downloaded
free of charge from NETLIB (repository for freely
ScaLAPACK
distributable numerical software)
– http://www.netlib.org/
• Collection of all NETLIB links mentioned in the notes:
LAPACK PBLAS – http://www.netlib.org/blas/
– http://www.netlib.org/atlas/
– http://www.netlib.org/blas/gemm_based/
BLAS BLACS – http://www.netlib.org/lapack/
– http://www.netlib.org/blacs/
– http://www.netlib.org/scalapack/ (includes PBLAS)

MPI
Crash Course: FORTRAN 77+ Fixed Source Format
• FORTRAN 77+ is used in these notes to refer to the • FORTRAN 77+ has a strict source format known as the
dialect of FORTRAN 77 used by LAPACK and ”fixed source format” (removed in later standards)
ScaLAPACK developers. • Columns are used for different things:
– Straight FORTRAN 77 is quite arcane and most compilers have – 1: Comment column
implemented a set of extensions. Set your editor to expand
– 2-5: Label columns
• FORTRAN has been the language of choice for scientific tabs to spaces.
– 6: Continuation column Use 3 as tabstop (two tabs takes
and engineering for a long time, partly because it: – 7-72: Statement columns you to colunm 7)
– Has an extensive compiler support for multi-dimensional arrays – 73-: Truncated (silently)
– Has restrictions in the language to allow aggressive compiler Label Statements Truncated
optimizations
– Has language support (in FORTRAN 90 and onwards) for
dynamic memory management, derived types, object orientation,
operator overloading, generic interfaces, array expressions,
distributed arrays (co-arrays), etc Comment
Continuation

IF-THEN-ELSE GO TO, CONTINUE and Labels


• IF-statement: • Labels
– IF( <logical expression> ) <statement> – Integers from 1 to 9999
• IF-construct: – Placed in columns 2 to 5
• IF( <logical expression> ) THEN – Used as targets for GO TO statements and in DO loops
<block> • GO TO-statement:
[ELSE IF( <logical expression> ) THEN – GO TO <label>
<block>]
– Transfers control to statement labeled with <label>
[ELSE
<block>] • CONTINUE-statement:
END IF – CONTINUE
– A do-nothing statement often used as target statement and DO
loop end statement.

DO PROGRAM
• DO-construct: • In FORTRAN you do not have a special function called
– DO <label> <var> = <low>, <high>[, <step>] MAIN, instead you have the PROGRAM construct:
<block> – PROGRAM [name]
<label> CONTINUE <declarations>
– Example: <statements>
• DO 10 J = 1, M, 2 END [PROGRAM name]
...
10 CONTINUE
– New syntax:
• DO J = 1, M, 2
...
END DO

SUBROUTINEs and FUNCTIONs Arithmetic Operators


• SUBROUTINEs (think of C functions returning void) FORTRAN C/Java
– CALL mysub(<arglist>)
• FUNCTIONs (think of C functions returning non-void) + +
– <lval> = myfunc(<arglist>)
- -
• Declaring a SUBROUTINE: * *
– SUBROUTINE <name>(<dummy arglist>)
<dummy argument type declarations> / /
END [SUBROUTINE name]
• Declaring a FUNCTION: ** N/A
– <type> FUNCTION <name>(<dummy arglist>)
<dummy argument type declarations> +=
END [FUNCTION name] N/A
– Example: ++
N/A
• INTEGER FUNCTION MAX(a, b)
INTEGER a, b
MAX = a
IF( b .GT. a ) MAX = b
END
Logical Operators Data Types
FORTRAN C/Java • INTEGER
– Signed 32-bit (usually) integer
.GT. >
• LOGICAL
.LT. < – .TRUE. or .FALSE.
.LE. <= • CHARACTER(<length>) or CHARACTER (just one character)
– ’string’ or ”string”
.GE. >= • REAL
.EQ. == – Single precision IEEE (usually) floating point f = 5E+0
• DOUBLE PRECISION
.EQV. (logical)
– Double precision IEEE (usually) floating point d = 5D+0
.NE. != • COMPLEX
.NEQV. (logical) (exclusive or) – Single precision IEEE (usually) complex number c = (r, i)
.AND. && • COMPLEX*16
– Double precision IEEE (usually) complex number c = (r, i)
.OR. ||

.NOT. !

Arrays (Matrices and Vectors) Automatic Arrays


• Declaring a vector of 50 INTEGERs • Size of array is either known at compile time or
– INTEGER vec(50) determined by dummy arguments and the array is not a
• Declaring a 25x47 matrix of 50 INTEGERs dummy argument itself.
– INTEGER mtx(25, 47) • Storage will be allocated (think of it as being allocated on
the stack) at runtime and deallocated automatically when
• Indexing starts from 1 (unless explicitly stated in the variable falls out of scope.
declaration) – Example:
• Indexing top left element in matrix: • SUBROUTINE auto(N)
INTEGER A(N)
– mtx(1, 1)
END
• Indexing bottom right index in matrix:
– mtx(25, 47)
Assumed Shape Array Assumed Size Arrays
• The shape (extent of all dimensions) need not be known • Extent of last dimension in FORTRAN arrays need not
at compile time. be known at compile time (or at runtime for that matter)
• An array where the extent of one or more dimension is to generate indexing code (first dimension in the case of
determined by dummy arguments is referred to as an C).
assumed shape array. • An array declared with unknown last dimension extent is
– Useful for passing arrays as arguments to subroutines. referred to as an assumed size array.
– Example: – REAL A(LDA, *)
• SUBROUTINE mysub(A, LDA, M, N) • Indexing code:
INTEGER LDA, M, N A(i, j)  A + (i-1) + (j-1)*LDA
REAL A(LDA, N)
END

Comments Continuation (long statements)


• Comment lines are created by putting (almost) any • Long statements (going beyond column 72) can be
character (usually * or c) in the first column: broken into several lines by placing (almost) any
– Example: character (usually numbers, $, &, +) in the continuation
•c This is a comment column (column 6)
* This is also a comment
– Example:
A = 1 c This is not a comment
• A(1, 2) = longvariablename +
$ anotherlongvariable
FORTRAN 77+/C ”Interoperability” Other things to know about FORTRAN
• Calling FORTRAN 77+ from C: • FORTRAN is case insensitive
– These are usual type relationships:
• LOGICAL (?) • FORTRAN passes everything by reference
INTEGER int
CHARACTER char • FORTRAN 77 has no type checking of arguments
REAL float
DOUBLE PRECISION double • FORTRAN 77 has no support for recursive subroutines
COMPLEX float[2]
COMPLEX*16 double[2] or functions
– Everything in FORTRAN is passed by reference
• This is usually implemented by passing a pointer.
• INTEGER int*
DOUBLE PRECISION double*
CHARACTER char*
– Symbols are usually lower case with added underscore:
• SUBROUTINE MySUB(...) mysub_
– Symbols with underscore sometimes get extra underscore:
• SUBROUTINE My_SUB(...) my_sub__

Storage Formats used by the Libraries Full Storage Format


• General matrices: Matrix Indices Memory Placement
11 12 13 14 15 16 17 18 19 0 9 18 27 36 45 54 63 72
– Column Major 21 22 23 24 25 26 27 28 29 1 10 19 28 37 46 55 64 73
31 32 33 34 35 36 37 38 39 2 11 20 29 38 47 56 65 74
• Symmetric and triangular matrices 41 42 43 44 45 46 47 48 49 3 12 21 30 39 48 57 66 75
51 52 53 54 55 56 57 58 59 4 13 22 31 40 49 58 67 76
– Column Major Column Packed 61 62 63 64 65 66 67 68 69 5 14 23 32 41 50 59 68 77
71 72 73 74 75 76 77 78 79 6 15 24 33 42 51 60 69 78
• Band matrices 81 82 83 84 85 86 87 88 89 7 16 25 34 43 52 61 70 79
– Diagonal Storage 91 92 93 94 95 96 97 98 99 8 17 26 35 44 53 62 71 80

• Tridiagonal matrices
– Diagonal Storage
Standard Packed Storage Format Rectangular Full Packed Storage Format
Matrix Indices Memory Placement Matrix Indices Memory Placement
11 * * * * * * * * 0 * * * * * * * * 11 66 76 86 96 0 9 18 27 36
21 22 * * * * * * * 1 9 * * * * * * * 21 22 77 87 97 1 10 19 28 37
31 32 33 * * * * * * 2 10 17 * * * * * * 31 32 33 88 98 2 11 20 29 38
41 42 43 44 * * * * * 3 11 18 24 * * * * * 41 42 43 44 99 3 12 21 30 39
51 52 53 54 55 * * * * 4 12 19 25 30 * * * * 51 52 53 54 55 4 13 22 31 40
61 62 63 64 65 66 * * * 5 13 20 26 31 35 * * * 61 62 63 64 65 5 14 23 32 41
71 72 73 74 75 76 77 * * 6 14 21 27 32 36 39 * * 71 72 73 74 75 6 15 24 33 42
81 82 83 84 85 86 87 88 * 7 15 22 28 33 37 40 42 * 81 82 83 84 85 7 16 25 34 43
91 92 93 94 95 96 97 98 99 8 16 23 29 34 38 41 43 44 91 92 93 94 95 8 17 26 35 44

Banded Storage Format BLAS


Full Matrix Indices
11 12 * * * * * * * • Basic Linear Algebra Subroutines (BLAS)
21 22 23 * * * * * * – http://www.netlib.org/blas/ Reference implementation
31 32 33 34 * * * * *
* 42 43 44 45 * * * * – http://www.netlib.org/atlas/ Auto-tuning HPC impl.
* * 53 54 55 56 * * *
* * * 64 65 66 67 * *
– http://www.netlib.org/blas/gemm_based/
* * * * 75 76 77 78 * GEMM-based BLAS by
* * * * * 86 87 88 89 Kågström et. al.
* * * * * * 97 98 99
– http://www.tacc.utexas.edu/resources/software/
GotoBLAS

Matrix Indices Memory Placement • Interfaces:


* 12 23 34 45 56 67 78 89 0 9 18 27 36 45 54 63 72 – FORTRAN (official)
11 22 33 44 55 66 77 88 99 1 10 19 28 37 46 55 64 73 – C, C++, Java, ... (unofficial)
21 32 43 54 65 76 87 98 * 2 11 20 29 38 47 56 65 74
31 42 53 64 75 86 97 * * 3 12 21 30 39 48 57 66 75 • Language:
– C, assembler, FORTRAN, ... (depends on vendor)
BLAS - Content Coding Conventions
• _XXYY
• BLAS contains subroutines and functions for a number – _: Data type
• S, D, C, or Z
of basic linear algebra operations. – XX: Type of matrix
– Dot product • GE, GB: GEneral, General Banded
• HE, HB, HP: HErmitian, Hermitian Banded, Hermitian Packed
– Givens rotation generation and application • SY, SB, SP: SYmmetric, Symmetric Banded, Symmetric Packed
– Vector updates • TR, TB, TP: TRiangular, Triangular Banded, Triangular Packed
– YY: Operation
– Matrix-vector product update • S: ”Solve”
• M: ”Matrix”
– Triangular system solve (with single or multiple right hand sides) • V: ”Vector”
– Matrix-matrix product update • R: Rank-1
• R2: Rank-2
– ... • RK: Rank-k
• The routines operate on various storage formats and on • R2K: Rank-2k
• Example:
four data types (single, double, complex, double – DTRSM:
• Double precision
complex). • TRiangular
• Solve
• Multiple right hand sides

Memory Traffic - Limitations Locality - Examples


• Memory bandwidth and latency can not match the high • Example of poor inherent locality: AXPY (a*x + y)
performance of floating point computations on the chip. – 2 vector loads (x, y)
• Solution: – 1 vector store (y)
– Exploit caches by data locality in space and time – 2 vector operations (*, +)
– flop/memref = 2/3
• Solution requires:
– An operation that has much inherent locality
• Metric for estimating inherent locality in linear algebra: • Example of good inherent locality: GEMM (a*A*B + b*C)
– Number of floating point operations – ~2*N^3 flops
------------------------------------------------------ – ~3*N^2 loads
Number of memory locations referenced – ~1*N^2 stores
– i.e., flop/memref – flop/memref ~ N/2
Level 1, 2, 3 LAPACK
• Level-1 BLAS: Vector operations (~1 flop/memref) • Linear Algebra PACKage (LAPACK)
– _dot – http://www.netlib.org/lapack/ Official LAPACK releases
_axpy – http://www.netlib.org/lapack/lanws/
_swap Publications related to
_copy LAPACK and DLA
_scal
... • Some vendors provide their own optimized LAPACK routines as well
as BLAS routines:
• Level-2 BLAS: Matrix-Vector operations (~1 flop/memref)
– IBM: ESSL (Proprietary)
– _gemv
_symv – AMD: ACML (Free?)
_trsv – Intel: MKL (Proprietary?)
... – Cray: libsci (Proprietary?)
• Level-3 BLAS: Matrix-Matrix operations (~N flop/memref) • Interfaces:
– _gemm – FORTRAN (official)
_syrk – C, C++, Java, ... (unofficial)
_trsm
... • Language:
– FORTRAN 77+

LAPACK - Content Workspace Management


• Compared with BLAS, the high level algorithms and • Many routines in LAPACK require auxilliary workspace
tricky numerical algorithms go into LAPACK. to function and/or run faster.
– Factorizing matrices • Users must provide this storage.
• LU, Cholesky, QR, QL, RQ, LQ, ...
• Routines take workspace via their arguments, typically:
– Applying factored-form orthogonal matrices
– WORK: Workspace
– Solving linear equations
– LWORK: Length of workspace
– Solving linear least squares problems
– Decomposing matrices • Routines requiring workspace allow workspace queries.
• SVD, Schur, ... – Workspace query:
LWORK = -1
– Computing eigenvalues and eigenvectors WORK(1) contains required workspace
• Symmetric, non-symmetric, ... Cast to INTEGER: INT(WORK(1))
– Error bounds, condition estimation – If you do a workspace query the routine will not modify any of its
arguments.
Error Reporting LAPACK - Examples
• LAPACK routines have an extra INTEGER argument at the • Solving a linear system after LU factorization
end of their argument lists: INFO – DGETRS( TRANS, N, NRHS, A, LDA, IPIV, B, LDB, INFO )
– The value of INFO tells what went wrong (if anything):
• 0: Success
• Computing QR factorization
• < 0: Argument number –INFO contained an illegal value (fatal,
– DGEQRF( M, N, A, LDA, TAU, WORK, LWORK, INFO )
programming error)
• > 0: Something went wrong during computation (exact
interpretation is routine specific)
– Example: DGETRF (LU factorization)
INFO > 0: U(INFO, INFO) is exactly zero

BLACS 2D-Grid, Scope, Context


• Basic Linear Algebra Communication Subroutines • Processes are arranged in a logical 2D-grid.
– http://www.netlib.org/blacs/ Official BLACS releases • Each process is a member of three scopes:
• Purpose: – ’All’: All processes in the grid
– Communication of submatrices appropriate for dense linear – ’Row’: All processes on the same row of the grid
algebra algorithms (e.g., ScaLAPACK)
– ’Column’: All processes on the same column of the grid
• Objection:
– ”I know MPI inside out, why should I learn BLACS?” • BLACS communication is tied to a context (think of MPI
communicators) which is an integer.
• Answer:
– It will hopefully be apparent at the end of this segment.
• Interfaces:
– FORTRAN, C (official)
• Language:
– C
Submatrix Communication Point-to-Point
• The BLACS unit of communication is a submatrix of • Send:
some specified size and shape. – xGESD2D(CTXT, M, N, A, LDA, RDST, CDST)
• Two types of submatrices: – xTRSD2D(CTXT, UPLO, DIAG, M, N, A, LDA, RDST, CDST)
– General submatrices: • Receive:
• Parameters: M, N, A, LDA – xGERV2D(CTXT, M, N, A, LDA, RSRC, CSRC)
– Trapezoidal submatrices (generalization of triangular): – xTRRV2D(CTXT, UPLO, DIAG, M, N, A, LDA, RSRC, CSRC)
• Parameters: M, N, A, LDA, UPLO, DIAG
• Packing of matrices hidden from user
• Types supported:
– I: Integer
– S: Single precision
– D: Double precision
– C: Complex single precision
– Z: Complex double precision

Collectives Collectives: Topology


• Broadcast (send): • Topologies (TOP) specify the communication pattern.
– xGEBS2D(CTXT, SCOPE, TOP, M, N, A, LDA) – ’I’: Increasing ring
– xTRBS2D(CTXT, SCOPE, TOP, UPLO, DIAG, M, N, A, LDA) – ’D’: Decreasing ring
• Broadcast (receive): – ’S’: Split ring
– xGEBR2D(CTXT, SCOPE, TOP, M, N, A, LDA, – ’M’: Multi-ring
RSRC, CSRC)
– ’1’: 1-tree
– xTRBR2D(CTXT, SCOPE, TOP, UPLO, DIAG, M, N, A, LDA,
RSRC, CSRC) – ’B’: Bidirectional exchange
– ’ ’: Default (may use MPI_Bcast)
• Combine operations (SUM, MAX, MIN):
– xGSUM2D(CTXT, SCOPE, TOP, M, N, A, LDA, RDST, CDST)
– xGAMX2D(CTXT, SCOPE, TOP, M, N, A, LDA, RA, CA,
RCFLAG, RDST, CDST)
– xGAMN2D(CTXT, SCOPE, TOP, M, N, A, LDA, RA, CA,
RCFLAG, RDST, CDST)
BLACS – Setup (FORTRAN) PBLAS
• Initializing BLACS: • Parallel BLAS
– CALL BLACS_PINFO(ME, NP) – http://www.netlib.org/scalapack/ PBLAS reference impl.
• Initializing context: is part of ScaLAPACK
– CALL BLACS_GET(0, 0, CTXT) • Interfaces:
CALL BLACS_GRIDINIT(CTXT, 'Row', P, Q)
– FORTRAN
CALL BLACS_GRIDINFO(CTXT, P, Q, MYROW, MYCOL)
– C?
• Getting someones rank from coordinates
– RANK = BLACS_PNUM(CTXT, ROW, COL)
• Language:
– C
• Getting someones coordinates from rank
– CALL BLACS_PCOORD(CTXT, RANK, ROW, COL)
• Exiting BLACS
– CALL BLACS_EXIT(0)

2D Block Cyclic Distribution Matrix Descriptors


• PBLAS operates on data distributed using the 2D block • Descriptors are used in PBLAS and ScaLAPACK to
cyclic distribution. encapsulate information on a distributed matrix.
• Recall: • A descriptor is a 9-item integer vector:
– INTEGER DESCA(9)

0 1 – DESCA(1): (DTYPE) 1
DESCA(2): (CTXT) BLACS context
11 12 13 14 15 11 12 15 13 14
0 21 22 25 23 24 DESCA(3): (M) Number of rows in global matrix
21 22 23 24 25
31 32 33 34 35 51 52 55 53 54 DESCA(4): (N) Number of columns in global matrix
41 42 43 44 45 31 32 35 33 34 DESCA(5): (MB) Row blocking factor
1
51 52 53 54 55 41 42 45 43 44 DESCA(6): (NB) Column blocking factor
DESCA(7): (RSRC) Row index of owner of A(1, 1)
5 x 5 matrix, 2 x 2 blocks 2 x 2 process grid point of view DESCA(8): (CSRC) Column index of owner of A(1, 1)
DESCA(9): (LLD) Leading dimension of the local matrix
PBLAS - Example ScaLAPACK
• Parallel version of DGEMM • SCAlable LAPACK (distributed memory)
– CALL PDGEMM( TRANSA, TRANSB, – http://www.netlib.org/scalapack/ Official ScaLAPACK releases
M, N, K,
ALPHA, A, IA, JA, DESC_A,
B, IB, JB, DESC_B,
BETA, C, IC, JC, DESC_C )
• Notice:
– PBLAS has interfaces that take descriptions of submatrices
– BLAS, on the other hand, takes submatrices implicitly

ScaLAPACK - Content ScaLAPACK – Coding Conventions


• Most of LAPACK • Symbols are similar to LAPACK (just add P)
• No support for band and packed matrices • Submatrices are referenced explicitly in interface:
• Missing some more advanced algorithms – A(I, J), LDA LAPACK submatrix reference
– SVD and QR w/ pivoting least squares – A, I, J, DESCA ScaLAPACK submatrix reference
– Generalized least squares
– Non-symmetric eigenvalue problems
– D&C SVD
– ...
Utilities: DESCINIT Utilities: NUMROC
• SUBROUTINE DESCINIT(DESC, M, N, MB, NB, RSRC, CSRC, • INTEGER FUNCTION NUMROC(N, NB, ME, SRC, NP)
CTXT, LLD, INFO)
• Finds the number of rows (or columns) mapped to a
• Initializes all elements of a descriptor.
specific grid row (or column).
• Arguments:
• Arguments:
– DESC Descriptor to initialize (output)
– N Extent of matrix dimension
– M, N Size of global matrix
– NB Blocking factor in matrix dimension
– MB, NB Blocking factors
– ME Row (or column) index of processor of interest
– RSRC, CSRC Coordinates of owner of (1, 1) matrix element
– SRC Row (or column) index of source
– CTXT BLACS context
– NP Number of processes in grid dimension
– LLD Leading dimension of local matrix (use NUMROC
to find)
– INFO Error reporting, 0: success (output)

Utilities: INFOG2L
• SUBROUTINE INFOG2L(GRINDX, GCINDX, DESC, NPROW, NPCOL,
MYROW, MYCOL, LRINDX, LCINDX, RSRC, CSRC)
• Given a global matrix element (GRINDX, GCINDX), returns
the corresponding local matrix element (LRINDX, LCINDX)
and coordinates of processor that owns that element
(RSRC, CSRC).
• Arguments:
– GRINDX, GCINDX Global matrix element
– DESC Descriptor of matrix
– NPROW, NPCOL Grid size
– MYROW, MYCOL My coordinates
– LRINDX, LCINDX Local matrix element (output)
– RSRC, CSRC Owner of element (output)

You might also like