Intel MKL 2019 Developer Guide Linux PDF
Intel MKL 2019 Developer Guide Linux PDF
for Linux*
Developer Guide
Revision: 061
Legal Information
Intel® Math Kernel Library for Linux* Developer Guide
Contents
Legal Information ................................................................................ 6
Getting Help and Support..................................................................... 7
Introducing the Intel® Math Kernel Library .......................................... 8
What's New ....................................................................................... 10
Notational Conventions...................................................................... 11
Related Information .......................................................................... 12
2
Contents
3
Intel® Math Kernel Library for Linux* Developer Guide
4
Contents
5
Intel® Math Kernel Library for Linux* Developer Guide
Legal Information
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this
document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of
merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from
course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information
provided here is subject to change without notice. Contact your Intel representative to obtain the latest
forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors which may cause deviations from
published specifications. Current characterized errata are available on request.
Intel, the Intel logo, Intel Atom, Intel Core, Intel Xeon Phi, VTune and Xeon are trademarks of Intel
Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Java is a registered trademark of Oracle and/or its affiliates.
Copyright 2006-2018, Intel Corporation.
This software and the related documents are Intel copyrighted materials, and your use of them is governed
by the express license under which they were provided to you (License). Unless the License provides
otherwise, you may not use, modify, copy, publish, distribute, disclose or transmit this software or the
related documents without Intel's prior written permission.
This software and the related documents are provided as is, with no express or implied warranties, other
than those that are expressly stated in the License.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
6
Getting Help and Support
7
Intel® Math Kernel Library for Linux* Developer Guide
NOTE
It is your responsibility when using Intel MKL to ensure that input data has the required format and
does not contain invalid characters. These can cause unexpected behavior of the library.
The library requires subroutine and function parameters to be valid before being passed. While some
Intel MKL routines do limited checking of parameter errors, your application should check for NULL
pointers, for example.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
8
Introducing the Intel® Math Kernel Library
Optimization Notice
9
Intel® Math Kernel Library for Linux* Developer Guide
What's New
This Developer Guide documents Intel® Math Kernel Library (Intel® MKL) 2019.
The Developer Guide has been updated to fix inaccuracies in the document.
10
Notational Conventions
Notational Conventions
The following term is used in reference to the operating system.
Linux* This term refers to information that is valid on all supported Linux* operating
systems.
<parent The installation directory that includes Intel MKL directory; for example, the
directory> directory for Intel® Parallel Studio XE Composer Edition.
Italic Italic is used for emphasis and also indicates document names in body text, for
example:
see Intel MKL Developer Reference.
Monospace Indicates filenames, directory names, and pathnames, for example: ./benchmarks/
lowercase linpack
Monospace Indicates:
lowercase mixed • Commands and command-line options, for example,
with uppercase
icc myprog.c -L$MKLPATH -I$MKLINCLUDE -lmkl -liomp5 -lpthread
[ items ] Square brackets indicate that the items enclosed in brackets are optional.
{ item | item } Braces indicate that only one of the items listed between braces should be selected.
A vertical bar ( | ) separates the items.
11
Intel® Math Kernel Library for Linux* Developer Guide
Related Information
To reference how to use the library in your application, use this guide in conjunction with the following
documents:
• The Intel® Math Kernel Library Developer Reference, which provides reference information on routine
functionalities, parameter descriptions, interfaces, calling syntaxes, and return values.
• The Intel® Math Kernel Library for Linux* OS Release Notes.
12
Getting Started 1
Getting Started 1
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Notational Conventions
13
1 Intel® Math Kernel Library for Linux* Developer Guide
C mklvars.csh
For example:
• The command
mklvars.sh ia32
sets the environment for Intel MKL to use the IA-32 architecture.
• The command
mklvars.sh intel64 mod ilp64
sets the environment for Intel MKL to use the Intel 64 architecture, ILP64 programming interface, and
Fortran modules.
• The command
mklvars.sh intel64 mod
sets the environment for Intel MKL to use the Intel 64 architecture, LP64 interface, and Fortran modules.
NOTE
Supply the parameter specifying the architecture first, if it is needed. Values of the other two
parameters can be listed in any order.
See Also
High-level Directory Structure
Intel® MKL Interface Libraries and Modules
Fortran 95 Interfaces to LAPACK and BLAS
Setting the Number of Threads Using an OpenMP* Environment Variable
14
Getting Started 1
Shell Files Commands
Caution
Before uninstalling Intel MKL, remove the above commands from all profile files where the script
execution was added. Otherwise you may experience problems logging in.
See Also
Scripts to Set Environment Variables
Compiler Support
Intel® MKL supports compilers identified in the Release Notes. However, the library has been successfully
used with other compilers as well.
When building Intel MKL code examples for either C or Fortran, you can select a compiler: Intel®, GNU*, or
PGI*.
Intel MKL provides a set of include files to simplify program development by specifying enumerated values
and prototypes for the respective functions. Calling Intel MKL functions from your application without an
appropriate include file may lead to incorrect behavior of the functions.
See Also
Intel® MKL Include Files
15
1 Intel® Math Kernel Library for Linux* Developer Guide
For each component, the examples are grouped in subdirectories mainly by Intel MKL function domains and
programming languages. For instance, the blas subdirectory (extracted from the examples_core archive)
contains a makefile to build the BLAS examples and the vmlc subdirectory contains the makefile to build the
C examples for Vector Mathematics functions. Source code for the examples is in the next-level sources
subdirectory.
See Also
High-level Directory Structure
What You Need to Know Before You Begin Using the Intel®
Math Kernel Library
Target platform Identify the architecture of your target machine:
• IA-32 or compatible
• Intel® 64 or compatible
Reason:Because Intel MKL libraries are located in directories corresponding to your
particular architecture (seeArchitecture Support), you should provide proper paths
on your link lines (see Linking Examples). To configure your development
environment for the use with Intel MKL, set your environment variables using the
script corresponding to your architecture (see Scripts to Set Environment Variables
Setting Environment Variables for details).
Mathematical Identify all Intel MKL function domains that you require:
problem • BLAS
• Sparse BLAS
• LAPACK
• PBLAS
• ScaLAPACK
• Sparse Solver routines
• Parallel Direct Sparse Solvers for Clusters
• Vector Mathematics functions (VM)
• Vector Statistics functions (VS)
• Fourier Transform functions (FFT)
• Cluster FFT
• Trigonometric Transform routines
• Poisson, Laplace, and Helmholtz Solver routines
• Optimization (Trust-Region) Solver routines
• Data Fitting Functions
• Extended Eigensolver Functions
Reason: The function domain you intend to use narrows the search in the Intel MKL
Developer Reference for specific routines you need. Additionally, if you are using the
Intel MKL cluster software, your link line is function-domain specific (see Working
with the Intel® Math Kernel Library Cluster Software). Coding tips may also depend
on the function domain (see Other Tips and Techniques to Improve Performance).
Programming Intel MKL provides support for both Fortran and C/C++ programming. Identify the
language language interfaces that your function domains support (see Appendix A: Intel® Math
Kernel Library Language Interfaces Support).
Reason: Intel MKL provides language-specific include files for each function domain
to simplify program development (see Language Interfaces Support_ by Function
Domain).
For a list of language-specific interface libraries and modules and an example how to
generate them, see also Using Language-Specific Interfaces with Intel® Math Kernel
Library.
16
Getting Started 1
Range of integer If your system is based on the Intel 64 architecture, identify whether your
data application performs calculations with large data arrays (of more than 231-1
elements).
Reason: To operate on large data arrays, you need to select the ILP64 interface,
where integers are 64-bit; otherwise, use the default, LP64, interface, where
integers are 32-bit (see Using the ILP64 Interface vs).
Number of threads If your application uses an OpenMP* threading run-time library, determine the
number of threads you want Intel MKL to use.
Reason: By default, the OpenMP* run-time library sets the number of threads for
Intel MKL. If you need a different number, you have to set it yourself using one of
the available mechanisms. For more information, see Improving Performance with
Threading.
Linking model Decide which linking model is appropriate for linking your application with Intel MKL
libraries:
• Static
• Dynamic
Reason: The link line syntax and libraries for static and dynamic linking are
different. For the list of link libraries for static and dynamic models, linking
examples, and other relevant topics, like how to save disk space by creating a
custom dynamic library, see Linking Your Application with the Intel® Math Kernel
Library.
MPI used Decide what MPI you will use with the Intel MKL cluster software. You are strongly
encouraged to use the latest available version of Intel® MPI.
Reason: To link your application with ScaLAPACK and/or Cluster FFT, the libraries
corresponding to your particular MPI should be listed on the link line (see Working
with the Intel® Math Kernel Library Cluster Software).
17
2 Intel® Math Kernel Library for Linux* Developer Guide
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
Architecture Support
Intel® Math Kernel Library (Intel® MKL) for Linux* provides architecture-specific implementations for
supported platforms. The following table lists the supported architectures and directories where each
architecture-specific implementation is located.
Architecture Location
See Also
High-level Directory Structure
Notational Conventions
Detailed Structure of the IA-32 Architecture Directory lib/ia32
Detailed Structure of the Intel® 64 Architecture Directory lib/intel64
<mkl directory> Installation directory of the Intel® Math Kernel Library (Intel® MKL)
18
Structure of the Intel® Math Kernel Library 2
Directory Contents
examples Source and data files for Intel MKL examples. Provided in archives
corresponding to Intel MKL components selected during installation.
include/ia32 Fortran 95 .mod files for the IA-32 architecture and Intel® Fortran
compiler
include/intel64/lp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler,
and LP64 interface
include/intel64/ilp64 Fortran 95 .mod files for the Intel® 64 architecture, Intel Fortran compiler,
and ILP64 interface
interfaces/fftw2x_cdft MPI FFTW 2.x interfaces to the Intel MKL Cluster FFT
interfaces/fftw3x_cdft MPI FFTW 3.x interfaces to the Intel MKL Cluster FFT
interfaces/fftw2xf FFTW 2.x interfaces to the Intel MKL FFT (Fortran interface)
interfaces/fftw3xf FFTW 3.x interfaces to the Intel MKL FFT (Fortran interface)
lib/ia32_lin Static libraries and shared objects for the IA-32 architecture
lib/intel64_lin Static libraries and shared objects for the Intel® 64 architecture
See Also
Notational Conventions
Using Code Examples
19
2 Intel® Math Kernel Library for Linux* Developer Guide
Layer Description
Interface Layer This layer matches compiled code of your application with the threading and/or
computational parts of the library. This layer provides:
• LP64 and ILP64 interfaces.
• Compatibility with compilers that return function values differently.
See Also
Using the ILP64 Interface vs. LP64 Interface
Linking Your Application with the Intel® Math Kernel Library
Linking with Threading Libraries
20
Linking Your Application with the Intel® Math Kernel Library 3
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
Using the Intel® Parallel Studio XE Composer Edition compiler see Using the -mkl Compiler Option.
Explicit dynamic linking see Using the Single Dynamic Library for
how to simplify your link line.
Explicitly listing libraries on your link line see Selecting Libraries to Link with for a
summary of the libraries.
Using an internally provided tool see Using the Command-line Link Tool to
determine libraries, options, and
environment variables or even compile and
build your application.
-mkl or to link with a certain Intel MKL threading layer depending on the
-mkl=parallel threading option provided:
• For -qopenmp the OpenMP threading layer for Intel compilers
• For -tbb the Intel® Threading Building Blocks (Intel® TBB)
threading layer
-mkl=cluster to link with Intel MKL cluster components (sequential) that use
Intel MPI.
21
3 Intel® Math Kernel Library for Linux* Developer Guide
NOTE
The -qopenmp option has higher priority than -tbb in choosing the Intel MKL threading layer for
linking.
For more information on the -mkl compiler option, see the Intel Compiler User and Reference Guides.
On Intel® 64 architecture systems, for each variant of the -mkl option, the compiler links your application
using the LP64 interface.
If you specify any variant of the -mkl compiler option, the compiler automatically includes the Intel MKL
libraries. In cases not covered by the option, use the Link-line Advisor or see Linking in Detail.
See Also
Listing Libraries on a Link Line
Using the ILP64 Interface vs. LP64 Interface
Using the Link-line Advisor
Intel® Software Documentation Library for Intel® compiler documentation
for Intel® compiler documentation
SDL enables you to select the interface and threading library for Intel MKL at run time. By default, linking
with SDL provides:
• Intel LP64 interface on systems based on the Intel® 64 architecture
• Intel interface on systems based on the IA-32 architecture
• Intel threading
To use other interfaces or change threading preferences, including use of the sequential version of Intel MKL,
you need to specify your choices using functions or environment variables as explained in section
Dynamically Selecting the Interface and Threading Layer.
22
Linking Your Application with the Intel® Math Kernel Library 3
Interface layer Threading layer Computational RTL
layer
The Single Dynamic Library (SDL) automatically links interface, threading, and computational libraries and
thus simplifies linking. The following table lists Intel MKL libraries for dynamic linking using SDL. See
Dynamically Selecting the Interface and Threading Layer for how to set the interface and threading layers at
run time through function calls or environment settings.
SDL RTL
See Also
Layered Model Concept
Using the Link-line Advisor
Using the -mkl Compiler Option
Working with the Cluster Software
See Also
High-level Directory Structure
23
3 Intel® Math Kernel Library for Linux* Developer Guide
The tool not only provides the options, libraries, and environment variables to use, but also performs
compilation and building of your application.
The tool mkl_link_tool is installed in the <mkl directory>/tools directory.
See the knowledge base article at http://software.intel.com/en-us/articles/mkl-command-line-link-tool for
more information.
Linking Examples
See Also
Using the Link-line Advisor
Examples for Linking with ScaLAPACK and Cluster FFT
NOTE
If you successfully completed the Scripts to Set Environment Variables Setting Environment Variables
step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit
-L$MKLPATH in the examples for dynamic linking.
24
Linking Your Application with the Intel® Math Kernel Library 3
-Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/
libmkl_core.a
-Wl,--end-group
-liomp5 -lpthread -lm
• Static linking of myprog.f, Fortran 95 BLAS interface, and OpenMP* threaded Intel MKL:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -I$MKLINCLUDE/ia32
-lmkl_blas95
-Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/
libmkl_core.a
-Wl,--end-group -liomp5 -lpthread -lm
• Static linking of myprog.c and Intel MKL threaded with Intel® Threading Building Blocks (Intel® TBB),
provided that the LIBRARY_PATH environment variable contains the path to Intel TBB library:
icc myprog.c -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/
libmkl_tbb_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -ltbb -lstdc++
-lpthread -lm
• Dynamic linking of myprog.c and Intel MKL threaded with Intel TBB, provided that the LIBRARY_PATH
environment variable contains the path to Intel TBB library:
icc myprog.c -L$MKLPATH -I$MKLINCLUDE -lmkl_intel -lmkl_tbb_thread -lmkl_core -ltbb
-lstdc++ -lpthread -lm
See Also
Fortran 95 Interfaces to LAPACK and BLAS
Examples for linking a C application using cluster components
Examples for linking a Fortran application using cluster components
Using the Single Dynamic Library
Linking with System Libraries for specifics of linking with a GNU compiler
NOTE
If you successfully completed the Scripts to Set Environment Variables Setting Environment Variables
step of the Getting Started process, you can omit -I$MKLINCLUDE in all the examples and omit
-L$MKLPATH in the examples for dynamic linking.
• Static linking of myprog.f and OpenMP* threaded Intel MKL supporting the LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE
-Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a
$MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm
• Dynamic linking of myprog.f and OpenMP* threaded Intel MKL supporting the LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
-liomp5 -lpthread -lm
• Static linking of myprog.f and sequential version of Intel MKL supporting the LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE
25
3 Intel® Math Kernel Library for Linux* Developer Guide
Linking in Detail
This section recommends which libraries to link with depending on your Intel MKL usage scenario and
provides details of the linking.
26
Linking Your Application with the Intel® Math Kernel Library 3
NOTE
The syntax below is for dynamic linking. For static linking, replace each library name preceded with "-
l" with the path to the library file. For example, replace -lmkl_core with $MKLPATH/libmkl_core.a,
where $MKLPATH is the appropriate user-defined environment variable.
<files to link>
-L<MKL path>-I<MKL include>
[-I<MKL include>/{ia32|intel64|{ilp64|lp64}}]
[-lmkl_blas{95|95_ilp64|95_lp64}]
[-lmkl_lapack{95|95_ilp64|95_lp64}]
[<cluster components>]
-lmkl_{intel|intel_ilp64|intel_lp64|intel_sp2dp|gf|gf_ilp64|gf_lp64}
-lmkl_{intel_thread|gnu_thread|pgi_thread|tbb_thread|sequential}
-lmkl_core
[-liomp5] [-lpthread] [-lm] [-ldl] [-ltbb -lstdc++]
In the case of static linking,enclose the cluster components, interface, threading, and computational libraries
in grouping symbols (for example, -Wl,--start-group $MKLPATH/libmkl_cdft_core.a $MKLPATH/
libmkl_blacs_intelmpi_ilp64.a $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/
libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group).
The order of listing libraries on the link line is essential, except for the libraries enclosed in the grouping
symbols above.
See Also
Using the Link Line Advisor
Linking Examples
Working with the Cluster Software
27
3 Intel® Math Kernel Library for Linux* Developer Guide
The following table lists available interface layers for IA-32 architecture along with the values to be used to
set each layer.
Specifying the Interface Layer for IA-32 Architecture
Interface Layer Value of Value of the Parameter of
MKL_INTERFACE_LAYER mkl_set_interface_layer
See Also
Using the Single Dynamic Library
Layered Model Concept
28
Linking Your Application with the Intel® Math Kernel Library 3
Directory Structure in Detail
Fortran
C or C++
Caution
Linking of an application compiled with the -i8 or -DMKL_ILP64 option to the LP64 libraries may
result in unpredictable consequences and erroneous output.
29
3 Intel® Math Kernel Library for Linux* Developer Guide
To determine the type of an integer parameter of a function, use appropriate include files. For functions that
support only a Fortran interface, use the C/C++ include files *.h.
The above table explains which integer parameters of functions become 64-bit and which remain 32-bit for
ILP64. The table applies to most Intel MKL functions except some Vector Mathematics and Vector Statistics
functions, which require integer parameters to be 64-bit or 32-bit regardless of the interface:
• Vector Mathematics: The mode parameter of the functions is 64-bit.
• Random Number Generators (RNG):
All discrete RNG except viRngUniformBits64 are 32-bit.
The viRngUniformBits64 generator function and vslSkipAheadStream service function are 64-bit.
• Summary Statistics: The estimate parameter of the vslsSSCompute/vsldSSCompute function is 64-
bit.
Refer to the Intel MKL Developer Reference for more information.
To better understand ILP64 interface details, see also examples.
Limitations
All Intel MKL function domains support ILP64 programming but FFTW interfaces to Intel MKL:
• FFTW 2.x wrappers do not support ILP64.
• FFTW 3.x wrappers support ILP64 by a dedicated set of functions plan_guru64.
See Also
High-level Directory Structure
Intel® MKL Include Files
Language Interfaces Support, by Function Domain
Layered Model Concept
Directory Structure in Detail
See Also
Fortran 95 Interfaces to LAPACK and BLAS
Compiler-dependent Functions and Fortran 90 Modules
30
Linking Your Application with the Intel® Math Kernel Library 3
31
3 Intel® Math Kernel Library for Linux* Developer Guide
libmkl_
sequential.so
See Also
Layered Model Concept
Notational Conventions
32
Linking Your Application with the Intel® Math Kernel Library 3
Static Linking Dynamic Linking
libmkl_core.a libmkl_core.so
Computational Libraries for the Intel® 64 or Intel® Many Integrated Core Architecture
See Also
Linking with ScaLAPACK and Cluster FFT
Using the Link-line Advisor
Using the ILP64 Interface vs. LP64 Interface
33
3 Intel® Math Kernel Library for Linux* Developer Guide
See Also
Scripts to Set Environment Variables
Layered Model Concept
NOTE
To link with Intel MKL statically using a GNU or PGI compiler, link also the system library libdl by
adding -ldl to your link line. The Intel compiler always passes -ldl to the linker.
See Also
Linking Examples
NOTE
The objects in Intel MKL static libraries are position-independent code (PIC), which is not typical for
static libraries. Therefore, the custom shared object builder can create a shared object from a subset
of Intel MKL functions by picking the respective object files from the static libraries.
Value Comment
libia32
The builder uses static Intel MKL interface, threading, and core libraries to build a
custom shared object for the IA-32 architecture.
libintel64
The builder uses static Intel MKL interface, threading, and core libraries to build a
custom shared object for the Intel® 64 architecture.
34
Linking Your Application with the Intel® Math Kernel Library 3
Value Comment
soia32
The builder uses the single dynamic library libmkl_rt.so to build a custom shared
object for the IA-32 architecture.
sointel64
The builder uses the single dynamic library libmkl_rt.so to build a custom shared
object for the Intel® 64 architecture.
help
The command prints Help on the custom shared object builder
The <options> placeholder stands for the list of parameters that define macros to be used by the makefile.
The following table describes these parameters:
Parameter Description
[Values]
interface =
Defines whether to use LP64 or ILP64 programming interfacefor the Intel
{lp64|ilp64} 64architecture.The default value is lp64.
threading =
Defines whether to use the Intel MKL in the threaded or sequential mode. The
{parallel| default value is parallel.
sequential}
export =
Specifies the full name of the file that contains the list of entry-point functions to be
<file name> included in the shared object. The default name is user_example_list (no
extension).
name = <so
Specifies the name of the library to be created. By default, the names of the created
name> library is mkl_custom.so.
xerbla =
Specifies the name of the object file <user_xerbla>.o that contains the user's
<error handler>
error handler. The makefile adds this error handler to the library for use instead of
the default Intel MKL error handler xerbla. If you omit this parameter, the native
Intel MKL xerbla is used. See the description of the xerbla function in the Intel
MKL Developer Reference on how to develop your own error handler.
MKLROOT =
Specifies the location of Intel MKL libraries used to build the custom shared object.
<mkl directory> By default, the builder uses the Intel MKL installation directory.
In this case, the command creates the mkl_small.so library for processors using the IA-32 architecture.
The command takes the list of functions from my_func_list.txt file and uses the user's error handler
my_xerbla.o.
The process is similar for processors using the Intel® 64 architecture.
See Also
Using the Single Dynamic Library
35
3 Intel® Math Kernel Library for Linux* Developer Guide
1. Link your application with installed Intel MKL libraries to make sure the application builds.
2. Remove all Intel MKL libraries from the link line and start linking.
Unresolved symbols indicate Intel MKL functions that your application uses.
3. Include these functions in the list.
Important
Each time your application starts using more Intel MKL functions, update the list to include the new
functions.
See Also
Specifying Function Names
NOTE
The lists of functions are provided in the <mkl directory>/tools/builder folder merely as
examples. See Composing a List of Functions for how to compose lists of functions for your custom
shared object.
Tip
Names of Fortran-style routines (BLAS, LAPACK, etc.) can be both upper-case or lower-case, with or
without the trailing underscore. For example, these names are equivalent:
BLAS: dgemm, DGEMM, dgemm_, DGEMM_
LAPACK: dgetrf, DGETRF, dgetrf_, DGETRF_.
Properly capitalize names of C support functions in the function list. To do this, follow the guidelines below:
1. In the mkl_service.h include file, look up a #define directive for your function
(mkl_service.h is included in the mkl.h header file).
2. Take the function name from the replacement part of that directive.
For example, the #define directive for the mkl_disable_fast_mm function is
#define mkl_disable_fast_mm MKL_Disable_Fast_MM.
Capitalize the name of this function in the list like this: MKL_Disable_Fast_MM.
For the names of the Fortran support functions, see the tip.
NOTE
If selected functions have several processor-specific versions, the builder automatically includes them
all in the custom library and the dispatcher manages them.
36
Linking Your Application with the Intel® Math Kernel Library 3
37
4 Intel® Math Kernel Library for Linux* Developer Guide
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Managing Multi-core Performance
38
Managing Performance and Memory 4
• All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers.
• All Vector Mathematics functions (except service functions).
• FFT.
For a list of FFT transforms that can be threaded, see Threaded FFT Problems.
LAPACK Routines
In this section, ? stands for a precision prefix of each flavor of the respective routine and may have the value
of s, d, c, or z.
The following routines are threaded with OpenMP* for Intel® Core™2 Duo and Intel® Core™ i7 processors:
• Level1 BLAS:
?axpy, ?copy, ?swap, ddot/sdot, cdotc, drot/srot
• Level2 BLAS:
?gemv, ?trsv, ?trmv, dsyr/ssyr, dsyr2/ssyr2, dsymv/ssymv
39
4 Intel® Math Kernel Library for Linux* Developer Guide
Most FFT problems are threaded. In particular, computation of multiple transforms in one call (number of
transforms > 1) is threaded. Details of which transforms are threaded follow.
One-dimensional (1D) transforms
1D transforms are threaded in many cases.
1D complex-to-complex (c2c) transforms of size N using interleaved complex data layout are threaded under
the following conditions depending on the architecture:
Architecture Conditions
LAPACK Routines
The following LAPACK routines are threaded with Intel TBB:
?geqrf, ?gelqf, ?getrf, ?potrf, ?unmqr*, ?ormqr*, ?unmrq*, ?ormrq*, ?unmlq*, ?ormlq*, ?unmql*,
?ormql*, ?sytrd, ?hetrd, ?syev, ?heev, and ?latrd.
A number of other LAPACK routines, which are based on threaded LAPACK or BLAS routines, make effective
use of Intel TBB threading:
?getrs, ?gesv, ?potrs, ?bdsqr, and ?gels.
40
Managing Performance and Memory 4
Sparse BLAS Routines
The Sparse BLAS inspector-executor application programming interface routines mkl_sparse_?_mv are
threaded with Intel TBB for the general compressed sparse row (CSR) and block sparse row (BSR) formats.
The following Sparse BLAS inspector-executor application programming routines are threaded with Intel TBB:
• mkl_sparse_?_mv using the general compressed sparse row (CSR) and block sparse row (BSR) matrix
formats.
• mkl_sparse_?_mm using the general CSR sparse matrix format and both row and column major storage
formats for the dense matrix.
You parallelize the program If more than one thread calls Intel MKL, and the function being called is
using the technology other threaded, it may be important that you turn off Intel MKL threading. Set
than Intel OpenMP and Intel the number of threads to one by any of the available means (see
TBB (for example: pthreads Techniques to Set the Number of Threads).
on Linux*).
You parallelize the program To avoid simultaneous activities of multiple threading RTLs, link the
using OpenMP directives program against the Intel MKL threading library that matches the
and/or pragmas and compile compiler you use (see Linking Examples on how to do this). If this is not
the program using a non-Intel possible, use Intel MKL in the sequential mode. To do this, you should link
compiler. with the appropriate threading library: libmkl_sequential.a or
libmkl_sequential.so (see Appendix C: Directory Structure in Detail).
You thread the program using To avoid simultaneous activities of multiple threading RTLs, link the
Intel TBB threading program against the Intel MKL Intel TBB threading library and Intel TBB
technology and compile the RTL if it matches the compiler you use. If this is not possible, use Intel
program using a non-Intel MKL in the sequential mode. To do this, link with the appropriate
compiler. threading library: libmkl_sequential.a or libmkl_sequential.so
(see Appendix C: Directory Structure in Detail).
You run multiple programs The threading RTLs from different programs you run may place a large
calling Intel MKL on a number of threads on the same processor on the system and therefore
multiprocessor system, for overuse the machine resources. In this case, one of the solutions is to set
example, a program the number of threads to one by any of the available means (see
parallelized using a message- Techniques to Set the Number of Threads). The Intel® Distribution for
passing interface (MPI). LINPACK* Benchmark section discusses another solution for a Hybrid
(OpenMP* + MPI) mode.
41
4 Intel® Math Kernel Library for Linux* Developer Guide
See Also
Using Additional Threading Control
Linking with Compiler Support RTLs
NOTE
A call to the mkl_set_num_threads or mkl_domain_set_num_threads function changes the number
of OpenMP threads available to all in-progress calls (in concurrent threads) and future calls to Intel
MKL and may result in slow Intel MKL performance and/or race conditions reported by run-time tools,
such as Intel® Inspector.
To avoid such situations, use the mkl_set_num_threads_local function (see the "Support Functions"
section in the Intel MKL Developer Reference for the function description).
When choosing the appropriate technique, take into account the following rules:
• The Intel MKL threading controls take precedence over the OpenMP controls because they are inspected
first.
• A function call takes precedence over any environment settings. The exception, which is a consequence of
the previous rule, is that a call to the OpenMP subroutine omp_set_num_threads() does not have
precedence over the settings of Intel MKL environment variables such as MKL_NUM_THREADS. See Using
Additional Threading Control for more details.
• You cannot change run-time behavior in the course of the run using the environment variables because
they are read only once at the first call to Intel MKL.
If you use the Intel TBB threading technology, read the documentation for the tbb::task_scheduler_init
class at https://www.threadingbuildingblocks.org/documentation to find out how to specify the number of
threads.
See Also
Using Additional Threading Control
42
Managing Performance and Memory 4
43
4 Intel® Math Kernel Library for Linux* Developer Guide
c[i*SIZE+j]= (double)0;
}
}
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
printf("row\ta\tc\n");
for ( i=0;i<10;i++){
printf("%d:\t%f\t%f\n", i, a[i*SIZE],
c[i*SIZE]);
}
free (a);
free (b);
free (c);
return 0;
}
ALPHA = 1.1
BETA = -1.2
DO I=1,N
DO J=1,N
A(I,J) = I+J
B(I,J) = I*j
C(I,J) = 0.0
END DO
END DO
CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N)
print *,'Row A C'
DO i=1,10
write(*,'(I4,F20.8,F20.8)') I, A(1,I),C(1,I)
END DO
CALL OMP_SET_NUM_THREADS(1);
DO I=1,N
DO J=1,N
A(I,J) = I+J
B(I,J) = I*j
C(I,J) = 0.0
END DO
END DO
CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N)
print *,'Row A C'
DO i=1,10
write(*,'(I4,F20.8,F20.8)') I, A(1,I),C(1,I)
END DO
CALL OMP_SET_NUM_THREADS(2);
DO I=1,N
DO J=1,N
A(I,J) = I+J
B(I,J) = I*j
C(I,J) = 0.0
44
Managing Performance and Memory 4
END DO
END DO
CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N)
print *,'Row A C'
DO i=1,10
write(*,'(I4,F20.8,F20.8)') I, A(1,I),C(1,I)
END DO
STOP
END
NOTE
Some Intel MKL routines may use fewer OpenMP threads than suggested by the threading controls if
either the underlying algorithms do not support the suggested number of OpenMP threads or the
routines perform better with fewer OpenMP threads because of lower OpenMP overhead and/or better
data locality. Set the MKL_DYNAMIC environment variable to FALSE or call mkl_set_dynamic(0) to
use the suggested number of OpenMP threads whenever the algorithms permit and regardless of
OpenMP overhead and data locality.
Section "Number of User Threads" in the "Fourier Transform Functions" chapter of the Intel MKL Developer
Reference shows how the Intel MKL threading controls help to set the number of threads for the FFT
computation.
The table below lists the Intel MKL environment variables for threading control, their equivalent functions,
and OMP counterparts:
45
4 Intel® Math Kernel Library for Linux* Developer Guide
NOTE
Call mkl_set_num_threads() to force Intel MKL to use a given number of OpenMP threads and
prevent it from reacting to the environment variables MKL_NUM_THREADS, MKL_DOMAIN_NUM_THREADS,
and OMP_NUM_THREADS.
The example below shows how to force Intel MKL to use one thread:
#include <mkl.h>
...
mkl_set_num_threads ( 1 );
See the Intel MKL Developer Reference for the detailed description of the threading control functions, their
parameters, calling syntax, and more code examples.
MKL_DYNAMIC
The MKL_DYNAMIC environment variable enables Intel MKL to dynamically change the number of threads.
The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE.
When MKL_DYNAMIC is TRUE, Intel MKL may use fewer OpenMP threads than the maximum number you
specify.
For example, MKL_DYNAMIC set to TRUE enables optimal choice of the number of threads in the following
cases:
• If the requested number of threads exceeds the number of physical cores (perhaps because of using the
Intel® Hyper-Threading Technology), Intel MKL scales down the number of OpenMP threads to the number
of physical cores.
• If you are able to detect the presence of a message-passing interface (MPI), but cannot determine
whether it has been called in a thread-safe mode, Intel MKL runs one OpenMP thread.
When MKL_DYNAMIC is FALSE, Intel MKL uses the suggested number of OpenMP threads whenever the
underlying algorithms permit. For example, if you attempt to do a size one matrix-matrix multiply across
eight threads, the library may instead choose to use only one thread because it is impractical to use eight
threads in this event.
If Intel MKL is called from an OpenMP parallel region in your program, Intel MKL uses only one thread by
default. If you want Intel MKL to go parallel in such a call, link your program against an OpenMP threading
RTL supported by Intel MKL and set the environment variables:
• OMP_NESTED to TRUE
• OMP_DYNAMIC and MKL_DYNAMIC to FALSE
• MKL_NUM_THREADS to some reasonable value
With these settings, Intel MKL uses MKL_NUM_THREADS threads when it is called from the OpenMP parallel
region in your program.
In general, set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for
example, to use nested parallelism where the library is already called from a parallel section.
46
Managing Performance and Memory 4
MKL_DOMAIN_NUM_THREADS
The MKL_DOMAIN_NUM_THREADS environment variable suggests the number of OpenMP threads for a
particular function domain.
MKL_DOMAIN_NUM_THREADS accepts a string value <MKL-env-string>, which must have the following
format:
<MKL-env-string> ::= <MKL-domain-env-string> { <delimiter><MKL-domain-env-string> }
<delimiter> ::= [ <space-symbol>* ] ( <space-symbol> | <comma-symbol> | <semicolon-
symbol> | <colon-symbol>) [ <space-symbol>* ]
<MKL-domain-env-string> ::= <MKL-domain-env-name><uses><number-of-threads>
<MKL-domain-env-name> ::= MKL_DOMAIN_ALL | MKL_DOMAIN_BLAS | MKL_DOMAIN_FFT |
MKL_DOMAIN_VML | MKL_DOMAIN_PARDISO
<uses> ::= [ <space-symbol>* ] ( <space-symbol> | <equality-sign> | <comma-symbol>)
[ <space-symbol>* ]
<number-of-threads> ::= <positive-number>
<positive-number> ::= <decimal-positive-number> | <octal-number> | <hexadecimal-number>
In the syntax above, values of <MKL-domain-env-name> indicate function domains as follows:
MKL_DOMAIN_PARDISO Intel MKL PARDISO, a direct sparse solver based on Parallel Direct
Sparse Solver (PARDISO*)
For example,
MKL_DOMAIN_ALL 2 : MKL_DOMAIN_BLAS 1 : MKL_DOMAIN_FFT 4
MKL_DOMAIN_ALL=2 : MKL_DOMAIN_BLAS=1 : MKL_DOMAIN_FFT=4
MKL_DOMAIN_ALL=2, MKL_DOMAIN_BLAS=1, MKL_DOMAIN_FFT=4
MKL_DOMAIN_ALL=2; MKL_DOMAIN_BLAS=1; MKL_DOMAIN_FFT=4
MKL_DOMAIN_ALL = 2 MKL_DOMAIN_BLAS 1 , MKL_DOMAIN_FFT 4
MKL_DOMAIN_ALL,2: MKL_DOMAIN_BLAS 1, MKL_DOMAIN_FFT,4 .
The global variables MKL_DOMAIN_ALL, MKL_DOMAIN_BLAS, MKL_DOMAIN_FFT,MKL_DOMAIN_VML, and
MKL_DOMAIN_PARDISO, as well as the interface for the Intel MKL threading control functions, can be found in
the mkl.h header file.
The table below illustrates how values of MKL_DOMAIN_NUM_THREADS are interpreted.
Value of Interpretation
MKL_DOMAIN_NUM_
THREADS
MKL_DOMAIN_ALL= All parts of Intel MKL should try four OpenMP threads. The actual number of threads
4 may be still different because of the MKL_DYNAMIC setting or system resource
issues. The setting is equivalent to MKL_NUM_THREADS = 4.
47
4 Intel® Math Kernel Library for Linux* Developer Guide
Value of Interpretation
MKL_DOMAIN_NUM_
THREADS
MKL_DOMAIN_ALL= All parts of Intel MKL should try one OpenMP thread, except for BLAS, which is
1, suggested to try four threads.
MKL_DOMAIN_BLAS
=4
MKL_DOMAIN_VML= VM should try two OpenMP threads. The setting affects no other part of Intel MKL.
2
Be aware that the domain-specific settings take precedence over the overall ones. For example, the
"MKL_DOMAIN_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests trying four OpenMP threads for BLAS,
regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4,
MKL_DOMAIN_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads().
However, a function call with input "MKL_DOMAIN_ALL", such as "mkl_domain_set_num_threads (4,
MKL_DOMAIN_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later
calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with
"MKL_DOMAIN_ALL=4" will be overwritten with MKL_NUM_THREADS = 2.
Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for
example, "MKL_DOMAIN_BLAS=4,MKL_DOMAIN_FFT=2", the corresponding function does not take string
syntax. So, to do the same with the function calls, you may need to make several calls, which in this
example are as follows:
mkl_domain_set_num_threads ( 4, MKL_DOMAIN_BLAS );
mkl_domain_set_num_threads ( 2, MKL_DOMAIN_FFT );
MKL_NUM_STRIPES
The MKL_NUM_STRIPES environment variable controls the Intel MKL threading algorithm for ?gemm functions.
When MKL_NUM_STRIPES is set to a positive integer value nstripes, Intel MKL tries to use a number of
partitions equal to nstripes along the leading dimension of the output matrix.
The following table explains how the value nstripes of MKL_NUM_STRIPES defines the partitioning algorithm
used by Intel MKL for ?gemm output matrix; max_threads_for_mkl denotes the maximum number of OpenMP
threads for Intel MKL:
Value of
Partitioning Algorithm
MKL_NUM_STRIPES
1 < nstripes < 2D partitioning with the number of partitions equal to nstripes:
(max_threads_for_mkl/ • Horizontal, for column-major ordering.
2) • Vertical, for row-major ordering.
nstripes = 1 1D partitioning algorithm along the opposite direction of the leading dimension.
The following figure shows the partitioning of an output matrix for nstripes = 4 and a total number of 8
OpenMP threads for column-major and row-major orderings:
48
Managing Performance and Memory 4
You can use support functions mkl_set_num_stripes and mkl_get_num_stripes to set and query the
number of stripes, respectively.
49
4 Intel® Math Kernel Library for Linux* Developer Guide
Usage model: disable Intel MKL internal threading for the whole application
When used: Intel MKL internal threading interferes with application's own threading or may slow down the
application.
Example: the application is threaded at top level, or the application runs concurrently with other
applications.
Options:
• Link statically or dynamically with the sequential library
• Link with the Single Dynamic Library mkl_rt.so and select the sequential library using an environment
variable or a function call:
• Set MKL_THREADING_LAYER=sequential
• Call mkl_set_threading_layer(MKL_THREADING_SEQUENTIAL)‡
Use to globally set a desired number of OpenMP threads for Intel MKL at run time.
• Call mkl_domain_set_num_threads().
Use if at some point application threads start working with different Intel MKL function domains.
• Call mkl_set_num_threads_local().
Use to set the number of OpenMP threads for Intel MKL called from a particular thread.
NOTE
If your application uses OpenMP* threading, you may need to provide additional settings:
• Set the environment variable OMP_NESTED=TRUE, or alternatively call omp_set_nested(1), to
enable OpenMP nested parallelism.
• Set the environment variable MKL_DYNAMIC=FALSE, or alternatively call mkl_set_dynamic(0), to
prevent Intel MKL from dynamically reducing the number of OpenMP threads in nested parallel
regions.
‡For details of the mentioned functions, see the Support Functions section of the Intel MKL Developer
Reference, available in the Intel Software Documentation Library.
50
Managing Performance and Memory 4
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Linking with Threading Libraries
Dynamically Selecting the Interface and Threading Layer
Intel MKL-specific Environment Variables for OpenMP Threading Control
MKL_DOMAIN_NUM_THREADS
Avoiding Conflicts in the Execution Environment
Intel Software Documentation Library
51
4 Intel® Math Kernel Library for Linux* Developer Guide
To resolve this issue, before calling Intel MKL, set an affinity mask for each OpenMP thread using the
KMP_AFFINITY environment variable or the sched_setaffinity system function. The following code
example shows how to resolve the issue by setting an affinity mask by operating system means using the
Intel compiler. The code calls the functionsched_setaffinityto bind the threads tothecoreson different
sockets. Then the Intel MKLFFT functionis called:
CPU_ZERO(&new_mask);
Compile the application with the Intel compiler using the following command:
icc test_application.c -openmp
wheretest_application.cis the filename for the application.
Build the application. Run it in two threads, for example, by using the environment variable to set the
number of threads:
env OMP_NUM_THREADS=2 ./a.out
See the Linux Programmer's Manual (in man pages format) for particulars of the
sched_setaffinityfunction used in the above example.
52
Managing Performance and Memory 4
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
#include "mkl.h"
int main(void) {
return 0;
}
• For multi-threaded Intel MKL, compile with MKL_DIRECT_CALL preprocessor macro:
53
4 Intel® Math Kernel Library for Linux* Developer Guide
• To use Intel MKL in the sequential mode, compile with MKL_DIRECT_CALL_SEQ preprocessor macro:
# include "mkl_direct_call.fi"
program DGEMM_MAIN
....
* Call Intel MKL DGEMM
....
call sub1()
stop 1
end
end
• For multi-threaded Intel MKL, compile with -fpp option for Intel Fortran compiler (or with -Mpreprocess
for PGI compilers) and with MKL_DIRECT_CALL preprocessor macro:
NOTE
In order to further improve performance, a dedicated JIT API has been introduced. In addition to
enabling benefits from tailored GEMM kernels, this API enables you to call directly the generated
kernel and remove any library overhead. For more information see the JIT API documentation.
54
Managing Performance and Memory 4
To enable JIT code generation for ?gemm, compile your C or Fortran code with the preprocessor macro shown
depending on whether a threaded or sequential mode of Intel MKL is required:
Notes
• Just-in-time code generation introduces a runtime overhead at the first call of ?gemm for a given set
of input parameters: layout (parameter for C only), transa, transb, m, n, k, alpha, lda, ldb,
beta, and ldc. To benefit from JIT code generation, use this feature when you need to call the
same GEMM kernel (same set of input parameters - layout (parameter for C only), transa,
transb, m, n, k, alpha, lda, ldb, beta, and ldc) many times (for example, several hundred
calls).
• If MKL_DIRECT_CALL_JIT is enabled, every call to ?gemm might generate a kernel that is stored by
Intel MKL. The memory used to store those kernels cannot be freed by the user and will be freed
only if mkl_finalize is called or if Intel MKL is unloaded. To limit the memory footprint of the
feature, the number of stored kernels is limited to 1024 on IA-32 and 4096 on Intel® 64.
Important
With a limited error checking, you are responsible for checking the correctness of function parameters
to avoid unsafe and incorrect code execution.
NOTE
The direct call feature substitutes the names of Intel MKL functions with longer counterparts, which
can cause the lines to exceed the column limit for a fixed format Fortran source code compiled with
PGI compilers. Because the compilers ignore any part of the line that exceeds the limit, the behavior
of the program can be unpredictable.
55
4 Intel® Math Kernel Library for Linux* Developer Guide
Coding Techniques
This section discusses coding techniques to improve performance on processors based on supported
architectures.
To improve performance, properly align arrays in your code. Additional conditions can improve performance
for specific function domains.
call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork,
iwork, ifail, info)
call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail,
info)
See Also
Managing Performance of the Cluster Fourier Transform Functions
56
Managing Performance and Memory 4
Operating on Denormals
The IEEE 754-2008 standard, "An IEEE Standard for Binary Floating-Point Arithmetic", defines denormal (or
subnormal) numbers as non-zero numbers smaller than the smallest possible normalized numbers for a
specific floating-point format. Floating-point operations on denormals are slower than on normalized
operands because denormal operands and results are usually handled through a software assist mechanism
rather than directly in hardware. This software processing causes Intel MKL functions that consume
denormals to run slower than with normalized floating-point numbers.
You can mitigate this performance issue by setting the appropriate bit fields in the MXCSR floating-point
control register to flush denormals to zero (FTZ) or to replace any denormals loaded from memory with zero
(DAZ). Check your compiler documentation to determine whether it has options to control FTZ and DAZ.
Note that these compiler options may slightly affect accuracy.
See Also
Intel Software Documentation Library
57
4 Intel® Math Kernel Library for Linux* Developer Guide
• Call
mkl_set_memory_limit (MKL_MEM_MCDRAM, <limit_in_mbytes>)
• Set the environment variable:
• For the bash shell:
MKL_FAST_MEMORY_LIMIT="<limit_in_mbytes>"
• For a C shell (csh or tcsh):
setenv MKL_FAST_MEMORY_LIMIT "<limit_in_mbytes>"
The setting of the limit affects all Intel MKL functions, including user-callable memory functions such as
mkl_malloc. Therefore, if an application calls mkl_malloc, mkl_calloc, or mkl_realloc, which always
tries to allocate memory to MCDRAM, make sure that the limit is sufficient.
If you replace Intel MKL memory management functions with your own functions (for details, see Redefining
Memory Functions), Intel MKL uses your functions and does not work with the memkind library directly.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
Memory Renaming
In addition to the memkind library, Intel MKL memory management by default uses standard C run-time
memory functions to allocate or free memory. These functions can be replaced using memory renaming.
Intel MKL accesses the memory functions by pointers i_malloc, i_free, i_calloc, and i_realloc,
which are visible at the application level. You can programmatically redefine values of these pointers to the
addresses of your application's memory management functions.
Redirecting the pointers is the only correct way to use your own set of memory management functions. If
you call your own memory functions without redirecting the pointers, the memory will get managed by two
independent memory management packages, which may cause unexpected memory issues.
58
Managing Performance and Memory 4
2. Redefine values of pointers i_malloc, i_free, i_calloc, and i_realloc prior to the first call to
Intel MKL functions, as shown in the following example:
#include "i_malloc.h"
. . .
i_malloc = my_malloc;
i_calloc = my_calloc;
i_realloc = my_realloc;
i_free = my_free;
. . .
// Now you may call Intel MKL functions
See Also
Using High-bandwidth Memory with Intel MKL
59
5 Intel® Math Kernel Library for Linux* Developer Guide
Language-specific Usage
Options 5
The Intel® Math Kernel Library (Intel® MKL) provides broad support for Fortran and C/C++ programming.
However, not all functions support both Fortran and C interfaces. For example, some LAPACK functions have
no C interface. You can call such functions from C using mixed-language programming.
If you want to use LAPACK or BLAS functions that support Fortran 77 in the Fortran 95 environment,
additional effort may be initially required to build compiler-specific interface libraries and modules from the
source code provided with Intel MKL.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Language Interfaces Support, by Function Domain
60
Language-specific Usage Options 5
File name Contains
libfftw2xf_intel.a Interfaces for FFTW version 2.x (Fortran interface for Intel
compilers) to call Intel MKL FFT.
libfftw2xf_gnu.a Interfaces for FFTW version 2.x (Fortran interface for GNU
compiler) to call Intel MKL FFT.
libfftw3xc_intel.a2 Interfaces for FFTW version 3.x (C interface for Intel compiler)
to call Intel MKL FFT.
libfftw3xf_intel.a2 Interfaces for FFTW version 3.x (Fortran interface for Intel
compilers) to call Intel MKL FFT.
libfftw3xf_gnu.a Interfaces for FFTW version 3.x (Fortran interface for GNU
compilers) to call Intel MKL FFT.
libfftw3x_cdft.a Interfaces for MPI FFTW version 3.x (C interface) to call Intel
MKL cluster FFT.
libfftw3x_cdft_ilp64.a Interfaces for MPI FFTW version 3.x (C interface) to call Intel
MKL cluster FFT supporting the ILP64 interface.
See Also
Fortran 95 Interfaces to LAPACK and BLAS
61
5 Intel® Math Kernel Library for Linux* Developer Guide
Important
The parameter INSTALL_DIR is required.
As a result, the required library is built and installed in the <user dir>/lib directory, and the .mod files are
built and installed in the <user dir>/include/<arch>[/{lp64|ilp64}] directory, where <arch> is one of
{ia32, intel64}.
By default, the ifort compiler is assumed. You may change the compiler with an additional parameter of
make:
FC=<compiler>.
For example, the command
make libintel64 FC=pgf95 INSTALL_DIR=<userpgf95 dir> interface=lp64
builds the required library and .mod files and installs them in subdirectories of <userpgf95 dir>.
To delete the library from the building directory, use one of the following commands:
• For the IA-32 architecture,
make cleania32 INSTALL_DIR=<user dir>
• For the Intel® 64 architecture,
make cleanintel64 [interface=lp64|ilp64] INSTALL_DIR=<user dir>
• For all the architectures,
make clean INSTALL_DIR=<user dir>
Caution
Even if you have administrative rights, avoid setting INSTALL_DIR=../.. or INSTALL_DIR=<mkl
directory> in a build or clean command above because these settings replace or delete the Intel
MKL prebuilt Fortran 95 library and modules.
62
Language-specific Usage Options 5
In cases where RTL dependencies might arise, the functions are delivered as source code and you need to
compile the code with whatever compiler you are using for your application.
In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support.
Therefore, Intel MKL delivers these modules compiled with the Intel compiler, along with source code, to be
used with different compilers.
Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments
Not all Intel MKL function domains support both C and Fortran environments. To use Intel MKL Fortran-style
functions in C/C++ environments, you should observe certain conventions, which are discussed for LAPACK
and BLAS in the subsections below.
Caution
Avoid calling BLAS 95/LAPACK 95 from C/C++. Such calls require skills in manipulating the descriptor
of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain
links to a Fortran RTL.
63
5 Intel® Math Kernel Library for Linux* Developer Guide
For example, if a two-dimensional matrix A of size mxn is stored densely in a one-dimensional array B, you
can access a matrix element like this:
A[i][j] = B[i*n+j] in C ( i=0, ... , m-1, j=0, ... , -1)
See Example "Calling a Complex BLAS Level 1 Function from C++" on how to call BLAS routines from C.
See also the Intel(R) MKL Developer Reference for a description of the C interface to LAPACK functions.
CBLAS
Instead of calling BLAS routines from a C-language program, you can use the CBLAS interface.
CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. Use
the mkl.h header file with the CBLAS interface. The header file specifies enumerated values and prototypes
of all the functions. It also determines whether the program is being compiled with a C++ compiler, and if it
is, the included file will be correct for use with C++ compilation. Example "Using CBLAS Interface Instead of
Calling BLAS Directly from C" illustrates the use of the CBLAS interface.
C Interface to LAPACK
Instead of calling LAPACK routines from a C-language program, you can use the C interface to LAPACK
provided by Intel MKL.
The C interface to LAPACK is a C-style interface to the LAPACK routines. This interface supports matrices in
row-major and column-major order, which you can define in the first function argument matrix_order. Use
the mkl.h header file with the C interface to LAPACK. mkl.h includes the mkl_lapacke.h header file, which
specifies constants and prototypes of all the functions. It also determines whether the program is being
compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. You
can find examples of the C interface to LAPACK in the examples/lapacke subdirectory in the Intel MKL
installation directory.
64
Language-specific Usage Options 5
Intel MKL provides complex types MKL_Complex8 and MKL_Complex16, which are structures equivalent to
the Fortran complex types COMPLEX(4) and COMPLEX(8), respectively. The MKL_Complex8 and
MKL_Complex16 types are defined in the mkl_types.h header file. You can use these types to define
complex data. You can also redefine the types with your own types before including the mkl_types.h header
file. The only requirement is that the types must be compatible with the Fortran complex layout, that is, the
complex type must be a pair of real numbers for the values of real and imaginary parts.
For example, you can use the following definitions in your C++ code:
#define MKL_Complex8 std::complex<float>
and
#define MKL_Complex16 std::complex<double>
See Example "Calling a Complex BLAS Level 1 Function from C++" for details. You can also define these
types in the command line:
-DMKL_Complex8="std::complex<float>"
-DMKL_Complex16="std::complex<double>"
See Also
Intel® Software Documentation Library for the Intel® Fortran Compiler documentation
for the Intel® Fortran Compiler documentation
Calling BLAS Functions that Return the Complex Values in C/C++ Code
Complex values that functions return are handled differently in C and Fortran. Because BLAS is Fortran-style,
you need to be careful when handling a call from C to a BLAS function that returns complex values. However,
in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which
provides a mechanism for returning the complex value correctly when the function is called from a C
program. When a Fortran function is called as a subroutine, the return value is the first parameter in the
calling sequence. You can use this feature to call a BLAS function from C.
The following example shows how a call to a Fortran function as a subroutine converts to a call from C and
the hidden parameter result gets exposed:
Normal Fortran function call: result = cdotc( n, x, 1, y, 1 )
A call to the function as a subroutine: call cdotc( result, n, x, 1, y, 1)
A call to the function from C: cdotc( &result, &n, x, &one, y, &one )
NOTE
Intel MKL has both upper-case and lower-case entry points in the Fortran-style (case-insensitive)
BLAS, with or without the trailing underscore. So, all these names are equivalent and acceptable:
cdotc, CDOTC, cdotc_, and CDOTC_.
The above example shows one of the ways to call several level 1 BLAS functions that return complex values
from your C and C++ applications. An easier way is to use the CBLAS interface. For instance, you can call the
same function using the CBLAS interface as follows:
cblas_cdotc( n, x, 1, y, 1, &result )
NOTE
The complex value comes last on the argument list in this case.
The following examples show use of the Fortran-style BLAS interface from C and C++, as well as the CBLAS
(C language) interface:
65
5 Intel® Math Kernel Library for Linux* Developer Guide
#include "mkl.h"
#define N 5
int main()
{
int n = N, inca = 1, incb = 1, i;
MKL_Complex16 a[N], b[N], c;
for( i = 0; i < n; i++ ){
a[i].real = (double)i; a[i].imag = (double)i * 2.0;
b[i].real = (double)(n - i); b[i].imag = (double)i * 2.0;
}
zdotc( &c, &n, a, &inca, b, &incb );
printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.real, c.imag );
return 0;
}
#include <complex>
#include <iostream>
#define MKL_Complex16 std::complex<double>
#include "mkl.h"
#define N 5
int main()
{
int n, inca = 1, incb = 1, i;
std::complex<double> a[N], b[N], c;
n = N;
66
Language-specific Usage Options 5
Example "Using CBLAS Interface Instead of Calling BLAS Directly from C"
This example uses CBLAS:
#include <stdio.h>
#include "mkl.h"
typedef struct{ double re; double im; } complex16;
#define N 5
int main()
{
int n, inca = 1, incb = 1, i;
complex16 a[N], b[N], c;
n = N;
for( i = 0; i < n; i++ ){
a[i].re = (double)i; a[i].im = (double)i * 2.0;
b[i].re = (double)(n - i); b[i].im = (double)i * 2.0;
}
cblas_zdotc_sub(n, a, inca, b, incb, &c );
printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im );
return 0;
}
67
6 Intel® Math Kernel Library for Linux* Developer Guide
Obtaining Numerically
Reproducible Results 6
Intel® Math Kernel Library (Intel® MKL) offers functions and environment variables that help you obtain
Conditional Numerical Reproducibility (CNR) of floating-point results when calling the library functions from
your application. These new controls enable Intel MKL to run in a special mode, when functions return bitwise
reproducible floating-point results from run to run under the following conditions:
• Calls to Intel MKL occur in a single executable
• The number of computational threads used by the library does not change in the run
It is well known that for general single and double precision IEEE floating-point numbers, the associative
property does not always hold, meaning (a+b)+c may not equal a +(b+c). Let's consider a specific example.
In infinite precision arithmetic 2-63 + 1 + -1 = 2-63. If this same computation is done on a computer using
double precision floating-point numbers, a rounding error is introduced, and the order of operations becomes
important:
(2-63 + 1) + (-1) ≃ 1 + (-1) = 0
versus
2-63 + (1 + (-1)) ≃ 2-63 + 0 = 2-63
This inconsistency in results due to order of operations is precisely what the new functionality addresses.
The application related factors that affect the order of floating-point operations within a single executable
program include selection of a code path based on run-time processor dispatching, alignment of data arrays,
variation in number of threads, threaded algorithms and internal floating-point control settings. You can
control most of these factors by controlling the number of threads and floating-point settings and by taking
steps to align memory when it is allocated (see the Getting Reproducible Results with Intel® MKL knowledge
base article for details). However, run-time dispatching and certain threaded algorithms do not allow users to
make changes that can ensure the same order of operations from run to run.
Intel MKL does run-time processor dispatching in order to identify the appropriate internal code paths to
traverse for the Intel MKL functions called by the application. The code paths chosen may differ across a wide
range of Intel processors and Intel architecture compatible processors and may provide differing levels of
performance. For example, an Intel MKL function running on an Intel® Pentium® 4 processor may run one
code path, while on the latest Intel® Xeon® processor it will run another code path. This happens because
each unique code path has been optimized to match the features available on the underlying processor. One
key way that the new features of a processor are exposed to the programmer is through the instruction set
architecture (ISA). Because of this, code branches in Intel MKL are designated by the latest ISA they use for
optimizations: from the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) to the Intel® Advanced Vector
Extensions 2 (Intel® AVX2). The feature-based approach introduces a challenge: if any of the internal
floating-point operations are done in a different order or are re-associated, the computed results may differ.
Dispatching optimized code paths based on the capabilities of the processor on which the code is running is
central to the optimization approach used by Intel MKL. So it is natural that consistent results require some
performance trade-offs. If limited to a particular code path, performance of Intel MKL can in some
circumstances degrade by more than a half. To understand this, note that matrix-multiply performance
nearly doubled with the introduction of new processors supporting Intel AVX2 instructions. Even if the code
branch is not restricted, performance can degrade by 10-20% because the new functionality restricts
algorithms to maintain the order of operations.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
68
Obtaining Numerically Reproducible Results 6
Optimization Notice
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
NOTE
On non-Intel CPUs and on Intel CPUs that do not support Intel AVX2, this environment setting may
cause results to differ because the AUTO branch is used instead, while the above function call returns
an error and does not enable the CNR mode.
NOTE
On non-Intel CPUs, this environment setting may cause results to differ because the AUTO branch is
used instead, while the above function call returns an error and does not enable the CNR mode.
69
6 Intel® Math Kernel Library for Linux* Developer Guide
To ensure Intel MKL calls return the same results on all Intel or Intel compatible CPUs supporting Intel SSE2
instructions:
1. Make sure that your application uses a fixed number of threads
2. (Recommended) Properly align input and output arrays in Intel MKL function calls
3. Do either of the following:
• Call
mkl_cbwr_set(MKL_CBWR_COMPATIBLE)
• Set the environment variable:
export MKL_CBWR = COMPATIBLE
NOTE
The special MKL_CBWR_COMPATIBLE/COMPATIBLE option is provided because Intel and Intel
compatible CPUs have a few instructions, such as approximation instructions rcpps/rsqrtps, that may
return different results. This option ensures that Intel MKL does not use these instructions and forces a
single Intel SSE2 only code path to be executed.
Next steps
See Specifying the Code Branches for details of specifying the branch using
environment variables.
Support Functions for Conditional Numerical Reproducibility for how to configure the CNR mode of Intel
MKL using functions.
Intel MKL PARDISO - Parallel Direct Sparse Solver Interface for how to configure the CNR mode for
PARDISO.
See Also
Code Examples
Value Description
AUTO CNR mode uses the standard ISA-based dispatching model while
ensuring fixed cache sizes, deterministic reductions, and static
scheduling
CNR mode uses the branch for the following ISA:
70
Obtaining Numerically Reproducible Results 6
Value Description
AVX512_E1 Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with support
for Vector Neural Network Instructions
AVX512_MIC_E1 Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with support
for Vector Neural Network Instructions on Intel® Xeon Phi™ processors
NOTE
• If the value of the branch is incorrect or your processor or operating system does not support the
specified ISA, CNR ignores this value and uses the AUTO branch without providing any warning
messages.
• Calls to functions that define the behavior of CNR must precede any of the math library functions
that they control.
• Settings specified by the functions take precedence over the settings specified by the environment
variable.
See the Intel MKL Developer Reference for how to specify the branches using functions.
See Also
Getting Started with Conditional Numerical Reproducibility
71
6 Intel® Math Kernel Library for Linux* Developer Guide
Reproducibility Conditions
To get reproducible results from run to run, ensure that the number of threads is fixed and constant.
Specifically:
• If you are running your program with OpenMP* parallelization on different processors, explicitly specify
the number of threads.
• To ensure that your application has deterministic behavior with OpenMP* parallelization and does not
adjust the number of threads dynamically at run time, set MKL_DYNAMIC and OMP_DYNAMIC to FALSE. This
is especially needed if you are running your program on different systems.
• If you are running your program with the Intel® Threading Building Blocks parallelization, numerical
reproducibility is not guaranteed.
NOTE
• As usual, you should align your data, even in CNR mode, to obtain the best possible performance.
While CNR mode also fully supports unaligned input and output data, the use of it might reduce the
performance of some Intel MKL functions on earlier Intel processors. Refer to coding techniques
that improve performance for more details.
• Conditional Numerical Reproducibility does not ensure that bitwise-identical NaN values are
generated when the input data contains NaN values.
• If dynamic memory allocation fails on one run but succeeds on another run, you may fail to get
reproducible results between these two runs.
See Also
MKL_DYNAMIC
Coding Techniques
See Also
Specifying Code Branches
Code Examples
The following simple programs show how to obtain reproducible results from run to run of Intel MKL
functions. See the Intel MKL Developer Reference for more examples.
72
Obtaining Numerically Reproducible Results 6
C Example of CNR
#include <mkl.h>
int main(void) {
int my_cbwr_branch;
/* Align all input/output data on 64-byte boundaries */
/* "for best performance of Intel MKL */
void *darray;
int darray_size=1000;
/* Set alignment value in bytes */
int alignment=64;
/* Allocate aligned array */
darray = mkl_malloc (sizeof(double)*darray_size, alignment);
/* Find the available MKL_CBWR_BRANCH automatically */
my_cbwr_branch = mkl_cbwr_get_auto_branch();
/* User code without Intel MKL calls */
/* Piece of the code where CNR of Intel MKL is needed */
/* The performance of Intel MKL functions might be reduced for CNR mode */
/* If the "IF" statement below is commented out, Intel MKL will run in a regular mode, */
/* and data alignment will allow you to get best performance */
if (mkl_cbwr_set(my_cbwr_branch)) {
printf("Error in setting MKL_CBWR_BRANCH! Aborting…\n”);
return;
}
/* CNR calls to Intel MKL + any other code */
/* Free the allocated aligned array */
mkl_free(darray);
}
73
6 Intel® Math Kernel Library for Linux* Developer Guide
END
74
Obtaining Numerically Reproducible Results 6
IF (MKL_CBWR_SET(MY_CBWR_BRANCH) .NE. MKL_CBWR_SUCCESS) THEN
PRINT *, 'Error in setting MKL_CBWR_BRANCH! Aborting…'
RETURN
ENDIF
! CNR calls to Intel MKL + any other code
! Free the allocated array
DEALLOCATE(DARRAY)
END
75
7 Intel® Math Kernel Library for Linux* Developer Guide
Coding Tips 7
This section provides coding tips for managing data alignment and version-specific compilation.
See Also
Mixed-language Programming with the Intel® Math Kernel Library Tips on language-specific
programming
Managing Performance and Memory Coding tips related to performance improvement and use of
memory functions
Obtaining Numerically Reproducible Results Tips for obtaining numerically reproducible results of
computations
76
Coding Tips 7
integer*8 mkl_malloc
#endif
external mkl_malloc, mkl_free, mkl_app
...
double precision darray
pointer (p_wrk,darray(1))
integer workspace
...
! Allocate aligned workspace
p_wrk = mkl_malloc( %val(8*workspace), %val(alignment) )
...
! call the program using Intel MKL
call mkl_app( darray )
...
! Free workspace
call mkl_free(p_wrk)
These symbols enable conditional compilation of code that uses new features introduced in a particular
version of the library.
To perform conditional compilation:
1. Depending on your compiler, include in your code the file where the macros are defined:
77
7 Intel® Math Kernel Library for Linux* Developer Guide
include "mkl.fi"
!DEC$IF DEFINED INTEL_MKL_VERSION
!DEC$IF INTEL_MKL_VERSION .EQ. 110204
* Code to be conditionally compiled
!DEC$ENDIF
!DEC$ENDIF
C/C++ Compiler. Fortran Compiler with Enabled Preprocessing:
#include "mkl.h"
#ifdef INTEL_MKL_VERSION
#if INTEL_MKL_VERSION == 110204
... Code to be conditionally compiled
#endif
#endif
78
Managing Output 8
Managing Output 8
Using Intel MKL Verbose Mode
When building applications that call Intel MKL functions, it may be useful to determine:
• which computational functions are called,
• what parameters are passed to them, and
• how much time is spent to execute the functions.
You can get an application to print this information to a standard output device by enabling Intel MKL
Verbose. Functions that can print this information are referred to as verbose-enabled functions.
When Verbose mode is active in an Intel MKL domain, every call of a verbose-enabled function finishes with
printing a human-readable line describing the call. However, if your application gets terminated for some
reason during the function call, no information for that function will be printed. The first call to a verbose-
enabled function also prints a version information line.
To enable the Intel MKL Verbose mode for an application, do one of the following:
• set the environment variable MKL_VERBOSE to 1, or
• call the support function mkl_verbose(1).
To disable the Intel MKL Verbose mode, call the mkl_verbose(0) function. Both enabling and disabling of the
Verbose mode using the function call takes precedence over the environment setting. For a full description of
now the mkl_verbose function works by language, see either the Intel MKL Developer Reference for C or
theIntel MKL Developer Reference for Fortran. Both references are available in the Intel® Software
Documentation Library.
Intel MKL Verbose mode is not a thread-local but a global state. In other words, if an application changes the
mode from multiple threads, the result is undefined.
WARNING
The performance of an application may degrade with the Verbose mode enabled, especially when the
number of calls to verbose-enabled functions is large, because every call to a verbose-enabled
function requires an output operation.
See Also
Intel Software Documentation Library
Intel MKL version. This information is separated by a comma from the rest
of the line.
79
8 Intel® Math Kernel Library for Linux* Developer Guide
The name of the Although the name printed may differ from the name used in
function. the source code of the application (for example, the cblas_
prefix of CBLAS functions is not printed), you can easily
recognize the function by the printed name.
Values of the • The values are listed in the order of the formal argument list.
arguments. The list directly follows the function name, it is parenthesized
and comma-separated.
• Arrays are printed as addresses (to see the alignment of the
data).
• Integer scalar parameters passed by reference are printed by
value. Zero values are printed for NULL references.
• Character values are printed without quotes.
• For all parameters passed by reference, the values printed
are the values returned by the function. For example, the
printed value of the info parameter of a LAPACK function is
its value after the function execution.
Time taken by the • The time is printed in convenient units (seconds, Managing Multi-
function. milliseconds, and so on), which are explicitly indicated. core Performance
• The time may fluctuate from run to run. for options to set
an affinity mask.
80
Managing Output 8
Information Description Related Links
Value of the The value printed is prefixed with CNR: Getting Started
MKL_CBWR with Conditional
environment Numerical
variable. Reproducibility
Status of the Intel The value printed is prefixed with FastMM: Avoiding Memory
MKL memory Leaks in Intel MKL
manager. for a description
of the Intel MKL
memory manager
Values of Intel The first value printed is prefixed with NThr: Intel MKL-specific
MKL environment Environment
variables defining Variables for
the general and Threading Control
domain-specific
numbers of
threads,
separated by a
comma.
81
9 Intel® Math Kernel Library for Linux* Developer Guide
Important
ScaLAPACK, Cluster FFT, and Cluster Sparse Solver function domains are not installed by default. To
use them, explicitly select the appropriate component during installation.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Intel® Math Kernel Library Structure
Managing Performance of the Cluster Fourier Transform Functions
Intel® Distribution for LINPACK* Benchmark
82
Working with the Intel® Math Kernel Library Cluster Software 9
<MKL cluster library> One of libraries for ScaLAPACK or Cluster FFT and appropriate
architecture and programming interface (LP64 or ILP64).
Available libraries are listed in Appendix C: Directory Structure
in Detail. For example, for the LP64 interface, it is -
lmkl_scalapack_lp64 or -lmkl_cdft_core. Cluster Sparse
Solver does not require an additional computation library.
<MKL core libraries> Processor optimized kernels, threading library, and system
library for threading support, linked as described in Listing
Libraries on a Link Line.
<MPI linker script> A linker script that corresponds to the MPI version.
For example, if you are using Intel MPI, want to statically link with ScaLAPACK using the LP64 interface, and
have only one MPI process per core (and thus do not use threading), specify the following linker options:
-L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_scalapack_lp64.a $MKLPATH/
libmkl_blacs_intelmpi_lp64.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a
$MKLPATH/libmkl_core.a -static_mpi -Wl,--end-group -lpthread -lm
NOTE
Grouping symbols -Wl,--start-group and -Wl,--end-group are required for static linking.
Tip
Use the Using the Link-line Advisor to quickly choose the appropriate set of <MKL cluster
Library>, <BLACS>, and <MKL core libraries>.
See Also
Linking Your Application with the Intel(R) Math Kernel Library
Examples of Linking for Clusters
83
9 Intel® Math Kernel Library for Linux* Developer Guide
Caution
Avoid over-prescribing the number of OpenMP threads, which may occur, for instance, when the
number of MPI ranks per node and the number of OpenMP threads per node are both greater than
one. The number of MPI ranks per node multiplied by the number of OpenMP threads per node should
not exceed the number of hardware threads per node.
If you are using your login environment to set an environment variable, such as OMP_NUM_THREADS,
remember that changing the value on the head node and then doing your run, as you do on a shared-
memory (SMP) system, does not change the variable on all the nodes because mpirun starts a fresh default
shell on all the nodes. To change the number of OpenMP threads on all the nodes, in .bashrc, add a line at
the top, as follows:
OMP_NUM_THREADS=1; export OMP_NUM_THREADS
You can run multiple CPUs per node using MPICH. To do this, build MPICH to enable multiple CPUs per node.
Be aware that certain MPICH applications may fail to work perfectly in a threaded environment (see the
Known Limitations section in the Release Notes. If you encounter problems with MPICH and setting of the
number of OpenMP threads is greater than one, first try setting the number of threads to one and see
whether the problem persists.
Important
For Cluster Sparse Solver, set the number of OpenMP threads to a number greater than one because
the implementation of the solver only supports a multithreaded algorithm.
See Also
Techniques to Set the Number of Threads
84
Working with the Intel® Math Kernel Library Cluster Software 9
• 0 or undefined.
Intel MKL considers that thread support level of Intel MPI Library is MPI_THREAD_SINGLE and defaults to
sequential execution.
• 1, 2, or 3.
This value determines Intel MKL conclusion of the thread support level:
• 1 - MPI_THREAD_FUNNELED
• 2 - MPI_THREAD_SERIALIZED
• 3 - MPI_THREAD_MULTIPLE
In all these cases, Intel MKL determines the number of MPI processes per node using the other
environment variables listed and defaults to the number of threads equal to the number of available cores
per node divided by the number of MPI processes per node.
Important
Instead of relying on the discussed implicit settings, explicitly set the number of threads for Intel MKL.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Managing Multi-core Performance
Intel® Software Documentation Library for more information on Intel MPI Library
for more information on Intel MPI Library
To build a custom BLACS library, from the above directory run the make command.
85
9 Intel® Math Kernel Library for Linux* Developer Guide
builds a static custom BLACS library libmkl_blacs_custom_lp64.a using the MPI compiler from the
current shell environment. Look into the <mkl directory>/interfaces/mklmpi/makefile for targets and
variables that define how to build the custom library. In particular, you can specify the compiler through the
MPICC variable.
For more control over the building process, refer to the documentation available through the command
make help
See Also
Linking with Intel MKL Cluster Software
See Also
Directory Structure in Detail
86
Working with the Intel® Math Kernel Library Cluster Software 9
mpicc <user files to link> \
-Wl,--start-group \
$MKLPATH/libmkl_cdft_core.a \
$MKLPATH/libmkl_blacs_intelmpi_lp64.a \
$MKLPATH/libmkl_intel_lp64.a \
$MKLPATH/libmkl_intel_thread.a \
$MKLPATH/libmkl_core.a \
-Wl,--end-group \
-liomp5 -lpthread
To link dynamically with Cluster Sparse Solver for a cluster of systems based on the Intel® 64 architecture,
use the following link line:
mpicc <user files to link> \
-L$MKLPATH \
-lmkl_blacs_intelmpi_lp64 \
-lmkl_intel_lp64 \
-lmkl_intel_thread -lmkl_core \
-liomp5 -lpthread
See Also
Linking with Intel MKL Cluster Software
Using the Link-line Advisor
87
9 Intel® Math Kernel Library for Linux* Developer Guide
-liomp5 -lpthread
To link statically with Cluster Sparse Solver for a cluster of systems based on the Intel® 64 architecture, use
the following link line:
mpiifort <user files to link> \
-Wl,--start-group \
$MKLPATH/libmkl_blacs_intelmpi_lp64.a \
$MKLPATH/libmkl_intel_lp64.a \
$MKLPATH/libmkl_intel_thread.a \
$MKLPATH/libmkl_core.a \
-Wl,--end-group \
-liomp5 -lpthread
See Also
Linking with Intel MKL Cluster Software
Using the Link-line Advisor
88
Managing Behavior of the Intel(R) Math Kernel Library with Environment Variables 10
89
10 Intel® Math Kernel Library for Linux* Developer Guide
In these commands, <mode-string> controls error handling behavior and computation accuracy, consists of
one or several comma-separated values of the mode parameter listed in the table below, and meets these
requirements:
• Not more than one accuracy control value is permitted
• Any combination of error control values except VML_ERRMODE_DEFAULT is permitted
• No denormalized numbers control values are permitted
Values of the mode Parameter
Value of mode Description
Accuracy Control
VML_HA high accuracy versions of VM functions
VML_LA low accuracy versions of VM functions
VML_EP enhanced performance accuracy versions of VM functions
Denormalized Numbers Handling Control
VML_FTZDAZ_ON Faster processing of denormalized inputs is enabled.
VML_FTZDAZ_OFF Faster processing of denormalized inputs is disabled.
Error Mode Control
VML_ERRMODE_IGNORE On computation error, VM Error status is updated, but otherwise no
action is set. Cannot be combined with other VML_ERRMODE
settings.
VML_ERRMODE_NOERR On computation error, VM Error status is not updated and no action is
set. Cannot be combined with other VML_ERRMODE settings.
VML_ERRMODE_STDERR On error, the error text information is written to stderr.
VML_ERRMODE_EXCEPT On error, an exception is raised.
VML_ERRMODE_CALLBACK On error, an additional error handler function is called.
VML_ERRMODE_DEFAULT On error, an exception is raised and an additional error handler
function is called.
These commands provide an example of valid settings for the MKL_VML_MODE environment variable in your
command shell:
• For the bash shell:
export MKL_VML_MODE=VML_LA,VML_ERRMODE_ERRNO,VML_ERRMODE_STDERR
• For a C shell (csh or tcsh):
setenv MKL_VML_MODE VML_LA,VML_ERRMODE_ERRNO,VML_ERRMODE_STDERR
NOTE
VM ignores the MKL_VML_MODE environment variable in the case of incorrect or misspelled settings of
mode.
90
Managing Behavior of the Intel(R) Math Kernel Library with Environment Variables 10
Important
While this table explains the settings that usually improve performance under certain conditions, the
actual performance highly depends on the configuration of your cluster. Therefore, experiment with the
listed values to speed up your computations.
-1 (default) Enables CFFT to decide which of the two above values to use
depending on the value of DFTI_TRANSPOSE.
enable_soi Not applicable A flag that enables low-communication Segment Of Interest FFT (SOI
FFT) algorithm for one-dimensional complex-to-complex CFFT, which
requires fewer MPI communications than the standard nine-step (or
six-step) algorithm.
Caution
While using fewer MPI communications, the SOI FFT algorithm incurs a
minor loss of precision (about one decimal digit).
The following example illustrates usage of the environment variable assuming the bash shell:
export MKL_CDFT=wo_omatcopy=1,alltoallv=4,enable_soi
mpirun –ppn 2 –n 16 ./mkl_cdft_app
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
91
10 Intel® Math Kernel Library for Linux* Developer Guide
Optimization Notice
For more details about the mkl_enable_instructions function, including the argument values, see the
Intel MKL Developer Reference.
For example:
• To turn on automatic CPU-based dispatching of Intel AVX-512 with support of AVX512_4FMAPS and
AVX512_4VNNI instruction groups on systems based on Intel Xeon Phi processors, do one of the
following:
92
Managing Behavior of the Intel(R) Math Kernel Library with Environment Variables 10
• Call
mkl_enable_instructions(MKL_ENABLE_AVX512_MIC_E1)
• Set the environment variable:
• For the bash shell:
export MKL_ENABLE_INSTRUCTIONS=AVX512_MIC_E1
• For a C shell (csh or tcsh):
setenv MKL_ENABLE_INSTRUCTIONS AVX512_MIC_E1
• To configure the library not to dispatch more recent architectures than Intel AVX2, do one of the
following:
• Call
mkl_enable_instructions(MKL_ENABLE_AVX2)
• Set the environment variable:
• For the bash shell:
export MKL_ENABLE_INSTRUCTIONS=AVX2
• For a C shell (csh or tcsh):
setenv MKL_ENABLE_INSTRUCTIONS AVX2
NOTE
Settings specified by the mkl_enable_instructions function take precedence over the settings
specified by the MKL_ENABLE_INSTRUCTIONS environment variable.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
93
11 Intel® Math Kernel Library for Linux* Developer Guide
Tip
After configuring your CDT, you can benefit from the Eclipse-provided code assist feature. See
Code/Context Assist description in the CDT Help for details.
To configure your Eclipse IDE CDT to link with Intel MKL, you need to perform the steps explained below. The
specific instructions for performing these steps depend on your version of the CDT and on the tool-chain/
compiler integration. Refer to the CDT Help for more details.
To configure your Eclipse IDE CDT, do the following:
1. Open Project Properties for your project.
2. Add the Intel MKL include path, that is, <mkl directory>/include, to the project's include paths.
3. Add the Intel MKL library path for the target architecture to the project's library paths. For example, for
the Intel® 64 architecture, add <mkl directory>/lib/intel64_lin.
4. Specify the names of the Intel MKL libraries to link with your application. For example, you may need
the following libraries: mkl_intel_lp64, mkl_intel_thread, mkl_core, and iomp5.
NOTE
Because compilers typically require library names rather than file names, omit the "lib" prefix and "a"
or "so" extension.
See Also
Intel® MKL Libraries to Link with
Linking in Detail
94
Intel® Math Kernel Library Benchmarks 12
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
Acknowledgement
This product includes software developed at the University of Tennessee, Knoxville, Innovative Computing
Laboratories.
95
12 Intel® Math Kernel Library for Linux* Developer Guide
xlinpack_xeon64 The 64-bit program executable for a system with Intel Xeon
processor using Intel® 64 architecture.
These files are not available immediately after installation and appear as a result of execution of an
appropriate runme script.
See Also
High-level Directory Structure
./runme_xeon32
./runme_xeon64
To run the software for other problem sizes, see the extended help included with the program. Extended help
can be viewed by running the program executable with the -e option:
./xlinpack_xeon32 -e
./xlinpack_xeon64 -e
The pre-defined data input files lininput_xeon32, lininput_xeon64, are examples. Different systems
have different numbers of processors or amounts of memory and therefore require new input files. The
extended help can give insight into proper ways to change the sample input files.
Each input file requires the following minimum amount of memory:
lininput_xeon32 2 GB
lininput_xeon64 16 GB
If the system has less memory than the above sample data input requires, you may need to edit or create
your own data input files, as explained in the extended help.
The Intel Optimized LINPACK Benchmark determines the optimal number of OpenMP threads to use. To run a
different number, you can set the OMP_NUM_THREADS or MKL_NUM_THREADS environment variable inside a
sample script. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it
defaults to the number of physical cores.
96
Intel® Math Kernel Library Benchmarks 12
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
NOTE
Performance of statically and dynamically linked prebuilt binaries may be different. The performance of
both depends on the version of Intel MPI you are using. You can build binaries statically or dynamically
linked against a particular version of Intel MPI by yourself.
97
12 Intel® Math Kernel Library for Linux* Developer Guide
HPL code is homogeneous by nature: it requires that each MPI process runs in an environment with similar
CPU and memory constraints. The Intel Distribution for LINPACK Benchmark supports heterogeneity,
meaning that the data distribution can be balanced to the performance requirements of each node, provided
that there is enough memory on that node to support additional work. For information on how to configure
Intel MKL to use the internode heterogeneity, see Heterogeneous Support in the Intel Distribution for
LINPACK Benchmark.
runme_intel64_dynamic Sample run script for the Intel® 64 architecture and binary
dynamically linked against Intel MPI library.
runme_intel64_static Sample run script for the Intel® 64 architecture and binary
statically linked against Intel MPI library.
Prebuilt libraries and utilities for building with a customized MPI implementation
See Also
High-level Directory Structure
98
Intel® Math Kernel Library Benchmarks 12
Building the Intel Distribution for LINPACK Benchmark for a Customized MPI
Implementation
The Intel Distribution for LINPACK Benchmark contains a sample build script build.sh. If you are using a
customized MPI implementation, this script builds a binary using Intel MKL MPI wrappers. To build the binary,
follow these steps:
1. Specify the location of Intel MKL to be used (MKLROOT)
2. Set up your MPI environment
3. Run the script build.sh
See Also
Contents of the Intel Distribution for LINPACK Benchmark
NOTE
The Intel Distribution for LINPACK Benchmark may contain additional optimizations compared to the
reference Netlib HPL implementation.
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
Contents of the Intel Distribution for LINPACK Benchmark
Configuring Parameters
The most significant parameters in HPL.dat are P, Q, NB, and N. Specify them as follows:
• P and Q - the number of rows and columns in the process grid, respectively.
P*Q must be the number of MPI processes that HPL is using.
Choose P≤Q.
99
12 Intel® Math Kernel Library for Linux* Developer Guide
Processor NB
Intel Xeon Processor E26*/E26* v2 (codenamed Sandy Bridge or Ivy Bridge) 256
Intel Xeon Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) 384
instructions (codenamed Skylake Server)
• N - the problem size:
• For homogeneous runs, choose N divisible by NB*LCM(P,Q), where LCM is the least common multiple
of the two numbers.
• For heterogeneous runs, see Heterogeneous Support in the Intel Distribution for LINPACK Benchmark
for how to choose N.
NOTE
Increasing N usually increases performance, but the size of N is bounded by memory. In general, you
can compute the memory required to store the matrix (which does not count internal buffers) as
8*N*N/(P*Q) bytes, where N is the problem size and P and Q are the process grids in HPL.dat. A
general rule of thumb is to choose a problem size that fills 80% of memory.
100
Intel® Math Kernel Library Benchmarks 12
NOTE
High-bandwidth Multi-Channel Dynamic Random Access Memory (MCDRAM) on the second-generation
Intel Xeon Phi processors may appear to be a NUMA node. However, because there are no CPUs on
this node, do not run an MPI process for it.
See Also
Notational Conventions
Building the Intel Distribution for LINPACK Benchmark for a Customized MPI Implementation
Building the Netlib HPL from Source Code
Using High-bandwidth Memory with Intel MKL
101
12 Intel® Math Kernel Library for Linux* Developer Guide
If the heterogeneous factor is 2.5, roughly 2.5 times the work will be put on the more powerful nodes. The
more work you put on the more powerful nodes, the more memory you might be wasting on the other nodes
if all nodes have equal amount of memory. If your cluster includes many different types of nodes, you may
need multiple heterogeneous factors.
Let P be the number of rows and Q the number of columns in your processor grid (PxQ). The work must be
homogeneous within each processor column because vertical operations, such as pivoting or panel
factorization, are synchronizing operations. When there are two different types of nodes, use MPI to process
all the faster nodes first and make sure the "PMAP process mapping" (line 9) of HPL.dat is set to 1 for
Column-major mapping. Because all the nodes must be the same within a process column, the number of
faster nodes must always be a multiple of P, and you can specify the faster nodes by setting the number of
process columns C for the faster nodes with the c command-line parameter. The -f 1.0 -c 0 setting
corresponds to the default homogeneous behavior.
To understand how to choose the problem size N for a heterogeneous run, first consider a homogeneous
system, where you might choose N as follows:
N ~= sqrt(Memory Utilization * P * Q * Memory Size in Bytes / 8)
Memory Utilization is usually around 0.8 for homogeneous Intel Xeon processor systems. On a
heterogeneous system, you may apply a different formula for N for each set of nodes that are the same and
then choose the minimum N over all sets. Suppose you have a cluster with only one heterogeneous factor F
and the number of processor columns (out of the total Q) in the group with that heterogeneous factor equal
to C. That group contains P*C nodes. First compute the sum of the parts: S =F*P*C + P*(Q-C). Note that on
a homogeneous system S=P*Q,F=1, and C=Q. Take N as
N ~= sqrt(Memory Utilization * P * Q * ((F*P*C)/S) * Memory Size in Bytes / 8)
or simply scale down the value of N for the homogeneous system by sqrt(F*P*C/S).
Example
Suppose the cluster has 100 nodes each having 64 GB of memory, and 20 of the nodes are 2.7 times as
powerful as the other 80. Run one MPI process per node for a total of 100 MPI processes. Assume a square
processor grid P=Q=10, which conveniently divides up the faster nodes evenly. Normally, the HPL
documentation recommends choosing a matrix size that consumes 80 percent of available memory. If N is
the size of the matrix, the matrix consumes 8N^2/(P*Q) bytes. So a homogeneous run might look like:
./xhpl –n 820000 –b 256 –p 10 –q 10
If you redistribute the matrix and run the heterogeneous Intel Distribution for LINPACK Benchmark, you can
take advantage of the faster nodes. But because some of the nodes will contain 2.7 times as much data as
the other nodes, you must shrink the problem size (unless the faster nodes also happen to have 2.7 times as
much memory). Instead of 0.8*64GB*100 total memory size, we have only 0.8*64GB*20 + 0.8*64GB/
2.7*80 total memory size, which is less than half the original space. So the problem size in this case would
be 526000. Because P=10 and there are 20 faster nodes, two processor columns are faster. If you arrange
MPI to send these nodes first to the application, the command line looks like:
./xhpl –n 526000 –b 1024 –p 10 –q 10 –f 2.7 –c 2
The m parameter may be misleading for heterogeneous calculations because it calculates the problem size
assuming all the nodes have the same amount of data.
Warning
The number of faster nodes must be C*P. If the number of faster nodes is not divisible by P, you
might not be able to take advantage of the extra performance potential by giving the faster nodes
extra work.
102
Intel® Math Kernel Library Benchmarks 12
While it suffices to simply provide f and c command-line parameters if you need only one heterogeneous
factor, you must add lines to the HPL.dat input to support multiple heterogeneous factors. For the above
example (two processor columns have nodes that are 2.7 times faster), instead of passing f and c
command-line parameters you can modify the HPL.dat input file by adding these two lines to the end:
NOTE
Numbering of processor columns starts at 0. The start and stopping numbers must be between 0 and
Q-1 (inclusive).
If instead there are three different types of nodes in a cluster and you need at least two heterogeneous
factors, change the number in the first row above from 1 to 2 and follow that line with two lines specifying
the start column, stopping column, and heterogeneous factor.
When choosing parameters for heterogeneous support in HPL.dat, primarily focus on the most powerful
nodes. The larger the heterogeneous factor, the more balanced the cluster may be from a performance
viewpoint, but the more imbalanced from a memory viewpoint. At some point, further performance balancing
might affect the memory too much. If this is the case, try to reduce any changes done for the faster nodes
(such as in block sizes). Experiment with values in HPL.dat carefully because wrong values may greatly
hinder performance.
When tuning on a heterogeneous cluster, do not immediately attempt a heterogeneous run, but do the
following:
1. Break the cluster down into multiple homogeneous clusters.
2. Make heterogeneous adjustments for performance balancing. For instance, if you have two different
sets of nodes where one is three times as powerful as the other, it must do three times the work.
3. Figure out the approximate size of the problem (per node) that you can run on each piece.
4. Do some homogeneous runs with those problem sizes per node and the final block size needed for the
heterogeneous run and find the best parameters.
5. Use these parameters for an initial heterogeneous run.
Environment Variables
The table below lists Intel MKL environment variables to control runs of the Intel Distribution for LINPACK
Benchmark.
HPL_LOG Controls the level of detail for the An integer ranging from 0 to
HPL output. 2:
• 0 - no log is displayed.
• 1 - only one root node
displays a log, exactly the
same as the ASYOUGO
option provides.
103
12 Intel® Math Kernel Library for Linux* Developer Guide
HPL_SWAPWIDTH Specifies width for each swap 16 or 24. The default is 24.
operation.
You can set Intel Distribution for LINPACK Benchmark environment variables using the PMI_RANK and
PMI_SIZE environment variables of the Intel MPI library, and you can create a shell script to automate the
process.
1 Nothing specified All Intel Xeon processors in the cluster are used.
HPL_HOST_CORE=1-3,8-10
3 HPL_HOST_NODE=1 Only Intel Xeon processor cores on NUMA node 1 are used.
104
Intel® Math Kernel Library Benchmarks 12
See Also
Heterogeneous Support in the Intel Distribution for LINPACK Benchmark
105
12 Intel® Math Kernel Library for Linux* Developer Guide
xhpcg_avx The Intel AVX optimized version of the benchmark, optimized for systems
based on the first and the second generations of Intel Xeon processor E3
family, Intel Xeon processor E5 family, or Intel Xeon processor E7 family.
xhpcg_avx2 The Intel AVX2 optimized version of the benchmark, optimized for systems
based on the third and later generations of the Intel Xeon processor E3 family,
Intel Xeon processor E5 family, Intel Xeon processor E7 family, and future Intel
processors with Intel AVX2 support. Running the Intel AVX optimized version of
the benchmark on an Intel AVX2 enabled system produces non-optimal
performance. The Intel AVX2 optimized version of the benchmark does not run
on systems that do not support Intel AVX2.
xhpcg_knl The Intel Xeon Phi processor (formerly Knights Landing) optimized version of
the benchmark is designed for systems based on Intel Xeon Phi processors
with Intel AVX-512 support. Running the Intel AVX or AVX2 optimized versions
of the benchmark on an Intel AVX-512 enabled system produces non-optimal
performance. The Intel Xeon Phi processor optimized version of the benchmark
does not run on systems that do not support Intel AVX-512.
xhpcg_skx The Intel Xeon Scalable processor (formerly Skylake) optimized version of the
benchmark is designed for systems based on Intel Xeon Scalable processors
and future Intel processors with Intel AVX-512 support. Running the Intel AVX
or AVX2 optimized versions of the benchmark on an Intel AVX-512 enabled
system produces non-optimal performance. The Intel Xeon Scalable processor
optimized version of the benchmark does not run on systems that do not
support Intel AVX-512.
The Intel Optimized HPCG package also includes the source code necessary to build these versions of the
benchmark for other MPI implementations, such as SGI MPT*, MPICH2, or Open MPI: Intel AVX optimized
version and Intel AVX2 optimized version, and Intel AVX-512 optimized version. Build instructions are
available in the QUICKSTART file included with the package.
See Also
High-level Directory Structure
106
Intel® Math Kernel Library Benchmarks 12
• The Intel Xeon Phi processor optimized version performs best with four MPI processes per processor
and two threads for each processor core, with SMT turned on. Specifically, for a 128-node cluster
with one Intel Xeon Phi processor 7250 per node, run the executable in this manner:
#> mpiexec.hydra -n
512 -ppn 2 env OMP_NUM_THREADS=34
KMP_AFFINITY=granularity=fine,compact,1,0
./bin/xhpcg_knl -n160
6. When the benchmark completes execution, which usually takes a few minutes, find the YAML file with
official results in the current directory. The performance rating of the benchmarked system is in the last
section of the file:
HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]
107
A Intel® Math Kernel Library for Linux* Developer Guide
LAPACK routines for solving least-squares problems, eigenvalue and singular Yes Yes
value problems, and Sylvester's equations
Yes
Auxiliary and utility LAPACK routines Yes
Yes
Parallel Basic Linear Algebra Subprograms (PBLAS)
Yes
ScaLAPACK †
Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) Yes Yes
108
Intel® Math Kernel Library Language Interfaces Support A
Function Domain Fortran int C/C++
erface interface
Include Files
The table below lists Intel MKL include files.
BLACS mkl_blacs.h‡‡
PBLAS mkl_pblas.h‡‡
ScaLAPACK mkl_scalapack.h‡‡
109
A Intel® Math Kernel Library for Linux* Developer Guide
See Also
Language Interfaces Support, by Function Domain
110
Support for Third-Party Interfaces B
Important
For ease of use, the FFTW3 interface is also integrated in Intel MKL.
Caution
The FFTW2 and FFTW3 interfaces are not compatible with each other. Avoid linking to both of them. If
you must do so, first modify the wrapper source code for FFTW2:
1. Change every instance of fftw_destroy_plan in the fftw2xc interface to
fftw2_destroy_plan.
2. Change all the corresponding file names accordingly.
3. Rebuild the pertinent libraries.
111
C Intel® Math Kernel Library for Linux* Developer Guide
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and
SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-
dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to
the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
See Also
High-level Directory Structure
Using Language-Specific Interfaces with Intel(R) MKL
Name Installed by
Default
Interface Layer
112
Directory Structure in Detail C
File Contents Optional Component
Name Installed by
Default
Threading Layer
Computational Layer
Name Installed by
Default
Interface Layer
Threading Layer
113
C Intel® Math Kernel Library for Linux* Developer Guide
Name Installed by
Default
Computational Layer
114
Directory Structure in Detail C
File Contents Optional Component
Name Installed by
Default
Message Catalogs
115
C Intel® Math Kernel Library for Linux* Developer Guide
Name Installed by
Default
Interface Layer
Threading Layer
116
Directory Structure in Detail C
File Contents Optional Component
Name Installed by
Default
Computational Layer
Cluster Libraries
117
C Intel® Math Kernel Library for Linux* Developer Guide
Name Installed by
Default
Name Installed by
Default
Interface Layer
Threading Layer
118
Directory Structure in Detail C
File Contents Optional Component
Name Installed by
Default
Computational Layer
119
C Intel® Math Kernel Library for Linux* Developer Guide
Name Installed by
Default
Cluster Libraries
120
Directory Structure in Detail C
File Contents Optional Component
Name Installed by
Default
Message Catalogs
121
Intel® Math Kernel Library for Linux* Developer Guide
Index
A directory structure
high-level 18
affinity mask 51 in-detail
aligning data, example 76 dispatch Intel(R) architectures, configure with an
architecture support 18 environment variable 92
dispatch, new Intel(R) architectures, enable with an
environment variable 92
B
BLAS E
calling routines from C 63
Fortran 95 interface to 62 Eclipse* CDT
OpenMP* threaded routines 38 configuring 94
Enter index keyword 21
environment variables
C for threading control 45
setting for specific architecture and programming
C interface to LAPACK, use of 63 interface 13
C, calling LAPACK, BLAS, CBLAS from 63 to control dispatching for Intel(R) architectures 92
C/C++, Intel(R) MKL complex types 64 to control threading algorithm for ?gemm 48
calling to enable dispatching of new architectures 92
BLAS functions from C 65 to manage behavior of function domains 89
CBLAS interface from C 65 to manage behavior of Intel(R) Math Kernel Library
complex BLAS Level 1 function from C 65 with 89
complex BLAS Level 1 function from C++ 65 to manage performance of cluster FFT 90
Fortran-style routines from C 63 examples, linking
CBLAS interface, use of 63 for cluster software 86
Cluster FFT general 24
environment variable for 90
linking with 82
managing performance of 90 F
cluster software, Intel(R) MKL 82
cluster software, linking with FFT interface
commands 82 OpenMP* threaded problems 38
linking examples 86 FFTW interface support 111
Cluster Sparse Solver, linking with 82 Fortran 95 interface libraries 30
code examples, use of 15 function call information, enable printing 79
coding
data alignment 76
techniques to improve performance 56 H
compilation, Intel(R) MKL version-dependent 77
header files, Intel(R) MKL 109
compiler run-time libraries, linking with 33
heterogeneity
compiler-dependent function 62
of Intel(R) Distribution for LINPACK* Benchmark 97
complex types in C and C++, Intel(R) MKL 64
heterogeneous cluster
computation results, consistency 68
support by Intel(R) Distribution for LINPACK*
computational libraries, linking with 32
Benchmark for Clusters 101
conditional compilation 77
high-bandwidth memory, use in Intel(R) Math Kernel
configuring Eclipse* CDT 94
Library 57
consistent results 68
HT technology, configuration tip 51
conventions, notational 11
custom shared object
building 34 I
composing list of functions 35
specifying function names 36 ILP64 programming, support for 29
improve performance, for matrices of small sizes 53
include files, Intel(R) MKL 109
D information, for function call , enable printing 79
installation, checking 13
data alignment, example 76
Intel(R) Distribution for LINPACK* Benchmark
denormal number, performance 57
heterogeneity of 97
direct call, to Intel(R) Math Kernel Library computational
Intel(R) Distribution for LINPACK* Benchmark for Clusters
kernels 53
122
Index
multi-core performance 51
Intel(R) Distribution for LINPACK* Benchmark for Clusters (continued)
heterogeneous support 101
Intel(R) Hyper-Threading Technology, configuration tip 51
Intel(R) Optimized High Performance Conjugate Gradient N
Benchmark
getting started 106 notational conventions 11
overview 105 number of threads
Intel® Threading Building Blocks, functions threaded with 40 changing at run time 43
interface changing with OpenMP* environment variable 42
Fortran 95, libraries 30 Intel(R) MKL choice, particular cases 46
LP64 and ILP64, use of 29 setting for cluster 83
interface libraries and modules, Intel(R) MKL 60 techniques to set 42
interface libraries, linking with 29 numerically reproducible results 68
K O
kernel, in Intel(R) Math Kernel Library, direct call to 53 OpenMP* threaded functions 38
OpenMP* threaded problems 38
L
P
language interfaces support 108
language-specific interfaces parallel performance 41
interface libraries and modules 60 parallelism, of Intel(R) MKL 38
LAPACK performance
C interface to, use of 63 multi-core 51
calling routines from C 63 with denormals 57
Fortran 95 interface to 62 with subnormals 57
OpenMP* threaded routines 38 performance improvement, for matrices of small sizes 53
performance of packed routines 56 performance, of Intel(R) MKL, improve on specific
layers, Intel(R) MKL structure 19 processors 57
libraries to link with
computational 32
interface 29 R
run-time 33
results, consistent, obtaining 68
system libraries 34
results, numerically reproducible, obtaining 68
threading 31
link tool, command line 23
link-line syntax 26 S
linking examples
cluster software 86 ScaLAPACK, linking with 82
general 24 SDL 22, 27
linking with Single Dynamic Library 22, 27
compiler run-time libraries 33 structure
computational libraries 32 high-level 18
interface libraries 29 in-detail
system libraries 34 model 19
threading libraries 31 support, technical 7
linking, quick start 21 supported architectures 18
linking, Web-based advisor 23 syntax, link-line 26
LINPACK benchmark 95 system libraries, linking with 34
M T
memory functions, redefining 58 technical support 7
memory management 57 thread safety, of Intel(R) MKL 38
memory renaming 58 threaded functions, with Intel® Threading Building Blocks 40
memory, high-bandwidth, use in Intel(R) Math Kernel threading control, Intel(R) MKL-specific 45
Library 57 threading libraries, linking with 31
message-passing interface
custom, usage 85
Intel(R) Math Kernel Library interaction with 84 U
mixed-language programming 63
module, Fortran 95 62 unstable output, getting rid of 68
MPI
custom, usage 85
Intel(R) Math Kernel Library interaction with 84
123
Intel® Math Kernel Library for Linux* Developer Guide
V
Vector Mathematics
default mode, setting with environment variable 89
environment variable to set default mode 89
verbose mode, of Intel(R) MKL 79
124