Use This Installation Guide To Install An HPC Cluster Using OpenHPC and Warewulf Open Source Software. Based On The Core Installation Recipes
Use This Installation Guide To Install An HPC Cluster Using OpenHPC and Warewulf Open Source Software. Based On The Core Installation Recipes
Section 1: Reference Design ......................3 No license (express or implied, by estoppel or otherwise) to any intellectual property rights is
Document Conventions .............................5 granted by this document.
Preparation .....................................................7
Intel disclaims all express and implied warranties, including without limitation, the implied
warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as
Installation any warranty arising from course of performance, course of dealing, or usage in trade.
Section 2: Install the Linux* Operating
This document contains information on products, services and/or processes in development.
System .............................................................. 10
All information provided here is subject to change without notice. Contact your Intel
Post-Install Configuration ...................... 13 representative to obtain the latest forecast, schedule, specifications and roadmaps.
Section 3: Install the Cluster ................... 19
The products and services described may contain defects or errors known as errata which may
Configure the Head Node....................... 23
cause deviations from published specifications. Current characterized errata are available on
Build the compute node image ............ 25 request.
Install oneAPI HPC Toolkit ..................... 30
Copies of documents which have an older number and are referenced in this document may
Install and Configure Omni-Path Fabric
be obtained by calling 1-800-548-4725 or visiting www.intel.com/design/literature.htm.
Software........................................................ 34
Configure IPoIB .......................................... 35 Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Reboot the head node ............................. 37
* Other names and brands may be claimed as the property of others.
Assemble the node image...................... 38
Provision the cluster ................................ 40 This document is adapted from Install Guide: CentOS 8.2 Base OS Warewulf/SLURM Edition
Section 4: Verify Cluster Design with for Linux* (x86 64), by OpenHPC, a Linux Foundation Collaborative Project, under CC-BY-4.0
https://github.com/openhpc/ohpc/releases/download/v2.0.GA/
Intel Cluster Checker .................................. 41
Install_guide-CentOS8-Warewulf-SLURM-2.0-x86_64.pdf
Run Cluster Checker ................................. 43
Cover page: OpenHPC Logo, by OpenHPC under CC-BY-4.0:
Optional Components https://github.com/openhpc/ohpc/tree/2.x/docs/recipes/install/common/figures
2
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
This reference design is validated internally on a specific hardware and software configuration but is
presented here more generically. When instructions are followed exactly as written, will install and work
correctly on the listed hardware configuration.
Even though other configurations are not validated, the use of similar hardware and software is expected to
work correctly. Changes to the chassis, memory size, number of nodes, and processor or storage models
should have no impact on cluster operation, but it may impact performance. This recipe may not work as
written with changes made to the server board, fabric, or software stack.
The validation cluster consists of 1 head node and 4 compute nodes, with a 1-Gbps Ethernet interface
connected to an external network, and all nodes connected through a 10-Gbps Ethernet network. MPI
messaging is performed over Intel® Omni-Path fabric. The Bill of Materials (BOM) specifies the minimum
configuration only.
Ethernet and fabric interfaces may be on-board or add-on PCI Express interfaces.
QTY Item Configuration
2 Intel® Xeon® Scalable Processors
Minimum 32 GB ECC memory
Intel® SSD Data Center Family, SATA, 800 GB
1 GbE Intel® Ethernet network adapter
10 GbE Intel® Ethernet network adapter
1 Head Node Intel® Omni-Path Host Fabric Interface (HFI)
2 Intel® Xeon® Scalable Processors
Minimum 96 GB ECC memory
Intel® 10GB Ethernet interface
4 Compute node Intel® Omni-Path Host Fabric Interface (HFI)
1 Ethernet Switch Low Latency 10-Gbps Ethernet Switch
1 Fabric Switch Intel® Omni-Path Edge Switch
Table 0-1. Hardware Bill of Materials for the cluster
The following software stack has been validated. Changes to the versions shown below may result in
installation, operational, or performance issues.
Software required for optional components are covered separately in those sections.
Software Version
CentOS* Linux* OS installation DVD (minimal or full) 8 build 2105
Linux* kernel update from CentOS 4.18.0-305.3.1.el8
Intel® oneAPI HPC Toolkit 2021.3.0
Intel® Cluster Checker 2021.3.0
HPC Platform RPM packages for EL8 2018.0
OpenHPC* distribution 2.3
3
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Intel® oneAPI Toolkits (including Intel® Cluster Checker tool), Intel® HPC Platform, and OpenHPC* packages are
available through online repositories. Instructions are provided in this document to configure them.
Omni-Path Fabric and Omni-Path Express Fabric are now products of Cornelis Networks; however, this recipe
uses the previous Intel® Omni-Path fabric. Software to support the fabric is included in the Linux distribution.
CentOS installation ISOs are downloaded from www.centos.org.
4
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Certain conventions used in this reference design are intended to make the document easier to use and
provide flexibility to support different cluster designs.
This document is structured for direct migration into a configuration manager; for example, translating to an
Ansible Playbook, with sections into Ansible plays and steps into Ansible tasks. To facilitate this, steps are
written to accomplish a single task or command line operation. The intent is that each step will be easily
understood for migration into configuration manager modules. At the same time, clear directions are provided
for building the same cluster manually.
Much of the configuration normally performed during head node operating system installation has been
moved to later command line steps. This is done to provide a bare-minimum system to the configuration
manager
Ansible configuration will be covered in a separate, future document.
Values that are configurable by you are distinguished by yellow underline text throughout this document.
Many values used for software configuration depend on the hardware and software listed in the bill of
materials, the document date of release, and developer preference. For example, IP addresses and hostnames
are expected to be different in the final cluster installation. Depending on your cluster requirements, other
suitable values may be used in place of the ones in this document.
Values that are configurable and their default values, include:
Hostname: frontend
Domain Name: cluster
Management IP Address: 192.168.1.100
Management Network Device (Ethernet): eno2
External Network Device (Ethernet): eno1
Messaging Hostname: frontend-ib0
Messaging IP Address: 192.168.5.100
Messaging Fabric Device (Intel® Omni-Path): ib0
Hostnames: cXX
Management IP Addresses: 192.168.1.XX
Management Network Device (Ethernet): eth0
Messaging IP Addresses: 192.168.5.XX
Messaging Network Device (Intel® Omni-Path): ib0
IPMI IP Addresses: 192.168.2.XX
5
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Where devices and special nodes are required on the management or messaging fabric, IP addresses above
200 in the last IP address octet will be assigned to these devices.
Hostname representation in this document varies to best fit the situation. When used in hostnames, NNN and
XXX are meant represent a strictly 3-digit, zero-padded number (e.g. c001, c002…c200). When used in IP
addresses, however, leading zeroes are not used (e.g. 192.168.1.1). As a means of disambiguation, XXX used
to denote a single host, while NNN is used to specify the last host in the range.
Padding compute node hostnames is helpful to maintain sort order, but it is not required.
The class C network used in this document supports up to 254 hosts including the head node. These settings
allow for a convenient pairing of node hostnames to IP addresses – matching cXXX to 192.168.1.XXX –at the
cost of scalability past 253 compute nodes. To build a cluster larger than 253 nodes, different IP conventions
must be used.
6
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
This cluster is simple Beowulf style, consisting of a single head node managing all cluster functions and one or
more compute nodes for processing.
Compute
Intel® Omni_Path Fabric
Nodes Messaging Fabric (ib0)
10 Gbps Ethernet
External Management and
Storage Network
Networks
Head
Node
Public Interface
Warewulf provides a method to gather MAC addresses as they are booted. However, this method is inefficient
for large clusters. The preferred method to identify nodes is to collect MAC addresses beforehand. In order to
complete these instructions, the MAC addresses on the management fabric are required. For each compute
node, identify the Ethernet port connected to the management network and record the interface's MAC
address.
In order to facilitate automatic or remote restart using IPMI (Intelligent Platform Management Interface) or
Redfish, the IP address of the Baseboard Management Controller (BMC) is required. For each compute node
record the BMC IP address for the Ethernet port connected to the management network.
On most serverboards, MAC addresses for LAN-on-Motherboard Ethernet interfaces are consecutive. The
MAC address of the first BMC is usually the MAC address after the MAC address of the last onboard NIC. For
example:
NIC1: A4:BF:01:DD:24:D8 MAC address for provisioning.
NIC2: A4:BF:01:DD:24:D9
IPMI on NIC1: A4:BF:01:DD:24:DA Used for Warewulf node reboot
IPMI on NIC2: A4:BF:01:DD:24:DB
IPMI on dedicated NIC: A4:BF:01:DD:24:DC Used for remote KVM
You must identify the correct order of enumerated network devices on the head node.
7
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The remaining steps in this document are all designed to be completed from a command line interface, unless
otherwise stated. The command line interface is the default local login method since no GUI (X Window
Server) is provided in the OS installation instructions, although there is no restriction against installing a GUI if
needed. Remote SSH login is also available by default.
It is recommended to use the method that is easiest to copy and paste commands from this document.
8
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
9
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The head node is configured as the primary node in the cluster and is set up to manage and install all compute
nodes.
2. At the CentOS 8 boot menu, Select “Install CentOS 8” and press Enter.
By default, “Test this media” is selected.
If prompted, press Enter again to continue. Installation will begin shortly if Enter is not pressed.
It will take about one minute for the Linux* kernel and the Anaconda installer to load.
5. Click Continue.
The Installation Summary menu will appear next.
Note than an SMT warning may be present at the bottom of the screen. If you do not want SMT active
on your head node, you can change BIOS settings after installation is complete.
10
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
If you are not installing the recipe on the exact hardware configuration in this recipe, your network interface
names may be different. Different firmware versions may also result in name changes. You will need to
substitute the interface names on your system with the ones used in this document.
The network interface(s) will be configured after installation.
For this implementation, “Automatically configure partitioning” will be used on a single system SSD.
Partitioning can be configured to meet your design requirements. For example, if you use a second SSD or
HDD for /home directories, it should be configured here.
7. Click Done.
If the drive isn’t empty, the INSTALLATION OPTIONS dialog will appear.
11
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
9. Confirm that the text under Software Selection reads “Minimal Install”.
b. In the Base Environment list box on the left side, select “Minimal Install”.
c. Click Done.
12. Enter the same password in Root Password and Confirm text boxes.
After completing configuration options, the installation should appear with no red text or warning
indicators.
Figure 2-3 Installation Summary
Wait for the installation to complete. A progress bar estimates the remaining time.
17. Click Reboot System when the installation is complete. Remove the CentOS* install media.
The system should already be configured to boot from the primary drive. Wait until reboot is complete
and the new OS is loaded.
12
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Basic configuration and updates to the operating system is performed before installing any management or
application software.
The network interfaces were not configured during installation. This cluster network will also be the
messaging network on Ethernet-only installations.
13
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
If you need to disable SELinux, do that now. This document provides required steps to install the cluster with
SELinux enabled and enforcing.
14
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Provisioning services for the cluster use DHCP, TFTP, and HTTP network protocols and default firewall rules
may block these services. However, disabling the firewall on an Internet-accessible device is not secure. The
following changes will allow all connections within the cluster while maintaining the default firewall
configuration on the external interface.
For a cluster to function properly, the date and time must be synchronized on all nodes. It is strongly
recommended the a central network time server be available that is synchronized to an authoritative time
source.
You can list all available time zones by running timedatectl list-timezones.
b. For each time source in your network, add the following line to the file
server <timeserver_ip>
where <timeserver_ip> is the IP address of your timeserver. You can use the "pool" option to
reference a pool of timeservers.
c. Remove the line " pool 2.centos.pool.ntp.org iburst" if you will not use it.
Advisory
If the frontend does not have access to a time server, or if the time server is unreliable, it is necessary to
allow Chrony to act as the time server for compute nodes This can be done by editing the /etc/chrony.conf
file and adding the line:
local stratum 10
15
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The OpenHPC community provides a release package that includes GPG keys for package signing and YUM
repository configuration. This package can be installed directly from the OpenHPC build server. In addition,
the head node must have access to the standard CentOS 7 and EPEL repositories. Mirrors are readily available
for both repositories.
The public EPEL repository is enabled automatically by the ohpc-release package. This requires that the
CentOS Extras repository is enabled, which is default.
By default, CentOS installs recommended dependencies in addition to required dependencies. Changing this
option will reduce the number of packages installed on the server.
This step is optional.
A local repository is used to store additional non-OS rpms for installation by YUM or DNF. It simplifies
dependency resolution and allows simplified installation and removal of packages.
16
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
It is recommended to make all HPC node installations compliant to the HPC Platform specification written by
Intel. This step requires two additional software components not provided by YUM repositories.
The HPC Platform RPMs are installed on the head node to meet both the requirements and advisories.
17
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The head node (frontend) is not added automatically to the hosts file, so it is done manually.
If there are additional devices on the network, other than nodes to be provisioned by the cluster manager, add
them here using the same steps. For example, the Intel® Omni-Path fabric switch or a managed Ethernet
switch for the management network may be added here.
While not always required, the head node is rebooted so that all updates and changes are activated. It is
necessary if the kernel is updated. This will also confirm there are no boot errors resulting from package
updates.
18
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Warewulf is a software package used to create and control nodes in a cluster. Most packages are installed on a
provisioning node and other node images are created offline. Warewulf installs all other nodes in the cluster
from the provisioning node. In this guide, the cluster head node is configured as the provisioning node, and
then set up to manage and install all other cluster nodes.
The cluster installation is performed through a series of steps:
1. The head node is configured.
2. A base compute node image is created and configured.
3. Special-purpose node images, if needed, are created and configured.
4. Physical node information is added to the cluster manager.
5. The cluster is provisioned.
6. The base cluster is evaluated and validated.
7. Optional components are added to the cluster.
8. The complete cluster is evaluated and validated.
Packages that were part of the minimum installation have been updated to assure that they are the latest
version. Since new packages are downloaded directly from the online repository, they will be the latest
version.
To add support for provisioning services, install the base OpenHPC* group package followed by the Warewulf
provisioning system packages.
2. Install Warewulf
dnf -y install ohpc-warewulf
Advisory
If you encounter Buzilla bug 1762314, where dnf reports a missing Perl package, you can correct it by
using the following command.
dnf -y module enable perl-DBI
Warewulf will need to be updated to match settings for this cluster. In addition, hybridizing the compute node
significantly reduces its image size—without impacting performance—if rarely accessed directories are
excluded from memory.
19
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
b. Update exclusions.
In the EXCLUDE section, modify the file so that the following lines, and only the following lines, are
included and uncommented. These directories and files will not be copied to the VNFS image.
exclude += /tmp/*
exclude += /var/log/[!m]*
exclude += /var/tmp/*
exclude += /var/cache
exclude += /opt/*
exclude += /home/*
20
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
hybridpath = /opt/ohpc/admin/images/%{name}
b. Add the following line at the beginning of the file, after the comment lines:
drivers += updates, updates/kernel
21
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
22
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Additional configuration on the head node is needed. The head node will act as an NTP time source, NFS
server, and logfile collector for other nodes in the cluster.
The C compiler and additional tools are used by several of the standard and optional components. Singularity
is a Linux container solution, with support for both its own and Docker containers.
The /home directory is shared as read/write, and the /opt directory is shared read-only across the cluster.
#WWEXPORT:/var/chroots:192.168.1.0/255.255.255.0
#/var/chroots 192.168.1.0/255.255.255.0(ro,no_root_squash)
#WWEXPORT:/usr/local:192.168.1.0/255.255.255.0
#/usr/local 192.168.1.0/255.255.255.0(ro,no_root_squash)
#WWEXPORT:/opt:192.168.1.0/255.255.255.0
/opt 192.168.1.0/255.255.255.0(ro,no_root_squash)
System logging for the cluster can be consolidated to the head node to provide easy access and reduce the
memory requirements on the diskless compute node.
23
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
24
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The OpenHPC* build of Warewulf includes enhancements and enabling for CentOS 8. The wwmkchroot
command creates a minimal chroot image in the /opt/ohpc/admin/images/ subdirectory.
To access the remote repositories by hostname (and not IP addresses), the chroot environment also needs to
be updated to enable DNS resolution. If the head node has a working DNS configuration in place, the chroot
environment can use this configuration file.
By default, the CentOS* Base repos point to the latest version of CentOS 8. Since this reference design is not
always tested against the latest version, the repos for a specific version of CentOS* and its updates are used
instead.
In addition, defining a permanent chroot location will simplify modification of the node image.
You will add new variables to the root .bash_profile file.
The node image is configured to use its own DNF configuration and repository information. Note the use of
the installroot and root flags in these commands. This instructs dnf and rpm to install to the specified chroot
as opposed to the functional filesystem on the head node.
25
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Best practice for clusters dictates that packages should never be installed or removed directly on running
production nodes. Instead, update the node image and synchronize this to running systems.
The force option is never included in delete commands on the compute node, to avoid accidental deletion of
files from the frontend. Each file deletion will need confirmed so they are removed only from the node image.
You may include the force (-f) option at your own risk.
Cross-mount the local repository from the head node into the chroot directory.
26
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Update the intial packages from the online repositories to ensure they are at the latest version.
Basic client settings on the image are configured to match settings made earlier on the head node, including
authentication, log file consolidation, shared storage, and time synchronization. Resource management and
fabric support are added in their own sections.
43. Disable the firewall on the compute nodes, if it exists. If the firewall is not installed, this command will
report “unit firewalld.service does not exist”.
systemctl --root=$CHROOT disable firewalld.service
44. Ensure that the authorized_keys file has the correct permissions
chmod 600 $CHROOT/root/.ssh/authorized_keys
27
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
On the chroot image, add Network Time Protocol (NTP) support and identify the head node as the NTP server.
b. Delete or comment out all existing lines that begin with "pool" or “server”.
The /home directory is shared for read and write across the cluster. The /opt directory is shared for read-only
access to all nodes.
b. Comment out or remove the mounts for /var/chroots and usr/local. When complete the NFS
mounts will be:
192.168.1.100:/home /home nfs defaults 0 0
192.168.1.100:/opt /opt nfs defaults 0 0
Disable logging on compute nodes except for emergency and boot logs.
28
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
29
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The full toolkit is installed, which includes compilers, libraries, and performance tools. It is also possible to
install only the runtime components--instructions for custom installations is found at:
https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-oneapi-hpc-
linux/top.html
RPM packages for the oneAPI toolkit are available on a publicly accessible YUM repository. The configured
repository will permit simple upgrades to the latest version of the toolkit.
50. Install the RPM signature key from the Internet repository.
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2023.PUB
30
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The oneAPI toolkit includes Tcl-based environment modules that allow users to quickly change their
development environment. OpenHPC uses the Lmod environment module system, which is backward
compatible with most Tcl environment modules.
To set up oneAPI environment modules, a shell script is provided in the main oneAPI directory.
Lmod does not support absolute environment module paths. Some modules must be updated with a
workaround for incompatible features.
31
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
proc ModulesHelp { } {
global version
puts stderr "\nThis module loads the Intel compiler environment.\n"
puts stderr "\nSee the man pages for icc, icpc, and ifort for detailed information"
puts stderr "on available compiler options and command-line syntax."
puts stderr "\nVersion $version\n"
}
setenv ACL_SKIP_BSP_CONF 1
family "compiler"
63. Create an environment module to load the oneAPI MPI Library toolchain.
32
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
proc ModulesHelp { } {
global version
puts stderr "\nThis module loads the Intel MPI environment.\n"
puts stderr " mpiifort (Fortran source)"
puts stderr " mpiicc (C source)"
puts stderr " mpiicpc (C++ source)\n"
puts stderr "Version $version\n"
}
family "MPI"
33
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
New distro releases include Omni-Path drivers and basic tools. For most applications, this should be sufficient.
When complete, reboot the head node to confirm that Omni-Path software is active, and that the subnet
manager is running.
These packages can be installed in a single dnf command. They are only separated here for clarity.
34
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
A configuration file will be created on the head node. For IPoIB interfaces on compute nodes, the template is
imported and copied to each node during provisioning.
b. Add the following line to the file, immediately after the current frontend entry.
192.168.5.100 frontend-ib0.cluster frontend-ib0
35
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
You will import a template network script for Intel® Omni-Path interface that is specifically designed to be
deployed by Warewulf* for compute nodes.
74. Enable connected mode for the compute nodes and enable setting non-default MTUs
77. Update the system settings for optimized IP over Fabric performance on Omni-Path
c. Add the following entry to the end of the line. The comma is added directly after the last entry;
delete any training spaces first.
, ifcfg-ib0.ww
36
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
This is the last reboot. All changes could be manually activated, but a reboot will confirm that all services start
with configuration changes applied correctly. This Is important to check, in the event the head node is
rebooted in the future.
37
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Warewulf* employs a two-stage process for provisioning nodes. First, a bootstrap is used to initialize the
kernel and install process, then an encapsulated image containing the full system is loaded on the node.
The bootstrap image includes the runtime kernel and associated modules, as well as simple scripts for
provisioning.
With the local site customizations in place, the following step uses the wwvnfs command to assemble a VNFS
capsule from the chroot environment defined for the compute instance.
In the steps below, the variable “XX” is used in hostnames, node IPs, and MAC addresses. It must be replaced
by the node number (01-99).
Note
In the step below, XX is used in both IP addresses and hostnames. Omit leading zeroes for
IP addresses otherwise Warewulf* will interpret the first zero to mean the number is in octal
base. It is safe (and recommended) to include leading zeroes in hostnames.
Associations for nodes, boot images, and vnfs images are configured in the file:
84. Repeat the following command for each compute node in the cluster.
This will add Intel® Xeon server nodes to the Warewulf datastore. Replace <mac_address> with the
correct value for that node.
wwsh -y node new cXX --ipaddr=192.168.1.XX --hwaddr=<mac_address>
38
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
86. Repeat the following command once for each compute node in the cluster with Intel® Omni-Path
Fabric.
Add the IPoIB interface to each node. Also increase the MTU of the ib0 interface to 65.52 KB, up from
the default 2.044 KB.
wwsh -y node set cXXX -D ib0 --ipaddr=192.168.5.XXX --netmask=255.255.255.0 \
--network=192.168.5.0 --mtu 65520
39
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
40
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Intel® Cluster Checker (CLCK) is one of many tools installed as part of the Intel® oneAPI HPC Toolkit. It is a
powerful tool designed to assess cluster health. It can also be installed independently if Intel Parallel Studio
Runtimes are installed.
Intel® Cluster Checker provides a convenient suite of diagnostics that can be used to aid in isolating hardware
and software problems on an installed cluster.
c. Change the network interface for the head node. It should be:
<network_interface>ens513f0</network_interface>
If this user does not already exist, the user is created. While Cluster Checker can be run as the root user, it is
more appropriate to evaluate functionality as a normal system user.
5. Synchronize files
wwsh file resync
41
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
The nodefile defines the cluster nodes to check. For each node in the cluster, there is a line in the nodefile that
lists the node hostname and defines the role of that node in the cluster. In addition, it also defines the
subcluster to which the node belongs.
Each line has the format:
hostname # role: <role_name>
The next steps create this list in the clck user home directory. The default role is “compute” and it can be
omitted for compute nodes.
b. List every node short hostname in the cluster, one per line, including the headnode.
d. For each compute node, enter the hostname and add on the same line:
# role: compute
The completed file should look similar to this:
frontend # role: head
c01 # role: compute
c02 # role: compute
c03 # role: compute
c04 # role: compute
42
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Intel® Cluster Checker executes in two phases. In the data collection phase, Intel® Cluster Checker collects data
from the cluster for use in analysis. In the analysis phase, Intel® Cluster Checker analyzes the data in the
database and produces the results of analysis. It is possible to invoke these phases together or separately and
to customize their scope.
The clck command executes data collection followed immediately by analysis and displays the results of
analysis. It is also possible to run data collection and analysis separately, using the commands clck-collect
and clck-analyze.
The following commands are run as the clck user.
10. Execute the data collection and analysis with the health framework definition.
This will gather all information required to evaluate the overall health and conformance of the cluster,
including consistency, functionality, and performance.
clck -f nodefile -F health_extended_user
The analyzer returns the list of checks performed, the list of nodes checked, and the results of the
analysis into the file clck_results.log. The results are a synopsis of the cluster health.
43
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
44
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
If applications are built with Intel® Compilers or are dynamically linked to Intel® Performance Libraries*, the
Intel® Parallel Studio XE runtime should be made available on the target platform. Otherwise, the application
must provide all required libraries in its installation. Intel® OneAPI runtimes should provide the required
runtimes, but these instructions are provided if earlier versions are required.
The runtime packages are available at no cost and are available through online repositories.
Packages are installed on the head node only and shared through NFS.
3. Install the main package. This will install all required packages and dependencies.
dnf -y install intel-psxe-runtime
45
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Slurm requires a system user for the resource management daemons. The default configuration file supplied
with the OpenHPC build of Slurm requires that the “Slurm" user account exist.
Normal user SSH access to compute nodes is restricted by PAM (pluggable authentication module) via Slurm.
Since this may not be ideal in certain instances, a user group is created for which these restrictions do not
apply.
The global Slurm configuration file and the cryptographic key that is required by the munge authentication
library have to be available on every host in the resource management pool.
This procedure depends on the hardware configuration of the cluster. The Slurm configuration file needs to
be updated with the node names of the compute nodes, the properties of their processors and the Slurm
partitions, or queues, they belong to.
• NodeName: compute nodes name
• Sockets: number of physical CPUs in a compute node.
• CoresPerSocket: number of CPU cores per physical CPU.
• ThreadsPerCore: number of threads that can be executed in parallel on one core.
• PartitionName: Slurm partition (also called a "queue"). Users can access the resources on compute nodes
assigned to the partition. The name of the partition will be visible to the user.
• Nodes: NodeNames that belong to a given partition.
• Default: The default partition, which is used when a user doesn’t explicitly specify a partition.
46
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
For a cluster based on the Bill of Materials of this Reference Design, the Slurm configuration file needs to be
updated. This will require two changes.
1. Add the host name of the Slurm controller server to the Slurm configuration file.
2. Update the NodeName definition to reflect the hardware capabilities.
Other configuration changes will also be made.
b. Replace it with
ControlMachine=frontend
47
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
48
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
23. Update the client so that configuration is pulled from the server.
echo SLURMD_OPTIONS="--conf-server 192.168.1.100" > $CHROOT/etc/sysconfig/slurmd
27. Allow unrestricted SSH access for the users in the ssh-allowed group.
Users not belonging to ssh-allowed will be subject to SSH restrictions.
49
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
50
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
51
Installation Guide | OpenHPC* 2.3 on CentOS* 8.4
Once the resource manager is enabled for production use, users should be able to run jobs. To test this, create
and use a “test" user on the head node. Then compile and execute an application interactively through the
resource manager.
Note the use of prun for parallel job execution, which summarizes the underlying native job launch
mechanism being used.
52