A Comprehensive Guide To HPC in The Data Center
A Comprehensive Guide To HPC in The Data Center
To determine whether your organization could benefit from HPC, consider the complexity of
computing tasks your data center handles, the budget you can feasibly allocate toward
obtaining and running numerous servers, the expertise and training requirements of your
staff and the size of your data center facilities. Data centers that handle compute-intensive
tasks, such as government resource tracking or financial risk modeling, and data centers that
▼ NEXT ARTICLE
By understanding the major requirements and limiting factors for an HPC infrastructure, you
can determine whether HPC is right for your business and how best to implement it.
Only a thorough understanding of use cases, utilization and return on investment metrics
lead to successful HPC projects.
You can scale up that simple model with multiple primary nodes that each support many
worker nodes, which means the typical HPC deployment consists of multiple servers --
usually virtualized to multiply the number of effective servers available to the cluster. The
dedicated cluster network also requires high-bandwidth TCP/IP network gear such as Gigabit
Ethernet, NICs and switches. The number of servers and switches depends on the size of the
cluster, as well as the capability of each server.
A business new to HPC often starts with a limited hardware deployment scaled to just a few
racks, and then scales out the cluster later. You can limit your number of servers and
Some organizations might adopt an HPC framework such as the Hadoop framework to
manage their HPC. Hadoop includes components such as the HDFS file system, Hadoop
Common, MapReduce and YARN, which offer many of the same functions listed above.
HPC projects require an output, which can take the form of visualization, modeling or other
reporting software to deliver computing results to administrators. Tools like Hunk, Platfora
HPC deployments require a careful assessment of data center facilities and detailed
evaluations of system power and cooling requirements versus capacity. If the facilities are
inadequate for an HPC deployment, you must seek alternatives to in-house HPC.
Facilities challenges. Available physical data center space, power and cooling required to
handle additional racks filled with servers and network gear limit many organizations hoping
to implement HPC. Server upgrades can help. By deploying larger and more capable servers
to support additional VMs, you can effectively add more HPC "nodes" without adding more
physical servers. In addition, grouping VMs within the same physical server can ease
Adding racks of high-density servers can add a considerable cooling load to a data center's
air handling system. When additional cooling isn't available, evaluate colocation or cloud
options, or consider advanced cooling technologies such as immersion cooling for HPC racks.
Businesses with frequent and modest HPC tasks can choose to build and maintain a limited
HPC cluster for the convenience and security of local data processing projects and still turn
to the public cloud for occasional more demanding HPC projects that they cannot support in-
house.
center HPC
A big data framework is a collection of software components that can be used to build a
distributed system for the processing of large data sets, comprising structured,
semistructured or unstructured data. These data sets can be from multiple sources and
range in size from terabytes to petabytes to exabytes.
The chief components of Apache Hadoop are the Hadoop Distributed File System (HDFS) and
a data processing engine that implements the MapReduce program to filter and sort data.
Also included is YARN, a resource manager for the Hadoop cluster.
Apache Spark can also run on HDFS or an alternative distributed file system. It was
developed to perform faster than MapReduce by processing and retaining data in memory
for subsequent steps, rather than writing results straight back to storage. This can make
Spark up to 100 times faster than Hadoop for smaller workloads.
Which processing units WHAT IS THE DIFFERENCE BETWEEN HADOOP AND KAFKA?
for AI does your
organization require?
Apache Kafka is a distributed event streaming platform designed to process real-time data
feeds. This means data is processed as it passes through the system.
Like Hadoop, Kafka runs on a cluster of server nodes, making it scalable. Some server nodes
form a storage layer, called brokers, while others handle the continuous import and export
of data streams.
Strictly speaking, Kafka is not a rival platform to Hadoop. Organizations can use it alongside
Hadoop as part of an overall application architecture where it handles and feeds incoming
data streams into a data lake for a framework, such as Hadoop, to process.
Because of its ability to handle thousands of messages per second, Kafka is useful for
applications such as website activity tracking or telemetry data collection in large-scale IoT
deployments.
Kafka was originally developed at social network LinkedIn to analyze the connections among
its millions of users. It is perhaps best viewed as a framework capable of capturing data in
real time from numerous sources and sorting it into topics to be analyzed for insights into
the data.
That analysis is likely to be performed using a tool such as Spark, which is a cluster
computing framework that can execute code developed in languages such as Java, Python or
Scala. Spark also includes Spark SQL, which provides support for querying structured and
Top considerations for Apache Hive enables SQL developers to use Hive Query Language (HQL) statements that are
HPC infrastructure in the
data center similar to standard SQL employed for data query and analysis. Hive can run on HDFS and is
best suited for data warehousing tasks, such as extract, transform and load (ETL), reporting
Compare Hadoop vs.
Spark vs. Kafka for your
and data analysis.
big data strategy
▼ NEXT ARTICLE
Equipping servers with GPUs has become one of the most common infrastructure
approaches for AI. You can use the massively parallel architecture of a GPU chip to
accelerate the bulk floating-point operations involved in processing AI models.
Field-programmable gate arrays (FPGAs) are essentially chips crammed with logic blocks that
you can configure and reconfigure as required to perform different functions. ASICs have
logic functions built into the silicon during manufacturing. Both accelerate hardware
Which processing units Although GPUs and similar types of hardware accelerators get the most attention when it
for AI does your comes to AI, CPUs remain relevant for many areas of AI and machine learning. For example,
organization require?
Intel has added features to its server CPUs to help accelerate AI workloads. The latest Xeon
Scalable family features Intel Deep Learning Boost, which features new instructions to
accelerate the kind of calculations involved in inferencing. This means that these CPUs can
accelerate certain AI workloads with no additional hardware required.
STORAGE FOR AI
Organizations should not overlook storage when it comes to infrastructure to support AI.
Training a machine learning model requires a huge volume of sample data, and systems
must be fed data as fast as they can take it to keep performance up.
Several specialized systems offer higher performance for AI workloads. Nvidia bases its DGX
servers around its GPUs, with an architecture optimized to keep those GPUs fed with data.
Storage vendors have also partnered with Nvidia to provide validated reference
architectures that pair high-performance storage arrays with Nvidia DGX systems. For
example, DDN optimized its Accelerated, Any-Scale AI portfolio for all types of access
patterns and data layouts used in training AI models, and vendors such as NetApp and Pure
Storage offer similar storage architectures.