0% found this document useful (0 votes)
52 views

Att Cpu Impact On Packet Processing Perfomance Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Att Cpu Impact On Packet Processing Perfomance Paper

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Case Study

Modeling the impact of


CPU properties to optimize and
predict packet-processing
performance
Intel and AT&T have collaborated in a proof of concept (POC) to model and analyze
the performance of packet-processing workloads on various CPUs. This POC has
established a highly accurate model that can be used to simulate edge router
workloads on x86 systems. Developers and architects can use this methodology to
more accurately predict performance improvements for network workloads across
future CPU product generations and hardware accelerators. This approach can help
developers gain better insight into the impacts of new software and
x86 hardware components on packet processing throughput.

Intel® CoFluent™ Studio


Intel® CoFluent™ Technology for Big Data

Challenges in packet-processing
AUTHORS Today’s packet-processing devices face an enormous performance challenge. Not only is more
and more data being transmitted, but packet processing tasks — which need to be executed at
AT&T authors Intel authors line speed — have become increasingly complex.
Kartik Pandit Bianny Bian Along with the usual forwarding of data, packet processing systems are also responsible for
Vishwa M. Prasad Atul Kwatra other functions. These functions include traffic management (shaping, timing, scheduling),
Patrick Lu security processing, and quality of service (QoS).
Mike Riess
Wayne Willey Making the issue even more of a challenge is the proliferation of internet devices and sensors.
Huawei Xie Data is now often produced faster than can be transmitted, stored, or processed. Most of this
Gen Xu data is transmitted as packet streams over packet-switched networks. Along the way, various
network devices — such as switches, routers, and adapters — are used to forward and process
the data streams.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 2

At the same time, technology and customer POC results


Network virtualization requirements are also changing rapidly. These
changes are forcing vendors to seek data- The results of our POC show that
streaming software solutions that deliver performance scales linearly with the number
Network function virtualization
shorter development cycles. Unfortunately, of cores, and scales almost linearly with core
(NFV) refers to a network
infrastructure concept. NFV shorter development cycles often mean that frequency.1 Packet size did not have a
virtualizes network node functions developers must make more revisions to significant performance impact in terms of
to make it easier to deploy and update initial design shortcomings, and/or packets per second (PPS) on our workload,
manange network services. With allow less time between required updates. nor did the size or performance of
NFV, service providers can LLC cache.1
simplify and speed up scaling new Overall, this is a complex set of conditions
network functions and that creates the challenge of optimizing and Our study and detailed results tell us that,
applications, and better use predicting packet-processing performance in when selecting a hardware component for an
network resources. future architectures. edge router workload, developers should
consider prioritizing core number and
A virtualized network function frequency.
(VNF) refers to a virtualized
Proposed solution: Network
function used in an NFV function virtualization (NFV) One of the key components in our POC was
architecture. These functions used our use of Intel® CoFluent™ Technology
One proposed solution to the challenge of (Intel® CoFluent™), a modeling and
to be carried out by dedicated
hardware. With VNF, they are data streaming is network function simulation tool for both hardware and
virtualized, performed by virtualization (NFV). software. We found our Intel CoFluent model
software, and run in one or more To explore the effectiveness of this potential to be highly accurate. The average difference
virtual machines. Common VNFs (delta) in results between the simulation
NFV solution, Intel and AT&T have
include routing, load balancing, predictions and comparative measurements
caching, intrusion detection collaborated in a proof of concept (POC)
project. In this POC, we focused on modeling made on actual physical architectures was
devices, and firewalls.
and analyzing the impact of different CPU under 4%.1
properties on packet processing workloads.1 Because of this, we believe similar models
We performed an extensive analysis of packet could help developers make faster, better
processing workloads via detailed simulations choices of components for new designs and
on various microarchitectures. We then optimizations. In turn, this could help
compared the simulation results to developers reduce risk, minimize time to
measurements made on physical hardware market for both new and updated products,
systems. This allowed us to validate the and reduce overall development costs.
accuracy of both our model and the POC
simulations.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 3

Table of Contents
Challenges in packet-processing ................................................................... 1 Results and model validation ...................................................................... 15
Proposed solution: Network function virtualization (NFV) ............................. 2 Establishing a baseline for simulation accuracy ............................................. 15
POC results ...................................................................................................... 2 Measuring performance at different core frequencies .................................... 15
Analyzing performance for different cache sizes ........................................... 16
Proof of concept for network function virtualization (NFV) ...................... 4
Measuring performance for different numbers of cores ................................. 17
Goals of the POC ............................................................................................. 4
Simulating hardware acceleration components .............................................. 17
Predictive model to meet POC goals ................................................................ 4
Performance sensitivity from generation to generation .................................. 18
Key components used in this POC ................................................................ 4 Core cycles per instruction (CPI) .............................................................. 18
Network processing equipment ........................................................................ 4 Maximum capacity at the LLC level ......................................................... 18
Packet processing workload ............................................................................. 4 Ideal world versus reality .......................................................................... 19
DPDK framework for packet processing ......................................................... 5 Performance sensitivities based on traffic profile .......................................... 19
Hardware and software simulation tools .......................................................... 5 Performance scaled linearly with the number of cores .............................. 19
Traditional hardware and software simulation tools .................................... 5 Execution flow was steady ........................................................................ 19
Better solutions model both hardware and software .................................... 5 Next steps...................................................................................................... 19
Intel® CoFluent™ Technology........................................................................ 6
Simulation at a functional level ................................................................... 6 Summary ...................................................................................................... 20
Layered and configurable architecture......................................................... 6 Key findings .................................................................................................. 20
CPU frequency .......................................................................................... 20
Upstream and downstream workloads ......................................................... 6
LLC cache ................................................................................................. 20
Upstream pipeline ............................................................................................ 6 Performance .............................................................................................. 20
Execution flow of the upstream pipeline ..................................................... 7 Packet size................................................................................................. 20
6 Stages in a typical upstream pipeline ........................................................ 8 Conclusion ..................................................................................................... 20
Downstream pipeline ....................................................................................... 8
Appendix A. Performance in the downstream pipeline ............................ 21
Physical system setup ..................................................................................... 8
Packetgen in the physical DUT ........................................................................ 9 Appendix B. Acronyms and terminology ................................................... 22
Generating a packet-traffic profile ................................................................. 10
Appendix C. Authors ................................................................................... 23
Performance and sensitivities of the traffic profile ......................................... 10
Hyperthreading was disabled to expose the impact of other elements ....... 10
Lookup table size affected performance .................................................... 10 List of tables
Developing the Intel CoFluent simulation model....................................... 11 Table 1. Test configuration based on the pre-production
Simulating the packetgen ............................................................................... 11 Intel® Xeon® processor, 1.8GHz (Skylake) ................................ 9
Simulating the network .................................................................................. 12 Table 2. Test configuration based on the Intel® Xeon® E5-2630,
Modeling the upstream pipeline ..................................................................... 12 2.2GHz (Broadwell) ..................................................................... 9
Implementing lookup algorithms ................................................................... 12 Table 3. Test configuration based on the Intel® Xeon® processor
Developing a model of the cost of performance ............................................. 12 E5-2680, 2.5 GHz (Haswell) ...................................................... 10
Hardware performance considerations ...................................................... 13 Table A-1. Test configuration based on the Intel® Xeon® processor
Impact of cache on pipeline performance .................................................. 13 Gold 6152, 2.1 GHz .................................................................. 21
Characterizing the cost model ................................................................... 13
Establishing the execution cost of the model ............................................. 14 List of figures
Simulation constraints and conditions ............................................................ 14 Figure 1. Edge router with upstream and downstream pipelines ................. 5
Support for the upstream pipeline.............................................................. 14 Figure 2. Edge router’s upstream software pipeline .................................... 6
Cache analysis supported for LLC ............................................................ 14 Figure 3. Edge router’s downstream software pipeline ............................... 6
Dropping or dumping packets was not supported ...................................... 14 Figure 4. Model development ................................................................... 11
Critical data paths simulated...................................................................... 14 Figure 5. Model of upstream software pipeline ......................................... 12
Hardware analysis and data collection ....................................................... 14 Figure 6. Simple models for estimating cycles per instruction .................. 13
Hardware analysis tools ................................................................................. 14 Figure 7. Baseline performance measurements. ........................................ 15
Event Monitor (EMON) tool ..................................................................... 14 Figure 8. Performance measured at different frequencies ......................... 16
Sampling Enabling Product (SEP) tool ..................................................... 14 Figure 9. Performance measured for different LLC cache sizes ................ 16
EMON Data Processor (EDP) tool ............................................................ 14 Figure 10. Measuring performance on multi-core CPUs ............................. 17
Collecting performance-based data ................................................................ 15 Figure 11. Performance comparison of baseline configuration versus a
Workload performance measurements and analysis for model inputs ............ 15 simulation that includes an FPGA-accelerated ACL lookup ...... 17
Figure 12. Comparison of performance from CPU generation
to generation .............................................................................. 18
Figure A-1. Throughput scaling, as tested on the Intel® Xeon®
processor E5-2680 ..................................................................... 21
Figure A-2. Throughput scaling at 2000 MHz on different architectures ...... 21
Modeling the impact of CPU properties to optimize and predict packet-processing performance 4

Proof of concept Upstream and downstream


Key components used
for network function pipeline traffic in this POC
virtualization (NFV) Upstream traffic is traffic that For this NFV POC, we needed to identify
moves from end users toward the the critical hardware characteristics that had
The goal of this joint Intel-AT&T POC was to
network’s core. the most impact on the network processing
generate information that could help
equipment and the packet processing
developers choose designs that could be best
Downstream traffic is traffic that workload. To do this, we modeled and
optimized for network traffic — and make
moves toward the end user. simulated a hardware system as typically used
such choices faster and more accurately. We
for NFV.
also wanted to generate information that
would help developers predict packet-
processing performance for future would also allow us to project more accurate Network processing equipment
architectures more accurately, for both optimizations for future CPU product
Network processing equipment can usually be
hardware and software developments. generations.
divided into three categories:
To do this, we first developed a predictive
Goals of the POC  Easily programmable, general CPUs
model based on performance data from an
Our joint Intel-AT&T team had two
existing x86 hardware platform. We then  High performance (but hardwired)
compared network performance on that application-specific integrated circuits
main goals:
physical architecture to the performance (ASICs)
 Quantify and measure the performance projected by our simulation. These
of the upstream traffic pipeline of the comparative measurements would help us  Middle-ground network-processing units
Intel® Data Plane Development Kit determine the accuracy of our simulation (NPUs), such as field-programmable gate
(DPDK) virtual provider edge model. A high degree of accuracy would help arrays (FPGAs)
router (vPE). build confidence in using Intel CoFluent to Of those three categories, we focused this
effectively characterize network function POC on the impact of general CPU
 Identify CPU characteristics and identify virtualization workloads. characteristics on packet processing
components that have a significant
For this POC, our team focused on modeling throughput.
impact on network performance for
workloads running on an x86 system. upstream pipelines. Upstream traffic moves
from end users toward the network’s core (see Packet processing workload
In this paper, we quantify the vPE upstream Figure 1, next page). In the future, we hope to
traffic pipeline using the Intel CoFluent Packet-processing throughput is dependent
develop a similar predictive model to analyze
modeling and simulation solution. We on several hardware characteristics.
the performance of downstream pipelines,
validated our model by comparing the These include:
where traffic moves toward the end user.
simulation results to performance
 CPU speed
measurements on physical hardware. Longer term, our goal is to use these
predictive models and simulations to identify  Number of programmable cores in
Predictive model performance bottlenecks in various designs of the CPU
architecture, microarchitecture, and software.
to meet POC goals That future work would model both upstream  Cache size, bandwidth, and hit/miss
To achieve our goals, our team needed to and downstream pipelines. We hope to use latencies for level 1 cache (L1), level 2
develop a highly accurate Intel CoFluent that knowledge to recommend changes that cache (L2), and level 3 cache (L3; also
simulation model of the Intel DPDK vPE will significantly improve the performance of called last level cache, or LLC)
router workload. Such a model would help us future x86 architectures for packet processing  Memory bandwidth and read/write
characterize the network traffic pipelines. It workloads. This information should make it latencies
easier for developers to choose components
that will best optimize NFV workloads for  Network interface card (NIC) throughput
specific business needs. In our POC, the packet processing workload is
the DPDK vPE virtual router.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 5

Figure 1. Edge router with upstream and downstream pipelines. Upstream traffic moves from end users toward the network’s core.
Downstream traffic moves toward the end user(s).

DPDK framework Hardware and software At the other end of the spectrum are
for packet processing simulation tools hardware-oriented simulators. These
simulators model system timing on a cycle-
The Intel DPDK is a set of libraries and Optimizing a design for network traffic is by-cycle basis. These models are highly
drivers for fast packet processing. The DPDK typically done using traditional simulation accurate, but suffer from very slow simulation
packet framework gives developers a standard tools and a lot of manual effort. We were speeds. Because of this, they are not usually
methodology for building complex packet looking for a better approach that would make used to analyze complete, end-to-end systems.
processing pipelines. The DPDK provides it easier and faster for developers to choose Instead, they are used mainly for decision-
pipeline configuration files and functional the best components for their needs. making at the microarchitecture level.
blocks for building different kinds of
applications. For example, for our POC, we Traditional hardware and software Better solutions model
used the functions to build our internet simulation tools both hardware and software
protocol (IP) pipeline application.
For system analysis, traditional simulation- Solutions that model only software
One of the benefits of DPDK functions is that based modeling tools range from solely performance or which model only hardware
they help with the rapid development of software-oriented approaches to solely performance are not effective for modeling
packet processing applications that run on hardware-oriented approaches. Unfortunately, the performance of a complete system. The
multicore processors. For example, in our these traditional tools have not been able to best solution for modeling a complete system
POC, the edge router pipeline is built on meet the complex performance challenges would be:
the IP pipeline application (based on the driven by today’s packet-processing devices.
DPDK functions), to run on our four  Highly configurable
At one end of the traditional analysis
physical hardware DUTs. Our IP pipeline  Able to simulate both software and
spectrum are the software-oriented
models a provider edge router between the hardware aspects of an environment
simulations. In these simulations, software
access network and the core network
behavior and interactions are defined against a  Easily execute without the overhead of
(see Figure 1).
specific execution time. However, solutions setting up actual packet-processing
based solely on software analyses do not take applications
hardware efficiency into consideration.
Hardware efficiency has a significant impact
on system performance.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 6

Figure 2. Edge router’s upstream software pipeline.

Figure 3. Edge router’s downstream software pipeline.

Intel® CoFluent™ Technology Layered and configurable architecture Upstream and


For our POC, Intel CoFluent was ideal
For our joint Intel-AT&T POC, we needed a
because the simulator can estimate complete
downstream workloads
more effective tool than a software-only or
hardware-only analysis tool. To reach our system designs. Even more, Intel CoFluent For this project, our team examined primarily
goals, we chose the Intel CoFluent modeling can do so without the need for embedded the upstream traffic pipeline. Figure 2 shows
and simulation solution. Intel CoFluent is an application code, firmware, or even a precise the edge router’s upstream software pipeline.
application that helps developers characterize platform description. In our POC, this meant (Figure 3 shows the edge router’s downstream
and optimize both hardware and software we did not have to create and set up actual software pipeline.)
environments. As shown by the results of this packet processing applications for our model,
POC, the Intel CoFluent model proved to be but could simulate them instead. Upstream pipeline
highly accurate when compared to Another key advantage of using Intel
measurements taken on a physical system.1 In the edge router’s upstream traffic pipeline
CoFluent for our POC is the tool’s layered there are several actively running components.
and configurable architecture. The layered and Our POC used a physical model to validate
Simulation at a functional level configurable capabilities help developers the results of our simulation experiments. This
With Intel CoFluent, the computing and optimize early architecture and designs, and physical test setup consisted of three actively
communication behavior of the software stack predict system performance. Intel CoFluent running components:
is abstracted and simulated at a functional also includes a low overhead, discrete-event
level. Software functions are then dynamically simulation engine. This engine enables fast  Ixia* packet generator (packetgen), or
mapped onto hardware components. The simulation speed and good scalability. the software packetgen
timing of the hardware components — CPU,  Intel® Ethernet controller (the NIC)
memory, network, and storage — is modeled
according to payload and activities, as  Upstream software pipeline stages
perceived by software. running on one or more cores
Modeling the impact of CPU properties to optimize and predict packet-processing performance 7

In our physical model, the Ixia packetgen


injects packets into the Ethernet controller.
This simulates the activity of packets arriving
from the access network. The Ethernet Downstream pipeline stages The ID of the output interface for the
controller receives packets from the access Although we did not simulate the input packet is read from the identified
flow table entry. The set of flows used
network, and places them in its internal downstream pipeline for this project,
by the application is statically
buffer. we did collect some data on this configured, and is loaded into the hash
pipeline (see Appendix A). upon initialization.
Execution flow
As shown in Figure 3 (previous page),
of the upstream pipeline LPM lookup method
the second stage of the downstream
The upstream software pipeline can run on a pipeline is the routing stage. This stage When the lookup method is LPM-
single core, or the workload can be distributed demonstrates the use of the hash and based, an LPM object is used to
amongst several cores. Each core iterates each LPM (longest prefix match) libraries in emulate the pipeline’s forwarding stage
pipeline assigned to it, and runs the pipeline’s the data plane development kit for internet protocol version 4 (IPv4)
standard flow. (DPDK). The hash and LPM libraries packets. The LPM object is used as the
are used to implement packet routing table, in order to identify the
Here is the general execution flow of the
forwarding. In this pipeline stage, the next hop for each input packet at
typical packet processing pipeline:
lookup method is either hash-based or runtime.
 Receive packets from input ports. LPM-based, and is selected at runtime.
Configuration
 Perform port-specific action handlers and
Hash lookup method of downstream pipeline
table look-ups.
When the lookup method is hash- Below is the configuration code we
Execute entry actions on a lookup hit, or based, a hash object is used to emulate used for the first stage in the
execute the default actions on a lookup miss. the downstream pipeline’s flow downstream pipeline.
(The table entry action usually sends packets classification stage. The hash object is
to the output port, or dumps or drops the correlated with a flow table, in order to Additional information and analysis of
packets.) map each input packet to its flow at the downstream pipeline will be a
runtime. The hash lookup key is future POC project.
represented by a unique DiffServ
5-tuple.
The DiffServ 5-tuple is composed of [ PIPELINE1]
several fields that are read from the type = ROUTING
input packet. These fields are the core = 1
source IP address, destination IP pktq_in = RXQ0.0 RXQ1.0
pktq_out = SWQ0 SWQ1 SINK0
address, transport protocol, source port
encap = ethernet_qinq
number, and destination port number. ip_hdr_offset = 270
Traffic Manager Pipeline:
This is a pass-through stage
with the following
configuration:
[PIPELINE2]
type = PASS-THROUGH
core = 1
pktq_in = SWQ0 SWQ1 TM0 TM1
pktq_out = TM0 TM1 SWQ2
SWQ3

Transmit Pipeline: Also a


pass through stage:
[PIPELINE3]
type = PASS-THROUGH
core = 1
pktq_in = SWQ2 SWQ3
pktq_out = TXQ0.0 TXQ1.0
Modeling the impact of CPU properties to optimize and predict packet-processing performance 8

6 Stages in a typical upstream pipeline Note that the scope of this project did not
allow a complete analysis of the downstream
In the specific case of the upstream traffic
pipeline. The downstream pipeline uses
pipeline of the DPDK vPE, there are usually
different software and has different
Project names for 6 stages. Figure 2 (earlier in this paper) shows
functionality and pipeline stages, as compared
devices under test (DUTs) an overview of the 6 typical stages. The first
to the upstream pipeline.
pipeline stage drains packets from the
Intel internal project code names are Ethernet controller. The last stage in the chain We do provide some of the data and insights
often used to refer to various processors queues up the packets and sends them to the for the downstream pipeline that we observed
during development, proof of concepts core network through the Ethernet controller. while conducting our POC (see Appendix A).
(POCs), and other research projects. However, full analysis and verification of
In our POC, we modeled and simulated all
those initial results will have to be a future
In the joint Intel and AT&T POC, we key stages of the upstream pipeline, and
project.
used three main test configurations, one verified those results against known hardware
of which was a pre-production configurations.
processor (Skylake). Some of the Physical system setup
devices under test (DUTs) were used to
confirm simulation results and establish
Downstream pipeline When our team began setting up this POC, we
the accuracy of the simulations. Some In the edge router’s downstream traffic started with a description of a typical physical
were used to confirm simulation results pipeline there are 3 actively running architecture. We then set up a hardware DUT
for projecting optimal configurations components and 4 typical pipeline stages. that would match that architecture as closely
for future generations of processors. A as possible. We set up additional DUTs to
Figure 3 (earlier in this paper) shows an
fourth, production version of the provide configurations for comparisons and
Skylake microarchitecture was used to overview of the four typical stages. The three
components are: verifications.
characterize some aspects of the
downstream pipeline (see Appendix A).
 DPDK packetgen Tables 1 and 2 (next page) describe the two
DUTs we built for the first phase of our
The three main DUTs for our POC  Intel® Ethernet controller (the NIC) NFV POC. We used these DUTs to take
were based on these processors, with performance measurements on the upstream
these project code names:  Downstream software pipeline stages,
pipeline.
running on one or more cores
 Skylake-based DUT: In our POC, for the downstream pipeline, the
We compared those measurements to the
Pre-production Intel® Xeon® corresponding elements of our Intel CoFluent
packetgen injects packets into the Ethernet
processor, 1.8 GHz simulations. The physical DUTs helped us
controller. This simulates packets entering the
determine the accuracy of the virtual Intel
access network from the core.
 Broadwell-based DUT: CoFluent model that we used for our
Intel® Xeon® processor The first stage of the edge router’s simulations and projections.
E5-2630, 2.2 GHz downstream pipeline pops packets from the
internal buffer of the Ethernet controller. The
 Haswell-based DUT: last stage sends packets to the access network
Intel® Xeon® processor via the Ethernet controller.
E5-2680, 2.5 GHz
In our POC, the devices under test (DUTs)
used IxNetwork* client software to connect to
an Ixia traffic generator. Ixia generates
simulated edge traffic into the DUT, and
reports measurements of the maximum
forwarding performance of the pipeline. In
our model, we did not include considerations
of packet loss.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 9

The DUT described in Table 1 is based on a Table 1. Test configuration based on the
pre-production Intel® Xeon® processor, pre-production Intel® Xeon® processor, 1.8GHz (Skylake)
1.8 GHz, with 32 cores. The Intel-internal
project code name for this pre-production Component Description Details
processor is “Skylake.” Pre-production Intel® Xeon® processor,
Product
1.8 GHz
The DUT described in Table 2 is based on an
Intel® Xeon® processor E5-2630, 2.2 GHz. Processor Speed (MHz) 1800
The Intel-internal project code name for this Number of CPUs 32 Cores / 64 Threads
processor is “Broadwell.” LLC cache 22528 KB
In order to explore the performance sensitivity Capacity 64 GB
of one processor generation versus another,
Type DDR4
we set up an additional DUT, as described in
Table 3. This DUT is based on an Intel® Rank 2
Memory
Xeon® processor E5-2680, 2.5 GHz. The Speed (MHz) 2666
Intel-internal project code name for this Channel/socket 6
processor is “Haswell.”
Per DIMM size 16 GB
Packetgen in the physical DUT Ethernet controller X710-DA4 (4x10G)
NIC
Driver igb_uio
As mentioned earlier in the description of the
upstream pipeline, our physical systems Distribution Ubuntu 16.04.2 LTS
OS
included the Ixia packetgen. In the upstream Kernel 4.4.0-64-lowlatency
pipeline, the job of this hardware-based
BIOS Hyper-threading Off
packetgen is to generate packets and work
with the packet receive (RX) and transmit
(TX) functions. Basically, the packetgen
sends packets into the receive unit or out of Table 2. Test configuration based on the
the transmit unit. This is just one of the key Intel® Xeon® E5-2630, 2.2GHz (Broadwell)
hardware functions that was simulated in our
Component Description Details
Intel CoFluent model.
Product Intel® Xeon® processor E5-2630, 2.2 GHz
Speed (MHz) 2200
Processor
Number of CPUs 10 Cores / 20 Threads
LLC cache 25600 KB
Capacity 64 GB
Type DDR4
Rank 2
Memory
Speed (MHz) 2133
Channel/socket 4
Per DIMM size 16 GB
Ethernet controller X710-DA4 (4x10G)
NIC
Driver igb_uio
Distribution Ubuntu 16.04.2 LTS
OS
Kernel 4.4.0-64-lowlatency
BIOS Hyper-threading Off
Modeling the impact of CPU properties to optimize and predict packet-processing performance 10

Table 3. Test configuration based on the In our physical test model, we used default
Intel® Xeon® processor E5-2680, 2.5 GHz (Haswell) settings for other parameters, such as the
media access control (MAC) address, source
Component Description Details (SRC) transmission control protocol (TCP)
Product Intel® Xeon® processor E5-2680 v3, 2.5 GHz port, and destination (DST) TCP port.
Speed (MHz) 2500
Processor Performance and sensitivities
Number of CPUs 24 Cores / 24 Threads
LLC cache 30720KB
of the traffic profile
Capacity 256 GB In order to get the most accurate results, we
needed to characterize the traffic profile in
Type DDR4
detail for both the hardware DUTs and our
Rank 2 Intel CoFluent models and simulations.
Memory
Speed (MHz) 2666
Channel/socket 6 Hyperthreading was disabled to
expose the impact of other elements
Per DIMM size 16 GB
There are a number of system and application
Ethernet controller (4x10G)
NIC parameters that can impact performance,
Driver igb_uio including hyperthreading. For example, when
Distribution Ubuntu 16.04.2 LTS we ran the workload with hyperthreading
OS enabled, we gained about 25% more
Kernel 4.4.0-64-lowlatency
performance per core.1
BIOS Hyper-threading Off
However, hyperthreading shares some
hardware resources between cores, and this
can mask core performance issues. Also, the
Generating a packet-traffic profile For our POC, we chose the following IP range performance delivered by hyperthreading can
settings to traverse the LPM (longest prefix make it hard to identify the impact of other,
Once we set up our physical DUTs, we match) table. For lpm24, the memory range more subtle architectural elements. Since we
needed to estimate the performance effect of a is 64 MB, which exceeds the LLC size, and were looking for the impact of those other
cache miss in the routing table lookup on can trigger a miss in the LLC cache. packet-handling elements, we disabled
these architectures. To do this, for each
hyperthreading for this POC.
packet, we increased the destination IP to a range 0 dst ip start 0.0.0.0
range 0 dst ip min 0.0.0.0
fixed stride of 0.4.0.0. This caused each
range 0 dst ip max Lookup table size affected performance
destination IP lookup to hit at a different 255.255.255.255
memory location in the routing table. range 0 dst ip inc 0.4.0.0 While setting up the experiments, we
observed a performance difference (delta) that
For the source IP setting, any IP stride should depended on the size of the application’s
be appropriate, as long as the stride succeeds lookup table. Because of this, for our POC,
on the access control list (ACL) table lookup. we decided to use the traffic profile described
(The exact relationship of cache misses under “Generating a packet-traffic profile.”
and traffic characteristics is not described in This ensured that we had some LLC misses
this POC, and will be investigated in a in our model.
future study.)
Modeling the impact of CPU properties to optimize and predict packet-processing performance 11

Figure 4. Model development. This figure shows how we modeled the flow for the simulation of the entire virtual edge provider (vPE) router pipeline.

hardware components except storage. actually simulate the packetgen. Instead, we


Developing the Intel It was not necessary to model storage because used queues to represent the packetgen.
CoFluent simulation model our workload did not perform any actual
To do this, we first simulated a queue of
storage I/O.
To develop the simulation model for this packets that were sent to the Ethernet
project, we performed an analysis of the For our project, the actively running controller at a defined rate. In other words, we
source code, and developed a behavior model. components were the CPU, Ethernet created receive (RX) and transmit (TX)
We then developed a model of the controller, and packet generator (packetgen). queues for our model. We took a packet off
performance cost in order to create a One of the benefits of using the Intel CoFluent the RX queue every so many milliseconds for
virtualized network function (VNF) framework for these simulations is that Intel the RX stage in the pipeline. This simulated a
development model (see Figure 4). CoFluent can schedule these components at a packet arriving at a specific rate.
user-specified granularity of nanoseconds or
To create our VNF simulation, we built an We did a similar simulation for the TX stage
even smaller.
Intel CoFluent model of actively running in the pipeline.
components and pipeline stages that
Simulating the packetgen The rate at which packets entered and exited
corresponded to those in the physical DUTs.
the queues was determined by the way the
We mapped these pipeline stages to a CPU In a simulation, there is no physical hardware physical DUT behaved, so the simulation
core as the workload. We then simulated the to generate packets, so we needed to add that would model the physical DUT as closely
behavior of each pipeline stage. functionality to our model in order to simulate as possible.
In any performance cost model, underlying that capability. For this POC, we did not
hardware can have an impact on the pipeline
flow. For this reason, we modeled all key
Modeling the impact of CPU properties to optimize and predict packet-processing performance 12

Figure 5. Model of upstream software pipeline. The ACL pipeline stage supports auditing of incoming traffic. Note that, in our proof-of-concept (POC),
the queueing stage has two locations, and performs packet receiving (RX) or packet transmitting (TX), depending on its location in the pipeline.

Simulating the network queuing and packet TX stage are also usually Developing a model
separate pipeline stages. In our POC, we of the cost of performance
In this POC, the Ethernet controller was modeled the packet RX and packet TX stages
simulated based on a very simple throughput as a single packet queueing stage that was It’s important to understand that
model, which receives or sends packets at a located at both the beginning and the end of Intel CoFluent is a high-level framework for
user-specified rate. the pipeline (see Figure 5). simulating behavior. This means that the
framework doesn’t actually execute CPU
For our POC since we wanted to characterize instructions; access cache or memory cells; or
the performance impact of CPU parameters on
Implementing lookup algorithms
perform network I/O. Instead, Intel CoFluent
software packet processing pipeline we did One of the things we needed to do for our uses algorithms to simulate these processes
not implement the internal physical layer model was to implement lookup algorithms. with great accuracy (as shown in this POC).1
(PHY), MAC, switching, or first-in-first-out To do this, we first had to consider the
(FIFO) logic. We specifically defined our test pipelines. As shown earlier in Figure 2, In order to develop a model of the execution
to make sure we would not see system an upstream pipeline usually consists cost of performance, we tested the accuracy of
bottlenecks from memory bus bandwidth or of 3 actively running components the simulations in all phases of our POC by
from the bandwidth of the Peripheral and 6 typical pipeline stages. comparing the simulation results to
Component Interconnect Express (PCI-e). measurements of actual physical architectures
Note that the ACL pipeline stage is a of various DUTs. The delta between the
Because of that, we did not need to model the
multiple-bit trie implementation (a tree-like estimated execution times from the
effect of that network transaction traffic
structure). This routing pipeline stage uses an simulation, and measurements made on the
versus system bandwidth.
LPM lookup algorithm which is based on a physical DUTs, ranged from 0.4%
full implementation of the binary tree. For our to 3.3%.1 This gave us a high degree of
Modeling the upstream pipeline POC, we implemented the ACL lookup confidence in our cost model.
In our POC, we simulated all key stages of the algorithm and the LPM lookup algorithm to
support auditing of the incoming traffic. We (Specifics on the accuracy of our model and
upstream packet processing pipeline. The
also implemented these two algorithms to simulations, as compared to the DUTs, are
ACL filters, flow classifier, metering and
support routing of traffic to different discussed later in this paper under “Results
policing, and routing stages were modeled
destinations. and model validation.”)
individually. The packet RX stage and the
In addition, the flow classification used a hash With the very small delta seen between
table lookup algorithm, while flow action performance on the simulations versus the
used an array table lookup algorithm. We physical DUTs, we expect to be able to use
implemented both of these algorithms other Intel CoFluent models in the future, to
in our model. effectively identify the best components for
other packet-traffic workloads under
development.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 13

Hardware performance considerations


To build an NFV model of the true cost of
performance, we had to consider hardware,
not just software. For example, elements that
affect a typical hardware configuration
include: core frequency, cache size, memory
frequency, NIC throughput, and so on. Also,
different cache sizes can trigger different
cache miss rates in the LPM table or the
ACL table lookup. Any of these hardware-
configuration factors could have a significant
effect on our cost model.

Impact of cache on pipeline performance


Another key consideration in our POC was
how much the packet processing performance
could be affected by different hardware
Figure 6. Simple models for estimating cycles per instruction (CPI) when cache misses occur;
components. For example, consider the
and for estimating path length for the pipeline.
example of cache. Specifically, look at the
impact on pipeline performance of cache
misses at different cache levels. Besides the
some instructions are executed as if Note:
packet RX and packet TX pipelines, all edge
router pipelines must perform a table lookup, overlapped. The result is that, regardless of L1 and L2 also have a significant impact
then a table entry handler on a hit — or a the DUT configuration, the blocking factor is on performance. However, the impact of L1
default table entry handler on a miss. These not usually 1. and L2 can’t actually be quantified, since
operations are mostly memory operations. The challenge for developers is that the cache we cannot change the sizes of the L1
miss rate has a critical impact on performance. and L2 cache.
Cache misses could have very different access
latencies for the memory operations at To address this challenge, we needed to
different cache levels. For example, latencies quantify the impact of this miss rate, and  Linear performance. Core frequency
could be as many as 4 cycles in an L1 cache integrate its consequences into our cost refers to clock cycles per second (CPS),
hit; to 12 cycles in an L2 cache hit; to model. To do this, we configured the cache which is used as an indicator of the
hundreds of cycles in a DRAM memory size via the Intel® Platform Quality of processor's speed. The higher the core
access. The longer the latency of the memory Service Technology (PQOS) utility. We also frequency, the faster the core can execute
access, the greater the impact on the CPU increased the number of destination IPs to an instruction. Again we used a
microarchitecture pipeline. traverse the LPM table. This allowed us to regression model to estimate the packet
introduce different LLC cache miss rates into processing throughput for the upstream
The impact of latency on the pipeline is called our model. software pipeline, by using the core
the blocking probability or blocking factor (as frequency as a predictor variable.
compared to a zero cache miss). The longer  Real-world performance versus ideal
the memory access latency, the more the CPU cache miss rate. To estimate the impact Characterizing the cost model
execution pipeline is blocked. The blocking of cache miss rate on performance, we
factor is the ratio of that latency as compared In our POC, the cost model is a model of the
regressed the equation for core cycles per execution cost of specific functions in the
to zero cache misses. instruction (CPI). In regressing the execution flow of the upstream software
You might expect the blocking factor to be 1 equation for CPI, we used LLC misses pipeline. In other words, the cost model is the
when memory access cycles aren’t hidden by per instruction (MPI) and LLC miss execution latency.
the CPU’s execution pipeline, but that is not latency (ML) as predictor variables (see
actually the case. A miss does not necessarily Figure 6). In other words, we regressed The cost model for each software pipeline
result in the processor being blocked. In the blocking factor and the core CPI consists of characterizing the pipeline in terms
reality, the CPU can execute other instructions metric. This gave us a way to estimate of CPI and path length. We determined CPI
even while some instructions are blocked at the extra performance cost imposed by using the simple model shown in Figure 6
the memory access point. Because of this, different hardware cache configurations. (above).
Modeling the impact of CPU properties to optimize and predict packet-processing performance 14

In order to get the execution latency for a Cache analysis supported for LLC Hardware analysis tools
specific length of the pipeline, we also had to
In this study, our test setup did not allow us to We used three hardware analysis tools to help
estimate the path length for that section of the
change the L1 and L2 size to evaluate the with our VNF verifications: Event Monitor
pipeline. The path length is the number of x86
impact of L1 and L2 on performance. Because (EMON) tool, Sampling Enabling Product
instructions retired per 1 Mb of data sent.
of this, our model supported only the LLC (SEP) tool, and the EMON Data Processor
Again, see Figure 6 (previous page).
cache size sensitivity analysis, and not an (EDP) tool. These tools were developed by
In our model, multiplying the two variables — analysis of L1 or L2. Intel, and are available for download from the
CPI and path length — gives the execution Intel Developer Zone.
time of the software pipeline in terms of CPU Dropping or dumping packets was not
cycles. With that information, we were able to supported Event Monitor (EMON) tool
simulate CPI and path length, using the Intel
Dropping packets is an error-handling EMON is a low-level command-line tool for
CoFluent tool, in order to compute the end-to-
method; and dumping packets is a debugging processors and chipsets. The tool logs event
end packet throughput.
or profiling tool. Dropping and dumping counters against a timebase. For our POC, we
packets doesn’t always occur in the upstream used EMON to collect and log hardware
Establishing the execution cost
pipeline. If it does, it can occur at a low rate performance counters.
of the model
during the running lifetime of that pipeline.
We began building our Intel CoFluent cost You can download EMON as part of the
Our test model did not support dropping or Intel® VTune Amplifier suite. Intel VTune
model based on the DPDK code. With the
dumping packets. If we had included dropping Amplifier is a performance analysis tool that
previous considerations taken into account,
packets and/or the debugging tools in our helps users develop serial and multithreaded
we used the DPDK code to measure the
POC model, they could have introduced more applications.
instructions and cycles spent in the different
overhead to the simulator. This could have
pipeline stages. These cycles were assumed to
slowed the simulation speed and skewed our Sampling Enabling Product (SEP) tool
be the basic execution cost of the model.
results.
Figure 4, earlier in this paper, shows an SEP is a command-line performance data
overview of the cost model. We suspect that dropping and dumping collector. It performs event-based sampling
packets might not be critical to performance (EBS) by leveraging the counter overflow
Simulation constraints in most scenarios, but we would need to feature of the test hardware’s performance
create another model to explore those impacts. monitoring unit (PMU). The tool captures the
and conditions
That would be an additional project for processor’s execution state each time a
For the upstream pipeline, we modeled the future. performance counter overflow raises an
hardware parameters (such as CPU frequency interrupt.
and LLC cache size), packet size, pipeline Critical data paths simulated
configurations, and flow configurations. We Using SEP allowed us to directly collect the
focused on the areas we expected would have With those three constraints in place, we performance data — including cache misses
the biggest impact on performance. We then modeled and simulated the most critical data — of the target hardware system. SEP is part
verified the results of our simulations against paths of the upstream pipeline. This allowed of the Intel VTune Amplifier.
performance measurements made on the us to examine the most important performance
physical hardware DUTs. considerations of that pipeline. EMON Data Processor (EDP) tool

In order to create an effective model for a EDP is an Intel-internal analysis tool that
complete system, we accepted some
Hardware analysis and processes EMON performance samples for
conditions for this project. data collection analysis. EDP analyzes key hardware events
such as CPU utilization, core CPI, cache
Support for the upstream pipeline Verifying the accuracy of any simulation is an misses and miss latency, and so on.
important phase of any study. In order to test
As mentioned earlier, this POC was focused the accuracy of our Intel CoFluent simulations
on the upstream pipeline. The scope of our against the hardware DUT configurations, we
model did not support simulating the needed to look at how VNF performance data
downstream pipeline. However, we hope to would be collected and analyzed.
conduct future POCs to explore packet
performance in that pipeline. The information
we did collect on the downstream pipeline
is presented in Appendix A, at the end of
this paper.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 15

hotspots. For example, it does not report Our results show that it is possible to use a
Baseline delta of performance
where the code takes up the most cycles, or VNF model to estimate both best-case and
on physical hardware versus
simulation where it generates the most cache misses. worst-case packet performance for any given
6.0 Sampling mode is better for collecting that production environment.
kind of information.
Throughput (MPPS)

5.0 The next several discussions explain how we


4.0
EMON outputs the raw counter information in established the accuracy of our simulation,
the form of comma-separated values (CSV and describe our key results.
3.0 3.72 3.60
data). We used the EMON Data Processor
2.0 (EDP) tool to import those raw counter CSV Establishing a baseline for
files, and convert them into higher level
1.0 simulation accuracy
metrics. For our POC, we converted our data
0.0 into Microsoft Excel* format, so we could To establish a baseline of the accuracy of our
Pre-production Intel®
Intel® Xeon® CoFluent™
interpret the data more easily. VNF simulation model, we first measured
processor, Technology packet performance on a physical Skylake-
1.8GHz simulation based DUT, versus our Intel CoFluent
(Skylake)
Workload performance
simulation model.
measurements and analysis for
Figure 7. Baseline performance Figure 7 shows performance when measured
measurements, with default CPU
model inputs
under the default CPU frequency, with the
frequency and default LLC cache We used various metrics to collect data from default LLC cache size of 22 MB. As you can
size of 22 MB.1 The DUT for these the hardware platform. Using these metrics let see in Figure 7, the simulation measurement
measurements was based on a pre- us input the data more easily into our model
production Intel® Xeon® processor,
projections (our results) are very close —
and/or calibrate our modeling output. within 3.3% — of those made on real-world
1.8 GHz (Skylake).
Such metrics included: architectures.1

 Instructions per cycle (IPC)


Collecting performance-based data Measuring performance at
 Core frequency different core frequencies
In general, there are two ways to collect
performance-based data for hardware  L1, L2, and LLC cache sizes Figure 8 (next page) shows the results of
counters: measuring performance at different core
 Cache associativity
frequencies on both DUTs and simulations.
 Counting mode, implemented in
 Memory channels These measurements include using
EMON, which is part of the Intel® Turbo Boost at maximum frequency,
Intel VTune suite  Number of processor cores and adjusting the core frequency using a
 Sampling mode, implemented in the We also collected application-level Linux* P-state driver.
SEP, which is also part of the performance metrics in order to calibrate the In Figure 8, the yellow line represents the
Intel VTune suite model’s projected results. difference between measurements of the
Counting mode reports the number of simulation as compared to measurements
occurrences of a counter in a specific period Results and model validation taken on the physical DUTs. As you can see,
of time. This mode lets us calculate precise the Intel CoFluent simulation provides
bandwidth and latency information for a given Our NFV project provided significant estimated measurements that are within
configuration. performance data for various hardware 0.7% to 3.3% of the measurements made on
configurations and their correspondingly the actual physical hardware.1
Counting mode is best for observing the modeled simulations. It also allowed us to
precise use of system resources. However, compare the performance of different
counting mode does not report software simulation models, from real-world
configurations to worst-case configurations,
to ideal configurations.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 16

Delta of performance based on core fequency:


physical hardware versus simulation
6 1.0% 0.7% 1.5% 1.1% 0.7%
2.9% 3.3% 100%
Throughput (MPPS)

5 5.65 5.69
80%
4
3.72 3.60 60%
3 3.35 3.26
3.12 3.09
2 2.60 2.58 2.79 2.75 40%
2.41 2.44
1 20%

0 0%
1.1 GHz 1.2 GHz 1.3 GHz 1.5 GHz 1.6 GHz 1.8 GHz 3.0 GHz
22 MB
22 MB cache

Pre-production
Intel® Intel® Xeon®
Xeon® processor processor,1.8GHz
1.8GHz (Skylake)
FigureE5-2600,
9. (Skylake)
Intel® CoFluent™ Technology
Intel® CoFluent™ Technology simulation
Performa simulation
Delta of measurements
Accuracy of simulationmade on physical
nce measured for system and simulation
different LLC
Pre-production Intel®
cacheXeon®
sizes processor,
in our 1.8GHz (Skylake)
Figure 8. Performance measured at different frequencies for our simulation versus on the physical device under test (DUT).1
Intel® CoFluent™ Technologyversus
simulations simulation
The DUT for these measurements was based on a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake).
themade
Delta of measurements physical
on physical system and simulation
DUT.122 MB cache
Figure 8. Performance measured at different frequencies for our simulation versus on the physical under test (DUT).1for
device performance
Analyzing
Delta of performance based on LLC cache size:
The DUT for these measurements was based on a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake).
physical hardware versus simulation different cache sizes
1.1% 1.6% 0.5% Figure 9 shows packet performance as
6 3.3% 100% measured for different LLC cache sizes:
2 MB, 4 MB, 8 MB, and 22 MB. In our
Throughtput (MPPS)

5 80%
simulation, cache size was adjusted using the
4
60% Intel PQOS utility.
3 3.58 3.56 3.72 3.60
3.36 3.33 3.47 3.41
40% Our POC showed that a 2 MB LLC cache size
2 causes a dramatically larger miss rate (32%
1 20% miss rate) than a 22 MB LLC cache size
(1.6% miss rate).1 However, almost 90% of
0 0% memory access hits are at L1 cache.1
1.8 GHz 1.8 GHz 1.8 GHz 1.8 GHz
Because of this, adjusting the LLC cache size
2 MB 4 MB 8 MB 22 MB decreases performance by a maximum of
only 10%.1
Intel® Xeon® processor
Pre-production E5-2600,
Intel® Xeon® 1.8GHz
processor, (Skylake)
1.8GHz (Skylake) In Figure 9, the yellow line again shows the
Intel® CoFluent™
Intel® CoFluent™Technology
Technology simulation
simulation difference between measurements made on
Delta
Deltabetween measurements
of measurements on physical
made on physical system
system and and simulation
simulation the physical DUT, and measurements of the
simulation. The delta remains very small,
Figure 10. Measuring performance on multi-core between 0.5% and 3.3%.1
Figure 9. Performance
CPUs.measured for different
1Pre-production LLC cache
Intel® Xeon® sizes 1.8GHz
processor, in our simulations
(Skylake)
1
versus the physical DUT. The DUT for these
Intel® CoFluent™ Technology simulation measurements was based on
a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake).
Delta of measurements made on physical system and simulation

Figure 8. Performance measured at different frequencies for our simulation versus


on the physical device under test (DUT).1
Figure 9. Performance measured for different LLC cache sizes
in our simulations versus the physical DUT.1 The DUT for these
measurements was based on a pre-production Intel® Xeon® processor,
1.8 GHz (Skylake).
Modeling the impact of CPU properties to optimize and predict packet-processing performance 17

Performance when scaling CPU cores (2.2 GHz) Performance without a hardware accelerator and
with a hardware accelerator
30 120%
0.4% 1.2% 4.5
3.0%
25 100% 4.4
Throughput (MPPS)

24.5 23.8 4.3 4.39

Throughput (MPPS)
20 80% 4.2
4.1
15 60% 4.0
3.9
10 40%
3.8
8.2 8.3 3.7 3.80
5 20%
3.6
4.1 4.1
0 0% 3.5
Baseline hardware Simulated FPGA-
1 core 2 cores 6 cores
configuration without field- accelerated ACL lookup
Pre-production Intel® Xeon® processor, 1.8GHz (Skylake) programmable gate array
(FPGA)-accelerated
Intel® CoFluent™ Technology simulation access control list (ACL)
Delta between measurements on physical system and simulation lookup

Figure 10. Measuring performance on multi-core CPUs.1 The DUT Pre-production Intel® Xeon® processor, 1.8GHz (Skylake)
for these measurements was based on a pre-production Intel® CoFluent™ Technology simulation
Intel® Xeon® processor, 1.8 GHz (Skylake).
Figure 11. Performance comparison of baseline configuration
versus a simulation that includes a field-programmable
Figure 11. Performance comparison of baseline configuration gate array (FPGA)-accelerated access control list
Measuring performance
versus a simulation that includes a field-programmable (ACL) lookup.1 The DUT for these measurements was
for different numbers
gate arrayof cores
(FPGA)-accelerated access control list based on a pre-production Intel® Xeon® processor,
(ACL) lookup.1Figure 10. Measuring performance 1.8 GHz (Skylake).
Figure 10 showson themulti-core
throughput results
CPUs. 1 forDUT for these measurements
The
measurements taken on DUTs with various
was based on a pre-production Intel®Simulating
Xeon® processor,
hardware
numbers of cores. These measurements 1.8 GHzwere
(Skylake).
acceleration components Figure 12. Comparison of performance from CPU generation to
made on the pre-production Skylake-based generation.1Figure 11. Performance comparison
DUT, and compared with the results projected Previously, we showed how we broke down It’s important to note that this performance
of baseline configuration versus a simulation that
by our simulation. Note that in this POC, the the distribution of CPU cycles and result represents only
includes a field-programmable one
gate functionality
array (FPGA)- of the
upstream pipeline ran on a single core, even pipeline that
instructions amongst different stages of accelerated access control listwas simulated for FPGA. The
when run on processors with multiple cores. 15.5%
.1 result we saw here
(ACL) lookup The DUT for these measurements
pipelines. Just as we did in that analysis, we does not represent
was
can do a similar what-if analysis to based
identify on a the results of
pre-production the full
Intel® capability
Xeon® of the FPGA
processor, 1.8
As you can see in Figure 10, the throughput used for this workload. Still, this kind of
GHz
the best hardware accelerators for our model. (Skylake).
scales linearly as more cores are added. In this what-if analysis can help developers more
test, the packets were spread evenly across the For example, in one what-if analysis, we accurately estimate the cost and efficiency of
cores by way of a manual test configuration. replaced the ACL lookup with an FPGA adopting FPGA accelerators or of using some
In our test, all pipeline stages were bound to accelerator that is 10 times as efficient as the other method to offload hardware
one fixed core, and the impact of core-to-core standard software implementation in the functionalities.
movement was very small. DPDK. We found that swapping this
Again, the yellow line represents the component sped up the performance of the
difference between measurements of the overall upstream traffic pipeline by over 15%
physical system, and measurements of the (see Figure 11).1
simulation. For performance based on cache
size, the delta is still very small, between
0.4% and 3.0% for simulation predictions as
compared to measurements made on
the DUT.1
Modeling the impact of CPU properties to optimize and predict packet-processing performance 18

Performance comparison of two generations


of hardware platforms versus simulations

4
Best case and worst case 3.5 3.72 3.59
traffic profiles 3 3.43 3.35

Throughput (MPPS)
2.5
At the time of this joint NFV project,
a production traffic profile was not 2
available for analyzing bottlenecks in a 1.5
production deployment. However, we
did analyze both the worst-case profile 1
and the best case profile. In the worst 0.5
case profile, every packet is a new 0
flow. In the best-case profile, there is Intel® Xeon® processor E5-2630L Pre-production Intel® Xeon®
only one flow. We did not set out to v4 1.8G (Broadwell) processor, 1.8GHz (Skylake)
study these scenarios specifically, but
the traffic profile we used provided Hardware device under test (DUT)
information on both best- and worst- Intel® CoFluent™ Technology simulation
case profiles.
Figure 12. Comparison of performance from CPU generation to generation. 1
Our results showed that the difference
in performance between worst-case and
best-case profiles was only 7%.1 That
could offer developers a rough Performance Broadwell also has only 256 KB of L2 cache,
Figure 12. Comparison of performance from CPU generation to generation. 1
estimation of what performance could sensitivity from generation while Skylake has 2 MB of L2 cache (more
be like between best-case and worst- cache is better). Also, when there is a cache
case packet performance for any given to generation miss in L2, the L2 message-passing interface
production environment. (MPI) on the Skylake-based DUT is 6x the
Figure 12 shows the performance of the
upstream pipeline on a Broadwell-based throughput of L2 MPI delivered by
It’s important to understand that our Broadwell.1
microarchitecture, as compared to a Skylake-
results show only a rough estimation
of that difference for our pipeline based microarchitecture. The Intel CoFluent
Our POC measurements tell us that all of
model and our particular type of simulation gives us an estimated delta of less
these factors contribute to the higher core CPI
application. The packet performance than 4% for measurements of packet
seen for Broadwell microarchitectures, versus
gap between your own actual best- and throughput on simulated generations of
the greater performance delivered by Skylake.
worst-case traffic profiles could be microarchitecture, as compared to
significantly different. measurements on the physical DUTs.1
Maximum capacity at the LLC level
s

Core cycles per instruction (CPI) One of the ways we used our simulations was
to understand performance when assuming
Our POC results tell us that several specific
maximum capacity at the LLC level. This
factors affect CPI and performance. For
analysis assumed an infinite-sized LLC, with
example, the edge routers on both Broadwell
no LLC misses.
and Skylake microarchitectures have the same
program path length. However, Skylake has a
much lower core CPI than Broadwell (lower
CPI is better). The core CPI on the Broadwell-
based DUT is 0.87; while the core CPI on the
Skylake DUT is only 0.50.1
Modeling the impact of CPU properties to optimize and predict packet-processing performance 19

Our analysis shows that packet throughput can Performance sensitivities based
achieve a theoretical maximum of 3.98 MPPS on traffic profile Fused µOPs
(million packets per second) per core on
Skylake-based microarchitectures.1 Our POC shows that the packet processing
Along with traditional micro-
performance for the upstream pipeline on the
ops fusion, Haswell supports
Ideal world versus reality edge router will change depending on the macro fusion. In macro fusion,
traffic profile you use. specific types of x86
In an ideal world, we would eliminate all
instructions are combined in
pipeline stalls in the core pipeline, eliminate
Performance scaled linearly with the pre-decode phase, and then
all branch mispredictions, eliminate all sent through a single decoder.
the number of cores
translation lookaside buffer (TLB) misses, and They are then translated into a
assume that all memory accesses hit at the Using the traffic profile we chose, we found single micro-op.
L1 data cache. This would allow us to achieve that we could sustain performance at about
the optimal core CPI. 4 MPPS per core on a Skylake architecture.1
We tested this performance on systems
For example, a Haswell architecture can with 1, 2, and 6 cores; and found that
commit up to 4 fused µOPs each cycle per performance scaled linearly with the number Next steps
thread. Therefore, the optimal CPI for the of cores (see Figure 10, earlier in this paper).1
4-wide microarchitecture pipeline is As we continue to model packet performance,
theoretically 0.25 CPI. For Skylake, the Note that, when adding more cores, each it’s unavoidable that we will have to deal with
processor’s additional microarchitecture core can use less LLC cache, which may hardware concurrency and the interactions
features can lower the ideal CPI even further. cause a higher LLC cache miss rate. Also, typically seen with various core and non-core
as mentioned earlier in this paper, under the components. The complexity of developing
If we managed to meet optimal conditions for heading, “Impact of cache on pipeline and verifying such systems will require
Haswell, when CPI reaches 0.25, we could performance,” adjusting the LLC cache size significant resources. However, we believe
double the packet performance seen today, could impact performance. So adding more we could gain significant insights from
which would then be about 8 MPPS cores could actually increase fetch latency and additional POCs.
(7.96 MPPS) per core. (Actual CPI is based cause core performance to drop.
on the application, of course, and on how We suggest that next steps include:
well the application is optimized for its Our results include the performance analysis
 Using our Intel CoFluent model to
workload.) for different cache sizes, from 2MB to 22MB.
identify component characteristics that
We did not obtain results for cache sizes
have the greatest impact on the
smaller than 2MB.
performance of networking workloads.
This could help developers choose the
Execution flow was steady best components for cluster designs that
We also discovered that our test vPE are focused on particular types of
application delivered a steady execution flow. workloads.
That meant we had a predictable number of
 Model and improve packet-traffic
instructions per packet. We can take that
profiles to support a multi-core
further to mean that the higher the core clock
paradigm. This paradigm would allow
frequency, the more throughput we could
scheduling of different parts of the
achieve.
workload pipeline onto different cores.
For our POC conditions (traffic profile, 1.8
 Model and improve traffic profiles to
GHz to 2.1 GHz processors, workload type)
study the impact of the number of new
we found that performance of the vPE scales
flows per second, load balancing across
linearly as frequency increases.1
cores, and other performance metrics.

 Model and simulate the downstream


pipeline.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 20

Summary LLC cache Conclusion


Most of today’s data is transmitted over The size and performance of LLC cache had Our detailed POC tells us that, when selecting
packet-switched networks, and the amount of little influence on our DPDK packet a hardware component for an edge router
data being transmitted is growing processing workload. This is because, in our workload, developers should consider
dramatically. This growth, along with a POC, most memory accesses hit in L1 and L2, prioritizing core number and frequency.
complex set of conditions, creates an not in LLC; and there is a low miss rate in
In terms of scaling for future products, our
enormous performance challenge for L1 and L2.
model was able to project potential
developers who work with packet traffic. Note that the size of LLC cache can cause performance gains very effectively. For
To identify ways to resolve this challenge, higher miss rates on different types of VNF example, our simulation model showed a
Intel and AT&T collaborated to perform a workloads. However, on the VNF workload detailed distribution of CPU cycles and
detailed POC on several packet processing we simulated, the effect was small because of instructions of each workload stage.
configurations. For this project, our joint team the high L1 and L2 hit rates. Developers can use this type of information to
used a simulation tool (Intel CoFluent) on a better estimate the performance gains when
We did not include studies in our POC to
DPDK packet-processing workload which FPGA hardware accelerators or other ASIC
understand the effects caused by other VNF
was based on the DPDK library, and run on off-loading methods are applied.
workloads running on different cores on the
x86 architectures. The simulation tool same socket, and affecting the LLC in Our POC also demonstrated that our Intel
demonstrated results (projections) with an different ways. That was not in the scope of CoFluent model is highly accurate in
accuracy of 96% to 97% when compared to our POC, and would be a future project. simulating vPE workloads on x86 systems.
the measurements made on physical hardware The average correlation between our
configurations.1 Performance simulations and the known-good physical
Our results provide insight into the kinds of architectures is within 96%.1 This correlation
Performance scales linearly with the number
changes that can have an impact on packet holds true even across the scaling of different
of cores.1 Our results show this is due to the
traffic throughput. We were also able to hardware configurations.
small impact of LLC, since there is such a low
identify the details of some of those impacts. miss rate in L1 and L2. Basically, when a The accuracy of these Intel CoFluent
This included how significant the changes memory access misses in L1 or L2, the system simulations can help developers prove the
were, based on different hardware will search LLC. Since L1 and L2 are value of modeling and simulating their
characteristics. Finally, our POC analyzed dedicated for each core, and since LLC is designs. With faster, more accurate
component and processor changes that could shared by all cores, the more cores there are, simulations, developers can improve their
provide significant performance gains. the higher the potential rate for LLC misses. choices of components used in new designs,
reduce risk, and speed up development cycles.
Key findings Packet size This can then help them reduce time to market
for both new and updated products, and help
Here are some of our key findings: Packet size does not have a significant
reduce overall development costs.
performance impact in terms of packets per
CPU frequency second (PPS).1 For example, look at the edge
router, which is a packet-forwarding
CPU frequency has a high impact on packet
approach. With an edge router, only the
performance — it’s a nearly linear scaling.1
packet header (the MAC/IP/TCP header) is
Even if the underlying architecture has
processed for classifying flow, for
different core frequencies, the execution
determining quality of service (QoS), or for
efficiency (reflected in core CPI) for these
making decisions about routing. The edge
cores is almost the same.
router doesn’t touch the packet payload; and
increasing the packet size (payload size) will
not consume extra CPU cycles.
Contrast this with BPS (bytes per second),
where BPS scales with PPS and packet size.
This BPS-to-PPS scaling will continue until
the bandwidth limit is reached, either at the
Ethernet controller, or at the system
interconnect.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 21

Appendix A. Table A-1. Test configuration based on the


Intel® Xeon® processor Gold 6152, 2.1 GHz
Performance in the
Component Description Details
downstream pipeline
Product Intel® Xeon® Gold processor 6152, 2.1 GHz
Our joint team characterized the downstream Speed (MHz) 2095
pipeline on two of our devices under test Processor
(DUTs). A full study of the downstream Number of CPUs 44 cores / 88 threads
pipeline was not in the scope of our proof of LLC cache 30976 KB
concept (POC) study. However, this appendix Capacity 256 GB
provides some preliminary results that we
Type DDR4
observed while studying the upstream
pipeline. Rank 2
Memory
Note that the downstream pipeline uses Speed (MHz) 2666
different software, with different functionality Channel/socket 6
and has different stages than the upstream Per DIMM size 16 GB
pipeline. Full verification of results seen from
Ethernet controller (4x10G)
the downstream pipeline will have to be a NIC
future project. Driver igb_uio

Table 3, earlier in this paper, describes the OS Distribution Ubuntu 16.04.2 LTS
hardware DUT configuration for the Haswell- BIOS Hyper-threading Off
based microarchitecture used to determine
throughput scaling of the downstream
pipeline. Table A-1 (above, right) describes
the hardware DUT configuration of a fourth,
production version of Skylake-based
microarchitecture on which we also obtained
downstream pipeline results. This fourth
production-version processor was the Intel®
Xeon® processor Gold 6152, 2.1 GHz.
In this POC, we measured throughput on a
single core for all stages of the downstream
pipeline.
Figure A-1 shows throughput scaling as a
function of frequency for the downstream
pipeline. These measurements were made on
the Intel® Xeon® processor E5-2680, 2.5 Figure A-1. Throughput scaling, as tested on Figure A-2. Throughput scaling at 2000 MHz
GHz DUT (Haswell-based architecture). the Intel® Xeon® processor on different architectures.1
Figure A-2 shows throughput measured at E5-2680, 2.5 GHz (Haswell) Throughput scaling is measured in
device under test.1 Throughput million packets per second (MPPS)
2000 MHz on two architectures: the Haswell-
scaling is measured in million as a function of CPU speed.
based architcture, and the production version
packets per second (MPPS) as a
of the Skylake-based architecture.
function of CPU speed.
As mentioned earlier, our POC focused on the
upstream pipeline. A more detailed analysis of
the downstream pipeline will be the subject of
future work.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 22

Appendix B.
Acronyms and terminology
This appendix defines and/or explains terms and acronyms used in this paper.

ACL Access control list. FIFO First in, first out. packetgen Packet generator.
ASICs Application-specific integrated FPGA Field-programmable gate array. PCI-e Peripheral Component
circuits. Interconnect Express.
I/O Input / output.
BF Blocking probability, or “blocking PHY Physical layer.
IP Internet protocol.
factor.”
POC Proof of concept.
IPC Instructions per cycle.
BPS Bytes per second.
PQOS Intel® Platform Quality of Service
IPv4 Internet protocol version 4.
CPI Cycles per instruction. Technology utility.
L1 Level 1 cache.
CPIcore CPI assuming infinite LLC (no PPS Packets per second.
off-chip accesses). L2 Level 2 cache.
QoS Quality of service.
CPS Clock cycles per second. L3 Level 3 cache. Also called last-
RX Receive, receiving.
level cache.
CSV Comma-separated values.
SEP Sampling Enabling Product, an
LLC Last-level cache. Also called
DPDK Data plane development kit. Intel-developed tool used for
level 3 cache.
hardware analysis.
DRAM Dynamic random-access
LPM Longest prefix match.
memory. SRC Source.
MAC Media access control.
DST Destination. TCP Transmission control protocol.
ML Miss latency or memory latency,
DUT Device under test. TLB Translation lookaside buffer.
as measured in core clock cycles.
EBS Event-based sampling. TX Transmit, transmitting.
MPI Message-passing interface. Also
EDP EMON Data Processor tool. EDP misses per instruction (with VNF Virtualized network function.
is an Intel-developed tool used regards to LLC).
vPE Virtual provider edge (router).
for hardware analysis.
MPPS Million packets per second.
EMON Event monitor. EMON is an Intel-
NFV Network function virtualization.
developed, low-level command-
line tool for analyzing processors NIC Network interface card.
and chipsets.
NPU Network-processing unit.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 23

Appendix C.
Authors

AT&T authors Intel authors

Kartik Pandit Bianny Bian Patrick Lu Huawei Xie


AT&T Intel Intel Intel
[email protected] [email protected]
Vishwa M Prasad Gen Xu
AT&T Atul Kwatra Mike Riess Intel
Intel Intel [email protected]
[email protected] [email protected]
Wayne Willey
Intel
[email protected]

For information about Intel CoFluent technology, visit intel.cofluent.com


To download some of the test tools used in this POC, visit the Intel Developer Zone.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 24

1 Results are based on Intel benchmarking and are provided for information purposes only.

Tests document performance of components on a particular test, in specific systems. Results have been estimated or simulated using internal Intel analyses or
architecture simulation or modeling, and are provided for informational purposes only. Any differences in system hardware, software, or configuration may affect actual
performance.

Performance results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and
"Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features
or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities
arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.

Information in this document is provided as-is. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel assumes no liability whatsoever, and Intel disclaims all express or implied warranty relating to this information, including liability or warranties relating to fitness
for a particular purpose, merchantability, or infringement of any patent, copyright, or other intellectual property right.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and
MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting
www.intel.com/design/literature.htm.

Intel, the Intel logo, Xeon, and CoFluent are trademarks of Intel Corporation in the U.S. and/or other countries.

AT&T and the AT&T logo are trademarks of AT&T Inc. in the U.S. and/or other countries.

Copyright © 2018 Intel Corporation. All rights reserved.

Copyright © 2018 AT&T Intellectual Property. All rights reserved.

*Other names and brands may be claimed as the property of others.

Printed in USA

You might also like