Att Cpu Impact On Packet Processing Perfomance Paper
Att Cpu Impact On Packet Processing Perfomance Paper
Challenges in packet-processing
AUTHORS Today’s packet-processing devices face an enormous performance challenge. Not only is more
and more data being transmitted, but packet processing tasks — which need to be executed at
AT&T authors Intel authors line speed — have become increasingly complex.
Kartik Pandit Bianny Bian Along with the usual forwarding of data, packet processing systems are also responsible for
Vishwa M. Prasad Atul Kwatra other functions. These functions include traffic management (shaping, timing, scheduling),
Patrick Lu security processing, and quality of service (QoS).
Mike Riess
Wayne Willey Making the issue even more of a challenge is the proliferation of internet devices and sensors.
Huawei Xie Data is now often produced faster than can be transmitted, stored, or processed. Most of this
Gen Xu data is transmitted as packet streams over packet-switched networks. Along the way, various
network devices — such as switches, routers, and adapters — are used to forward and process
the data streams.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 2
Table of Contents
Challenges in packet-processing ................................................................... 1 Results and model validation ...................................................................... 15
Proposed solution: Network function virtualization (NFV) ............................. 2 Establishing a baseline for simulation accuracy ............................................. 15
POC results ...................................................................................................... 2 Measuring performance at different core frequencies .................................... 15
Analyzing performance for different cache sizes ........................................... 16
Proof of concept for network function virtualization (NFV) ...................... 4
Measuring performance for different numbers of cores ................................. 17
Goals of the POC ............................................................................................. 4
Simulating hardware acceleration components .............................................. 17
Predictive model to meet POC goals ................................................................ 4
Performance sensitivity from generation to generation .................................. 18
Key components used in this POC ................................................................ 4 Core cycles per instruction (CPI) .............................................................. 18
Network processing equipment ........................................................................ 4 Maximum capacity at the LLC level ......................................................... 18
Packet processing workload ............................................................................. 4 Ideal world versus reality .......................................................................... 19
DPDK framework for packet processing ......................................................... 5 Performance sensitivities based on traffic profile .......................................... 19
Hardware and software simulation tools .......................................................... 5 Performance scaled linearly with the number of cores .............................. 19
Traditional hardware and software simulation tools .................................... 5 Execution flow was steady ........................................................................ 19
Better solutions model both hardware and software .................................... 5 Next steps...................................................................................................... 19
Intel® CoFluent™ Technology........................................................................ 6
Simulation at a functional level ................................................................... 6 Summary ...................................................................................................... 20
Layered and configurable architecture......................................................... 6 Key findings .................................................................................................. 20
CPU frequency .......................................................................................... 20
Upstream and downstream workloads ......................................................... 6
LLC cache ................................................................................................. 20
Upstream pipeline ............................................................................................ 6 Performance .............................................................................................. 20
Execution flow of the upstream pipeline ..................................................... 7 Packet size................................................................................................. 20
6 Stages in a typical upstream pipeline ........................................................ 8 Conclusion ..................................................................................................... 20
Downstream pipeline ....................................................................................... 8
Appendix A. Performance in the downstream pipeline ............................ 21
Physical system setup ..................................................................................... 8
Packetgen in the physical DUT ........................................................................ 9 Appendix B. Acronyms and terminology ................................................... 22
Generating a packet-traffic profile ................................................................. 10
Appendix C. Authors ................................................................................... 23
Performance and sensitivities of the traffic profile ......................................... 10
Hyperthreading was disabled to expose the impact of other elements ....... 10
Lookup table size affected performance .................................................... 10 List of tables
Developing the Intel CoFluent simulation model....................................... 11 Table 1. Test configuration based on the pre-production
Simulating the packetgen ............................................................................... 11 Intel® Xeon® processor, 1.8GHz (Skylake) ................................ 9
Simulating the network .................................................................................. 12 Table 2. Test configuration based on the Intel® Xeon® E5-2630,
Modeling the upstream pipeline ..................................................................... 12 2.2GHz (Broadwell) ..................................................................... 9
Implementing lookup algorithms ................................................................... 12 Table 3. Test configuration based on the Intel® Xeon® processor
Developing a model of the cost of performance ............................................. 12 E5-2680, 2.5 GHz (Haswell) ...................................................... 10
Hardware performance considerations ...................................................... 13 Table A-1. Test configuration based on the Intel® Xeon® processor
Impact of cache on pipeline performance .................................................. 13 Gold 6152, 2.1 GHz .................................................................. 21
Characterizing the cost model ................................................................... 13
Establishing the execution cost of the model ............................................. 14 List of figures
Simulation constraints and conditions ............................................................ 14 Figure 1. Edge router with upstream and downstream pipelines ................. 5
Support for the upstream pipeline.............................................................. 14 Figure 2. Edge router’s upstream software pipeline .................................... 6
Cache analysis supported for LLC ............................................................ 14 Figure 3. Edge router’s downstream software pipeline ............................... 6
Dropping or dumping packets was not supported ...................................... 14 Figure 4. Model development ................................................................... 11
Critical data paths simulated...................................................................... 14 Figure 5. Model of upstream software pipeline ......................................... 12
Hardware analysis and data collection ....................................................... 14 Figure 6. Simple models for estimating cycles per instruction .................. 13
Hardware analysis tools ................................................................................. 14 Figure 7. Baseline performance measurements. ........................................ 15
Event Monitor (EMON) tool ..................................................................... 14 Figure 8. Performance measured at different frequencies ......................... 16
Sampling Enabling Product (SEP) tool ..................................................... 14 Figure 9. Performance measured for different LLC cache sizes ................ 16
EMON Data Processor (EDP) tool ............................................................ 14 Figure 10. Measuring performance on multi-core CPUs ............................. 17
Collecting performance-based data ................................................................ 15 Figure 11. Performance comparison of baseline configuration versus a
Workload performance measurements and analysis for model inputs ............ 15 simulation that includes an FPGA-accelerated ACL lookup ...... 17
Figure 12. Comparison of performance from CPU generation
to generation .............................................................................. 18
Figure A-1. Throughput scaling, as tested on the Intel® Xeon®
processor E5-2680 ..................................................................... 21
Figure A-2. Throughput scaling at 2000 MHz on different architectures ...... 21
Modeling the impact of CPU properties to optimize and predict packet-processing performance 4
Figure 1. Edge router with upstream and downstream pipelines. Upstream traffic moves from end users toward the network’s core.
Downstream traffic moves toward the end user(s).
DPDK framework Hardware and software At the other end of the spectrum are
for packet processing simulation tools hardware-oriented simulators. These
simulators model system timing on a cycle-
The Intel DPDK is a set of libraries and Optimizing a design for network traffic is by-cycle basis. These models are highly
drivers for fast packet processing. The DPDK typically done using traditional simulation accurate, but suffer from very slow simulation
packet framework gives developers a standard tools and a lot of manual effort. We were speeds. Because of this, they are not usually
methodology for building complex packet looking for a better approach that would make used to analyze complete, end-to-end systems.
processing pipelines. The DPDK provides it easier and faster for developers to choose Instead, they are used mainly for decision-
pipeline configuration files and functional the best components for their needs. making at the microarchitecture level.
blocks for building different kinds of
applications. For example, for our POC, we Traditional hardware and software Better solutions model
used the functions to build our internet simulation tools both hardware and software
protocol (IP) pipeline application.
For system analysis, traditional simulation- Solutions that model only software
One of the benefits of DPDK functions is that based modeling tools range from solely performance or which model only hardware
they help with the rapid development of software-oriented approaches to solely performance are not effective for modeling
packet processing applications that run on hardware-oriented approaches. Unfortunately, the performance of a complete system. The
multicore processors. For example, in our these traditional tools have not been able to best solution for modeling a complete system
POC, the edge router pipeline is built on meet the complex performance challenges would be:
the IP pipeline application (based on the driven by today’s packet-processing devices.
DPDK functions), to run on our four Highly configurable
At one end of the traditional analysis
physical hardware DUTs. Our IP pipeline Able to simulate both software and
spectrum are the software-oriented
models a provider edge router between the hardware aspects of an environment
simulations. In these simulations, software
access network and the core network
behavior and interactions are defined against a Easily execute without the overhead of
(see Figure 1).
specific execution time. However, solutions setting up actual packet-processing
based solely on software analyses do not take applications
hardware efficiency into consideration.
Hardware efficiency has a significant impact
on system performance.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 6
6 Stages in a typical upstream pipeline Note that the scope of this project did not
allow a complete analysis of the downstream
In the specific case of the upstream traffic
pipeline. The downstream pipeline uses
pipeline of the DPDK vPE, there are usually
different software and has different
Project names for 6 stages. Figure 2 (earlier in this paper) shows
functionality and pipeline stages, as compared
devices under test (DUTs) an overview of the 6 typical stages. The first
to the upstream pipeline.
pipeline stage drains packets from the
Intel internal project code names are Ethernet controller. The last stage in the chain We do provide some of the data and insights
often used to refer to various processors queues up the packets and sends them to the for the downstream pipeline that we observed
during development, proof of concepts core network through the Ethernet controller. while conducting our POC (see Appendix A).
(POCs), and other research projects. However, full analysis and verification of
In our POC, we modeled and simulated all
those initial results will have to be a future
In the joint Intel and AT&T POC, we key stages of the upstream pipeline, and
project.
used three main test configurations, one verified those results against known hardware
of which was a pre-production configurations.
processor (Skylake). Some of the Physical system setup
devices under test (DUTs) were used to
confirm simulation results and establish
Downstream pipeline When our team began setting up this POC, we
the accuracy of the simulations. Some In the edge router’s downstream traffic started with a description of a typical physical
were used to confirm simulation results pipeline there are 3 actively running architecture. We then set up a hardware DUT
for projecting optimal configurations components and 4 typical pipeline stages. that would match that architecture as closely
for future generations of processors. A as possible. We set up additional DUTs to
Figure 3 (earlier in this paper) shows an
fourth, production version of the provide configurations for comparisons and
Skylake microarchitecture was used to overview of the four typical stages. The three
components are: verifications.
characterize some aspects of the
downstream pipeline (see Appendix A).
DPDK packetgen Tables 1 and 2 (next page) describe the two
DUTs we built for the first phase of our
The three main DUTs for our POC Intel® Ethernet controller (the NIC) NFV POC. We used these DUTs to take
were based on these processors, with performance measurements on the upstream
these project code names: Downstream software pipeline stages,
pipeline.
running on one or more cores
Skylake-based DUT: In our POC, for the downstream pipeline, the
We compared those measurements to the
Pre-production Intel® Xeon® corresponding elements of our Intel CoFluent
packetgen injects packets into the Ethernet
processor, 1.8 GHz simulations. The physical DUTs helped us
controller. This simulates packets entering the
determine the accuracy of the virtual Intel
access network from the core.
Broadwell-based DUT: CoFluent model that we used for our
Intel® Xeon® processor The first stage of the edge router’s simulations and projections.
E5-2630, 2.2 GHz downstream pipeline pops packets from the
internal buffer of the Ethernet controller. The
Haswell-based DUT: last stage sends packets to the access network
Intel® Xeon® processor via the Ethernet controller.
E5-2680, 2.5 GHz
In our POC, the devices under test (DUTs)
used IxNetwork* client software to connect to
an Ixia traffic generator. Ixia generates
simulated edge traffic into the DUT, and
reports measurements of the maximum
forwarding performance of the pipeline. In
our model, we did not include considerations
of packet loss.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 9
The DUT described in Table 1 is based on a Table 1. Test configuration based on the
pre-production Intel® Xeon® processor, pre-production Intel® Xeon® processor, 1.8GHz (Skylake)
1.8 GHz, with 32 cores. The Intel-internal
project code name for this pre-production Component Description Details
processor is “Skylake.” Pre-production Intel® Xeon® processor,
Product
1.8 GHz
The DUT described in Table 2 is based on an
Intel® Xeon® processor E5-2630, 2.2 GHz. Processor Speed (MHz) 1800
The Intel-internal project code name for this Number of CPUs 32 Cores / 64 Threads
processor is “Broadwell.” LLC cache 22528 KB
In order to explore the performance sensitivity Capacity 64 GB
of one processor generation versus another,
Type DDR4
we set up an additional DUT, as described in
Table 3. This DUT is based on an Intel® Rank 2
Memory
Xeon® processor E5-2680, 2.5 GHz. The Speed (MHz) 2666
Intel-internal project code name for this Channel/socket 6
processor is “Haswell.”
Per DIMM size 16 GB
Packetgen in the physical DUT Ethernet controller X710-DA4 (4x10G)
NIC
Driver igb_uio
As mentioned earlier in the description of the
upstream pipeline, our physical systems Distribution Ubuntu 16.04.2 LTS
OS
included the Ixia packetgen. In the upstream Kernel 4.4.0-64-lowlatency
pipeline, the job of this hardware-based
BIOS Hyper-threading Off
packetgen is to generate packets and work
with the packet receive (RX) and transmit
(TX) functions. Basically, the packetgen
sends packets into the receive unit or out of Table 2. Test configuration based on the
the transmit unit. This is just one of the key Intel® Xeon® E5-2630, 2.2GHz (Broadwell)
hardware functions that was simulated in our
Component Description Details
Intel CoFluent model.
Product Intel® Xeon® processor E5-2630, 2.2 GHz
Speed (MHz) 2200
Processor
Number of CPUs 10 Cores / 20 Threads
LLC cache 25600 KB
Capacity 64 GB
Type DDR4
Rank 2
Memory
Speed (MHz) 2133
Channel/socket 4
Per DIMM size 16 GB
Ethernet controller X710-DA4 (4x10G)
NIC
Driver igb_uio
Distribution Ubuntu 16.04.2 LTS
OS
Kernel 4.4.0-64-lowlatency
BIOS Hyper-threading Off
Modeling the impact of CPU properties to optimize and predict packet-processing performance 10
Table 3. Test configuration based on the In our physical test model, we used default
Intel® Xeon® processor E5-2680, 2.5 GHz (Haswell) settings for other parameters, such as the
media access control (MAC) address, source
Component Description Details (SRC) transmission control protocol (TCP)
Product Intel® Xeon® processor E5-2680 v3, 2.5 GHz port, and destination (DST) TCP port.
Speed (MHz) 2500
Processor Performance and sensitivities
Number of CPUs 24 Cores / 24 Threads
LLC cache 30720KB
of the traffic profile
Capacity 256 GB In order to get the most accurate results, we
needed to characterize the traffic profile in
Type DDR4
detail for both the hardware DUTs and our
Rank 2 Intel CoFluent models and simulations.
Memory
Speed (MHz) 2666
Channel/socket 6 Hyperthreading was disabled to
expose the impact of other elements
Per DIMM size 16 GB
There are a number of system and application
Ethernet controller (4x10G)
NIC parameters that can impact performance,
Driver igb_uio including hyperthreading. For example, when
Distribution Ubuntu 16.04.2 LTS we ran the workload with hyperthreading
OS enabled, we gained about 25% more
Kernel 4.4.0-64-lowlatency
performance per core.1
BIOS Hyper-threading Off
However, hyperthreading shares some
hardware resources between cores, and this
can mask core performance issues. Also, the
Generating a packet-traffic profile For our POC, we chose the following IP range performance delivered by hyperthreading can
settings to traverse the LPM (longest prefix make it hard to identify the impact of other,
Once we set up our physical DUTs, we match) table. For lpm24, the memory range more subtle architectural elements. Since we
needed to estimate the performance effect of a is 64 MB, which exceeds the LLC size, and were looking for the impact of those other
cache miss in the routing table lookup on can trigger a miss in the LLC cache. packet-handling elements, we disabled
these architectures. To do this, for each
hyperthreading for this POC.
packet, we increased the destination IP to a range 0 dst ip start 0.0.0.0
range 0 dst ip min 0.0.0.0
fixed stride of 0.4.0.0. This caused each
range 0 dst ip max Lookup table size affected performance
destination IP lookup to hit at a different 255.255.255.255
memory location in the routing table. range 0 dst ip inc 0.4.0.0 While setting up the experiments, we
observed a performance difference (delta) that
For the source IP setting, any IP stride should depended on the size of the application’s
be appropriate, as long as the stride succeeds lookup table. Because of this, for our POC,
on the access control list (ACL) table lookup. we decided to use the traffic profile described
(The exact relationship of cache misses under “Generating a packet-traffic profile.”
and traffic characteristics is not described in This ensured that we had some LLC misses
this POC, and will be investigated in a in our model.
future study.)
Modeling the impact of CPU properties to optimize and predict packet-processing performance 11
Figure 4. Model development. This figure shows how we modeled the flow for the simulation of the entire virtual edge provider (vPE) router pipeline.
Figure 5. Model of upstream software pipeline. The ACL pipeline stage supports auditing of incoming traffic. Note that, in our proof-of-concept (POC),
the queueing stage has two locations, and performs packet receiving (RX) or packet transmitting (TX), depending on its location in the pipeline.
Simulating the network queuing and packet TX stage are also usually Developing a model
separate pipeline stages. In our POC, we of the cost of performance
In this POC, the Ethernet controller was modeled the packet RX and packet TX stages
simulated based on a very simple throughput as a single packet queueing stage that was It’s important to understand that
model, which receives or sends packets at a located at both the beginning and the end of Intel CoFluent is a high-level framework for
user-specified rate. the pipeline (see Figure 5). simulating behavior. This means that the
framework doesn’t actually execute CPU
For our POC since we wanted to characterize instructions; access cache or memory cells; or
the performance impact of CPU parameters on
Implementing lookup algorithms
perform network I/O. Instead, Intel CoFluent
software packet processing pipeline we did One of the things we needed to do for our uses algorithms to simulate these processes
not implement the internal physical layer model was to implement lookup algorithms. with great accuracy (as shown in this POC).1
(PHY), MAC, switching, or first-in-first-out To do this, we first had to consider the
(FIFO) logic. We specifically defined our test pipelines. As shown earlier in Figure 2, In order to develop a model of the execution
to make sure we would not see system an upstream pipeline usually consists cost of performance, we tested the accuracy of
bottlenecks from memory bus bandwidth or of 3 actively running components the simulations in all phases of our POC by
from the bandwidth of the Peripheral and 6 typical pipeline stages. comparing the simulation results to
Component Interconnect Express (PCI-e). measurements of actual physical architectures
Note that the ACL pipeline stage is a of various DUTs. The delta between the
Because of that, we did not need to model the
multiple-bit trie implementation (a tree-like estimated execution times from the
effect of that network transaction traffic
structure). This routing pipeline stage uses an simulation, and measurements made on the
versus system bandwidth.
LPM lookup algorithm which is based on a physical DUTs, ranged from 0.4%
full implementation of the binary tree. For our to 3.3%.1 This gave us a high degree of
Modeling the upstream pipeline POC, we implemented the ACL lookup confidence in our cost model.
In our POC, we simulated all key stages of the algorithm and the LPM lookup algorithm to
support auditing of the incoming traffic. We (Specifics on the accuracy of our model and
upstream packet processing pipeline. The
also implemented these two algorithms to simulations, as compared to the DUTs, are
ACL filters, flow classifier, metering and
support routing of traffic to different discussed later in this paper under “Results
policing, and routing stages were modeled
destinations. and model validation.”)
individually. The packet RX stage and the
In addition, the flow classification used a hash With the very small delta seen between
table lookup algorithm, while flow action performance on the simulations versus the
used an array table lookup algorithm. We physical DUTs, we expect to be able to use
implemented both of these algorithms other Intel CoFluent models in the future, to
in our model. effectively identify the best components for
other packet-traffic workloads under
development.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 13
In order to get the execution latency for a Cache analysis supported for LLC Hardware analysis tools
specific length of the pipeline, we also had to
In this study, our test setup did not allow us to We used three hardware analysis tools to help
estimate the path length for that section of the
change the L1 and L2 size to evaluate the with our VNF verifications: Event Monitor
pipeline. The path length is the number of x86
impact of L1 and L2 on performance. Because (EMON) tool, Sampling Enabling Product
instructions retired per 1 Mb of data sent.
of this, our model supported only the LLC (SEP) tool, and the EMON Data Processor
Again, see Figure 6 (previous page).
cache size sensitivity analysis, and not an (EDP) tool. These tools were developed by
In our model, multiplying the two variables — analysis of L1 or L2. Intel, and are available for download from the
CPI and path length — gives the execution Intel Developer Zone.
time of the software pipeline in terms of CPU Dropping or dumping packets was not
cycles. With that information, we were able to supported Event Monitor (EMON) tool
simulate CPI and path length, using the Intel
Dropping packets is an error-handling EMON is a low-level command-line tool for
CoFluent tool, in order to compute the end-to-
method; and dumping packets is a debugging processors and chipsets. The tool logs event
end packet throughput.
or profiling tool. Dropping and dumping counters against a timebase. For our POC, we
packets doesn’t always occur in the upstream used EMON to collect and log hardware
Establishing the execution cost
pipeline. If it does, it can occur at a low rate performance counters.
of the model
during the running lifetime of that pipeline.
We began building our Intel CoFluent cost You can download EMON as part of the
Our test model did not support dropping or Intel® VTune Amplifier suite. Intel VTune
model based on the DPDK code. With the
dumping packets. If we had included dropping Amplifier is a performance analysis tool that
previous considerations taken into account,
packets and/or the debugging tools in our helps users develop serial and multithreaded
we used the DPDK code to measure the
POC model, they could have introduced more applications.
instructions and cycles spent in the different
overhead to the simulator. This could have
pipeline stages. These cycles were assumed to
slowed the simulation speed and skewed our Sampling Enabling Product (SEP) tool
be the basic execution cost of the model.
results.
Figure 4, earlier in this paper, shows an SEP is a command-line performance data
overview of the cost model. We suspect that dropping and dumping collector. It performs event-based sampling
packets might not be critical to performance (EBS) by leveraging the counter overflow
Simulation constraints in most scenarios, but we would need to feature of the test hardware’s performance
create another model to explore those impacts. monitoring unit (PMU). The tool captures the
and conditions
That would be an additional project for processor’s execution state each time a
For the upstream pipeline, we modeled the future. performance counter overflow raises an
hardware parameters (such as CPU frequency interrupt.
and LLC cache size), packet size, pipeline Critical data paths simulated
configurations, and flow configurations. We Using SEP allowed us to directly collect the
focused on the areas we expected would have With those three constraints in place, we performance data — including cache misses
the biggest impact on performance. We then modeled and simulated the most critical data — of the target hardware system. SEP is part
verified the results of our simulations against paths of the upstream pipeline. This allowed of the Intel VTune Amplifier.
performance measurements made on the us to examine the most important performance
physical hardware DUTs. considerations of that pipeline. EMON Data Processor (EDP) tool
In order to create an effective model for a EDP is an Intel-internal analysis tool that
complete system, we accepted some
Hardware analysis and processes EMON performance samples for
conditions for this project. data collection analysis. EDP analyzes key hardware events
such as CPU utilization, core CPI, cache
Support for the upstream pipeline Verifying the accuracy of any simulation is an misses and miss latency, and so on.
important phase of any study. In order to test
As mentioned earlier, this POC was focused the accuracy of our Intel CoFluent simulations
on the upstream pipeline. The scope of our against the hardware DUT configurations, we
model did not support simulating the needed to look at how VNF performance data
downstream pipeline. However, we hope to would be collected and analyzed.
conduct future POCs to explore packet
performance in that pipeline. The information
we did collect on the downstream pipeline
is presented in Appendix A, at the end of
this paper.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 15
hotspots. For example, it does not report Our results show that it is possible to use a
Baseline delta of performance
where the code takes up the most cycles, or VNF model to estimate both best-case and
on physical hardware versus
simulation where it generates the most cache misses. worst-case packet performance for any given
6.0 Sampling mode is better for collecting that production environment.
kind of information.
Throughput (MPPS)
5 5.65 5.69
80%
4
3.72 3.60 60%
3 3.35 3.26
3.12 3.09
2 2.60 2.58 2.79 2.75 40%
2.41 2.44
1 20%
0 0%
1.1 GHz 1.2 GHz 1.3 GHz 1.5 GHz 1.6 GHz 1.8 GHz 3.0 GHz
22 MB
22 MB cache
Pre-production
Intel® Intel® Xeon®
Xeon® processor processor,1.8GHz
1.8GHz (Skylake)
FigureE5-2600,
9. (Skylake)
Intel® CoFluent™ Technology
Intel® CoFluent™ Technology simulation
Performa simulation
Delta of measurements
Accuracy of simulationmade on physical
nce measured for system and simulation
different LLC
Pre-production Intel®
cacheXeon®
sizes processor,
in our 1.8GHz (Skylake)
Figure 8. Performance measured at different frequencies for our simulation versus on the physical device under test (DUT).1
Intel® CoFluent™ Technologyversus
simulations simulation
The DUT for these measurements was based on a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake).
themade
Delta of measurements physical
on physical system and simulation
DUT.122 MB cache
Figure 8. Performance measured at different frequencies for our simulation versus on the physical under test (DUT).1for
device performance
Analyzing
Delta of performance based on LLC cache size:
The DUT for these measurements was based on a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake).
physical hardware versus simulation different cache sizes
1.1% 1.6% 0.5% Figure 9 shows packet performance as
6 3.3% 100% measured for different LLC cache sizes:
2 MB, 4 MB, 8 MB, and 22 MB. In our
Throughtput (MPPS)
5 80%
simulation, cache size was adjusted using the
4
60% Intel PQOS utility.
3 3.58 3.56 3.72 3.60
3.36 3.33 3.47 3.41
40% Our POC showed that a 2 MB LLC cache size
2 causes a dramatically larger miss rate (32%
1 20% miss rate) than a 22 MB LLC cache size
(1.6% miss rate).1 However, almost 90% of
0 0% memory access hits are at L1 cache.1
1.8 GHz 1.8 GHz 1.8 GHz 1.8 GHz
Because of this, adjusting the LLC cache size
2 MB 4 MB 8 MB 22 MB decreases performance by a maximum of
only 10%.1
Intel® Xeon® processor
Pre-production E5-2600,
Intel® Xeon® 1.8GHz
processor, (Skylake)
1.8GHz (Skylake) In Figure 9, the yellow line again shows the
Intel® CoFluent™
Intel® CoFluent™Technology
Technology simulation
simulation difference between measurements made on
Delta
Deltabetween measurements
of measurements on physical
made on physical system
system and and simulation
simulation the physical DUT, and measurements of the
simulation. The delta remains very small,
Figure 10. Measuring performance on multi-core between 0.5% and 3.3%.1
Figure 9. Performance
CPUs.measured for different
1Pre-production LLC cache
Intel® Xeon® sizes 1.8GHz
processor, in our simulations
(Skylake)
1
versus the physical DUT. The DUT for these
Intel® CoFluent™ Technology simulation measurements was based on
a pre-production Intel® Xeon® processor, 1.8 GHz (Skylake).
Delta of measurements made on physical system and simulation
Performance when scaling CPU cores (2.2 GHz) Performance without a hardware accelerator and
with a hardware accelerator
30 120%
0.4% 1.2% 4.5
3.0%
25 100% 4.4
Throughput (MPPS)
Throughput (MPPS)
20 80% 4.2
4.1
15 60% 4.0
3.9
10 40%
3.8
8.2 8.3 3.7 3.80
5 20%
3.6
4.1 4.1
0 0% 3.5
Baseline hardware Simulated FPGA-
1 core 2 cores 6 cores
configuration without field- accelerated ACL lookup
Pre-production Intel® Xeon® processor, 1.8GHz (Skylake) programmable gate array
(FPGA)-accelerated
Intel® CoFluent™ Technology simulation access control list (ACL)
Delta between measurements on physical system and simulation lookup
Figure 10. Measuring performance on multi-core CPUs.1 The DUT Pre-production Intel® Xeon® processor, 1.8GHz (Skylake)
for these measurements was based on a pre-production Intel® CoFluent™ Technology simulation
Intel® Xeon® processor, 1.8 GHz (Skylake).
Figure 11. Performance comparison of baseline configuration
versus a simulation that includes a field-programmable
Figure 11. Performance comparison of baseline configuration gate array (FPGA)-accelerated access control list
Measuring performance
versus a simulation that includes a field-programmable (ACL) lookup.1 The DUT for these measurements was
for different numbers
gate arrayof cores
(FPGA)-accelerated access control list based on a pre-production Intel® Xeon® processor,
(ACL) lookup.1Figure 10. Measuring performance 1.8 GHz (Skylake).
Figure 10 showson themulti-core
throughput results
CPUs. 1 forDUT for these measurements
The
measurements taken on DUTs with various
was based on a pre-production Intel®Simulating
Xeon® processor,
hardware
numbers of cores. These measurements 1.8 GHzwere
(Skylake).
acceleration components Figure 12. Comparison of performance from CPU generation to
made on the pre-production Skylake-based generation.1Figure 11. Performance comparison
DUT, and compared with the results projected Previously, we showed how we broke down It’s important to note that this performance
of baseline configuration versus a simulation that
by our simulation. Note that in this POC, the the distribution of CPU cycles and result represents only
includes a field-programmable one
gate functionality
array (FPGA)- of the
upstream pipeline ran on a single core, even pipeline that
instructions amongst different stages of accelerated access control listwas simulated for FPGA. The
when run on processors with multiple cores. 15.5%
.1 result we saw here
(ACL) lookup The DUT for these measurements
pipelines. Just as we did in that analysis, we does not represent
was
can do a similar what-if analysis to based
identify on a the results of
pre-production the full
Intel® capability
Xeon® of the FPGA
processor, 1.8
As you can see in Figure 10, the throughput used for this workload. Still, this kind of
GHz
the best hardware accelerators for our model. (Skylake).
scales linearly as more cores are added. In this what-if analysis can help developers more
test, the packets were spread evenly across the For example, in one what-if analysis, we accurately estimate the cost and efficiency of
cores by way of a manual test configuration. replaced the ACL lookup with an FPGA adopting FPGA accelerators or of using some
In our test, all pipeline stages were bound to accelerator that is 10 times as efficient as the other method to offload hardware
one fixed core, and the impact of core-to-core standard software implementation in the functionalities.
movement was very small. DPDK. We found that swapping this
Again, the yellow line represents the component sped up the performance of the
difference between measurements of the overall upstream traffic pipeline by over 15%
physical system, and measurements of the (see Figure 11).1
simulation. For performance based on cache
size, the delta is still very small, between
0.4% and 3.0% for simulation predictions as
compared to measurements made on
the DUT.1
Modeling the impact of CPU properties to optimize and predict packet-processing performance 18
4
Best case and worst case 3.5 3.72 3.59
traffic profiles 3 3.43 3.35
Throughput (MPPS)
2.5
At the time of this joint NFV project,
a production traffic profile was not 2
available for analyzing bottlenecks in a 1.5
production deployment. However, we
did analyze both the worst-case profile 1
and the best case profile. In the worst 0.5
case profile, every packet is a new 0
flow. In the best-case profile, there is Intel® Xeon® processor E5-2630L Pre-production Intel® Xeon®
only one flow. We did not set out to v4 1.8G (Broadwell) processor, 1.8GHz (Skylake)
study these scenarios specifically, but
the traffic profile we used provided Hardware device under test (DUT)
information on both best- and worst- Intel® CoFluent™ Technology simulation
case profiles.
Figure 12. Comparison of performance from CPU generation to generation. 1
Our results showed that the difference
in performance between worst-case and
best-case profiles was only 7%.1 That
could offer developers a rough Performance Broadwell also has only 256 KB of L2 cache,
Figure 12. Comparison of performance from CPU generation to generation. 1
estimation of what performance could sensitivity from generation while Skylake has 2 MB of L2 cache (more
be like between best-case and worst- cache is better). Also, when there is a cache
case packet performance for any given to generation miss in L2, the L2 message-passing interface
production environment. (MPI) on the Skylake-based DUT is 6x the
Figure 12 shows the performance of the
upstream pipeline on a Broadwell-based throughput of L2 MPI delivered by
It’s important to understand that our Broadwell.1
microarchitecture, as compared to a Skylake-
results show only a rough estimation
of that difference for our pipeline based microarchitecture. The Intel CoFluent
Our POC measurements tell us that all of
model and our particular type of simulation gives us an estimated delta of less
these factors contribute to the higher core CPI
application. The packet performance than 4% for measurements of packet
seen for Broadwell microarchitectures, versus
gap between your own actual best- and throughput on simulated generations of
the greater performance delivered by Skylake.
worst-case traffic profiles could be microarchitecture, as compared to
significantly different. measurements on the physical DUTs.1
Maximum capacity at the LLC level
s
Core cycles per instruction (CPI) One of the ways we used our simulations was
to understand performance when assuming
Our POC results tell us that several specific
maximum capacity at the LLC level. This
factors affect CPI and performance. For
analysis assumed an infinite-sized LLC, with
example, the edge routers on both Broadwell
no LLC misses.
and Skylake microarchitectures have the same
program path length. However, Skylake has a
much lower core CPI than Broadwell (lower
CPI is better). The core CPI on the Broadwell-
based DUT is 0.87; while the core CPI on the
Skylake DUT is only 0.50.1
Modeling the impact of CPU properties to optimize and predict packet-processing performance 19
Our analysis shows that packet throughput can Performance sensitivities based
achieve a theoretical maximum of 3.98 MPPS on traffic profile Fused µOPs
(million packets per second) per core on
Skylake-based microarchitectures.1 Our POC shows that the packet processing
Along with traditional micro-
performance for the upstream pipeline on the
ops fusion, Haswell supports
Ideal world versus reality edge router will change depending on the macro fusion. In macro fusion,
traffic profile you use. specific types of x86
In an ideal world, we would eliminate all
instructions are combined in
pipeline stalls in the core pipeline, eliminate
Performance scaled linearly with the pre-decode phase, and then
all branch mispredictions, eliminate all sent through a single decoder.
the number of cores
translation lookaside buffer (TLB) misses, and They are then translated into a
assume that all memory accesses hit at the Using the traffic profile we chose, we found single micro-op.
L1 data cache. This would allow us to achieve that we could sustain performance at about
the optimal core CPI. 4 MPPS per core on a Skylake architecture.1
We tested this performance on systems
For example, a Haswell architecture can with 1, 2, and 6 cores; and found that
commit up to 4 fused µOPs each cycle per performance scaled linearly with the number Next steps
thread. Therefore, the optimal CPI for the of cores (see Figure 10, earlier in this paper).1
4-wide microarchitecture pipeline is As we continue to model packet performance,
theoretically 0.25 CPI. For Skylake, the Note that, when adding more cores, each it’s unavoidable that we will have to deal with
processor’s additional microarchitecture core can use less LLC cache, which may hardware concurrency and the interactions
features can lower the ideal CPI even further. cause a higher LLC cache miss rate. Also, typically seen with various core and non-core
as mentioned earlier in this paper, under the components. The complexity of developing
If we managed to meet optimal conditions for heading, “Impact of cache on pipeline and verifying such systems will require
Haswell, when CPI reaches 0.25, we could performance,” adjusting the LLC cache size significant resources. However, we believe
double the packet performance seen today, could impact performance. So adding more we could gain significant insights from
which would then be about 8 MPPS cores could actually increase fetch latency and additional POCs.
(7.96 MPPS) per core. (Actual CPI is based cause core performance to drop.
on the application, of course, and on how We suggest that next steps include:
well the application is optimized for its Our results include the performance analysis
Using our Intel CoFluent model to
workload.) for different cache sizes, from 2MB to 22MB.
identify component characteristics that
We did not obtain results for cache sizes
have the greatest impact on the
smaller than 2MB.
performance of networking workloads.
This could help developers choose the
Execution flow was steady best components for cluster designs that
We also discovered that our test vPE are focused on particular types of
application delivered a steady execution flow. workloads.
That meant we had a predictable number of
Model and improve packet-traffic
instructions per packet. We can take that
profiles to support a multi-core
further to mean that the higher the core clock
paradigm. This paradigm would allow
frequency, the more throughput we could
scheduling of different parts of the
achieve.
workload pipeline onto different cores.
For our POC conditions (traffic profile, 1.8
Model and improve traffic profiles to
GHz to 2.1 GHz processors, workload type)
study the impact of the number of new
we found that performance of the vPE scales
flows per second, load balancing across
linearly as frequency increases.1
cores, and other performance metrics.
Table 3, earlier in this paper, describes the OS Distribution Ubuntu 16.04.2 LTS
hardware DUT configuration for the Haswell- BIOS Hyper-threading Off
based microarchitecture used to determine
throughput scaling of the downstream
pipeline. Table A-1 (above, right) describes
the hardware DUT configuration of a fourth,
production version of Skylake-based
microarchitecture on which we also obtained
downstream pipeline results. This fourth
production-version processor was the Intel®
Xeon® processor Gold 6152, 2.1 GHz.
In this POC, we measured throughput on a
single core for all stages of the downstream
pipeline.
Figure A-1 shows throughput scaling as a
function of frequency for the downstream
pipeline. These measurements were made on
the Intel® Xeon® processor E5-2680, 2.5 Figure A-1. Throughput scaling, as tested on Figure A-2. Throughput scaling at 2000 MHz
GHz DUT (Haswell-based architecture). the Intel® Xeon® processor on different architectures.1
Figure A-2 shows throughput measured at E5-2680, 2.5 GHz (Haswell) Throughput scaling is measured in
device under test.1 Throughput million packets per second (MPPS)
2000 MHz on two architectures: the Haswell-
scaling is measured in million as a function of CPU speed.
based architcture, and the production version
packets per second (MPPS) as a
of the Skylake-based architecture.
function of CPU speed.
As mentioned earlier, our POC focused on the
upstream pipeline. A more detailed analysis of
the downstream pipeline will be the subject of
future work.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 22
Appendix B.
Acronyms and terminology
This appendix defines and/or explains terms and acronyms used in this paper.
ACL Access control list. FIFO First in, first out. packetgen Packet generator.
ASICs Application-specific integrated FPGA Field-programmable gate array. PCI-e Peripheral Component
circuits. Interconnect Express.
I/O Input / output.
BF Blocking probability, or “blocking PHY Physical layer.
IP Internet protocol.
factor.”
POC Proof of concept.
IPC Instructions per cycle.
BPS Bytes per second.
PQOS Intel® Platform Quality of Service
IPv4 Internet protocol version 4.
CPI Cycles per instruction. Technology utility.
L1 Level 1 cache.
CPIcore CPI assuming infinite LLC (no PPS Packets per second.
off-chip accesses). L2 Level 2 cache.
QoS Quality of service.
CPS Clock cycles per second. L3 Level 3 cache. Also called last-
RX Receive, receiving.
level cache.
CSV Comma-separated values.
SEP Sampling Enabling Product, an
LLC Last-level cache. Also called
DPDK Data plane development kit. Intel-developed tool used for
level 3 cache.
hardware analysis.
DRAM Dynamic random-access
LPM Longest prefix match.
memory. SRC Source.
MAC Media access control.
DST Destination. TCP Transmission control protocol.
ML Miss latency or memory latency,
DUT Device under test. TLB Translation lookaside buffer.
as measured in core clock cycles.
EBS Event-based sampling. TX Transmit, transmitting.
MPI Message-passing interface. Also
EDP EMON Data Processor tool. EDP misses per instruction (with VNF Virtualized network function.
is an Intel-developed tool used regards to LLC).
vPE Virtual provider edge (router).
for hardware analysis.
MPPS Million packets per second.
EMON Event monitor. EMON is an Intel-
NFV Network function virtualization.
developed, low-level command-
line tool for analyzing processors NIC Network interface card.
and chipsets.
NPU Network-processing unit.
Modeling the impact of CPU properties to optimize and predict packet-processing performance 23
Appendix C.
Authors
1 Results are based on Intel benchmarking and are provided for information purposes only.
Tests document performance of components on a particular test, in specific systems. Results have been estimated or simulated using internal Intel analyses or
architecture simulation or modeling, and are provided for informational purposes only. Any differences in system hardware, software, or configuration may affect actual
performance.
Performance results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and
"Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features
or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities
arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Information in this document is provided as-is. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel assumes no liability whatsoever, and Intel disclaims all express or implied warranty relating to this information, including liability or warranties relating to fitness
for a particular purpose, merchantability, or infringement of any patent, copyright, or other intellectual property right.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and
MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting
www.intel.com/design/literature.htm.
Intel, the Intel logo, Xeon, and CoFluent are trademarks of Intel Corporation in the U.S. and/or other countries.
AT&T and the AT&T logo are trademarks of AT&T Inc. in the U.S. and/or other countries.
Printed in USA