MultiCore Architecture
MultiCore Architecture
TREND IN
PROCESSOR
MAKING
Nabendu Karmakar
The revolution of computer system has moved up enormously. From the age of heavy
handed bulky computers today we have moved to thinnest notebooks. From the age
of 4 bit Intel 4004 processors, we have moved up to Intel Core i7 extremes. From the
first computer named as ENIAC, we now have palmtops. There have been a lot of
changes in the way of computing. Machines have upgraded, we have moved to multi
core processors from the single core processors. Single core processors, who served
the computing generation for quite a long time is now vanishing. It’s the Multi-core
CPUs that are in charge. With lots of new functionality, great features, new up
gradation Multi-core processors are surely the future product.
Nabendu Karmakar
Contents
1.1 Processors 01
6. Terminology 12
7. Multi-core Basics 13
8. Multi-core implementation 15
10.3 Multithreading 25
11.4 Starvation 27
Nabendu Karmakar
Multi-core Architecture 1
chunks of data.
The data in the instruction tells the processor what to do. The
instructions are very basic things like reading data from memory or sending
data to the user display, but they are processed so rapidly that we
experience the results as the smooth operation of a program. Processors
Multi-core Architecture 2
were originally developed with only one core. The core is the part of the
processor that actually performs the reading and executing of instructions.
Nabendu Karmakar
Multi-core Architecture 4
Intel manufactured the first microprocessor, the 4-bit 4004, in the early
1970s which was basically just a number-crunching machine. Shortly
afterwards they developed the 8008 and 8080, both 8-bit, and Motorola
followed suit with their 6800 which was equivalent to Intel’s 8080. The
companies then fabricated 16-bit microprocessors, Motorola had their 68000
and Intel the 8086 and 8088; the former would be the basis for Intel’s
80386 32-bit and later their popular Pentium lineup which were in the first
consumer-based PCs.
A single core processor is a processor which contains only one core. This kind
of processor was the trend of early computing system.
The local processor core does not see or have to process outside
memory commands, although some commands may cause data in cache to
be invalidated or flushed from cache.
Nabendu Karmakar
Multi-core Architecture 7
All of this is well-understood. But lately Moore’s Law has begun to show
signs of failing. It is not actually Moore’s Law that is showing weakness, but
the performance increases people expect and which occur as a side effect of
Moore’s Law. One often associates performance with high processor clock
frequencies. In the past, reducing the size of transistors has meant reducing
the distances between the transistors and decreasing transistor switching
times. Together, these two effects have contributed significantly to faster
processor clock frequencies. Another reason processor clocks could increase
is the number of 2 transistors available to implement processor functions.
Most processor functions, for example, integer addition, can be implemented
in multiple ways. One method uses very few transistors, but the path from
start to finish is very long. Another method shortens the longest path, but it
uses many more transistors. Clock frequencies are limited by the time it
takes a clock signal to cross the longest path within any stage. Longer paths
require slower clocks.
Nabendu Karmakar
can go. This is offset, somewhat, by reducing the transistor size because
smaller transistors can operate on lower voltages, which allows the chip to
produce less heat. Unfortunately, transistors are now so small that the
quantum behavior of electrons can affect their operation. According to
quantum mechanics, very small particles such as electrons are able to
spontaneously tunnel, at random, over short distances. The transistor base
and emitter are now close enough together that a measurable number of
electrons can tunnel from one to the other, causing a small amount of
leakage current to pass between them, which causes a small short in the
transistor.
Speeding up processor frequency had run its course in the earlier part
of this decade; computer architects needed a new approach to improve
performance. Adding an additional processing core to the same chip would, in
theory, result in twice the performance and dissipate less heat; though in
Nabendu Karmakar
practice the actual speed of each core is slower than the fastest single core
processor. In September 2005 the IEE Review noted that “power
consumption increases by 60% with every 400MHz rise in clock speed”.
cores – a dual core processor – on a single chip. The trade-off that must now
be made is that each processor core is slower than a single-core processor,
but there are two cores, and together they may be able to provide greater
throughput even though the individual cores are slower. Each following
generation will likely increase the number of cores and decrease the clock
frequency.
For dual-core to be effective, the work load must also have parallelism
that can use both cores. When an application is not multi-threaded, or it is
limited by memory performance or by external devices such as disk drives,
dual-core may not offer much benefit, or it may even deliver less
performance. Opteron processors use a memory controller that is integrated
into the same chip and is clocked at the same frequency as the processor.
Since dual-core processors use a slower clock, memory latency will be slower
for dual-core Opteron processors than for single-core, because commands
take longer to pass through the memory controller.
increases in some cases. Two cores can provide more sequential requests to
the memory controller than can a single core, which allows the controller to
intern eave commands to memory more efficiently.
not only has to be aware that the system is NUMA (that is, it has Non-
Uniform Memory Access), but it must also be prepared to deal with the more
complex memory arrangement. It must be dual-core-aware. The performance
implications of operating systems that are dual-core-aware will not be
explored here, but we state without further justification that operating
systems without such awareness show considerable variability when used
with dual-core processors. Operating systems that are dual-core-aware show
better performance, though there is still room for improvement.
Nabendu Karmakar
Multi-core Architecture 12
6. Terminology:
The terms multi-core and dual-core most commonly refer to some sort of
central processing unit (CPU), but are sometimes also applied to digital signal
processors (DSP) and system-on-a-chip (SoC).
Nabendu Karmakar
Multi-core Architecture 13
7. Multi-Core Basics:
The following isn’t specific to any one multicore design, but rather is a
basic overview of multi-core architecture. Although manufacturer designs
differ from one another, multicore architectures need to adhere to certain
aspects. The basic configuration of a microprocessor is seen in Figure 4.
Closest to the processor is Level 1 (L1) cache; this is very fast memory
used to store data frequently used by the processor. Level 2 (L2) cache is
just off-chip, slower than L1 cache, but still much faster than main memory;
L2 cache is larger than L1 cache and used for the same purpose. Main
memory is very large and slower than cache and is used, for example, to
store a file currently being edited in Microsoft Word. Most systems have
between 1GB to 4GB of main memory compared to approximately 32KB of
L1 and 2MB of L2 cache. Finally, when data isn’t located in cache or main
memory the system must retrieve it
from the hard disk, which takes
exponentially more time than reading
from the memory system.
Nabendu Karmakar
Multi-core Architecture 15
8. Multi-core Implementation:
Nabendu Karmakar
Figure 6 shows block diagrams for the Core 2 Duo and Athlon 64 X2,
respectively. Both the Intel and AMD popular in the market of
Microprocessors. Both architectures are homogenous dual-core processors.
Multi-core Architecture 16
The Core 2 Duo adheres to a shared memory model with private L1 caches
and a shared L2 cache which “provides a peak transfer rate of 96 GB/sec.”
If a L1 cache miss occurs both the L2 cache and the second core’s L1 cache
are traversed in parallel before sending a request to main memory. In
contrast, the Athlon follows a distributed memory model with discrete L2
caches. These L2 caches share a system request interface, eliminating the
need for a bus. The system request interface also connects the cores with an
on-chip memory controller and an interconnect called HyperTransport.
HyperTransport effectively reduces the number of buses required in a
system, reducing bottlenecks and increasing bandwidth. The Core 2 Duo
instead uses a bus interface. The Core 2 Duo also has explicit thermal and
power control unit’s on-chip. There is no definitive performance advantage of
a bus vs. an interconnect, and the Core 2 Duo and Athlon 64 X2 achieve
similar performance measures, each using a different communication
protocol.
The CELL is a
heterogeneous multicore
processor consisting of nine
cores, one Power Processing
Nabendu Karmakar
Fig 8. Tilera
TILE64
An
application that
Nabendu Karmakar
is written to
take advantage
of these
additional cores
will run far
Multi-core Architecture 18
faster than if it were run on a single core. Imagine having a project to finish,
but instead of having to work on it alone you have 64 people to work for
you. Each processor has its own L1 and L2 cache for a total of 5MB on-chip
and a switch that connects the core into the mesh network rather than a bus
or interconnect. The TILE64 also includes on-chip memory and I/O
controllers. Like the CELL processor, unused tiles (cores) can be put into a
sleep mode to further decrease power consumption. The TILE64 uses a 3-
way VLIW (very long instruction word) pipeline to deliver 12 times the
instructions as a single-issue, single-core processor. When VLIW is combined
with the MIMD (multiple instructions, multiple data) processors, multiple
operating systems can be run simultaneously and advanced multimedia
applications such as video conferencing and video-on-demand can be run
efficiently.
Nabendu Karmakar
Multi-core Architecture 19
Processors plug into the system board through a socket. Current technology
allows for one processor socket to provide access to one logical core. But
this approach is expected to change, enabling one processor socket to
provide access to two, four, or more processor cores. Future processors will
be designed to allow multiple processor cores to be contained inside a single
processor module. For example, a tightly coupled set of dual processor cores
could be designed to compute independently of each other—allowing
applications to interact with the processor cores as two separate processors
even though they share a single socket. This design would allow the OS to
“thread” the application across the multiple processor cores and could help
improve processing efficiency.
A multicore structure would also include cache modules. These modules could
either be shared or independent. Actual implementations of multicore
processors would vary depending on manufacturer and product development
over time. Variations may include shared or independent cache modules, bus
implementations, and additional threading capabilities such as Intel Hyper-
Threading (HT) Technology. A multicore arrangement that provides two or
more low-clock speed cores could be designed to provide excellent
performance while minimizing power consumption and delivering lower heat
output than configurations that rely on a single high-clock-speed core. The
following example shows how multicore technology could manifest in a
standard server configuration and how multiple low-clock-speed cores could
deliver greater performance than a single high-clock-speed core for
networked applications.
This example uses some simple math and basic assumptions about the
scaling of multiple processors and is included for demonstration purposes
Nabendu Karmakar
only. Until multicore processors are available, scaling and performance can
only be estimated based on technical models. The example described in this
article shows one possible method of addressing relative performance levels
as the industry begins to move from platforms based on single-core
processors to platforms based on multicore processors. Other methods are
possible, and actual processor performance and processor scalability are tied
Multi-core Architecture 20
A typical configuration might use dual 3.6 GHz 64-bit Intel Xeon™
processors supporting HT Technology. In the future, organizations might
deploy the same application on a similar server that instead uses a pair of
dual-core processors at a clock speed lower than 3.6 GHz. The four cores in
this example configuration might each run at 2.8 GHz. The following simple
example can help explain the relative performance of a low-clock-speed,
dual-core processor versus a high-clock-speed, dual-processor counterpart.
Dual-processor systems available today offer a scalability of roughly 80
percent for the second processor, depending on the OS, application, compiler,
and other factors.
That means the first processor may deliver 100 percent of its processing
power, but the second processor typically suffers some overhead from
multiprocessing activities. As a result, the two processors do not scale
linearly—that is, a dual-
processor system does not
achieve a 200 percent
performance increase over
a single-processor system,
but instead provides
approximately 180 percent
Nabendu Karmakar
Fig 9. Sample core speed and anticipated total relative power in a system
using two single-core processors
Multi-core Architecture 21
Fig 10. Sample core speed and anticipated total relative power in a system
using two dual-core processors
Nabendu Karmakar
Nabendu Karmakar
Multi-core Architecture 23
Having multiple cores on a single chip gives rise to some problems and
challenges. Power and temperature management are two concerns that can
increase exponentially with the addition of multiple cores. Memory/cache
coherence is another challenge, since all designs discussed above have
distributed L1 and in some cases L2 caches which must be coordinated. And
finally, using a multicore processor to its full potential is another issue. If
programmers don’t write applications that take advantage of multiple cores
there is no gain, and in some cases there is a loss of performance.
Application need to be written so that different parts can be run concurrently
(without any ties to another part of the application that is being run
simultaneously).
If two cores were placed on a single chip without any modification, the chip
would, in theory, consume twice as much power and generate a large
amount of heat.
and the heat is spread out across the chip. As seen in Figure 7, the majority
of the heat in the CELL processor is dissipated in the Power Processing
Element and the rest is spread across the Synergistic Processing Elements.
The CELL processor follows a common trend to build temperature monitoring
into the system, with its one linear sensor and ten internal digital sensors.
Multi-core Architecture 24
One core writes a value to a specific location; when the second core
attempts to read that value from its cache it won’t have the updated copy
unless its cache entry is invalidated and a cache miss occurs. This cache miss
forces the second core’s cache entry to be updated. If this coherence policy
wasn’t in place garbage data would be read and invalid results would be
produced, possibly crashing the program or the entire computer.
In general there are two schemes for cache coherence, a snooping protocol
and a directory-based protocol. The snooping protocol only works with a bus-
based system, and uses a number of states to determine whether or not it
needs to update cache entries and if it has control over writing to the block.
10.3 Multithreading:
The last, and most important, issue is using multithreading or other parallel
processing techniques to get the most performance out of the multicore
processor. “With the possible exception of Java, there are no widely used
commercial development languages with [multithreaded] ex-tensions.” Also to
get the full functionality we have to have program that support the feature of
TLP. Rebuilding applications to be multithreaded means a complete rework by
programmers in most cases. Programmers have to write applications with
subroutines able to be run in different cores, meaning that data
dependencies will have to be resolved or accounted for (e.g. latency in
communication or using a shared cache).
Nabendu Karmakar
Multi-core Architecture 26
Extra memory will be useless if the amount of time required for memory
requests doesn’t improve as well. Redesigning the interconnection network
between cores is a major focus of chip manufacturers. A faster network
means a lower latency in inter-core communication and memory
transactions. Intel is developing their Quick path interconnect, which is a 20-
bit wide bus running between 4.8 and 6.4 GHz; AMD‟s new HyperTransport
3.0 is a 32-bit wide bus and runs at 5.2 GHz. A different kind of interconnect
is seen in the TILE64‟s iMesh, which consists of five networks used to fulfill
I/O and off-chip memory communication. Using five mesh networks gives the
Tile architecture a per tile (or core) bandwidth of up to 1.28 Tbps (terabits
per second).
Nabendu Karmakar
In May 2007, Intel fellow Shekhar Borkar stated that “The software has to
also start following Moore’s Law, software has to double the amount of
parallelism that it can support every two years.” Since the number of cores in
Multi-core Architecture 27
a processor is set to double every 18 months, it only makes sense that the
software running on these cores takes this into account.
11.4 Starvation:
With a shared cache, for example Intel Core 2 Duo’s shared L2 cache,
if a proper replacement policy isn’t in place one core may starve for cache
Nabendu Karmakar
usage and continually make costly calls out to main memory. The
replacement policy should include stipulations for evicting cache entries that
other cores have recently loaded. This becomes more difficult with an
increased number of cores effectively reducing the amount of evict able
cache space without increasing cache misses.
Multi-core Architecture 28
Nabendu Karmakar
Multi-core Architecture 29
High server density in the data center can create significant power
consumption and cooling requirements. A multicore architecture can help
alleviate the environmental challenges created by high-clock-speed, single-
core processors. Heat is a function of several factors, two of which are
processor density and clock speed. Other drivers include cache size and the
size of the core itself. In traditional architectures, heat generated by each
new generation of processors has increased at a greater rate than clock
speed.
servers to the next, a direct comparison should not focus on the number of
processor cores but rather on the number of sockets. However, the most
effective comparison is ultimately not one of processors or sockets alone, but
a thorough comparison of the entire platform—including scalability, availability,
memory, I/O, and other features. By considering the entire platform and all
Multi-core Architecture 32
speed.
Nabendu Karmakar
Multi-core Architecture 34
Nabendu Karmakar
Multi-core Architecture 36
Now-a-days the multi-core processors are becoming very popular. Here are
some lists of multi-core processors that are being highly adopted:
Nabendu Karmakar
Multi-core Architecture 37
This model broke when the high frequencies caused processors to run at
speeds that caused increased power consumption and heat dissipation at
detrimental levels. Adding multiple cores within a processor gave the solution
of running at lower frequencies, but added interesting new problems.
http://www.opensparc.net/pubs/preszo/07/n2isscc.pdf