This document summarizes two shared memory architectures - bus-based and directory-based. It describes:
1) Bus-based architectures have CPUs, caches and shared memory connected by a shared bus. The bus bandwidth limits scalability. It discusses the memory coherence problem and snooping protocols like MESI to address it.
2) Directory-based architectures avoid broadcast snooping and scale better using point-to-point messaging. Each block tracks its presence in caches using a directory with processor bits. It brings coherence through directory lookups and targeted invalidations.
This document discusses parallel processing and multiple processor architectures. It covers single instruction, single data stream (SISD); single instruction, multiple data stream (SIMD); multiple instruction, single data stream (MISD); and multiple instruction, multiple data stream (MIMD) architectures. It then discusses the taxonomy of parallel processor architectures including tightly coupled symmetric multiprocessors (SMPs), non-uniform memory access (NUMA) systems, and loosely coupled clusters. It covers parallel organizations for these different architectures.
This document discusses multiprocessors and multiprocessing. It covers topics such as why you would want a multiprocessor, cache coherence issues that arise in shared memory multiprocessors, and different approaches to cache coherence like snoopy protocols and directory-based schemes. It also discusses classification of multiprocessors based on factors like the Flynn taxonomy, interconnection network, memory topology, and programming model.
1. Multicore processors require applications to be parallelized to avoid performance stagnation. Data management, programming models, and on-chip communication all impact performance.
2. Symmetric multiprocessors (SMPs) have uniform memory access times, while distributed memory systems have faster local memory access but slower remote access.
3. Shared memory architectures allow any processor to access any memory location directly using load/store instructions, while message passing involves explicit data transfer between processes using send and receive calls.
This document discusses different types of parallel processing architectures including single instruction single data stream (SISD), single instruction multiple data stream (SIMD), multiple instruction single data stream (MISD), and multiple instruction multiple data stream (MIMD). It provides details on tightly coupled symmetric multiprocessors (SMPs) and non-uniform memory access (NUMA) systems. It also covers cache coherence protocols like MESI and approaches to improving processor performance through multithreading and chip multiprocessing.
The Google File System (GFS) is a scalable distributed file system designed by Google to provide reliable, scalable storage and high performance for large datasets and workloads. It uses low-cost commodity hardware and is optimized for large files, streaming reads and writes, and high throughput. The key aspects of GFS include using a single master node to manage metadata, chunking files into 64MB chunks distributed across multiple chunk servers, replicating chunks for reliability, and optimizing for large sequential reads and appends. GFS provides high availability, fault tolerance, and data integrity through replication, fast recovery, and checksum verification.
This document discusses CPU scheduling and multithreaded programming. It covers key concepts in CPU scheduling like multiprogramming, CPU-I/O burst cycles, and scheduling criteria. It also discusses dispatcher role, multilevel queue scheduling, and multiple processor scheduling challenges. For multithreaded programming, it defines threads and their benefits. It compares concurrency and parallelism and discusses multithreading models, thread libraries, and threading issues.
This document provides an overview of memory management techniques in operating systems. It discusses contiguous memory allocation, segmentation, paging, and swapping. Contiguous allocation allocates processes to contiguous sections of memory which can lead to fragmentation issues. Segmentation divides memory into logical segments defined by segment tables. Paging divides memory into fixed-size pages and uses page tables to map virtual to physical addresses, avoiding external fragmentation. Swapping moves processes between main memory and disk to allow more processes to reside in memory than will physically fit. The document describes the hardware and data structures used to implement these techniques.
This document provides an overview of CPU caches, including definitions of key terms like SMP, NUMA, data locality, cache lines, and cache architectures. It discusses cache hierarchies, replacement strategies, write policies, inter-socket communication, and cache coherency protocols. Latency numbers for different levels of cache and memory are presented. The goal is to provide information to help improve application performance.
This document provides an overview of CPU caches, including definitions of key terms like SMP, NUMA, data locality, cache lines, and cache architectures. It discusses cache hierarchies, replacement strategies, write policies, inter-socket communication, and cache coherency protocols. Latency numbers for different levels of cache and memory are presented.
Computer organization & architecture chapter-1Shah Rukh Rayaz
The document provides an introduction to computer organization and architecture. It discusses the structure and function of computers, including data processing, storage, and movement functions. It also explains why this course is studied. The document then outlines the topics that will be covered in subsequent chapters, including computer evolution and performance, basic computer components and functions, and interconnection structures. It provides an overview of cache memory principles and the memory hierarchy in general.
This document discusses parallel hardware and techniques for exploiting parallelism. It covers instruction level parallelism techniques like pipelining and simultaneous multithreading. It also discusses parallel architectures like SIMD, vector processors, shared memory systems, distributed memory systems, and interconnection networks. Cache coherence protocols like MESI are presented to ensure data consistency across cores that share memory. Examples of multicore CPUs and supercomputers are provided to illustrate these concepts.
This document discusses parallel processors and multicore architecture. It begins with an introduction to parallel processors, including concurrent access to memory and cache coherency. It then discusses multicore architecture, where a single physical processor contains the logic of two or more cores. This allows increasing processing power while keeping clock speeds and power consumption lower than would be needed for a single high-speed core. Cache coherence methods like write-through, write-back, and directory-based approaches are also summarized for maintaining consistency across cores' caches when accessing shared memory.
The document discusses concurrency and parallelism issues and compares functional programming and reactive programming approaches. It notes that concurrency can cause problems with shared mutable state and discusses techniques like locks, synchronization, and non-blocking I/O to address issues. Functional programming aims to avoid shared mutable state through pure functions, lazy evaluation, and immutable data, while reactive programming uses asynchronous message-driven architectures. Both approaches aim to maximize multicore CPU usage by keeping threads busy processing non-blocking operations.
Cache coherence is an issue that arises in multiprocessing systems where multiple processors have cached copies of shared memory locations. If a processor modifies its local copy, it can create an inconsistent global view of memory.
There are two main approaches to maintaining cache coherence - snoopy bus protocols and directory schemes. Snoopy bus protocols use a shared bus for processors to monitor memory transactions and invalidate local copies when needed. Directory schemes track which processors are sharing each block of data using a directory structure.
One common snoopy protocol is MESI, which uses cache states of Modified, Exclusive, Shared, and Invalid to track the ownership of cache lines and ensure coherency is maintained when a line is modified.
Flynn's taxonomy classifies computer architectures based on the number of instruction and data streams. The main categories are:
1) SISD - Single instruction, single data stream (von Neumann architecture)
2) SIMD - Single instruction, multiple data streams (vector/MMX processors)
3) MIMD - Multiple instruction, multiple data streams (most multiprocessors including multi-core)
Multiprocessor architectures can be organized as shared memory (SMP/UMA) or distributed memory (message passing/DSM). Shared memory allows automatic sharing but can have memory contention issues, while distributed memory requires explicit communication but scales better. Achieving high parallel performance depends on minimizing sequential
This document provides an introduction to multi-core processors. It discusses that a multi-core processor contains two or more processors on a single integrated circuit. This leads to enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks. However, developing multithreaded applications for multi-core processors can be difficult, time-consuming, and error-prone. Adding more cores also introduces additional overheads and latencies between communicating and non-communicating cores. There are different types of multi-core architectures including symmetric multiprocessing (SMP) and asymmetric multiprocessing (AMP). Effective use of multi-core processors requires considerations around cache coherency, load balancing, interrupt handling, and concurrency management.
The document discusses different cache coherence protocols used in multi-core systems, including snooping-based and directory-based approaches. It describes the MSI protocol, a 3-state snooping protocol, and the MESI protocol, a 4-state optimization. It also covers the Dragon update protocol, a 4-state write-back update snooping approach, and compares the number of bus transactions required for different protocols and memory access patterns.
This document discusses high performance computing (HPC) and parallel computing. It defines HPC as aggregating computing power to solve large problems. Parallel computing uses multiple processors working together on common tasks. There are three main approaches to parallel computing: shared memory, where all processors access a common pool of memory; distributed memory, where each processor has its own local memory; and hybrid distributed shared memory. Parallel computers enable solving problems that require fast solutions or large amounts of memory, like weather forecasting.
This document discusses operating system structures and components. It describes four main OS designs: monolithic systems, layered systems, virtual machines, and client-server models. For each design, it provides details on how the system is organized and which components are responsible for which tasks. It also discusses some advantages and disadvantages of the different approaches. The document concludes by explaining how client-server models address issues with distributing OS functions to user space by having some critical servers run in the kernel while still communicating with user processes.
➤ ►🌍📺📱👉 Click Here to Download Link 100% Working Link
https://click4pc.com/after-verification-click-go-to-download-page/
Wondershare Filmora is an very impressive video editing software. It allows you to edit and convert videos and share them on a variety of different hosting ...
The Google File System (GFS) is a scalable distributed file system designed by Google to provide reliable, scalable storage and high performance for large datasets and workloads. It uses low-cost commodity hardware and is optimized for large files, streaming reads and writes, and high throughput. The key aspects of GFS include using a single master node to manage metadata, chunking files into 64MB chunks distributed across multiple chunk servers, replicating chunks for reliability, and optimizing for large sequential reads and appends. GFS provides high availability, fault tolerance, and data integrity through replication, fast recovery, and checksum verification.
This document discusses CPU scheduling and multithreaded programming. It covers key concepts in CPU scheduling like multiprogramming, CPU-I/O burst cycles, and scheduling criteria. It also discusses dispatcher role, multilevel queue scheduling, and multiple processor scheduling challenges. For multithreaded programming, it defines threads and their benefits. It compares concurrency and parallelism and discusses multithreading models, thread libraries, and threading issues.
This document provides an overview of memory management techniques in operating systems. It discusses contiguous memory allocation, segmentation, paging, and swapping. Contiguous allocation allocates processes to contiguous sections of memory which can lead to fragmentation issues. Segmentation divides memory into logical segments defined by segment tables. Paging divides memory into fixed-size pages and uses page tables to map virtual to physical addresses, avoiding external fragmentation. Swapping moves processes between main memory and disk to allow more processes to reside in memory than will physically fit. The document describes the hardware and data structures used to implement these techniques.
This document provides an overview of CPU caches, including definitions of key terms like SMP, NUMA, data locality, cache lines, and cache architectures. It discusses cache hierarchies, replacement strategies, write policies, inter-socket communication, and cache coherency protocols. Latency numbers for different levels of cache and memory are presented. The goal is to provide information to help improve application performance.
This document provides an overview of CPU caches, including definitions of key terms like SMP, NUMA, data locality, cache lines, and cache architectures. It discusses cache hierarchies, replacement strategies, write policies, inter-socket communication, and cache coherency protocols. Latency numbers for different levels of cache and memory are presented.
Computer organization & architecture chapter-1Shah Rukh Rayaz
The document provides an introduction to computer organization and architecture. It discusses the structure and function of computers, including data processing, storage, and movement functions. It also explains why this course is studied. The document then outlines the topics that will be covered in subsequent chapters, including computer evolution and performance, basic computer components and functions, and interconnection structures. It provides an overview of cache memory principles and the memory hierarchy in general.
This document discusses parallel hardware and techniques for exploiting parallelism. It covers instruction level parallelism techniques like pipelining and simultaneous multithreading. It also discusses parallel architectures like SIMD, vector processors, shared memory systems, distributed memory systems, and interconnection networks. Cache coherence protocols like MESI are presented to ensure data consistency across cores that share memory. Examples of multicore CPUs and supercomputers are provided to illustrate these concepts.
This document discusses parallel processors and multicore architecture. It begins with an introduction to parallel processors, including concurrent access to memory and cache coherency. It then discusses multicore architecture, where a single physical processor contains the logic of two or more cores. This allows increasing processing power while keeping clock speeds and power consumption lower than would be needed for a single high-speed core. Cache coherence methods like write-through, write-back, and directory-based approaches are also summarized for maintaining consistency across cores' caches when accessing shared memory.
The document discusses concurrency and parallelism issues and compares functional programming and reactive programming approaches. It notes that concurrency can cause problems with shared mutable state and discusses techniques like locks, synchronization, and non-blocking I/O to address issues. Functional programming aims to avoid shared mutable state through pure functions, lazy evaluation, and immutable data, while reactive programming uses asynchronous message-driven architectures. Both approaches aim to maximize multicore CPU usage by keeping threads busy processing non-blocking operations.
Cache coherence is an issue that arises in multiprocessing systems where multiple processors have cached copies of shared memory locations. If a processor modifies its local copy, it can create an inconsistent global view of memory.
There are two main approaches to maintaining cache coherence - snoopy bus protocols and directory schemes. Snoopy bus protocols use a shared bus for processors to monitor memory transactions and invalidate local copies when needed. Directory schemes track which processors are sharing each block of data using a directory structure.
One common snoopy protocol is MESI, which uses cache states of Modified, Exclusive, Shared, and Invalid to track the ownership of cache lines and ensure coherency is maintained when a line is modified.
Flynn's taxonomy classifies computer architectures based on the number of instruction and data streams. The main categories are:
1) SISD - Single instruction, single data stream (von Neumann architecture)
2) SIMD - Single instruction, multiple data streams (vector/MMX processors)
3) MIMD - Multiple instruction, multiple data streams (most multiprocessors including multi-core)
Multiprocessor architectures can be organized as shared memory (SMP/UMA) or distributed memory (message passing/DSM). Shared memory allows automatic sharing but can have memory contention issues, while distributed memory requires explicit communication but scales better. Achieving high parallel performance depends on minimizing sequential
This document provides an introduction to multi-core processors. It discusses that a multi-core processor contains two or more processors on a single integrated circuit. This leads to enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks. However, developing multithreaded applications for multi-core processors can be difficult, time-consuming, and error-prone. Adding more cores also introduces additional overheads and latencies between communicating and non-communicating cores. There are different types of multi-core architectures including symmetric multiprocessing (SMP) and asymmetric multiprocessing (AMP). Effective use of multi-core processors requires considerations around cache coherency, load balancing, interrupt handling, and concurrency management.
The document discusses different cache coherence protocols used in multi-core systems, including snooping-based and directory-based approaches. It describes the MSI protocol, a 3-state snooping protocol, and the MESI protocol, a 4-state optimization. It also covers the Dragon update protocol, a 4-state write-back update snooping approach, and compares the number of bus transactions required for different protocols and memory access patterns.
This document discusses high performance computing (HPC) and parallel computing. It defines HPC as aggregating computing power to solve large problems. Parallel computing uses multiple processors working together on common tasks. There are three main approaches to parallel computing: shared memory, where all processors access a common pool of memory; distributed memory, where each processor has its own local memory; and hybrid distributed shared memory. Parallel computers enable solving problems that require fast solutions or large amounts of memory, like weather forecasting.
This document discusses operating system structures and components. It describes four main OS designs: monolithic systems, layered systems, virtual machines, and client-server models. For each design, it provides details on how the system is organized and which components are responsible for which tasks. It also discusses some advantages and disadvantages of the different approaches. The document concludes by explaining how client-server models address issues with distributing OS functions to user space by having some critical servers run in the kernel while still communicating with user processes.
➤ ►🌍📺📱👉 Click Here to Download Link 100% Working Link
https://click4pc.com/after-verification-click-go-to-download-page/
Wondershare Filmora is an very impressive video editing software. It allows you to edit and convert videos and share them on a variety of different hosting ...
AnyDesk 5.2.1 Crack License Key Full Version 2019 {Latest}yousfhashmi786
➤ ►🌍📺📱👉 Click Here to Download Link 100% Working Link
https://click4pc.com/after-verification-click-go-to-download-page/
AnyDesk is a popular remote desktop software that allows you to access your computer from anywhere in the world.
➤ ►🌍📺📱👉 Click Here to Download Link 100% Working Link
https://click4pc.com/after-verification-click-go-to-download-page/
Adobe Illustrator Crack is a professional vector graphics design software used by graphic designers, illustrators, and artists to create .
➤ ►🌍📺📱👉 Click Here to Download Link 100% Working
Link https://click4pc.com/after-verification-click-go-to-download-page/
Parallel Desktop Crack is sincerely some of the existing first-class VM software. It carries Mac OS and a laptop with very cheap-cost specs.
Pulmonary delivery of biologics (insulin, vaccines, mRNA)
Definition and Purpose
Pulmonary Delivery: Involves administering biologics directly to the lungs via inhalation.
Goal: To achieve rapid absorption into the bloodstream, enhance bioavailability, and improve therapeutic outcomes.
Types of Biologics
• Insulin: Used for diabetes management; inhaled insulin can provide a non-invasive alternative to injections.
• Vaccines: Pulmonary delivery of vaccines (e.g., mRNA vaccines) can stimulate local and systemic immune responses.
• mRNA Therapeutics: Inhalable mRNA formulations can be used for gene therapy and vaccination, allowing for direct delivery to lung cells.
Advantages
• Non-Invasive: Reduces the need for needles, improving patient comfort and compliance.
• Rapid Onset: Direct absorption through the alveolar membrane can lead to quicker therapeutic effects.
• Targeted Delivery: Focuses treatment on the lungs, which is beneficial for respiratory diseases.
Future Directions
• Personalized Medicine: Potential for tailored therapies based on individual patient needs and responses.
• Combination Therapies: Exploring the use of pulmonary delivery for combination therapies targeting multiple diseases.
Gene therapy via inhalation
Definition and Purpose
• Gene Therapy: A technique that involves introducing, removing, or altering genetic material within a patient’s cells to treat or prevent disease.
• Inhalation Delivery: Administering gene therapies directly to the lungs through inhalation, targeting respiratory diseases and conditions.
Mechanism of Action
• Aerosolized Vectors: Utilizes viral or non-viral vectors (e.g., liposomes, nanoparticles) to deliver therapeutic genes to lung cells.
• Cell Uptake: Once inhaled, the vectors penetrate the alveolar epithelium and deliver genetic material to target cells.
Advantages
• Localized Treatment: Direct delivery to the lungs can enhance therapeutic effects while minimizing systemic side effects.
• Non-Invasive: Inhalation is less invasive than traditional injection methods, improving patient compliance.
• Rapid Onset: Potential for quicker therapeutic effects due to direct absorption in the pulmonary system.
Personalized inhaler systems with sensors
• Smart Inhalers: Devices with sensors that track usage and technique.
• Real-Time Monitoring: Connect to apps for data on adherence and inhalation patterns.
• Tailored Treatment: Adjusts medication based on individual usage data.
• Patient Engagement: Provides feedback and reminders to empower self-management.
• Improved Outcomes: Enhances adherence and reduces exacerbations in respiratory conditions.
• Future Potential: May integrate with other health data and use AI for predictive insights.
Sustained-Release Nano Formulations
Definition: Nanoscale drug delivery systems that release therapeutic agents over an extended period.
Components: Made from polymers, lipids, or inorganic materials that encapsulate drugs.
Mechanism:
2. Readings
• Read on your own:
– Shen & Lipasti Chapter 11
– G. S. Sohi, S. E. Breach and T.N. Vijaykumar. Multiscalar Processors,
Proc. 22nd Annual International Symposium on Computer
Architecture, June 1995.
– Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.
Lo, and Rebecca L. Stamm. Exploiting Choice: Instruction Fetch and
Issue on an Implementable Simultaneous Multithreading Processor,
Proc. 23rd Annual International Symposium on Computer
Architecture, May 1996 (B5)
• To be discussed in class:
– Review #6 due 11/17/2017: Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A
Spatial Architecture for Energy-Efficient Dataflow for Convolutional
Neural Networks," International Symposium on Computer
Architecture (ISCA), pp. 367-379, June 2016. Online PDF
3. Executing Multiple Threads
• Thread-level parallelism
• Synchronization
• Multiprocessors
• Explicit multithreading
• Data parallel architectures
• Multicore interconnects
• Implicit multithreading: Multiscalar
• Niagara case study
4. Thread-level Parallelism
• Instruction-level parallelism
– Reaps performance by finding independent work in a single
thread
• Thread-level parallelism
– Reaps performance by finding independent work across multiple
threads
• Historically, requires explicitly parallel workloads
– Originate from mainframe time-sharing workloads
– Even then, CPU speed >> I/O speed
– Had to overlap I/O latency with “something else” for the CPU to
do
– Hence, operating system would schedule other
tasks/processes/threads that were “time-sharing” the CPU
6. Thread-level Parallelism
• Initially motivated by time-sharing of single CPU
– OS, applications written to be multithreaded
• Quickly led to adoption of multiple CPUs in a single system
– Enabled scalable product line from entry-level single-CPU systems
to high-end multiple-CPU systems
– Same applications, OS, run seamlessly
– Adding CPUs increases throughput (performance)
• More recently:
– Multiple threads per processor core
• Coarse-grained multithreading (aka “switch-on-event”)
• Fine-grained multithreading
• Simultaneous multithreading
– Multiple processor cores per die
• Chip multiprocessors (CMP)
• Chip multithreading (CMT)
7. Thread-level Parallelism
• Parallelism limited by sharing
– Amdahl’s law:
• Access to shared state must be serialized
• Serial portion limits parallel speedup
– Many important applications share (lots of) state
• Relational databases (transaction processing): GBs of shared state
– Even completely independent processes “share” virtualized
hardware through O/S, hence must synchronize access
• Access to shared state/shared variables
– Must occur in a predictable, repeatable manner
– Otherwise, chaos results
• Architecture must provide primitives for serializing access
to shared state
9. Some Synchronization Primitives
• Only one is necessary
– Others can be synthesized
Primitive Semantic Comments
Fetch-and-add Atomic load/add/store
operation
Permits atomic increment, can be
used to synthesize locks for
mutual exclusion
Compare-and-swap Atomic
load/compare/conditional
store
Stores only if load returns an
expected value
Load-linked/store-
conditional
Atomic load/conditional
store
Stores only if load/store pair is
atomic; that is, there is no
intervening store
10. Synchronization Examples
• All three guarantee same semantic:
– Initial value of A: 0
– Final value of A: 4
• b uses additional lock variable AL to protect critical section with a spin
lock
– This is the most common synchronization method in modern
multithreaded applications
11. Multiprocessor Systems
• Focus on shared-memory symmetric multiprocessors
– Many other types of parallel processor systems have been
proposed and built
– Key attributes are:
• Shared memory: all physical memory is accessible to all CPUs
• Symmetric processors: all CPUs are alike
– Other parallel processors may:
• Share some memory, share disks, share nothing
• Have asymmetric processing units
• Shared memory idealisms
– Fully shared memory: usually nonuniform latency
– Unit latency: approximate with caches
– Lack of contention: approximate with caches
– Instantaneous propagation of writes: coherence required
15. Invalidate Protocol
• Basic idea: maintain single writer property
– Only one processor has write permission at any point in time
• Write handling
– On write, invalidate all other copies of data
– Make data private to the writer
– Allow writes to occur until data is requested
– Supply modified data to requestor directly or through memory
• Minimal set of states per cache line:
– Invalid (not present)
– Modified (private to this cache)
• State transitions:
– Local read or write: I->M, fetch modified
– Remote read or write: M->I, transmit data (directly or through memory)
– Writeback: M->I, write data to memory
16. Invalidate Protocol
Optimizations
• Observation: data can be read-shared
– Add S (shared) state to protocol: MSI
• State transitions:
– Local read: I->S, fetch shared
– Local write: I->M, fetch modified; S->M, invalidate other copies
– Remote read: M->S, supply data
– Remote write: M->I, supply data; S->I, invalidate local copy
• Observation: data can be write-private (e.g. stack frame)
– Avoid invalidate messages in that case
– Add E (exclusive) state to protocol: MESI
• State transitions:
– Local read: I->E if only copy, I->S if other copies exist
– Local write: E->M silently, S->M, invalidate other copies
18. Sample Invalidate Protocol (MESI)
Current
State s
Event and Local Coherence Controller Responses and Actions (s' refers to next state)
Local Read (LR) Local Write
(LW)
Local
Eviction (EV)
Bus Read
(BR)
Bus Write
(BW)
Bus Upgrade
(BU)
Invalid (I) Issue bus read
if no sharers
then s' = E
else s' = S
Issue bus
write
s' = M
s' = I Do nothing Do nothing Do nothing
Shared (S) Do nothing Issue bus
upgrade
s' = M
s' = I Respond
shared
s' = I s' = I
Exclusive
(E)
Do nothing s' = M s' = I Respond
shared
s' = S
s' = I Error
Modified
(M)
Do nothing Do nothing Write data
back;
s' = I
Respond
dirty;
Write data
back;
s' = S
Respond
dirty;
Write data
back;
s' = I
Error
19. Implementing Cache Coherence
• Snooping implementation
– Origins in shared-memory-bus systems
– All CPUs could observe all other CPUs requests on the bus;
hence “snooping”
• Bus Read, Bus Write, Bus Upgrade
– React appropriately to snooped commands
• Invalidate shared copies
• Provide up-to-date copies of dirty lines
– Flush (writeback) to memory, or
– Direct intervention (modified intervention or dirty miss)
• Snooping suffers from:
– Scalability: shared busses not practical
– Ordering of requests without a shared bus
– Lots of prior work on scaling snoop-based systems
20. Alternative to Snooping
• Directory implementation
– Extra bits stored in memory (directory) record MSI state of line
– Memory controller maintains coherence based on the current state
– Other CPUs’ commands are not snooped, instead:
• Directory forwards relevant commands
– Ideal filtering: only observe commands that you need to observe
– Meanwhile, bandwidth at directory scales by adding memory
controllers as you increase size of the system
• Leads to very scalable designs (100s to 1000s of CPUs)
• Directory shortcomings
– Indirection through directory has latency penalty
– Directory overhead for all memory, not just what is cached
– If shared line is dirty in other CPU’s cache, directory must forward
request, adding latency
– This can severely impact performance of applications with heavy
sharing (e.g. relational databases)
21. Memory Consistency
• How are memory references from different processors interleaved?
• If this is not well-specified, synchronization becomes difficult or even
impossible
– ISA must specify consistency model
• Common example using Dekker’s algorithm for synchronization
– If load reordered ahead of store (as we assume for a baseline OOO CPU)
– Both Proc0 and Proc1 enter critical section, since both observe that other’s
lock variable (A/B) is not set
• If consistency model allows loads to execute ahead of stores, Dekker’s
algorithm no longer works
– Common ISAs allow this: IA-32, PowerPC, SPARC, Alpha
22. Sequential Consistency [Lamport 1979]
• Processors treated as if they are interleaved processes on a single
time-shared CPU
• All references must fit into a total global order or interleaving that
does not violate any CPUs program order
– Otherwise sequential consistency not maintained
• Now Dekker’s algorithm will work
• Appears to preclude any OOO memory references
– Hence precludes any real benefit from OOO CPUs
23. High-Performance Sequential Consistency
• Coherent caches isolate CPUs if no sharing is
occurring
– Absence of coherence activity means CPU is free to
reorder references
• Still have to order references with respect to
misses and other coherence activity (snoops)
• Key: use speculation
– Reorder references speculatively
– Track which addresses were touched speculatively
– Force replay (in order execution) of such references
that collide with coherence activity (snoops)
24. Constraint graph example - SC
Proc 1
ST A
Proc 2
LD A
ST B
LD B
Program
order
Program
order
WAR
RAW
Cycle indicates that execution is
incorrect
1.
2.
3.
4.
25. Anatomy of a cycle
Proc 1
ST A
Proc 2
LD A
ST B
LD B
Program
order
Program
order
WAR
RAW
Incoming invalidate
Cache miss
26. High-Performance Sequential Consistency
• Load queue records all speculative loads
• Bus writes/upgrades are checked against LQ
• Any matching load gets marked for replay
• At commit, loads are checked and replayed if necessary
– Results in machine flush, since load-dependent ops must also replay
• Practically, conflicts are rare, so expensive flush is OK
27. Relaxed Consistency Models
• Key insight: only synchronizing references need ordering
• Hence, relax memory for all other references
– Enable high-performance OOO implementation
• Require programmer to label synchronization references
– Hardware must carefully order these labeled references
– All other references can be performed out of order
• Labeling schemes:
– Explicit synchronization ops (acquire/release)
– Memory fence or memory barrier ops:
• All preceding ops must finish before following ones begin
• Often: fence ops cause pipeline drain in modern OOO
machine
• More: ECE/CS 757
29. Split Transaction Bus
• “Packet switched” vs. “circuit switched”
• Release bus after request issued
• Allow multiple concurrent requests to overlap memory latency
• Complicates control, arbitration, and coherence protocol
– Transient states for pending blocks (e.g. “req. issued but not completed”)
33. Multithreaded Cores
• 1990’s: Memory wall and multithreading
– Processor-DRAM speed mismatch:
• nanosecond to fractions of a microsecond (1:500)
– H/W task switch used to bring in other useful
work while waiting for cache miss
– Cost of context switch must be much less than
cache miss latency
• Very attractive for applications with
abundant thread-level parallelism
– Commercial multi-user workloads
33
34. Approaches to Multithreading
• Fine-grain multithreading
– Switch contexts at fixed fine-grain interval (e.g. every
cycle)
– Need enough thread contexts to cover stalls
– Example: Tera MTA, 128 contexts, no data caches
• Benefits:
– Conceptually simple, high throughput, deterministic
behavior
• Drawback:
– Very poor single-thread performance
34
35. Approaches to Multithreading
• Coarse-grain multithreading
– Switch contexts on long-latency events (e.g. cache
misses)
– Need a handful of contexts (2-4) for most benefit
• Example: IBM RS64-IV (Northstar), 2 contexts
• Benefits:
– Simple, improved throughput (~30%), low cost
– Thread priorities mostly avoid single-thread
slowdown
• Drawback:
– Nondeterministic, conflicts in shared caches
35
36. Approaches to Multithreading
• Simultaneous multithreading
– Multiple concurrent active threads (no notion of thread
switching)
– Need a handful of contexts for most benefit (2-8)
• Example: Intel Pentium 4/Nehalem/Sandybridge, IBM
Power 5/6/7, Alpha EV8/21464
• Benefits:
– Natural fit for OOO superscalar
– Improved throughput
– Low incremental cost
• Drawbacks:
– Additional complexity over OOO superscalar
– Cache conflicts
36
37. Approaches to Multithreading
• Chip Multiprocessors (CMP)
Processor Cores/
chip
Multi-
threaded
?
Resources shared
IBM Power 4 2 No L2/L3, system interface
IBM Power 7 8 Yes (4T) Core, L2/L3, DRAM, system
interface
Sun Ultrasparc 2 No System interface
Sun Niagara 8 Yes (4T) Everything
Intel Pentium D 2 Yes (2T) Core, nothing else
Intel Core i7 4 Yes (2T) L3, DRAM, system interface
AMD Opteron 2, 4, 6,
12
No System interface (socket), L3
38. Approaches to Multithreading
• Chip Multithreading (CMT)
– Similar to CMP
• Share something in the core:
– Expensive resource, e.g. floating-point unit (FPU)
– Also share L2, system interconnect (memory and I/O bus)
• Examples:
– Sun Niagara, 8 cores per die, one FPU
– AMD Bulldozer: one FP cluster for every two INT clusters
• Benefits:
– Same as CMP
– Further: amortize cost of expensive resource over multiple cores
• Drawbacks:
– Shared resource may become bottleneck
– 2nd
generation (Niagara 2) does not share FPU
38
39. Multithreaded/Multicore Processors
• Many approaches for executing multiple threads on a
single die
– Mix-and-match: IBM Power7 CMP+SMT
39
MT Approach Resources shared between threads Context Switch Mechanism
None Everything Explicit operating system context
switch
Fine-grained Everything but register file and control
logic/state
Switch every cycle
Coarse-grained Everything but I-fetch buffers, register file and
con trol logic/state
Switch on pipeline stall
SMT Everything but instruction fetch buffers, return
address stack, architected register file, control
logic/state, reorder buffer, store queue, etc.
All contexts concurrently active; no
switching
CMT Various core components (e.g. FPU), secondary
cache, system interconnect
All contexts concurrently active; no
switching
CMP Secondary cache, system interconnect All contexts concurrently active; no
switching
46. Data Parallel Execution
From [Lee et al., ISCA ‘11]
•SIMT [Nvidia GPUs]
– Large number of threads, MIMD programmer view
– Threads ganged into warps, executed in SIMD fashion for
efficiency
– Control/data divergence causes inefficiency
– Programmer optimization required (ECE 759)
47. Multicore Interconnects
• Bus/crossbar - dismiss as short-term solutions?
• Point-to-point links, many possible topographies
– 2D (suitable for planar realization)
• Ring
• Mesh
• 2D torus
– 3D - may become more interesting with 3D packaging (chip
stacks)
• Hypercube
• 3D Mesh
• 3D torus
47
48. On-Chip Bus/Crossbar
• Used widely (Power4/5/6,/7 Piranha, Niagara, etc.)
– Assumed not scalable
– Is this really true, given on-chip characteristics?
– May scale "far enough" : watch out for arguments at the
limit
• Simple, straightforward, nice ordering properties
– Wiring is a nightmare (for crossbar)
– Bus bandwidth is weak (even multiple busses)
– Workload:
• “Commercial” applications usually latency-limited
• “Scientific” applications usually bandwidth-limited
48
49. On-Chip Ring
• Point-to-point ring interconnect
– Simple, easy
– Nice ordering properties (unidirectional)
– Every request a broadcast (all nodes can snoop)
– Scales poorly: O(n) latency, fixed bandwidth
49
50. On-Chip Mesh
• Widely assumed in academic literature
• Tilera (MIT startup), Intel 80-core prototype
• Not symmetric, so have to watch out for load
imbalance on inner nodes/links
– 2D torus: wraparound links to create symmetry
• Not obviously planar
• Can be laid out in 2D but longer wires, more intersecting
links
• Latency, bandwidth scale well
• Lots of existing literature
50
52. Implicitly Multithreaded Processors
• Goal: speed up execution of a single thread
(latency)
• Implicitly break program up into multiple smaller
threads, execute them in parallel, e.g.:
– Parallelize loop iterations across multiple processing
units
– Usually, exploit control independence in some fashion
– Not parallelism of order 100x, more like 3-5x
• Typically, one of two goals:
– Expose more ILP for a single window, or
– Build a more scalable, partitioned execution window
• Or, try to achieve both
53. Implicitly Multithreaded Processors
• Many challenges:
– Find additional ILP, past hard-to-predict branches
• Control independence
– Maintain data dependences (RAW, WAR, WAW) for
registers
– Maintain precise state for exception handling
– Maintain memory dependences (RAW/WAR/WAW)
– Maintain memory consistency model
• Still a research topic
– Multiscalar reading provides historical context
– Lots of related work in TLS (thread-level speculation)
54. Multiscalar
• Seminal work on implicit multithreading
– Started in mid 80’s under Guri Sohi @ Wisconsin
• Solved many of the “hard” problems
• Threads or tasks identified by compiler
– Tasks look like mini-programs, can contain loops, branches
• Hardware consists of a ring of processing nodes
– Head processor executes most speculative task
– Tail processor commits and resolves
– Miss-speculation causes task and all newer tasks to get flushed
• Nodes connected to:
– Sequencing unit that dispatches tasks to each one
– Shared register file that resolves RAW/WAR/WAW
– Address Resolution Buffer: resolves memory dependences
• http://www.cs.wisc.edu/mscalar
– Publications, theses, tools, contact information
55. Niagara Case Study
• Targeted application: web servers
– Memory intensive (many cache misses)
– ILP limited by memory behavior
– TLP: Lots of available threads (one per client)
• Design goal: maximize throughput (/watt)
• Results:
– Pack many cores on die (8)
– Keep cores simple to fit 8 on a die, share FPU
– Use multithreading to cover pipeline stalls
– Modest frequency target (1.2 GHz)
60. Thermal Profile
• Low operating temp
• No hot spots
• Improved reliability
• No need for exotic
cooling
61. T2000 System Power
• 271W running SpecJBB2000
• Processor is only 25% of total
• DRAM & I/O next, then conversion losses
62. Niagara Summary
• Example of application-specific system
optimization
– Exploit application behavior (e.g. TLP, cache misses,
low ILP)
– Build very efficient solution
• Downsides
– Loss of general-purpose suitability
– E.g. poorly suited for software development
(parallel make, gcc)
– Very poor FP performance (fixed in Niagara 2)
63. Lecture Summary
• Thread-level parallelism
• Synchronization
• Multiprocessors
• Explicit multithreading
• Data parallel architectures
• Multicore interconnects
• Implicit multithreading: Multiscalar
• Niagara case study
Editor's Notes
#8: Final values of A, assuming initial value of 0: (a) 3; (b) 4; (c) 4; (d) 1