0% found this document useful (0 votes)
96 views

DRAM Basics by Prof. Matthew D. Sinclair

1) Die-stacked DRAM stacks multiple DRAM dies on a single logic die to increase density and bandwidth. Current examples include Hybrid Memory Cube and High Bandwidth Memory. 2) DRAM uses capacitors to store bits which requires periodic refreshing to prevent data loss from charge leakage over time. DRAM operations like reading and writing involve precharging and accessing the row buffer to avoid destructive reads. 3) DRAM has a two-level addressing scheme using row and column addresses to access a 2D array, and requires multi-step operations to perform reads and writes via the row buffer.

Uploaded by

fasil t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

DRAM Basics by Prof. Matthew D. Sinclair

1) Die-stacked DRAM stacks multiple DRAM dies on a single logic die to increase density and bandwidth. Current examples include Hybrid Memory Cube and High Bandwidth Memory. 2) DRAM uses capacitors to store bits which requires periodic refreshing to prevent data loss from charge leakage over time. DRAM operations like reading and writing involve precharging and accessing the row buffer to avoid destructive reads. 3) DRAM has a two-level addressing scheme using row and column addresses to access a 2D array, and requires multi-step operations to perform reads and writes via the row buffer.

Uploaded by

fasil t
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

CS/ECE 752:

Advanced Computer Architecture I

Prof. Matthew D. Sinclair


DRAM Basics
Slide History/Attribution Diagram:
UW Madison UPenn UW Madison
UCLA
Hill, Sohi, Amir Roth, Hill, Sohi, Wood,
Nowatzki
Smith, Wood Milo Martin Various Universities Sankaralingam, Sinclair
Asanovic, Falsafi, Hoe, Lipasti,
Shen, Smith, Vijaykumar
Die-stacked DRAM (3D-DRAM)
• Die-stacked DRAM: Multiple DRAM
• Top layers store data dies are stacked
on top of bottom
• Bottom logic layer stores the
various control, access, and logic die
interface circuits
• Magic: Stacked means high
density, so high b/w interposer
integration not so expensive.
• Current Products: Source: AMD.com
• Hybrid Memory Cube (Micron)
• High Bandwidth Memory (Samsung, AMD, and Hynix)
• Tradeoffs:
• Basically the same latency as DRAM, but much higher bandwidth
• More expensive, so we can’t have as much memory…
• In GPU: Need high bandwidth all the time, but don’t need
that much memory, so it can serve as the main memory.
• What about CPU: Need huge memory, so cost is critical…
2
Emerging: Hybrid Memory Cube

• Micron proposal [Pawlowski, Hot Chips 11]


• www.hybridmemorycube.org (Now a dead link :( )
3
Hybrid Memory Cube MCM

• Micron proposal [Pawlowski, Hot Chips 11]


• www.hybridmemorycube.org (Now a dead link :( )

4
Network of DRAM

• Traditional DRAM: star topology


• HMC: mesh, etc. are feasible

5
Hybrid Memory Cube

• High-speed logic segregated in chip stack


• 3D TSV for bandwidth 6
High Bandwidth Memory (HBM)

[Shmuel Csaba Otto Traian]

• High-speed serial links vs. 2.5D silicon interposer


• Commercialized, HBM2/HBM3 …
7
Future: Resistive memory
• PCM: store bit in phase state of material
• Alternatives:
• Memristor, STT-MRAM
• Nonvolatile
• Dense: cross-point architecture (no access device)
• Relatively fast for read
• Very slow for write (also high power)
• Write endurance often limited
• Write leveling (also done for flash)
• Avoid redundant writes (read, cmp, write)
• Fix individual bit errors (write, read, cmp, fix)
• Lots of work on using this to augment/replace main memory

8
RAM

0/1 ? 0/1 ?
• RAM: large storage arrays
wordline0 • Basic structure
• MxN array of bits (M N-bit words)
address

0/1 ? 0/1 ?
• This one is 4x2
wordline1
• Bits in word connected by wordline
0/1 ? 0/1 ? • Bits in position connected by bitline
wordline2 • Operation
• Address decodes into M wordlines
0/1 ? 0/1 ?
wordline3
• High wordline → word on bitlines
• Bit/bitline connection → read/write
bitline1

bitline0

• Access latency
data • #ports * √#bits

9
SRAM

? ?
• SRAM: static RAM
• Bits as cross-coupled inverters (CCI)
– Four transistors per bit
address

? ? – More transistors for ports

? ? • “Static” means
• Inverters connected to pwr/gnd
+ Bits naturally/continuously “refreshed”
? ?

• Designed for speed

data

10
DRAM
• DRAM: dynamic RAM
• Bits as capacitors
+ Single transistors as ports
address

+ One transistor per bit/port

• “Dynamic” means
• Capacitors not connected to pwr/gnd
– Stored charge decays over time
– Must be explicitly refreshed

• Designed for density

data

11
DRAM Basics [Jacob and Wang]
• Precharge and Row Access

12
DRAM Basics, cont.
• Column Access

13
DRAM Basics, cont.
• Data Transfer

14
DRAM Operation I
• Read: similar to cache read
• Phase I: pre-charge bitlines to 0.5V
• Phase II: decode address, enable wordline
address

• Capacitor swings bitline voltage up(down)


• Sense-amplifier interprets swing as 1(0)
– Destructive read: word bits now discharged

write • Write: similar to cache write


sa sa
• Phase I: decode address, enable wordline
• Phase II: enable bitlines
• High bitlines charge corresponding capacitors

data – What about leakage over time?


15
DRAM Operation II
• Solution: add set of D-latches (row buffer)

• Read: two steps


address

• Step I: read selected word into row buffer


• Step IIA: read row buffer out to pins
• Step IIB: write row buffer back to selected word
+ Solves “destructive read” problem

r-I
sa sa • Write: two steps
• Step IA: read selected word into row buffer
r • Step IB: write data into row buffer
r/w-I DL DL • Step II: write row buffer back to selected word
r/w-II
data

16
DRAM Refresh
• DRAM periodically refreshes all contents
• Loops through all words
• Reads word into row buffer
address

• Writes row buffer back into DRAM array

• 1–2% of DRAM time occupied by refresh

sa sa

DL DL

data

17
DRAM Parameters
• DRAM parameters
• Large capacity: e.g., 64–256Mb
address • Arranged as square
+ Minimizes wire length
+ Maximizes refresh efficiency

DRAM • Narrow data interface: 1–16 bit


bit array
• Cheap packages → few bus pins

• Narrow address interface: N/2 bits


row buffer • 16Mb DRAM has a 12-bit address bus
• How does that work?

data

18
Two-Level Addressing
address
[23:12] [11:2]

• Two-level addressing
RAS • Row decoder/column muxes share
12to4K decoder

address lines
4K x 4K • Two strobes (RAS, CAS) signal which
bits part of address currently on bus

row buffer
4 1Kto1 muxes

CAS data

19
Access Latency and Cycle Time
• DRAM access much slower than SRAM
• More bits → longer wires
• Buffered access with two-level addressing
• SRAM access latency: <1ns
• DRAM access latency: 30–50ns

• DRAM cycle time also longer than access time


• Cycle time: time between start of consecutive accesses
• SRAM: cycle time = access time
• Begin second access as soon as first access finishes
• DRAM: cycle time = 2 * access time
• Why? Can’t begin new access while DRAM is refreshing row

20
Open v. Closed Pages
• Open Page
• Row stays active until another row needs to be accessed
• Acts as memory-level cache to reduce latency
• Variable access latency complicates memory controller
• Higher power dissipation (sense amps remain active)
• Closed Page
• Immediately deactivate row after access
• All accesses become Activate Row, Read/Write, Precharge

• Complex power v. performance trade off

21
DRAM Bandwidth
• Use multiple DRAM chips to increase bandwidth
• Recall, access are the same size as second-level cache
• Example, 16 2-byte wide chips for 32B access

• DRAM density increasing faster than demand


• Result: number of memory chips per system decreasing

• Need to increase the bandwidth per chip


• Especially important in game consoles
• SDRAM ➔ DDR ➔ DDR2 ➔ FBDIMM (➔ DDR3)
• Rambus - high-bandwidth memory
• Used by several game consoles

22
Synchronous DRAM (SDRAM)

RAS’

CAS’

Column add
Row add

Data Data Data


• Add Clock and Wider data!
• Also multiple transfers per RAS/CAS
23
Enhanced SDRAM & DDR
• Evolutionary Enhancements on SDRAM:
1. ESDRAM (Enhanced): Overlap row buffer access with
refresh

2. DDR (Double Data Rate): Transfer on both clock edges


3. DDR2’s small improvements
lower voltage, on-chip termination, driver calibration
prefetching, conflict buffering
4. DDR3, more small improvements
lower voltage, 2X speed, 2X prefetching,
2X banks, “fly-by topology”, automatic calibration

24
Interleaved Main Memory
• Divide memory into M banks and “interleave” addresses
across them, so word A is
• in bank (A mod M)
• at word (A div M)
Bank 0 Bank 1 Bank 2 Bank n
word 0 word 1 word 2 word n-1
word n word n+1 word n+2 word 2n-1
word 2n word 2n+1 word 2n+2 word 3n-1

PA
Doubleword in bank Bank Word in doubleword

Interleaved memory increases memory BW without wider bus


• Use parallelism in memory banks to hide memory latency
Copyright © 2002 Falsafi, from Hill,
Smith, Sohi, Vijaykumar, and Wood 25
Block interleaved memory systems
• Cache blocks map to separate memory controllers
• Interleave across DRAMs w/i a MC
• Interleave across intra-DRAM banks w/i a DRAM
DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
MC MC MC MC
CPU
B B+64 B+128 B+192

Data bus

26
Research: Processing in Memory
address
• Processing in memory
• Embed some ALUs in DRAM
• Picture is logical, not physical
DRAM • Do computation in DRAM rather than…
bit array • Move data to from DRAM to CPU
• Compute on CPU
• Move data from CPU to DRAM
• Will come back to this in “vectors” unit
row buffer
row buffer
• E.g.,: IRAM: intelligent RAM
• Berkeley research project
• [Patterson+,ISCA’97]
data • Very hot again

27
Memory Hierarchy Review
• Storage: registers, memory, disk
• Memory is the fundamental element

• Memory component performance


• tavg = thit + %miss * tmiss
• Can’t get both low thit and %miss in a single structure

• Memory hierarchy
• Upper components: small, fast, expensive
• Lower components: big, slow, cheap
• tavg of hierarchy is close to thit of upper (fastest) component
• 10/90 rule: 90% of stuff found in fastest component
• Temporal/spatial locality: automatic up-down data movement

28
28
Bonus

29
The DRAM Subsystem

Onur Mutlu….

Slide History/Attribution Diagram:


UW Madison UPenn UW Madison
UCLA
Hill, Sohi, Amir Roth, Hill, Sohi, Wood,
Nowatzki
Smith, Wood Milo Martin Various Universities Sankaralingam, Sinclair
Asanovic, Falsafi, Hoe, Lipasti,
Shen, Smith, Vijaykumar
DRAM Subsystem Organization

• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column

31
Page Mode DRAM
• A DRAM bank is a 2D array of cells: rows x columns
• A “DRAM row” is also called a “DRAM page”
• “Sense amplifiers” also called “row buffer”

• Each address is a <row,column> pair


• Access to a “closed row”
• Activate command opens row (placed into row buffer)
• Read/write command reads/writes column in the row buffer
• Precharge command closes the row and prepares the bank for next
access
• Access to an “open row”
• No need for activate command

32
DRAM Bank Operation
Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)

Row decoder
(Row 1, Column 0)

Rows
Row address 0
1

Row 01
Row
Empty Row Buffer CONFLICT
HIT !

Column address 0
1
85 Column mux

Data

33
The DRAM Chip
• Consists of multiple banks (2-16)
• Banks share command/address/data buses
• The chip itself has a narrow interface (4-16 bits per read)

34
128M x 8-bit DRAM Chip

35
DRAM Rank and Module
• Rank: Multiple chips operated together to form a wide
interface
• All chips comprising a rank are controlled at the same time
• Respond to a single command
• Share address and command buses, but provide different data

• A DRAM module consists of one or more ranks


• E.g., DIMM (dual inline memory module)
• This is what you plug into your motherboard

• If we have chips with 8-bit interface, to read 8 bytes in a


single access, use 8 chips in a DIMM

36
A 64-bit Wide DIMM (One Rank)

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM


Chip Chip Chip Chip Chip Chip Chip Chip

Command Data

37
A 64-bit Wide DIMM (One Rank)
• Advantages:
• Acts like a high-
capacity DRAM
chip with a wide
interface
• Simplicity: memory
controller does not
need to deal with
individual chips

• Disadvantages:
• Granularity:
Accesses cannot be
smaller than the
interface width

38
Multiple DIMMs
• Advantages:
• Enables even
higher capacity

• Disadvantages:
• Interconnect
complexity and
energy
consumption can
be high

40
DRAM Channels

• 2 Independent Channels: 2 Memory Controllers (Above)


• 2 Dependent/Lockstep Channels: 1 Memory Controller with
wide interface (Not shown above)

42
Generalized Memory Structure

43
The DRAM Subsystem
The Top Down View

Onur Mutlu….

Slide History/Attribution Diagram:


UW Madison UPenn UW Madison
UCLA
Hill, Sohi, Amir Roth, Hill, Sohi, Wood,
Nowatzki
Smith, Wood Milo Martin Various Universities Sankaralingam, Sinclair
Asanovic, Falsafi, Hoe, Lipasti,
Shen, Smith, Vijaykumar
DRAM Subsystem Organization

• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column

45
The DRAM subsystem

“Channel” DIMM (Dual in-line memory module)

Processor

Memory channel Memory channel

46
Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

47
Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Rank 0: collection of 8 chips Rank 1

48
Rank

Rank 0 (Front) Rank 1 (Back)

<0:63> <0:63>

Addr/Cmd CS <0:1> Data <0:63>

Memory channel
49
Breaking down a Rank

...

Chip 0

Chip 1

Chip 7
Rank 0

<56:63>
<8:15>
<0:7>
<0:63>

Data <0:63>

50
Breaking down a Chip

Chip 0

Bank 0
<0:7>
<0:7>

<0:7>

<0:7>
...

<0:7>

51
Breaking down a Bank
2kB
1B (column)

row 16k-1

...
Bank 0

row 0
<0:7>

Row-buffer
1B 1B 1B
...
<0:7>

52
DRAM Subsystem Organization

• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column

53
Example: Transferring a cache block
Physical memory space

0xFFFF…F
Channel 0
...

DIMM 0

0x40
Rank 0
64B
cache block

0x00

54
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

...
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block

0x00

55
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 0
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block

0x00

56
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 0
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block
8B
0x00 8B

57
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 1
...

<56:63>
<8:15>
<0:7>

0x40

64B
Data <0:63>
cache block
8B
0x00

58
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 1
...

<56:63>
<8:15>
<0:7>

0x40

64B
8B Data <0:63>
cache block
8B
0x00 8B

59
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F

Row 0 ...
Col 1
...

<56:63>
<8:15>
<0:7>

0x40

64B
8B Data <0:63>
cache block
8B
0x00
A 64B cache block takes 8 I/O cycles to transfer.

During the process, 8 columns are read sequentially.


60
Latency Components: Basic DRAM Operation
• CPU → controller transfer time
• Controller latency
• Queuing & scheduling delay at the controller
• Access converted to basic commands
• Controller → DRAM transfer time
• DRAM bank latency
• Simple CAS (column address strobe) if row is “open” OR
• RAS (row address strobe) + CAS if array precharged OR
• PRE + RAS + CAS (worst case)
• DRAM → Controller transfer time
• Bus latency (BL)
• Controller to CPU transfer time

61
Multiple Banks (Interleaving) and Channels
• Multiple banks
• Enable concurrent DRAM accesses
• Bits in address determine which bank an address resides in
• Multiple independent channels serve the same purpose
• But they are even better because they have separate data buses
• Increased bus bandwidth

• Enabling more concurrency requires reducing


• Bank conflicts
• Channel conflicts
• How to select/randomize bank/channel indices in address?
• Lower order bits have more entropy
• Randomizing hash functions (XOR of different address bits)
• Pathological cases (strided at long length)

62
How Multiple Banks Help

63
Address Mapping (Single Channel)
• Single-channel system with 8-byte memory bus
• 2GB memory, 8 banks, 16K rows & 2K columns per bank

• Row interleaving
• Consecutive rows of memory in consecutive banks

Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)

• Accesses to consecutive cache blocks serviced in a pipelined manner

• Cache block interleaving


• Consecutive cache block addresses in consecutive banks
• 64 byte cache blocks

Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits
• Accesses to consecutive cache blocks can be serviced in parallel
64
Bank Mapping Randomization
• DRAM controller can randomize the address mapping to
banks so that bank conflicts are less likely

3 bits Column (11 bits) Byte in bus (3 bits)

XOR

Bank index
(3 bits)

65
DRAM Refresh (I)
• DRAM capacitor charge leaks over time
• The memory controller needs to read each row periodically
to restore the charge
• Activate + precharge each row every N ms
• Typical N = 64 ms
• Implications on performance?
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms the
DRAM will be unavailable until refresh ends
• Burst refresh: All rows refreshed immediately after one
another
• Distributed refresh: Each row refreshed at a different time,
at regular intervals

68
DRAM Refresh (II)

• Distributed refresh eliminates long pause times


• How else we can reduce the effect of refresh on
performance?
• Can we reduce the number of refreshes?

69
Downsides of DRAM Refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while
refreshed
-- QoS/predictability impact: (Long) pause times during refresh
-- Refresh rate limits DRAM density scaling

Liu et al., “RAIDR: Retention-aware Intelligent DRAM Refresh,” ISCA 2012.

70
Memory Controllers

Onur Mutlu….

Slide History/Attribution Diagram:


UW Madison UPenn UW Madison
UCLA
Hill, Sohi, Amir Roth, Hill, Sohi, Wood,
Nowatzki
Smith, Wood Milo Martin Various Universities Sankaralingam, Sinclair
Asanovic, Falsafi, Hoe, Lipasti,
Shen, Smith, Vijaykumar
DRAM versus Other Types of Memories
• Long latency memories have similar characteristics that
need to be controlled.

• The following discussion will use DRAM as an example, but


many issues are similar in the design of controllers for
other types of memories
• Flash memory
• Other emerging memory technologies
• Phase Change Memory
• Spin-Transfer Torque Magnetic Memory

72
DRAM Types
• DRAM has different types with different interfaces
optimized for different purposes
• Commodity: DDR, DDR2, DDR3, DDR4
• Low power (for mobile): LPDDR[1-5]
• High bandwidth (for graphics): GDDR[1-5]
• Low latency: eDRAM, RLDRAM, …
• 3D stacked: HBM, HMC,…
• Underlying microarchitecture is fundamentally the same
• A flexible memory controller can support various DRAM
types, but…
• This complicates the memory controller
• Difficult to support all types (and upgrades)

73
DRAM Controller: Functions
• Ensure correct operation of DRAM (refresh and timing)

• Service DRAM requests while obeying timing constraints of


DRAM chips
• Constraints: resource conflicts (bank, bus, channel), minimum
write-to-read delays
• Translate requests to DRAM command sequences

• Buffer and schedule requests to improve performance


• Reordering, row-buffer, bank, rank, bus management

• Manage power consumption and thermals in DRAM


• Turn on/off DRAM chips, manage power modes

74
DRAM Controller: Where to Place
• In chipset
+ More flexibility to plug different DRAM types into the system
+ Less power density in the CPU chip

• On CPU chip
+ Reduced latency for main memory access
+ Higher bandwidth between cores and controller
• More information can be communicated (e.g. request’s
importance in the processing core)

75
A Modern DRAM Controller

76
DRAM Scheduling Policies (I)
• FCFS (first come first served)
• Oldest request first

• FR-FCFS (first ready, first come first served)


1. Row-hit first
2. Oldest first
Goal: Maximize row buffer hit rate → maximize DRAM throughput

• Actually, scheduling is done at the command level


• Column commands (read/write) prioritized over row commands
(activate/precharge)
• Within each group, older commands prioritized over younger ones

77
DRAM Scheduling Policies (II)
• A scheduling policy is essentially a prioritization order

• Prioritization can be based on


• Request age
• Row buffer hit/miss status
• Request type (prefetch, read, write)
• Requestor type (load miss or store miss)
• Request criticality
• Oldest miss in the core?
• How many instructions in core are dependent on it?

78
Row Buffer Management Policies
• Open row
• Keep the row open after an access
+ Next access might need the same row → row hit
-- Next access might need a different row → row conflict, wasted energy

• Closed row
• Close the row after an access (if no other requests already in the request
buffer need the same row)
+ Next access might need a different row → avoid a row conflict
-- Next access might need the same row → extra activate latency

• Adaptive policies
• Predict whether or not the next access to the bank will be to the
same row

79
Open vs. Closed Row Policies

Policy First access Next access Commands


needed for next
access
Open row Row 0 Row 0 (row hit) Read
Open row Row 0 Row 1 (row Precharge +
conflict) Activate Row 1 +
Read
Closed row Row 0 Row 0 – access in Read
request buffer
(row hit)
Closed row Row 0 Row 0 – access not Activate Row 0 +
in request buffer Read + Precharge
(row closed)
Closed row Row 0 Row 1 (row closed) Activate Row 1 +
Read + Precharge

80
Why are DRAM Controllers Difficult to Design?
• Need to obey DRAM timing constraints for correctness
• There are many (50+) timing constraints in DRAM
• tWTR: Minimum number of cycles to wait before issuing a read
command after a write command is issued (rank level constraint –
change sense of bus)
• tRC: Minimum number of cycles between the issuing of two
consecutive activate commands to the same bank
• …
• Need to keep track of many resources to prevent conflicts
• Channels, banks, ranks, data bus, address bus, row buffers
• Need to handle DRAM refresh
• Need to optimize for performance & QoS (in the presence of constraints)
• Reordering is not simple
• Fairness and QoS needs complicate the scheduling problem

81
Many DRAM Timing Constraints

• From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing


Write-Caused Interference in Memory Systems,” HPS Technical Report,
April 2010.

82
2D Packaging

Board level System on Chip

[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
Conventional packaging approaches
93
2D Packaging

[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
Move toward System in Package (SIP)
• PCB, ceramic, semiconductor substrates 94
2.5D Packaging

2D Packaging 2.5D Packaging

[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
2.5D uses silicon interposer, through-silicon vias
(TSV) 95
3D Packaging

3D Homogeneous 3D Heterogeneous

[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
3D uses through-silicon vias (TSV) and/or
interposer 96
Packaging Discussion

• Heterogeneous integration
• RF, analog (PHY), FG/PCM/ReRAM, photonics
• Cost
• Silicon yield
• Bandwidth, esp. interposer
• Thermals
• It’s real!
• DRAM: HMC, HBM
• FPGAs
• GPUs: AMD, NVIDIA
• CPUs: AMD Zen, EPYC
97
Brief History of DRAM
• DRAM (memory): a major force behind computer industry
• Modern DRAM came with introduction of IC (1970)
• Preceded by magnetic “core” memory (1950s)
• Each cell was a small magnetic “donut”
• And by mercury delay lines before that (ENIAC)
• Re-circulating vibrations in mercury tubes

“the one single development that put computers on their feet was the
invention of a reliable form of memory, namely the core memory… It’s
cost was reasonable, it was reliable, and because it was reliable it
could in due course be made large”
Maurice Wilkes
Memoirs of a Computer Programmer, 1985

98
DRAM Basics [Jacob and Wang]
• Precharge and Row Access

99
DRAM Basics, cont.
• Column Access

100
DRAM Basics, cont.
• Data Transfer

101
Open v. Closed Pages
• Open Page
• Row stays active until another row needs to be accessed
• Acts as memory-level cache to reduce latency
• Variable access latency complicates memory controller
• Higher power dissipation (sense amps remain active)
• Closed Page
• Immediately deactivate row after access
• All accesses become Activate Row, Read/Write, Precharge

• Complex power v. performance trade off

102
DRAM Bandwidth
• Use multiple DRAM chips to increase bandwidth
• Recall, access are the same size as second-level cache
• Example, 16 2-byte wide chips for 32B access

• DRAM density increasing faster than demand


• Result: number of memory chips per system decreasing

• Need to increase the bandwidth per chip


• Especially important in game consoles
• SDRAM ➔ DDR ➔ DDR2 ➔ FBDIMM (➔ DDR3)
• Rambus - high-bandwidth memory
• Used by several game consoles

103
DRAM Evolution
• Survey by Cuppu et al.
1. Early Asynchronous Interface
2. Fast Page Mode/Nibble Mode/Static Column (skip)
3. Extended Data Out
4. Synchronous DRAM & Double Data Rate
5. Rambus & Direct Rambus
6. FB-DIMM

104
Old 64MbitDRAM Example from Micron
Clock Recovery

105
Extended Data Out (EDO)

RAS’

CAS’

Row add Column add Column add Column add

Data Data Data

• Similar to Fast Page Mode


• But overlapped Column Address assert with Data Out
106
Synchronous DRAM (SDRAM)

RAS’

CAS’

Column add
Row add

Data Data Data


• Add Clock and Wider data!
• Also multiple transfers per RAS/CAS
107
Enhanced SDRAM & DDR
• Evolutionary Enhancements on SDRAM:
1. ESDRAM (Enhanced): Overlap row buffer access with
refresh

2. DDR (Double Data Rate): Transfer on both clock edges


3. DDR2’s small improvements
lower voltage, on-chip termination, driver calibration
prefetching, conflict buffering
4. DDR3, more small improvements
lower voltage, 2X speed, 2X prefetching,
2X banks, “fly-by topology”, automatic calibration

108
Wide v. Narrow Interfaces
• High frequency ➔ short wavelength ➔ data skew issues
• Balance wire lengths

DDR-2 serpentine board routing FB-DIMM board routing

109
Rambus RDRAM
• High-frequency, narrow channel
• Time multiplexed “bus” ➔ dynamic point-to-point channels
• ~40 pins ➔ 1.6GB/s
• Proprietary solution
• Never gained industry-wide acceptance (cost and power)
• Used in some game consoles (e.g., PS2)

RDRAM RDRAM RDRAM


CPU
from_clock
or
Memory to_clock
Controller
Data bus
16 bits @ 800 Mhz

110
FB-DIMM

111
DRAM Reliability
• One last thing about DRAM technology… errors
• DRAM bits can flip from 0➔1 or 1➔0
• Small charge stored per bit
• Energetic -particle strikes disrupt stored charge
• Many more bits
• Modern DRAM systems: built-in error detection/correction
• Today all servers; most new desktop and laptops
• Key idea: checksum-style redundancy
• Main DRAM chips store data, additional chips store f(data)
• |f(data)| < |data|
• On read: re-compute f(data), compare with stored f(data)
• Different ? Error…
• Option I (detect): kill program
• Option II (correct): enough information to fix error? fix and go on

112
DRAM Error Detection and Correction
address data error

4M 4M 4M 4M 4M
x x x x x
2B 2B 2B 2B 2B
0 1 2 3 f

• Performed by memory controller (not the DRAM chip)


• Error detection/correction schemes distinguished by…
• How many (simultaneous) errors they can detect
• How many (simultaneous) errors they can correct

113
Interleaved Main Memory
• Divide memory into M banks and “interleave” addresses
across them, so word A is
• in bank (A mod M)
• at word (A div M)
Bank 0 Bank 1 Bank 2 Bank n
word 0 word 1 word 2 word n-1
word n word n+1 word n+2 word 2n-1
word 2n word 2n+1 word 2n+2 word 3n-1

PA
Doubleword in bank Bank Word in doubleword

Interleaved memory increases memory BW without wider bus


• Use parallelism in memory banks to hide memory latency
Copyright © 2002 Falsafi, from Hill,
Smith, Sohi, Vijaykumar, and Wood 114
Block interleaved memory systems
• Cache blocks map to separate memory controllers
• Interleave across DRAMs w/i a MC
• Interleave across intra-DRAM banks w/i a DRAM
DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
MC MC MC MC
CPU
B B+64 B+128 B+192

Data bus

115
Memory Hierarchy Review
• Storage: registers, memory, disk
• Memory is the fundamental element

• Memory component performance


• tavg = thit + %miss * tmiss
• Can’t get both low thit and %miss in a single structure

• Memory hierarchy
• Upper components: small, fast, expensive
• Lower components: big, slow, cheap
• tavg of hierarchy is close to thit of upper (fastest) component
• 10/90 rule: 90% of stuff found in fastest component
• Temporal/spatial locality: automatic up-down data movement

116
Software Managed Memory
• Isn’t full associativity difficult to implement?
• Yes … in hardware
• Implement fully associative memory in software

• Let’s take a step back…

117

You might also like