DRAM Basics by Prof. Matthew D. Sinclair
DRAM Basics by Prof. Matthew D. Sinclair
4
Network of DRAM
5
Hybrid Memory Cube
8
RAM
0/1 ? 0/1 ?
• RAM: large storage arrays
wordline0 • Basic structure
• MxN array of bits (M N-bit words)
address
0/1 ? 0/1 ?
• This one is 4x2
wordline1
• Bits in word connected by wordline
0/1 ? 0/1 ? • Bits in position connected by bitline
wordline2 • Operation
• Address decodes into M wordlines
0/1 ? 0/1 ?
wordline3
• High wordline → word on bitlines
• Bit/bitline connection → read/write
bitline1
bitline0
• Access latency
data • #ports * √#bits
9
SRAM
? ?
• SRAM: static RAM
• Bits as cross-coupled inverters (CCI)
– Four transistors per bit
address
? ? • “Static” means
• Inverters connected to pwr/gnd
+ Bits naturally/continuously “refreshed”
? ?
data
10
DRAM
• DRAM: dynamic RAM
• Bits as capacitors
+ Single transistors as ports
address
• “Dynamic” means
• Capacitors not connected to pwr/gnd
– Stored charge decays over time
– Must be explicitly refreshed
data
11
DRAM Basics [Jacob and Wang]
• Precharge and Row Access
12
DRAM Basics, cont.
• Column Access
13
DRAM Basics, cont.
• Data Transfer
14
DRAM Operation I
• Read: similar to cache read
• Phase I: pre-charge bitlines to 0.5V
• Phase II: decode address, enable wordline
address
r-I
sa sa • Write: two steps
• Step IA: read selected word into row buffer
r • Step IB: write data into row buffer
r/w-I DL DL • Step II: write row buffer back to selected word
r/w-II
data
16
DRAM Refresh
• DRAM periodically refreshes all contents
• Loops through all words
• Reads word into row buffer
address
sa sa
DL DL
data
17
DRAM Parameters
• DRAM parameters
• Large capacity: e.g., 64–256Mb
address • Arranged as square
+ Minimizes wire length
+ Maximizes refresh efficiency
data
18
Two-Level Addressing
address
[23:12] [11:2]
• Two-level addressing
RAS • Row decoder/column muxes share
12to4K decoder
address lines
4K x 4K • Two strobes (RAS, CAS) signal which
bits part of address currently on bus
row buffer
4 1Kto1 muxes
CAS data
19
Access Latency and Cycle Time
• DRAM access much slower than SRAM
• More bits → longer wires
• Buffered access with two-level addressing
• SRAM access latency: <1ns
• DRAM access latency: 30–50ns
20
Open v. Closed Pages
• Open Page
• Row stays active until another row needs to be accessed
• Acts as memory-level cache to reduce latency
• Variable access latency complicates memory controller
• Higher power dissipation (sense amps remain active)
• Closed Page
• Immediately deactivate row after access
• All accesses become Activate Row, Read/Write, Precharge
21
DRAM Bandwidth
• Use multiple DRAM chips to increase bandwidth
• Recall, access are the same size as second-level cache
• Example, 16 2-byte wide chips for 32B access
22
Synchronous DRAM (SDRAM)
RAS’
CAS’
Column add
Row add
24
Interleaved Main Memory
• Divide memory into M banks and “interleave” addresses
across them, so word A is
• in bank (A mod M)
• at word (A div M)
Bank 0 Bank 1 Bank 2 Bank n
word 0 word 1 word 2 word n-1
word n word n+1 word n+2 word 2n-1
word 2n word 2n+1 word 2n+2 word 3n-1
PA
Doubleword in bank Bank Word in doubleword
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
MC MC MC MC
CPU
B B+64 B+128 B+192
Data bus
26
Research: Processing in Memory
address
• Processing in memory
• Embed some ALUs in DRAM
• Picture is logical, not physical
DRAM • Do computation in DRAM rather than…
bit array • Move data to from DRAM to CPU
• Compute on CPU
• Move data from CPU to DRAM
• Will come back to this in “vectors” unit
row buffer
row buffer
• E.g.,: IRAM: intelligent RAM
• Berkeley research project
• [Patterson+,ISCA’97]
data • Very hot again
27
Memory Hierarchy Review
• Storage: registers, memory, disk
• Memory is the fundamental element
• Memory hierarchy
• Upper components: small, fast, expensive
• Lower components: big, slow, cheap
• tavg of hierarchy is close to thit of upper (fastest) component
• 10/90 rule: 90% of stuff found in fastest component
• Temporal/spatial locality: automatic up-down data movement
28
28
Bonus
29
The DRAM Subsystem
Onur Mutlu….
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
31
Page Mode DRAM
• A DRAM bank is a 2D array of cells: rows x columns
• A “DRAM row” is also called a “DRAM page”
• “Sense amplifiers” also called “row buffer”
32
DRAM Bank Operation
Access Address:
(Row 0, Column 0) Columns
(Row 0, Column 1)
(Row 0, Column 85)
Row decoder
(Row 1, Column 0)
Rows
Row address 0
1
Row 01
Row
Empty Row Buffer CONFLICT
HIT !
Column address 0
1
85 Column mux
Data
33
The DRAM Chip
• Consists of multiple banks (2-16)
• Banks share command/address/data buses
• The chip itself has a narrow interface (4-16 bits per read)
34
128M x 8-bit DRAM Chip
35
DRAM Rank and Module
• Rank: Multiple chips operated together to form a wide
interface
• All chips comprising a rank are controlled at the same time
• Respond to a single command
• Share address and command buses, but provide different data
36
A 64-bit Wide DIMM (One Rank)
Command Data
37
A 64-bit Wide DIMM (One Rank)
• Advantages:
• Acts like a high-
capacity DRAM
chip with a wide
interface
• Simplicity: memory
controller does not
need to deal with
individual chips
• Disadvantages:
• Granularity:
Accesses cannot be
smaller than the
interface width
38
Multiple DIMMs
• Advantages:
• Enables even
higher capacity
• Disadvantages:
• Interconnect
complexity and
energy
consumption can
be high
40
DRAM Channels
42
Generalized Memory Structure
43
The DRAM Subsystem
The Top Down View
Onur Mutlu….
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
45
The DRAM subsystem
Processor
46
Breaking down a DIMM
Side view
47
Breaking down a DIMM
Side view
48
Rank
<0:63> <0:63>
Memory channel
49
Breaking down a Rank
...
Chip 0
Chip 1
Chip 7
Rank 0
<56:63>
<8:15>
<0:7>
<0:63>
Data <0:63>
50
Breaking down a Chip
Chip 0
Bank 0
<0:7>
<0:7>
<0:7>
<0:7>
...
<0:7>
51
Breaking down a Bank
2kB
1B (column)
row 16k-1
...
Bank 0
row 0
<0:7>
Row-buffer
1B 1B 1B
...
<0:7>
52
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
53
Example: Transferring a cache block
Physical memory space
0xFFFF…F
Channel 0
...
DIMM 0
0x40
Rank 0
64B
cache block
0x00
54
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F
...
...
<56:63>
<8:15>
<0:7>
0x40
64B
Data <0:63>
cache block
0x00
55
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F
Row 0 ...
Col 0
...
<56:63>
<8:15>
<0:7>
0x40
64B
Data <0:63>
cache block
0x00
56
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F
Row 0 ...
Col 0
...
<56:63>
<8:15>
<0:7>
0x40
64B
Data <0:63>
cache block
8B
0x00 8B
57
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F
Row 0 ...
Col 1
...
<56:63>
<8:15>
<0:7>
0x40
64B
Data <0:63>
cache block
8B
0x00
58
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F
Row 0 ...
Col 1
...
<56:63>
<8:15>
<0:7>
0x40
64B
8B Data <0:63>
cache block
8B
0x00 8B
59
Example: Transferring a cache block
Physical memory space
Chip 0 Chip 1 Chip 7
Rank 0
0xFFFF…F
Row 0 ...
Col 1
...
<56:63>
<8:15>
<0:7>
0x40
64B
8B Data <0:63>
cache block
8B
0x00
A 64B cache block takes 8 I/O cycles to transfer.
61
Multiple Banks (Interleaving) and Channels
• Multiple banks
• Enable concurrent DRAM accesses
• Bits in address determine which bank an address resides in
• Multiple independent channels serve the same purpose
• But they are even better because they have separate data buses
• Increased bus bandwidth
62
How Multiple Banks Help
63
Address Mapping (Single Channel)
• Single-channel system with 8-byte memory bus
• 2GB memory, 8 banks, 16K rows & 2K columns per bank
• Row interleaving
• Consecutive rows of memory in consecutive banks
Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits)
Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits)
8 bits 3 bits
• Accesses to consecutive cache blocks can be serviced in parallel
64
Bank Mapping Randomization
• DRAM controller can randomize the address mapping to
banks so that bank conflicts are less likely
XOR
Bank index
(3 bits)
65
DRAM Refresh (I)
• DRAM capacitor charge leaks over time
• The memory controller needs to read each row periodically
to restore the charge
• Activate + precharge each row every N ms
• Typical N = 64 ms
• Implications on performance?
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms the
DRAM will be unavailable until refresh ends
• Burst refresh: All rows refreshed immediately after one
another
• Distributed refresh: Each row refreshed at a different time,
at regular intervals
68
DRAM Refresh (II)
69
Downsides of DRAM Refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while
refreshed
-- QoS/predictability impact: (Long) pause times during refresh
-- Refresh rate limits DRAM density scaling
70
Memory Controllers
Onur Mutlu….
72
DRAM Types
• DRAM has different types with different interfaces
optimized for different purposes
• Commodity: DDR, DDR2, DDR3, DDR4
• Low power (for mobile): LPDDR[1-5]
• High bandwidth (for graphics): GDDR[1-5]
• Low latency: eDRAM, RLDRAM, …
• 3D stacked: HBM, HMC,…
• Underlying microarchitecture is fundamentally the same
• A flexible memory controller can support various DRAM
types, but…
• This complicates the memory controller
• Difficult to support all types (and upgrades)
73
DRAM Controller: Functions
• Ensure correct operation of DRAM (refresh and timing)
74
DRAM Controller: Where to Place
• In chipset
+ More flexibility to plug different DRAM types into the system
+ Less power density in the CPU chip
• On CPU chip
+ Reduced latency for main memory access
+ Higher bandwidth between cores and controller
• More information can be communicated (e.g. request’s
importance in the processing core)
75
A Modern DRAM Controller
76
DRAM Scheduling Policies (I)
• FCFS (first come first served)
• Oldest request first
77
DRAM Scheduling Policies (II)
• A scheduling policy is essentially a prioritization order
78
Row Buffer Management Policies
• Open row
• Keep the row open after an access
+ Next access might need the same row → row hit
-- Next access might need a different row → row conflict, wasted energy
• Closed row
• Close the row after an access (if no other requests already in the request
buffer need the same row)
+ Next access might need a different row → avoid a row conflict
-- Next access might need the same row → extra activate latency
• Adaptive policies
• Predict whether or not the next access to the bank will be to the
same row
79
Open vs. Closed Row Policies
80
Why are DRAM Controllers Difficult to Design?
• Need to obey DRAM timing constraints for correctness
• There are many (50+) timing constraints in DRAM
• tWTR: Minimum number of cycles to wait before issuing a read
command after a write command is issued (rank level constraint –
change sense of bus)
• tRC: Minimum number of cycles between the issuing of two
consecutive activate commands to the same bank
• …
• Need to keep track of many resources to prevent conflicts
• Channels, banks, ranks, data bus, address bus, row buffers
• Need to handle DRAM refresh
• Need to optimize for performance & QoS (in the presence of constraints)
• Reordering is not simple
• Fairness and QoS needs complicate the scheduling problem
81
Many DRAM Timing Constraints
82
2D Packaging
[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
Conventional packaging approaches
93
2D Packaging
[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
Move toward System in Package (SIP)
• PCB, ceramic, semiconductor substrates 94
2.5D Packaging
[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
2.5D uses silicon interposer, through-silicon vias
(TSV) 95
3D Packaging
3D Homogeneous 3D Heterogeneous
[M. Maxfield, “2D vs. 2.5D vs. 3D ICs 101,” EE Times, April
2012]
3D uses through-silicon vias (TSV) and/or
interposer 96
Packaging Discussion
• Heterogeneous integration
• RF, analog (PHY), FG/PCM/ReRAM, photonics
• Cost
• Silicon yield
• Bandwidth, esp. interposer
• Thermals
• It’s real!
• DRAM: HMC, HBM
• FPGAs
• GPUs: AMD, NVIDIA
• CPUs: AMD Zen, EPYC
97
Brief History of DRAM
• DRAM (memory): a major force behind computer industry
• Modern DRAM came with introduction of IC (1970)
• Preceded by magnetic “core” memory (1950s)
• Each cell was a small magnetic “donut”
• And by mercury delay lines before that (ENIAC)
• Re-circulating vibrations in mercury tubes
“the one single development that put computers on their feet was the
invention of a reliable form of memory, namely the core memory… It’s
cost was reasonable, it was reliable, and because it was reliable it
could in due course be made large”
Maurice Wilkes
Memoirs of a Computer Programmer, 1985
98
DRAM Basics [Jacob and Wang]
• Precharge and Row Access
99
DRAM Basics, cont.
• Column Access
100
DRAM Basics, cont.
• Data Transfer
101
Open v. Closed Pages
• Open Page
• Row stays active until another row needs to be accessed
• Acts as memory-level cache to reduce latency
• Variable access latency complicates memory controller
• Higher power dissipation (sense amps remain active)
• Closed Page
• Immediately deactivate row after access
• All accesses become Activate Row, Read/Write, Precharge
102
DRAM Bandwidth
• Use multiple DRAM chips to increase bandwidth
• Recall, access are the same size as second-level cache
• Example, 16 2-byte wide chips for 32B access
103
DRAM Evolution
• Survey by Cuppu et al.
1. Early Asynchronous Interface
2. Fast Page Mode/Nibble Mode/Static Column (skip)
3. Extended Data Out
4. Synchronous DRAM & Double Data Rate
5. Rambus & Direct Rambus
6. FB-DIMM
104
Old 64MbitDRAM Example from Micron
Clock Recovery
105
Extended Data Out (EDO)
RAS’
CAS’
RAS’
CAS’
Column add
Row add
108
Wide v. Narrow Interfaces
• High frequency ➔ short wavelength ➔ data skew issues
• Balance wire lengths
109
Rambus RDRAM
• High-frequency, narrow channel
• Time multiplexed “bus” ➔ dynamic point-to-point channels
• ~40 pins ➔ 1.6GB/s
• Proprietary solution
• Never gained industry-wide acceptance (cost and power)
• Used in some game consoles (e.g., PS2)
110
FB-DIMM
111
DRAM Reliability
• One last thing about DRAM technology… errors
• DRAM bits can flip from 0➔1 or 1➔0
• Small charge stored per bit
• Energetic -particle strikes disrupt stored charge
• Many more bits
• Modern DRAM systems: built-in error detection/correction
• Today all servers; most new desktop and laptops
• Key idea: checksum-style redundancy
• Main DRAM chips store data, additional chips store f(data)
• |f(data)| < |data|
• On read: re-compute f(data), compare with stored f(data)
• Different ? Error…
• Option I (detect): kill program
• Option II (correct): enough information to fix error? fix and go on
112
DRAM Error Detection and Correction
address data error
4M 4M 4M 4M 4M
x x x x x
2B 2B 2B 2B 2B
0 1 2 3 f
113
Interleaved Main Memory
• Divide memory into M banks and “interleave” addresses
across them, so word A is
• in bank (A mod M)
• at word (A div M)
Bank 0 Bank 1 Bank 2 Bank n
word 0 word 1 word 2 word n-1
word n word n+1 word n+2 word 2n-1
word 2n word 2n+1 word 2n+2 word 3n-1
PA
Doubleword in bank Bank Word in doubleword
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
MC MC MC MC
CPU
B B+64 B+128 B+192
Data bus
115
Memory Hierarchy Review
• Storage: registers, memory, disk
• Memory is the fundamental element
• Memory hierarchy
• Upper components: small, fast, expensive
• Lower components: big, slow, cheap
• tavg of hierarchy is close to thit of upper (fastest) component
• 10/90 rule: 90% of stuff found in fastest component
• Temporal/spatial locality: automatic up-down data movement
116
Software Managed Memory
• Isn’t full associativity difficult to implement?
• Yes … in hardware
• Implement fully associative memory in software
117