Module 2: Storing Data: Disks and Files 2.1 Memory Hierarchy
Module 2: Storing Data: Disks and Files 2.1 Memory Hierarchy
Memory hierarchy
Applications
SQL Interface
Request
Module Outline
SQL Commands
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Memory hierarchy
Disk space management
Buffer manager
File and record organization
Page formats
Record formats
Addressing schemes
Plan Executor
Parser
Operator Evaluator
Optimizer
&
&
&
&
CPU
CPU Cache (L1, L2)
Main Memory (RAM)
Magnetic Disk
Tape, CD-ROM, DVD
Storage Class
ff
primary
secondary
tertiary
Query Processor
Transaction
Manager
I Cost of primary memory 100 cost of secondary storage space of the same
size.
Buffer Manager
Recovery
Manager
Lock
Manager
Disk Space Manager
Concurrency Control
DBMS
I DBMS needs to make data persistent across DBMS (or host) shutdowns or
crashes; only secondary/tertiary storage is nonvolatile.
Index Files
Data Files
I Size of address space in primary memory (e.g., 232 Byte = 4 GB) may not be
sufficient to map the whole database (we might even have 232 records).
System Catalog
Database
13
2.1.1
Magnetic disks
14
track
sector
I Tapes store vast amounts of data ( 100 GB; more for roboter tape farms)
but they are sequential devices.
1 Each track is divided into arc-shaped
sectors (a characteristic of the disks
hardware),
I Magnetic disks (hard disks) allow direct access to any desired location; hard
disks dominate database system scenarios by far.
2 data is written to and read from disk
block by block (the block size is set to
block
a multiple of the sector size when the
disk is formatted),
1 Data on a hard disk is arranged in concentric rings (tracks) on one or more
platters,
3 typical disk block sizes are 4 KB or
8 KB.
2 tracks can be recorded on one or both
surfaces of a platter,
Data blocks can only be written and read if disk heads and platters are positioned accordingly.
rotation
disk arm
disk head
track
arm movement
cylinder
3 set of tracks with same diameter form
a cylinder,
platter
4 an array (disk arm) of disk heads, one
per recorded surface, is moved as a unit,
5 a stepper motor moves the disk heads
from track to track, the platters steadily
rotate.
15
delay),
3 disk block data has to be actually written/read (transfer time).
1 +
2 +
3
access time =
16
I The unit of a data transfer between disk and main memory is a block,
I if a single item (e.g., record, attribute) is needed, the whole containing block
must be transferred:
The time for I/O operations dominates the time taken for database operations.
I DBMSs take the geometry and mechanics of hard disks into account.
KB block
9.1 ms + 4.17 ms +
1s
13.87 ms
13 MB/8 KB
Current disk designs can transfer a whole track in one platter revolution,
active disk head can be switched after each revolution.
17
2.1.2
Accelerating Disk-I/O
Goals
intra-I/O parallelism
inter-I/O parallelism
19
20
Principle of Operation
MT T F
MT T F + MT T R
Use data redundancy to be able to reconstruct lost data, e.g., compute parity
information during normal operation
...
Thus: MT T DL =
When one of the disks fails, use parity to reconstruct lost data during failure recovery
MT T F
MT T F
N
(N 1) MT T R
Now we get MT T DL =
...
i.e., we only suffer from data loss, if a second disk fails before the first failed disk has
been replaced.
21
. . . typically reserve one extra disk as a hot spare to replace the failed one
immediately.
22
i : write(Bki , diski );
I Write Access: to write block number k back to disk j, we have to update the
parity information, too (let p be the number of the parity disk for block k):
I Reconstruction of block b on a failed disk j (let r be the number of the replacement disk):
i6=j : read(Bki , diski );
write(Bkj , diskr );
write(Bkj , diskj );
0
write(Bkp , diskp );
write(Bkj , diskj );
0
write(Bkp , diskp );
23
24
Recovery Strategies
2.1.3
RAID-Levels
I off-line: if a disk has failed, suspend normal I/O; reconstruct all blocks of the
failed disk and write them to the replacement disk; only afterwards resume
normal I/O traffic.
There are a number of variants ( RAID-Levels) differing w.r.t. the following characteristics:
I striping unit (data interleaving)
start reconstructing all blocks not yet reconstructed since the crash (in background);
5 RAID-levels have been introduced, later more levels have been defined.
N.B. we can even wait with all reconstruction until first normal read access!
26
25
one parity disk suffices, since controller can easily identify faulty disk!
read and write access goes to all disks, therefore, no inter-I/O parallelism, but
high bandwidth
doubles write-access
like RAID 4, but distribute parity blocks across all disks load balancing
failure recovery: determine lost disk by using the n 1 extra disks; correct
(reconstruct) its contents from 1 of those
best performance for small and large read as well as large write I/Os
More recent levels combine aspects of the ones listed here, or add multiple parity
blocks, e.g. RAID 6: two parity blocks per group.
27
28
Parity groups
Parity is not necessarily computed across all disks within an array, it is possibile to
define parity groups (of same or different sizes).
non-r e d u n d a n t ( R A I D - 0 )
mirroring ( R A I D - 1 )
disk 2
disk 3
disk 4
group 1
group 1
group 1
parity 1
group 2
group 2
group 2
parity 3
group 3
disk 5
(R A ID -2 )
bit-in t e r l e a v e d pa r i t y ( R A I D - 3 )
data in t e r l e a v i n g
on the block level
shading
= redundant info
group 4
parity 2
group 3
group 3
parity 4
group 4
group 4
group 5
group 5
group 5
...
data interleaving
on the byte level
...
me m o r y - st y l e E C C
disk 1
...
...
parity 5
...
block-in t e r l e a v e d pa r i t y ( R A I D - 4 )
bl o c k - in t e r l e a v e d , st r i p e d pa r i t y ( R A I D - 5 )
P + Q
pa r i t y , st r i p e d ( R A I D - 6 )
30
29
2.2
Web Forms
Applications
SQL Interface
SQL Commands
Plan Executor
Parser
Operator Evaluator
Optimizer
Query Processor
I RAID levels 2 and 4: are always inferior to levels 3 and 5, resply. Level
3 appropriate for workloads with large requests for contiguous blocks; bad for
many small requests of a single block.
I RAID level 5: is a good general-purpose solution. Best performance (with
redundancy) for small and large read as well as large write requests.
Transaction
Manager
Recovery
Manager
Buffer Manager
I Sequences of data pages are mapped onto contiguous sequences of blocks by the DSM.
Lock
Manager
Disk Space Manager
Yo
ua
Concurrency Control
re
he
re!
Index Files
DBMS
System Catalog
Data Files
Database
block-# page-#
2.2.1
2.3
I During database (or table) creation it is likely that blocks indeed can be arranged
contiguously on disk.
I Subsequent deallocations and new allocations however will, in general, create
holes.
I To reclaim space that has been freed, the disk space manager either uses
Buffer manager
Web Forms
Applications
SQL Interface
SQL Commands
Plan Executor
Parser
Operator Evaluator
Optimizer
Query Processor
Buffer Manager
Yo
1 bring in pages as they are needed for inmemory processing,
re!
he
Recovery
Manager
Lock
Manager
Disk Space Manager
Concurrency Control
Transaction
Manager
DBMS
Index Files
Data Files
1 reserve a block whose bytes are interpreted bit-wise (bit n = 0: block n is
System Catalog
Database
free),
2 toggle bit n whenever block n is (de-)allocated.
2 overwrite (replace) such pages when they become obsolete for query processing and new
pages require in-memory space.
I The buffer manager manages a collection of
pages in a designated main memory area, the
buffer pool,
I once all slots (frames) in this pool have been occupied, the buffer manager uses a replacement
policy to decide which frame to overwrite when a
new page needs to be brought in.
34
33
N.B. Simply overwriting a page in the buffer pool is not sufficient if this page has
been modified after it has been brought in (i.e., the page is so-called dirty).
pinPage / unpinPage
buffer pool
disk page
free frame
Indicate that page p is no longer needed as well as whether p has been modified by
a transaction (d):
main memory
disk
database
35
36
N.B.
I The pinCount of a page indicates how many users (e.g., transactions) are
working with that page,
1 How much pretious buffer space to allocate to each of the active transactions
static assignment
I a call to unpinPage does not trigger any I/O operation, even if the pinCount
for this page goes down to 0 (the page might become a suitable victim, though),
dynamic assignment
I a database transaction is required to properly bracket any page operation using pinPage and unpinPage, i.e.
2 Which page to replace when a new request arrives and the buffer is full (Page
a pinPage(p);
...
read data (records) on page at
address a;
...
a pinPage(p);
or
unpinPage(p, false);
...
read and modify data (records)
on page at address a;
...
Additional complexity is introduced when we take into account that the DBMS may
manage segments of different page sizes:
I one buffer pool: good space utilization, but fragmentation problem
I many buffer pools: no fragmentation, but worse utilization; global replacement/assignment strategy may get complicated
unpinPage(p, true);
2.3.1
Problem: shall we allocate parts of the buffer pool to each transaction (TX) or let
the replacement strategy alone decide on who gets how much buffer space?
38
It is also possible to apply mixed strategies, e.g., have different pools working with
different approaches. This complicates matters significantly, though.
40
1 Local LRU (cf. LRU replacement, later)
The strategy is to keep, for each transaction Ti , its working set, W S(Ti , ) in the
buffer.
Strategy:
Possible implementation: keep two counters, per TX and per page, resply
2 Working Set Model (cf. operating systems virtual memory management)
I Whenever Ti references Pj :
increment tr c(Ti );
copy tr c(Ti ) to lr c(Ti , Pj ).
I If a page has to be replaced for Ti , select among those with
I allocate buffer budgets according to the ratio between those optimal sizes
tr c(Ti ) lr c(Ti , Pj ) .
42
41
2.3.2
I The choice of victim frame selection (or buffer replacement) policy can considerably affect DBMS performance.
I Large number of policies in operating systems and database mgmt. systems.
no
none
ref to C
not in buffer
C L O C K
L F U
2
("S e c o n d
C h a n c e ")
victim
page
FIFO
C
A
ref to A
in buffer
Random
F IF O
A
L R U
1
3
3
6
0
all
LFU
L R D (V 1 )
victim
page
GCLOCK(V2)
LRD(V1)
DGCLOCK
LRD(V2)
gc
5 0
43
o r
rc age
2 2 0
3 2 6
1 4 0
3 4 5
3 5
6 2
1 3 7
3 1 7
1
0
1
0
LRU
CLOCK
GCLOCK(V1)
last
"used" bit
1
o r
1
References
0
0
G C L O C K
possibly initialized
with weights
ref count
6
0
0
1
2
1
3
1
44
N.B. LRU as well as Clock are heuristics only. Any heuristic can fail miserably in
certain scenarios:
1 LRU (least recently used)
2 Clock (second chance)
expect?
2 Let the size of relation R be 11 pages. What about the number of I/O
I Query compiler/optimizer . . .
46
r c(p)
tr c(t) age(p)
Strategy for victim selection: chose page with least reference density r d(p, t)
. . . many variants, e.g., for gradually disregarding old references.
47
48
1 Only those queries are activated, whose Hot Set buffer budget can be satisfied
1 Queries allocate their budget stepwise, upto the size of their Hot Set.
immediately.
2 Local LRU stacks are used for replacement.
2 Queries with higher demands have to wait until their budget becomes available.
3 Request for a page p:
3 Within its own buffer budget, each transaction applies a local LRU policy.
Properties:
(ii) If found in another transactions LRU-stack: access page, but dont update
the other LRU-stack.
4 unpinPage: push page onto freelist-stack.
5 Filling empty buffer frames: taken from the bottom of the freelist-stack.
N.B.
I As long as a page is in a local LRU-stack, it cannot be replaced.
I If a page drops out of a local LRU-stack, it is pushed onto the freelist-stack.
I A page is replaced only if it reaches the bottom of the freelist-stack before some
transaction pins it again.
50
49
Priority Hints
Prefetching
I Idea: with unpinPage, a transaction gives one of the two possible indications
to the buffer manager:
. . . when the buffer manager receives requests for (single) page(s), it may decide
to (asynchronously) read ahead
1. try to replace an ordinary page from the global partition using LRU;
2. replace a preferred page of the requesting TX according to MRU.
I Advantages:
similar performance,
51
52
2.3.3
Double Buffering
Buffer management for a DBMS curiously tastes like the virtual memory1 concept of modern operating systems. Both techniques provide access to more data
than will fit into primary memory.
So: why dont we use OS virtual memory facilities to implement DBMSs?
I A DBMS can predict certain reference patterns for pages in a buffer a lot better
than a general purpose OS.
I This is mainly because page references in a DBMS are initiated by higher-level
operations (sequential scans, relational operators) by the DBMS itself.
If the DBMS uses its own buffer manager (within the virtual memory of the DBMS
server process), independently from the OS VM manager, we may experience the
following:
I Virtual page fault: page resides in DBMS buffer. However, frame has been
swapped out of physical memory by OS VM manager.
An I/O operation is necessary that is not visible to the DBMS.
I Buffer fault: page does not reside in DBMS buffer, frame is in physical memory.
Regular DBMS page replacement, requiring an I/O operation.
I Double page fault: page does not reside in DBMS buffer, frame has been
swapped out of physical memory by OS VM manager.
2 Nested-loop joins call for page fixing and hating.
I Concurrency control protocols often prescribe the order in which pages are
written back to disk. Operating systems usually do not provide hooks for that.
1 Generally
2.4
53
Applications
SQL Interface
SQL Commands
Plan Executor
Parser
Operator Evaluator
Optimizer
Query Processor
!
e here
You ar
Recovery
Manager
Lock
Manager
Disk Space Manager
Concurrency Control
2 OS
B
..
.
true
..
.
System Catalog
Data Files
Database
precisely, table actually means bag here (set of elements with multiplicity 0).
55
Heap files
I As in any file structure, each record has a unique record identifier (rid).
C
..
.
foo
..
.
VM does not know dirty flags, hence brings in pages that could simply be overwritten.54
I The most simple file structure is the heap file which represents an unordered
collection of records.
DBMS
Index Files
3 More
2.4.1
Web Forms
I N.B. Record ids (rids) are used like record addresses (or pointers). Internally,
the heap file structure must be able to map a given rid to the page containing
the record.
56
I To support openScan(f ), the heap file structure has to keep track of all pages
in file f ; to support insertRecord(f , r ) efficiently, we need to keep track of
all pages with free space in file f .
Remarks:
I For insertRecord(f , r ),
1 try to find a page p in the free list with free space > |r |; should this fail, ask
I Let us have a look at two simple structures which can offer this support.
2 record r is written to page p;
2.4.2
3 since generally |r | |p|, p will belong to the list of pages with free space;
4 a unique rid for r is computed and returned to the caller.
I For openScan(f ),
1 the DBMS allocates a free page (the file header ) and writes an appropriate
data
page
data
page
data
page
linked list of
full pages
1 both page lists have to be traversed.
header
page
3 Initially, both lists are empty.
58
57
2.4.3
Directory of pages
Remarks:
data
page
header
page
For a file of 10000 pages, give lower and upper bounds for the number of page
I/O operations during an insertRecord(f , r ) call for a heap file organized
using
data
page
1 a linked list of pages,
2 a directory of pages (1000 directory entries/page).
linked list
lower bound: header page + first page in free list + write r = 3 page I/Os
upper bound: header page + 10000 pages in free list + write r = 10002 page I/Os
data
page directory
page
60
2.5
2.5.1
Page formats
I Locating the containing data page for a given rid is not the whole story when
the DBMS needs to access a specific record: the internal structure of pages
plays a crucial role.
I For the following discussion, we consider a page to consist of a sequence of
slots, each of which contains a record.
I A complete record identifier then has the unique form
hpage addr , nsloti
where nslot denotes the slot number on the page.
Fixed-length records
Life is particularly easy if all records on the page (in the file) are of the same
size s;
I getRecord(f , hp, ni):
given the rid hp, ni we know that the record is to be found at (byte) offset n s on
page p.
I deleteRecord(f , hp, ni):
copy the bytes of the last occupied slot on page p to offset n s, mark last slot as
free;
all occupied slots thus appear together at the start of the page (the page is packed).
I insertRecord(f , r ):
find a page p with free space s (see previous section); copy r to the first free slot
on p, mark slot as occupied.
61
I To avoid record copying (and thus rid modifications), we could simply use a free
slot bitmap on each page:
packed
unpacked w/ bitmap
slot 0
slot 1
slot 0
slot 1
slot 2
free
space
slot N1
slot M1
N
number
of records
page
header
1
M1
0 1 M
3 2
2.5.2
Variable-length records
If records on a page are of varying size (cf. the SQL datatype VARCHAR(n)), we
have to deal with page fragmentation:
I In insertRecord(f , r ) we have to find an empty slot of size |r |; at the same
time we want to try to minimize waste.
I To get rid of holes produced by deleteRecord(f , rid), compact the remaining
records to maintain a contiguous area of free space on the page.
A solution is to maintain a slot directory on each page (compare this with a heap
file directory!):
rid = <p, N1>
number
of slots
page p
rid = <p,1>
rid = <p,0>
24 bytes
Calling deleteRecord(f , hp, ni) simply means to set bit n in the bitmap to 0,
no other rids are affected.
pointer to
start of
free space
N1
16 24 N
0
1
slot directory
number of entries
in slot directory
63
64
2.6
Remarks:
I The slot directory contains entries hoffset, lengthi where offset is measured in
bytes from the data page start.
I In deleteRecord(f , hp, ni), simply set the offset of directory entry n to 1;
such an entry can be reused during subsequent insertRecord(f , r ) calls which
hit page p.
Record formats
I This section zooms into the record internals themselves, thus discussing access
to single record fields (conceptually: attributes). Attribute values are considered
atomic by an RDBMS.
I Depending on the field types, we are dealing with fixed-length or variablelength fields in a record, e.g.
SQL datatype
INTEGER
BIGINT
CHAR(n)
VARCHAR(n)
CLOB(n)
DATE
..
.
Directory compaction . . .
. . . is not allowed in this scheme (again, this would modify the rids of
all records hp, n0 i, n0 > n)!
If insertions are much more common than deletions, the directory size
will nevertheless be close to the actual number of records stored on
the page.
65
Fixed-length fields
If all fields in a record are of fixed length, offsets for field access can simply be read
off the DBMS system catalog (field fi of size li):
f1
f2
f3
l1
l2
l3
I The DBMS computes and then saves the field size information for the records
of a file in the system catalog when it processes the corresponding command
CREATE TABLE . . . .
2.6.1
4I
f4
$
f2
$ $
f4
l4
address = b+l1+l2
f1
2.6.2
66
Final remarks:
Variable-length record formats seem to be more complicated but, among other
advantages, they allow for the compact representation of SQL NULL values (field
f3 is NULL below):
f1
base address b
length (# of bytes)4
4
8
n, 1 n 254
1 . . . n, 1 n 32672
1 . . . n, 1 n 231
4
..
.
fixed-length?
f2
f4
Variable-length fields
Growing a record
Consider an update on a field (e.g. of type VARCHAR(n)) which lets the record
grow beyond the size of the free space on its containing page.
How could the DBMS handle this situation efficiently?
2 for a record of n fields, use an array of n + 1 offsets pointing into the record (the last
array entry marks the end of field fn):
f2
f1
f3
f2
f4
f3
For fields of type VARCHAR(n) or CLOB(n) with n > |page size| we are in trouble
whenever a record actually grows beyond page size (the record wont fit on any
one page).
f4
67
68
2.7
Addressing schemes
Direct addressing
I RBA relative byte address:
Consider disk file as a persistent virtual address space and use byte-offset as rid.
I given a rid, it should ideally not take more than 1 page I/O to get to the record
itself
I PP page pointers:
Use disk page numbers as rid.
pro: very efficient access to page; locating record within page is cheap (inmemory operation)
Consider the fact that rids are used as persistent pointers in a DBMS
(indexes, directories, implemenation of CODAYSL sets, . . .)
con: stable w.r.t. moving records within page, but not when moving across
pages
Conflicting goals!
Efficiency calls for direct disk address, while stability calls for some kind
of indirection.
70
69
Indirect addressing
Assign logical numbers to records. Address translation table maps to PPs (or
even RBAs).
pro: full stability w.r.t. all relocations of records
Try to avoid extra I/O by adding a probable PP (PPP) to LSNs. PPP is the
PP at the time of insertion into the database. If record is moved across pages,
PPPs are not updated!
pro: full stability w.r.t. all record relocations; PPP can save extra I/O for translation table, iff still correct
con: 2 additional page I/Os in case PPP is no longer valid: old page to notice
record has moved, second I/O to translation table to lookup new page number
p h y s is c h e
A d re s s e
D B -K e y
1 1
1 1
1 1
1 2
0
7
8
1 3
1 3
1 5
1 9
1 3
1 2 0
1 1 7 2
T a b le
D a te n b lo c k
71
72
TID addressing
Bibliography
(2 3 ,0 )
...
Chen, P., Lee, E., Gibson, G., R.H.Katz, and Patterson, D. (1994). Raid: Highperformance, reliable secondary storage. ACM Computing Surveys, 26(2):145185.
Denning, P. (1968). The working-set model for program behaviour. Communications of
the ACM, 11(5):323333.
2 3
2 5 5
Brown, K., Carey, M., and Livny, M. (1996). Goal-oriented buffer management revisited.
In Proc. ACM SIGMOD Conference on Management of Data.
2 5 5
...
H
arder, T. (1987). Realisierung von operationalen Schnittstellen, chapter 3. in (Lockemann
and Schmidt, 1987). Springer.
H
arder, T. (1999). Datenbanksysteme: Konzepte und Techniken der Implementierung.
Springer.
H
arder, T. and Rahm, E. (2001). Datenbanksysteme: Konzepte und Techniken der Implementierung. Springer Verlag, Berlin, Heidelberg, 2 edition.
pro: full stability w.r.t. all relocations of records; no extra I/O due to indirection
Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Intl Thompson Publishing, Bonn.
75