0% found this document useful (0 votes)
59 views

Module 2: Storing Data: Disks and Files 2.1 Memory Hierarchy

1. The document discusses memory hierarchy and magnetic disks. Magnetic disks allow direct access to any location and are divided into tracks that are further divided into sectors. Disk access time is determined by seek time, rotational delay, and transfer time. 2. RAID systems use data redundancy across multiple disks to improve availability. Different RAID levels distribute data and redundant information in various ways to balance performance, reliability and storage overhead. The most common levels are 0, 1, 4 and 5.

Uploaded by

nagarajuvcc123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Module 2: Storing Data: Disks and Files 2.1 Memory Hierarchy

1. The document discusses memory hierarchy and magnetic disks. Magnetic disks allow direct access to any location and are divided into tracks that are further divided into sectors. Disk access time is determined by seek time, rotational delay, and transfer time. 2. RAID systems use data redundancy across multiple disks to improve availability. Different RAID levels distribute data and redundant information in various ways to balance performance, reliability and storage overhead. The most common levels are 0, 1, 4 and 5.

Uploaded by

nagarajuvcc123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2.

Module 2: Storing Data: Disks and Files

Memory hierarchy

I Memory in off-the-shelf computer systems is arranged in a hierarchy:


Web Forms

Applications

SQL Interface

Request

Module Outline
SQL Commands

2.1
2.2
2.3
2.4
2.5
2.6
2.7

Memory hierarchy
Disk space management
Buffer manager
File and record organization
Page formats
Record formats
Addressing schemes

Plan Executor

Parser

Operator Evaluator

Optimizer

&
&
&
&

CPU
CPU Cache (L1, L2)
Main Memory (RAM)
Magnetic Disk
Tape, CD-ROM, DVD

Storage Class
ff
primary
secondary
tertiary

Query Processor

Transaction
Manager

I Cost of primary memory 100 cost of secondary storage space of the same
size.

Files and Index Structures

Buffer Manager

Recovery
Manager

Lock
Manager
Disk Space Manager
Concurrency Control

DBMS

I DBMS needs to make data persistent across DBMS (or host) shutdowns or
crashes; only secondary/tertiary storage is nonvolatile.

Index Files
Data Files

I Size of address space in primary memory (e.g., 232 Byte = 4 GB) may not be
sufficient to map the whole database (we might even have  232 records).

System Catalog

DBMS needs to bring in data from lower levels in memory hierarchy


as needed for processing.

Database

13

2.1.1

Magnetic disks

14

track

sector

I Tapes store vast amounts of data ( 100 GB; more for roboter tape farms)
but they are sequential devices.


1 Each track is divided into arc-shaped
sectors (a characteristic of the disks
hardware),

I Magnetic disks (hard disks) allow direct access to any desired location; hard
disks dominate database system scenarios by far.


2 data is written to and read from disk
block by block (the block size is set to
block
a multiple of the sector size when the
disk is formatted),


1 Data on a hard disk is arranged in concentric rings (tracks) on one or more
platters,


3 typical disk block sizes are 4 KB or
8 KB.


2 tracks can be recorded on one or both
surfaces of a platter,

Data blocks can only be written and read if disk heads and platters are positioned accordingly.

rotation

disk arm
disk head

track

arm movement

cylinder


3 set of tracks with same diameter form
a cylinder,

platter


4 an array (disk arm) of disk heads, one
per recorded surface, is moved as a unit,

5 a stepper motor moves the disk heads
from track to track, the platters steadily
rotate.
15

I This has implications on the disk access time:



1 Disk heads have to be moved to desired track (seek time),

2 disk controller waits for desired block to rotate under disk head (rotational

delay),

3 disk block data has to be actually written/read (transfer time).
1 +
2 +
3
access time =
16

I Access time for the IBM Deskstar 14GPX

I The unit of a data transfer between disk and main memory is a block,

I 3.5 inch hard disk, 14.4 GB capacity

I if a single item (e.g., record, attribute) is needed, the whole containing block
must be transferred:

I 5 platters of 3.35 GB of user data each, platters rotate at 7200/min


Reading or writing a disk block is called an I/O operation.

I average seek time 9.1 ms (min: 2.2 ms [track-to-track], max: 15.5


ms)

The time for I/O operations dominates the time taken for database operations.

I average rotational delay 4.17 ms

I DBMSs take the geometry and mechanics of hard disks into account.

I data transfer rate 13 MB/s


access time 8

KB block

9.1 ms + 4.17 ms +

1s
13.87 ms
13 MB/8 KB

Current disk designs can transfer a whole track in one platter revolution,
active disk head can be switched after each revolution.

This implies a closeness measure for data records r1 , r2 on disk:

N.B. Accessing a main memory location typically takes < 60 ns.



1 Place r1 and r2 inside the same block (single I/O operation!),

2 place r2 inside a block adjacent to r1 s block on the same track,

3 place r2 in a block somewhere on r1 s track,

4 place r2 in a track of the same cylinder than r1 s track,

5 place r2 in a cyclinder adjacent to r1 s cylinder.
18

17

2.1.2

Accelerating Disk-I/O

Performance gains withs parallel I/Os

Goals

Partition files into equally sized areas of consecutive blocks (striping)

I reduce number of I/Os:


DBMS buffer, physical DB-design
I reduce duration of I/Os:
access neighboring disk blocks (clustering) and bulk-I/O:
. advantage: optimized seek time, optimized rotational delay, minimized overhead
(e.g. interrupt handling)
. disadvantage: I/O path busy for a long time (concurrency!)

. bulk I/Os can be implemented on top of or inside disk controller


. . . used for mid-sized data access (prefetching, sector-buffering)

Z u g r iffs p a r a lle lit t

intra-I/O parallelism

different I/O paths (declustering) with parallel access


. advantage: parallel I/Os, minimized transfer time by multiple bandwidth
. disadvantage: avg. seek time and rotational delay increased, more hardware
needed, blocking of parallel transactions

. advanced hardware or disk arrays (RAID systems)


. . . used for large-size data access

A u ftr a g s p a r a lle llit t

inter-I/O parallelism

Striping unit (# logically consecutive bytes on one disk) determines degree of


parallelism for single I/O and degree of parallelism between different I/Os:
I small chunks: high intra-access parallelism but many devices busy
not many I/Os in parallel
I large chunks: low intra-access parallelism, but many I/Os in parallel

19

20

Principle of Operation

RAID-Systems: Improving Availability


I Goal: maximize availability . . . =

MT T F
MT T F + MT T R

Use data redundancy to be able to reconstruct lost data, e.g., compute parity
information during normal operation

...

where: MT T F =mean time to failure, MT T R=mean time to repair


I Problem: with N disks, we have N times higher probability for problems!
MT T F
N
where MT T DL=mean time to data loss

Thus: MT T DL =

. . . here denotes logical xor (exclusive or)

I Solution: RAID=Redundant array of inexpensive (independent) disks

When one of the disks fails, use parity to reconstruct lost data during failure recovery

MT T F
MT T F

N
(N 1) MT T R

Now we get MT T DL =

...

i.e., we only suffer from data loss, if a second disk fails before the first failed disk has
been replaced.

21

. . . typically reserve one extra disk as a hot spare to replace the failed one
immediately.
22

I Write Access to blocks b on all disks i 6= p:

Executing I/Os from/to a RAID System

compute new parity block Bkp from contents of all Bki6=p ;

I Read Access: to read block number k from disk j, execute a

i : write(Bki , diski );

read(Bk , diskj )-Operation.

I Write Access: to write block number k back to disk j, we have to update the
parity information, too (let p be the number of the parity disk for block k):

I Reconstruction of block b on a failed disk j (let r be the number of the replacement disk):
i6=j : read(Bki , diski );

i6=j : read(Bki , diski );

reconstruct Bkj as parity from all Bki6=j

compute new parity block Bkp from contents of all Bki ;

write(Bkj , diskr );

write(Bkj , diskj );
0

write(Bkp , diskp );

we can do better (i.e., more efficient), though:


read(Bkp , diskp );
0

compute new parity block Bkp := Bkp Bkj Bkj ;


0

write(Bkj , diskj );
0

write(Bkp , diskp );
23

24

Recovery Strategies

2.1.3

RAID-Levels

I off-line: if a disk has failed, suspend normal I/O; reconstruct all blocks of the
failed disk and write them to the replacement disk; only afterwards resume
normal I/O traffic.

There are a number of variants ( RAID-Levels) differing w.r.t. the following characteristics:
I striping unit (data interleaving)

I on-line: (needs a hot spare disk) resume normal I/O immediately




start reconstructing all blocks not yet reconstructed since the crash (in background);

allow parallel normal writes:


write the block to the replacement disk and update parity;

allow parallel normal read I/O:

how to scatter (primary) data across disks?

fine (bits/bytes) or coarse (blocks) grain?

I How to compute and distribute redundant information?

. if block has not yet been repaired:


reconstruct block
. if block has already been reconstructed:
read block from replacement disk or reconstruct it (load balancing decision)

what kind of redundant information (parity, ECCs)

where to allocate redundant information (separate/few disks, all disks of the


array)

5 RAID-levels have been introduced, later more levels have been defined.

N.B. we can even wait with all reconstruction until first normal read access!

26

25

I RAID Level 0: no redundancy, just striping

I RAID Level 3: bit-interleaved parity

least storage overhead

one parity disk suffices, since controller can easily identify faulty disk!

no extra effort for write-access

distribute (primary) data bit-wise onto data disks

not the best read-performance!

read and write access goes to all disks, therefore, no inter-I/O parallelism, but
high bandwidth

I RAID Level 1: mirroring

I RAID Level 4: block-interleaved parity

doubles necessary storage space

doubles write-access

like RAID 3, but distribute data block-wise (variable block size)

optimized read-performance due to alternative I/O path

small read I/O goes to only one disk

bottleneck: all write I/Os go to the one parity disk

I RAID Level 2: memory-style ECC

I RAID Level 5: block-interleaved striped parity

compute error-correcting codes for data of n disks

store onto n 1 additional disks

like RAID 4, but distribute parity blocks across all disks load balancing

failure recovery: determine lost disk by using the n 1 extra disks; correct
(reconstruct) its contents from 1 of those

best performance for small and large read as well as large write I/Os

variants w.r.t. distribution of block

More recent levels combine aspects of the ones listed here, or add multiple parity
blocks, e.g. RAID 6: two parity blocks per group.
27

28

Parity groups
Parity is not necessarily computed across all disks within an array, it is possibile to
define parity groups (of same or different sizes).

non-r e d u n d a n t ( R A I D - 0 )

mirroring ( R A I D - 1 )

disk 2

disk 3

disk 4

group 1

group 1

group 1

parity 1

group 2

group 2

group 2

parity 3

group 3

disk 5

(R A ID -2 )

bit-in t e r l e a v e d pa r i t y ( R A I D - 3 )

data in t e r l e a v i n g
on the block level
shading
= redundant info

group 4

parity 2
group 3

group 3

parity 4

group 4

group 4

group 5

group 5

group 5

...

data interleaving
on the byte level

...

me m o r y - st y l e E C C

disk 1

...

...

parity 5

...

block-in t e r l e a v e d pa r i t y ( R A I D - 4 )

bl o c k - in t e r l e a v e d , st r i p e d pa r i t y ( R A I D - 5 )

P + Q

pa r i t y , st r i p e d ( R A I D - 6 )

30

29

2.2

Selecting RAID levels


I RAID level 0: improves overall performance at lowest cost; no provision against
data loss, best write performance, since no redundancy;
I RAID levels 0+1: (aka. level 10) superior to level 1, main appl. area is small
storage subsystems, sometimes for write-intensive applications.
I RAID level 1: most expensive version; typically serialize the two necessary I/Os
for writes to avoid data loss in case of power failures, etc.

Disk space management

Web Forms

Applications

SQL Interface

SQL Commands

I the DSM talks to the disk controller and initiates


I/O operations,

Plan Executor

Parser

Operator Evaluator

Optimizer

I once a block has been brought in from disk it is


referred to as a pagea .

Query Processor

Files and Index Structures

I RAID levels 2 and 4: are always inferior to levels 3 and 5, resply. Level
3 appropriate for workloads with large requests for contiguous blocks; bad for
many small requests of a single block.
I RAID level 5: is a good general-purpose solution. Best performance (with
redundancy) for small and large read as well as large write requests.

Transaction
Manager
Recovery
Manager

Buffer Manager

I Sequences of data pages are mapped onto contiguous sequences of blocks by the DSM.

Lock
Manager
Disk Space Manager

Yo

ua

Concurrency Control

I The disk space manager (DSM) encapsulates


the gory details of hard disk access for the
DBMS,

re

he

re!

Index Files

DBMS

I The DBMS issues allocate/deallocate and


read/write commands to the DSM,
I which, internally, uses a mapping

System Catalog
Data Files

Database

block-# page-#

I RAID level 6: choice for higher level of reliability.


RAID logic can be implemented inside the disk subsystem/controller (hardware
RAID) or in OS (software RAID).
31

to keep track of page locations and block usage.


a Disk

blocks and pages are of the same size.


32

2.2.1

Keeping track of free blocks

2.3

I During database (or table) creation it is likely that blocks indeed can be arranged
contiguously on disk.
I Subsequent deallocations and new allocations however will, in general, create
holes.
I To reclaim space that has been freed, the disk space manager either uses

Buffer manager

Web Forms

Applications

SQL Interface

SQL Commands

Plan Executor

Parser

Operator Evaluator

Optimizer

I To scan the entire pages of a 20 GB table


(SELECT FROM . . . ), the DBMS needs to

Query Processor

a free block list:



1 keep a pointer to the first free block in a known location on disk,

2 when a block is no longer needed, append/prepend this block to the free

Files and Index Structures


re
ua

Buffer Manager

Yo


1 bring in pages as they are needed for inmemory processing,

re!

he

Recovery
Manager

Lock
Manager
Disk Space Manager
Concurrency Control

block list for future use,



3 next pointers may be stored in disk blocks themselves,


Transaction
Manager

DBMS

Index Files

or free block bitmap:

Data Files


1 reserve a block whose bytes are interpreted bit-wise (bit n = 0: block n is

System Catalog

Database

free),

2 toggle bit n whenever block n is (de-)allocated.


Size of the database on secondary storage



size of avail. primary mem. to hold user data.

Free block bitmaps allow for fast identification of contiguous sequences of


free blocks.


2 overwrite (replace) such pages when they become obsolete for query processing and new
pages require in-memory space.
I The buffer manager manages a collection of
pages in a designated main memory area, the
buffer pool,
I once all slots (frames) in this pool have been occupied, the buffer manager uses a replacement
policy to decide which frame to overwrite when a
new page needs to be brought in.
34

33

N.B. Simply overwriting a page in the buffer pool is not sufficient if this page has
been modified after it has been brought in (i.e., the page is so-called dirty).

Simple interface for a typical buffer manager


Indicate that page p is needed for further processing:
function pinPage(p):
if buffer pool contains p already then
pinCount(p) pinCount(p) + 1;
return address of frame for p;
select a victim frame p 0 to be replaced using the replacement policy;
if dirty(p 0 ) then
write p 0 to disk;
read page p from disk into selected frame;
pinCount(p) 1;
dirty(p) false;

pinPage / unpinPage
buffer pool
disk page
free frame

Indicate that page p is no longer needed as well as whether p has been modified by
a transaction (d):

main memory
disk

function unpinPage(p, d):


pinCount(p) pinCount(p) 1;
dirty(p) d;

database

35

36

Two strategic questions

N.B.
I The pinCount of a page indicates how many users (e.g., transactions) are
working with that page,


1 How much pretious buffer space to allocate to each of the active transactions

(Buffer Allocation Problem)? Two principal approaches:

I clean victim pages are not written back to disk,

static assignment

I a call to unpinPage does not trigger any I/O operation, even if the pinCount
for this page goes down to 0 (the page might become a suitable victim, though),

dynamic assignment

I a database transaction is required to properly bracket any page operation using pinPage and unpinPage, i.e.


2 Which page to replace when a new request arrives and the buffer is full (Page

Replacement Problem)? Again, two approaches can be followed:


decide without knowledge on reference pattern

a pinPage(p);
...
read data (records) on page at
address a;
...

a pinPage(p);

or

unpinPage(p, false);

presume knowledge on (expected) reference pattern

...
read and modify data (records)
on page at address a;
...

Additional complexity is introduced when we take into account that the DBMS may
manage segments of different page sizes:
I one buffer pool: good space utilization, but fragmentation problem
I many buffer pools: no fragmentation, but worse utilization; global replacement/assignment strategy may get complicated

unpinPage(p, true);

I A buffer manager typically offers at least one more interface-call: flushPage(p)


to force page p (synchronously) back to disk (for transaction mgmt. purposes)37

Possible solution could be to allow for set-oriented pinPages({p})-calls.

2.3.1

Typical allocation strategies include:

Buffer allocation policies

Problem: shall we allocate parts of the buffer pool to each transaction (TX) or let
the replacement strategy alone decide on who gets how much buffer space?

I global one buffer pool for all transactions

Properties of a local policy:

I local based on different kinds of data (e.g., catalog, index, data, . . . )

one TX cannot hurt others

I local each transaction gets a certain fraction of the buffer pool:

TXs are treated equally


possibly bad overall utilization of buffer space
some TXs may have vast amounts of buffer space occupied by old pages, while
others experience internal page thrashing, i.e., suffer from too little space

38

static partitioning assign buffer budget once for each TX

dynamic partitioning adjust a TXs buffer budget according to


. its past reference pattern
. some kind of semantic information

It is also possible to apply mixed strategies, e.g., have different pools working with
different approaches. This complicates matters significantly, though.

Problem with a global policy:


Consider a TX executing a sequential read on a huge relation:



all page accesses are references to newly loaded pages;


hence, almost all other pages are likely be replaced (following a standard
replacement strategy);
other TXs cannot proceed without loading in their pages again (external
page thrashing).
39

40

Examples for dynamic allocation strategies

Implementation of the Working Set Model


1 Local LRU (cf. LRU replacement, later)

Let W S(T, ) be the working set of TX T for window size , i.e.,


W S(T, ) = {pages referenced by T in the inverall [now , now]}.

I keep a separate LRU-stack for each active TX, and


I a global freelist for pages not pinned by any TX

The strategy is to keep, for each transaction Ti , its working set, W S(Ti , ) in the
buffer.

Strategy:

Possible implementation: keep two counters, per TX and per page, resply

i. replace a page from the freelist


ii. replace a page from the LRU-stack of the requesting TX

I tr c(Ti ) . . . TX-specific reference counter,

iii. replace a page from the TX with the largest LRU-stack

I lr c(Ti , Pj ) . . . TX-specific last reference counter for each referenced page Pj .


2 Working Set Model (cf. operating systems virtual memory management)

Idea of the algorithm:

Goal: avoid thrashing by allocating just enough buffer space to each TX

I Whenever Ti references Pj :

Approach: observe number of different page requests by each TX within a


certain intervall of time (window size )
I deduce optimal buffer budget from this observation,

increment tr c(Ti );
copy tr c(Ti ) to lr c(Ti , Pj ).
I If a page has to be replaced for Ti , select among those with

I allocate buffer budgets according to the ratio between those optimal sizes

tr c(Ti ) lr c(Ti , Pj ) .
42

41

2.3.2

Buffer replacement policies

Schematic overview of buffer replacement policies

I The choice of victim frame selection (or buffer replacement) policy can considerably affect DBMS performance.
I Large number of policies in operating systems and database mgmt. systems.

no
none

ref to C
not in buffer

C L O C K

L F U
2

("S e c o n d
C h a n c e ")

victim
page

FIFO

C
A

ref to A
in buffer

Age of page in buffer


since last ref.
total age

Random

F IF O
A

Criteria for victim selection used in some strategies


Criteria

L R U

1
3
3

6
0

all

LFU

L R D (V 1 )
victim
page

GCLOCK(V2)
LRD(V1)
DGCLOCK
LRD(V2)

gc
5 0

43

o r

rc age
2 2 0
3 2 6
1 4 0
3 4 5
3 5
6 2
1 3 7
3 1 7

1
0

1
0

LRU
CLOCK
GCLOCK(V1)

last

"used" bit
1

o r
1

References

0
0

G C L O C K
possibly initialized
with weights

ref count
6
0

0
1
2

1
3
1

44

I Two policies found in a number of DBMSs:

N.B. LRU as well as Clock are heuristics only. Any heuristic can fail miserably in
certain scenarios:


1 LRU (least recently used)



Keep a queue (often described as a stack) of pointers to frames.


In unpinPage(p, d), append p to the tail of queue, if pinCount(p) is decremented
to 0.
To find the next victim, search through the queue from its head and find the
first page p with pinCount(p) = 0.




A number of transactions want to scan the same sequence of pages (e.g.,


SELECT FROM R) one after the other. Assume a buffer pool with a
capacity of 10 pages.

1 Let the size of relation R be 10 or less pages. How many I/Os do you


2 Clock (second chance)


A challenge for LRU

Number the N frames in buffer pool 0 . . . N 1, initialize counter current 0,


and maintain a bit array referenced[0 . . . N 1], initialized to all 0.
In pinPage(p), do reference[p] 1.
To find the next victim, consider page current.
If pinCount(current) = 0 and referenced[current] = 0, current is the victim.
Otherwise, referenced[current] 0, current (current + 1) mod N, repeat.

I Generalization: LRU(k) take timestamps of last k references into account.


Standard LRU is LRU(1).
45

expect?

2 Let the size of relation R be 11 pages. What about the number of I/O

operations in this case?


Other well-known replacement policies are, e.g.,
I FIFO (first in, first out), LIFO (last in, first out)
I LFU (least frequently used), MRU (most recently used),
I GCLOCK (generalized clock), DGCLOCK (dynamic GCLOCK)
I LRD (least reference density),
I WS, HS (working set, hot set) see above,
I Random.

LRD least reference density

Exploiting semantic knowledge

Record the following three parameters

I Query compiler/optimizer . . .

I tr c(t) . . . total reference count of transaction t,

selects access plan, e.g., sequential scan vs. index,

I age(p) . . . value of tr c(t) at the time of loading p into buffer,

estimates number of page I/Os for cost-based optimization.

46

I Idea: use this information to determine query-specific, optimal buffer budget

I r c(p) . . . reference count of page p.

Query Hot Set model.

Update these parameters during a transactions page references (pinPage-Calls).


Goals:
From those, compute mean reference density of a page p at time t as:
r d(p, t) :=

r c(p)
tr c(t) age(p)

I optimize overall system throughput;


. . . where tr c(t) r c(p) 1

I to avoid thrashing is the most important goal.

Strategy for victim selection: chose page with least reference density r d(p, t)
. . . many variants, e.g., for gradually disregarding old references.

47

48

Hot Set with disjoint page sets

Hot Set with non-disjoint page sets


1 Only those queries are activated, whose Hot Set buffer budget can be satisfied


1 Queries allocate their budget stepwise, upto the size of their Hot Set.

immediately.


2 Local LRU stacks are used for replacement.


2 Queries with higher demands have to wait until their budget becomes available.


3 Request for a page p:


3 Within its own buffer budget, each transaction applies a local LRU policy.

(i) If found in own LRU-stack: update LRU-stack.

Properties:

(ii) If found in another transactions LRU-stack: access page, but dont update
the other LRU-stack.

No sharing of buffered pages between transactions.

(iii) If found in freelist: push page on own LRU-stack.

Risk of internal thrashing when Hot Set estimates are wrong.


4 unpinPage: push page onto freelist-stack.

Queries with large Hot Sets block following small queries.


(Or, if bypassing is permitted, many small queries can lead to starvation of large
ones.)


5 Filling empty buffer frames: taken from the bottom of the freelist-stack.

N.B.
I As long as a page is in a local LRU-stack, it cannot be replaced.
I If a page drops out of a local LRU-stack, it is pushed onto the freelist-stack.
I A page is replaced only if it reaches the bottom of the freelist-stack before some
transaction pins it again.
50

49

Priority Hints

Prefetching

I Idea: with unpinPage, a transaction gives one of the two possible indications
to the buffer manager:

. . . when the buffer manager receives requests for (single) page(s), it may decide
to (asynchronously) read ahead

preferred page . . . those are managed in a TX-local parition,

ordinary page . . . managed in a global partition.

I on-demand, asynchronous read-ahead:


e.g., when traversing the sequence set of an index, during a sequential scan of
a relation, . . .

I Strategy: when a page needs to be replaced,

I heuristic (speculative) prefetching:


e.g., sequential n-block lookahead (cf. drive or controller buffers in harddisks),
semantically determined supersets, index prefetch, . . .

1. try to replace an ordinary page from the global partition using LRU;
2. replace a preferred page of the requesting TX according to MRU.
I Advantages:


much simpler than DBMIN (Hot Set),

similar performance,

easy to deal with too small partitions.

51

52

2.3.3

Buffer management in DBMSs vs. OSs

Double Buffering

Buffer management for a DBMS curiously tastes like the virtual memory1 concept of modern operating systems. Both techniques provide access to more data
than will fit into primary memory.
So: why dont we use OS virtual memory facilities to implement DBMSs?
I A DBMS can predict certain reference patterns for pages in a buffer a lot better
than a general purpose OS.
I This is mainly because page references in a DBMS are initiated by higher-level
operations (sequential scans, relational operators) by the DBMS itself.

If the DBMS uses its own buffer manager (within the virtual memory of the DBMS
server process), independently from the OS VM manager, we may experience the
following:
I Virtual page fault: page resides in DBMS buffer. However, frame has been
swapped out of physical memory by OS VM manager.
An I/O operation is necessary that is not visible to the DBMS.
I Buffer fault: page does not reside in DBMS buffer, frame is in physical memory.
Regular DBMS page replacement, requiring an I/O operation.

Reference pattern examples in a DBMS



1 Sequential scans call for prefetching.

I Double page fault: page does not reside in DBMS buffer, frame has been
swapped out of physical memory by OS VM manager.


2 Nested-loop joins call for page fixing and hating.

I Concurrency control protocols often prescribe the order in which pages are
written back to disk. Operating systems usually do not provide hooks for that.
1 Generally

2.4

implemented using a hardware interrupt mechanism called page faulting.

53

Applications

SQL Interface

I We will now turn away from page management


and will instead focus on page usage in a DBMS.

SQL Commands

Plan Executor

Parser

Operator Evaluator

Optimizer

I On the conceptual level, a relational DBMS


manages tables of tuples3 , e.g.
A
..
.
42
..
.

Query Processor
!
e here

You ar

Files and Index Structures


Transaction
Manager
Buffer Manager

Recovery
Manager

Lock
Manager
Disk Space Manager
Concurrency Control

2 OS

B
..
.
true
..
.

I A typical heap file interface supports the following operations:

System Catalog
Data Files

Database

I On the physical level, such tables are represented


as files of records (tuple =
record), each page
holds one or more records
(in general, |record|  |page|).
I A file is a collection of records that may reside
on several pages.

precisely, table actually means bag here (set of elements with multiplicity 0).

55

Heap files

I As in any file structure, each record has a unique record identifier (rid).

C
..
.
foo
..
.

VM does not know dirty flags, hence brings in pages that could simply be overwritten.54

I The most simple file structure is the heap file which represents an unordered
collection of records.

DBMS

Index Files

3 More

= DBMS buffer needs to be memory resident in OS.

2.4.1

File and record organization

Web Forms

Two I/O operations necessary: one to bring in the frame (OS)2 ;


another one to replace the page in that frame (DBMS).

create/destroy heap file f named n:


createFile(n) / deleteFile(f )
insert record r and return its rid:
insertRecord(f , r )
delete a record with a given rid:
deleteRecord(f , rid)
get a record with a given rid:
getRecord(f , rid)
initiate a sequential scan over the whole heap file:
openScan(f )

I N.B. Record ids (rids) are used like record addresses (or pointers). Internally,
the heap file structure must be able to map a given rid to the page containing
the record.
56

I To support openScan(f ), the heap file structure has to keep track of all pages
in file f ; to support insertRecord(f , r ) efficiently, we need to keep track of
all pages with free space in file f .

Remarks:
I For insertRecord(f , r ),

1 try to find a page p in the free list with free space > |r |; should this fail, ask

the disk space manager to allocate a new page p;

I Let us have a look at two simple structures which can offer this support.


2 record r is written to page p;

2.4.2


3 since generally |r |  |p|, p will belong to the list of pages with free space;

Linked list of pages


4 a unique rid for r is computed and returned to the caller.

I When createFile(n) is called,

I For openScan(f ),


1 the DBMS allocates a free page (the file header ) and writes an appropriate

entry hn, header pagei to a known location on disk;



2 the header page is initialized to point to two doubly linked lists of pages:
data
page

data
page

linked list of pages


with free space

data
page

data
page

linked list of
full pages


1 both page lists have to be traversed.

I A call to deleteRecord(f , rid)



1 may result in moving the containing page from full to free page list,

2 or even lead to page deallocation if the page is completely free after deletion.

header
page

Finding a page with sufficient free space . . .


. . . is an important problem to solve inside insertRecord(f , r ). How does
the heap file structure support this operation? (How many pages of a file
do you expect to be in the list of free pages?)


3 Initially, both lists are empty.

58

57

2.4.3

Directory of pages

Remarks:

I An alternative to the linked list approach is to maintain a directory of pages in


a file.
I The header page contains the first page of a chain of directory pages; each entry
in a directory page identifies a page of the file:

I Free space management is also done via the directory:


each directory entry is actually of the form hpage addr p, nfreei, where nfree
indicates the actual amount of free space (e.g. in bytes) on page p.

I/O operations and free space management

data
page

header
page

I |page directory|  |data pages|

For a file of 10000 pages, give lower and upper bounds for the number of page
I/O operations during an insertRecord(f , r ) call for a heap file organized
using

data
page


1 a linked list of pages,

2 a directory of pages (1000 directory entries/page).
linked list
lower bound: header page + first page in free list + write r = 3 page I/Os
upper bound: header page + 10000 pages in free list + write r = 10002 page I/Os

data
page directory

page

directory (1000 entries/page)


lower bound: directory header page + write r = 2 page I/Os
upper bound: 10 directory pages + write r = 11 page I/Os
59

60

2.5

2.5.1

Page formats

I Locating the containing data page for a given rid is not the whole story when
the DBMS needs to access a specific record: the internal structure of pages
plays a crucial role.
I For the following discussion, we consider a page to consist of a sequence of
slots, each of which contains a record.
I A complete record identifier then has the unique form
hpage addr , nsloti
where nslot denotes the slot number on the page.

Fixed-length records

Life is particularly easy if all records on the page (in the file) are of the same
size s;
I getRecord(f , hp, ni):
given the rid hp, ni we know that the record is to be found at (byte) offset n s on
page p.
I deleteRecord(f , hp, ni):
copy the bytes of the last occupied slot on page p to offset n s, mark last slot as
free;
all occupied slots thus appear together at the start of the page (the page is packed).
I insertRecord(f , r ):
find a page p with free space s (see previous section); copy r to the first free slot
on p, mark slot as occupied.

 Packed pages and deletions:


One problem with packed pages remains, though:
I calling deleteRecord(f , hp, ni) modifies the rid of a different record hp, n0 i
on the same page.
I If any external reference to this record exists we need to chase the whole
database and update rid references hp, n0 i hp, ni.
Bad!
62

61

I To avoid record copying (and thus rid modifications), we could simply use a free
slot bitmap on each page:
packed

unpacked w/ bitmap

slot 0
slot 1

slot 0
slot 1
slot 2
free
space

slot N1
slot M1
N

number
of records

page
header

1
M1

0 1 M
3 2

2.5.2

Variable-length records

If records on a page are of varying size (cf. the SQL datatype VARCHAR(n)), we
have to deal with page fragmentation:
I In insertRecord(f , r ) we have to find an empty slot of size |r |; at the same
time we want to try to minimize waste.
I To get rid of holes produced by deleteRecord(f , rid), compact the remaining
records to maintain a contiguous area of free space on the page.
A solution is to maintain a slot directory on each page (compare this with a heap
file directory!):
rid = <p, N1>

number
of slots

page p
rid = <p,1>

offset of recd from


start of data area

rid = <p,0>

24 bytes

Calling deleteRecord(f , hp, ni) simply means to set bit n in the bitmap to 0,
no other rids are affected.

pointer to
start of
free space

Page header or trailer?


20

In both page organization schemes we have positioned the page header


at the end of its page. How would you justify this design decision?

N1

16 24 N
0
1

slot directory
number of entries
in slot directory

63

64

2.6

Remarks:
I The slot directory contains entries hoffset, lengthi where offset is measured in
bytes from the data page start.
I In deleteRecord(f , hp, ni), simply set the offset of directory entry n to 1;
such an entry can be reused during subsequent insertRecord(f , r ) calls which
hit page p.

Record formats

I This section zooms into the record internals themselves, thus discussing access
to single record fields (conceptually: attributes). Attribute values are considered
atomic by an RDBMS.
I Depending on the field types, we are dealing with fixed-length or variablelength fields in a record, e.g.
SQL datatype
INTEGER
BIGINT
CHAR(n)
VARCHAR(n)
CLOB(n)
DATE
..
.

 Directory compaction . . .
. . . is not allowed in this scheme (again, this would modify the rids of
all records hp, n0 i, n0 > n)!
If insertions are much more common than deletions, the directory size
will nevertheless be close to the actual number of records stored on
the page.

65

Fixed-length fields

If all fields in a record are of fixed length, offsets for field access can simply be read
off the DBMS system catalog (field fi of size li):
f1

f2

f3

l1

l2

l3

I The DBMS computes and then saves the field size information for the records
of a file in the system catalog when it processes the corresponding command
CREATE TABLE . . . .

N.B. Record compaction (defragmentation) is performed, of course.

2.6.1

4I

Datatype lengths valid for DB2 V7.1.

f4
$

f2

$ $

f4

l4

address = b+l1+l2
f1

2.6.2

66

Final remarks:
Variable-length record formats seem to be more complicated but, among other
advantages, they allow for the compact representation of SQL NULL values (field
f3 is NULL below):
f1

base address b

length (# of bytes)4
4
8
n, 1 n 254
1 . . . n, 1 n 32672
1 . . . n, 1 n 231
4
..
.

fixed-length?




f2

f4

Variable-length fields

If a record contains one or more variable-length fields, other record representations


are needed:

1 Use a special delimiter symbol ($) to separate record fields. Accessing field fi means
to scan over the bytes of fields f1 . . . f(n 1);

Growing a record
Consider an update on a field (e.g. of type VARCHAR(n)) which lets the record
grow beyond the size of the free space on its containing page.
How could the DBMS handle this situation efficiently?


2 for a record of n fields, use an array of n + 1 offsets pointing into the record (the last
array entry marks the end of field fn):

Really growing a record


f1

f2

f1

f3

f2

f4

f3

For fields of type VARCHAR(n) or CLOB(n) with n > |page size| we are in trouble
whenever a record actually grows beyond page size (the record wont fit on any
one page).

f4

67

How could the DBMS file manager cope with this?

68

2.7

Addressing schemes

Direct addressing
I RBA relative byte address:

What makes a good record ID (rid)?

Consider disk file as a persistent virtual address space and use byte-offset as rid.

I given a rid, it should ideally not take more than 1 page I/O to get to the record
itself

con: no stability at all w.r.t. moving record

I rids should be stable under all circumstances, such as




a record is being moved within a page

a record is being moved across pages

pro: very efficient access to page and to record within page

I PP page pointers:
Use disk page numbers as rid.

Why are these goals important to achieve?

pro: very efficient access to page; locating record within page is cheap (inmemory operation)

Consider the fact that rids are used as persistent pointers in a DBMS
(indexes, directories, implemenation of CODAYSL sets, . . .)

con: stable w.r.t. moving records within page, but not when moving across
pages

Conflicting goals!
Efficiency calls for direct disk address, while stability calls for some kind
of indirection.
70

69

Indirect addressing

Indirect addressing fancy variant

I LSN logical sequence numbers:

I LSN/PPP LSN with probable page pointers:

Assign logical numbers to records. Address translation table maps to PPs (or
even RBAs).
pro: full stability w.r.t. all relocations of records

Try to avoid extra I/O by adding a probable PP (PPP) to LSNs. PPP is the
PP at the time of insertion into the database. If record is moved across pages,
PPPs are not updated!

con: additional I/O to translation table (often in the buffer)

pro: full stability w.r.t. all record relocations; PPP can save extra I/O for translation table, iff still correct

CODASYL systems call this DBTT (database key translation table):

con: 2 additional page I/Os in case PPP is no longer valid: old page to notice
record has moved, second I/O to translation table to lookup new page number

p h y s is c h e
A d re s s e
D B -K e y

1 1
1 1
1 1
1 2
0

7
8

1 3
1 3
1 5
1 9
1 3

1 2 0

1 1 7 2

T a b le

D a te n b lo c k
71

72

TID addressing

Bibliography

I TID tuple identifier with forwarding:


Use hPP, Slot#i-pair as rid (see above). To guarantee stability, leave a forward
address on original page, if record has to be moved across pages.
For example: access record with given rid=h17, 2i:
1 7

Elmasri, R. and Navathe, S. (2000). Fundamentals of Database Systems. Addison-Wesley,


Reading, MA., 3 edition. Titel der deutschen Ausgabe von 2002: Grundlagen von
Datenbanken.

(2 3 ,0 )

...

Chen, P., Lee, E., Gibson, G., R.H.Katz, and Patterson, D. (1994). Raid: Highperformance, reliable secondary storage. ACM Computing Surveys, 26(2):145185.
Denning, P. (1968). The working-set model for program behaviour. Communications of
the ACM, 11(5):323333.

2 3

2 5 5

Brown, K., Carey, M., and Livny, M. (1996). Goal-oriented buffer management revisited.
In Proc. ACM SIGMOD Conference on Management of Data.

2 5 5

...

H
arder, T. (1987). Realisierung von operationalen Schnittstellen, chapter 3. in (Lockemann
and Schmidt, 1987). Springer.

Avoid chains of forward addresses!

H
arder, T. (1999). Datenbanksysteme: Konzepte und Techniken der Implementierung.
Springer.

When record has to be moved again: do not leave another forward


address, rather update forward on original page!

H
arder, T. and Rahm, E. (2001). Datenbanksysteme: Konzepte und Techniken der Implementierung. Springer Verlag, Berlin, Heidelberg, 2 edition.

pro: full stability w.r.t. all relocations of records; no extra I/O due to indirection

Heuer, A. and Saake, G. (1999). Datenbanken: Implementierungstechniken. Intl Thompson Publishing, Bonn.

con: 1 additional page I/O in case of forward pointer on original page


73

Lockemann, P. and Schmidt, J., editors (1987). Datenbank-Handbuch. Springer-Verlag.


Mitschang, B. (1995). Anfrageverarbeitung in Datenbanksystemen - Entwurfs- und Implementierungsaspekte. Vieweg.
ONeil, E., ONeil, P., and Weikum, G. (1999). An optimality proof of the LRU-k page
replacement algorithm. Journal of the ACM, 46(1):92112.
Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill,
New York, 3 edition.
Stonebraker, M. (1981). Operating systems support for database management. Communications of the ACM, 14(7):412418.

75

Lockemann, P. and Dittrich, K. (1987). Architektur von Datenbanksystemen, chapter 2.


in (Lockemann and Schmidt, 1987). Springer.
74

You might also like