Basic Merge of Two Sorted Lists
Basic Merge of Two Sorted Lists
1/27/2016
Phase 1
3 Phase
4
2
Fill available main memory with blocks from the
relation to be sorted.
Sort the records in main memory (using e.g.,
quicksort).
Write the sorted records from main memory to
“new” blocks of disk.
This yields one sorted sublist.
Each block of the relation is read and written
once in Phase 1.
Example: 2PMMS
Example: 2PMMS (Cont’d)
5 6
R has 10,000,000 tuples of 100 bytes each
Suppose block size = 4KB = 4096 bytes Phase 2 (Merging) read and write each block
Each block can fit 40 tuples once
R has 107/40 = 250,000 blocks The same calculation as for phase 1
Suppose main memory is 50 MB (50 × 220 B) 89.7 minutes
➨ 50 × 220/212 = 50 × 28 = 12,800 blocks can fit
into main memory.
Total time to sort R using a 20-way merge-
Assume the data blocks are placed on the disk randomly
Phase 1: Fill main mem. 250,000/12,800 = 20 times
sort method = Phase 1 + Phase 2
One fill: read and write 12,800 blocks = 179.4 minutes ≈ 3 hours
Total average I/O time for phase 1:
(250,000=19*12800+6800) =
2×250,000×10.76 ms = 89.7 minutes
1
Sometimes we need many passes in phase 2 How many passes are needed?
7 8
• Seek time is about half the average access time! • Suppose the capacity of 1 cylinder is 1 MB
Can we do something about it?! • Size of R = 107 tuples of 100B each = 109 B
• Store relation on one cylinder, or several
consecutive cylinders • We need 109/106 = 1000 cylinders
• To read the entire relation, • In Phase 1 of 2PMMS, we fill main memory
we only need one seek and 20 times (main memory = 50MB = 50 Cyl.)
one rotational delay ➨ one fill = read 50 cylinders =
• In the sorting example, we had 6.46 ms for avg. seek once
10.75ms = 6.46ms (seek time) + + 49*1.00025 ms for 49 one-cylinder move
4.16ms (rotational latency) + + 50* (16*8.33) (50*time to read 16 tracks from the
0.13ms (block transfer time) same cylinder)
to access a random block ≈ 6.72 seconds
2
13 No, since blocks are read from the fronts of the
20 fills = 20×6.72 Sec. = 134.4 Seconds sorted lists in an order that depends on
(1) the data, and
Write the sorted runs to 1000 consecutive
(2) which list gets exhausted its current block.
cylinders ≈ 134.4 Sec.
o That is, output blocks are written one at a time,
Phase 1 ≈ 268.8 seconds ≈ 4.5 minutes interspersed with block reads
Compare this with blocks of R placed randomly Thus, the 2nd phase will still take 89.7 (pg. 5)
which took about 90 minutes in Phase 1
minutes, and hence the total time to sort is
268.8/60+89.7=94.18 min
Does “cylindrification” help in Phase 2?
We cut the sorting time (179.6 min) by 50%,
but can’t do much better by cylindrification
alone.
14
18
3
Scheduling: Elevator algorithm Scheduling: Elevator/FIFO algorithms
19 20
Useful when there are many block requests to Cyl. Request Cyl. Time Cyl. Time
Request Time Request Finished Request Finished
choose from (not in our merge-sort Ex.)
8000 0 8000 4.3 8000 4.3
Floors = cylinders
24000 0 24000 13.6 24000 13.6
Block requests = elevator calls
56000 0 56000 26.9 56000 26.9
16000 10 64000 34.2 16000 42.2
Example on next page 64000 20 40000 45.5 64000 59.5
Avg. travel to a cyl. = 1 + (#of cyl’s)/4000 ms 40000 30 16000 56.8 40000 70.8
Avg. Rot. Delay= 4.16 and
Arrival times Elevator FIFO
BTT = 0.13 ms
Suppose the heads are on cylinder 8000
With 50MB main memory, we can afford 2 • So far, we saw how disk access time (performance)
track-sized buffers for each sublist and 2 may be improved depending on
for the output (1 trk=128KB) --> 42 trks=5M (1) the operation at hand and
Consume one track for each sublist while (2) the way disk works.
the other is being loaded. • Next, we will look at ways to mitigate disk failures
Similarly, write one output track while the and hence improve disk reliability.
other is being constructed. • Fails should be detected and recovered, if possible.
Effectively, this eliminates in-memory process time • This ability is essential for continuous operations.
A related issue here is Block size selection
(trade-off?): big block size amortize I/O cost but
wasted if the block is not fully used.
23
4
Performance/Reliability of Disk
Disk failures – A classification
Systems
25 26
A useful model of disk read: the reading function The function Read(w,s) returns the value “good” for s, if w
returns (w,s), where w is the data in the sector that has even number of 1’s; otherwise, s=“bad”.
is read and s is the status bit. It is possible that more than one bit in a sector be
corrupted, and hence
How s gets “good” or “bad” values? Easy! each
o an error(s) may not be detected.
sector has additional bits, called the checksum
Suppose bits error randomly:
(written by the disk controller).
Probability of undetected error (i.e. even 1’s) is thus 50%
Simple form of checksum is the parity bit: (Why?)
011010001
111011100
Even parity: the number of 1’s in a data bits and
their parity is always even.
• Suppose we have 8 parity bits Use data and/or disk redundancy to protect
01110110 Byte 1 against permanently destroyed disks
11001101 Byte 2 Mean time to failure = when 50% of the disks
00001111 Byte 3 have crashed, e.g., 10 years
10110100 Byte of parity bits Simplified (assuming this happens linearly)
o In the 1st year = 5%, …
o In the 2nd year = 5%, …
• With n parity bits, the probability of undetected
error = 1/2n However the mean time to a disk crash doesn’t
• Checksum/parity may help detect but not correct have to be the same as the mean time to data
errors. loss; there are solutions.
5
RAID 4
Redundant Array of Independent Disks, RAID
31 32
• Problem with RAID 1 (also called Mirroring):
RAID 1:Mirror each disk (data/redundant disks)
n data disks & n redundant disks
If a disk fails, restore using the mirror • RAID 4: One redundant disk only (dedicated parity).
Probability of mirror disk crashing while • x⊕
⊕y modulo-2 sum of x and y (XOR)
restoration = • 11110000 ⊕ 10101010 = 01011010
Suppose each disk lasts in 10 years, on average. • For any n: we have n data disks & 1 redundant disk
Assume 3 hrs (i.e., 1/2920 year) to replace a disk
• Each block in the redun. disk has the parity bits for
Probability the mirror disk fails during copying is:
the corresponding blocks in the data disks:
1/10 * 1/2920 = 1/29,200
If 1 disk fails on average every 10 years (Block-interleaved parity).
➨ A disk or its mirror fails on average every 5 years • Number the blocks (on each disk): 1,2,3,…,k
So, it takes on average 5 × 29,200 = 146,000 years for a i th Block of data disk 1: 11110000
i th Block of data disk 2: 10101010
non-recoverable error to occur. i th Block of data disk 3: 00111000
i th Block of redundant disk: 01100010
Commutative: x⊕ ⊕y = y⊕⊕x
• Reading: as usual
Associative: x⊕⊕(y⊕
⊕z) = (x⊕
⊕y)⊕⊕z o Interesting possibility: If we want to read a block from
disk i, but it is busy and all other disks are free, then
Identity: x⊕
⊕0 = 0⊕⊕x = x (0 is vector) instead, we can read corr. blocks from all other disks and
Self-inverse: x⊕
⊕x = 0 compute their sum mod 2.
• Writing:
o As a useful consequence, if x⊕y=z, then o Write block to disk i.
we can “add” x to both sides & get y=x⊕z o Update the corresponding block on redundant disk
o This also means # of writes to the redundant disk is n times
the average number of writes to any one data disk.
33
6
Failure recovery in RAID 4 Failure recovery in RAID 4
37 38
42
7
RAID 6 (Cont’d) RAID 6 (Cont’d)
44
How do we find a redundancy bit pattern? • Reading is as in RAID 4, e.g., read from the disk
o Columns should be different. containing the data.
• How about writing?
o We have all the combinations of bits in the Suppose we rewrite the first block of
columns but all 0’s. disk 2 to be 00001111.
We then compute the change:
00001111 ⊕ 10101010 = 10100101
RAID 6 (Cont’d)
45
46
• Suppose we have four disks: 1 and 2 are data disks, 3 • Disk pairs to consider:
1. {1,2}
and 4 are redundant
2. {1,3}
• Disk 3 is a mirror of 1. Disk 4 holds parity check bits 3. {1,4}
for disks 2 and 3 4. {2,3}
• which combination of simultaneous 2-disk failures can we 5. {2,4}
recover from? 6. {3,4}
• Can’t recover if 2 & 4 crash. Why?
We could recover from all the above crash-pairs but the 5th.
1 2 3 4
==========
x z x zOx