0% found this document useful (0 votes)
14 views

Basic Merge of Two Sorted Lists

The document discusses the two-phase multiway merge sort algorithm for sorting large datasets. It describes the two phases where the first phase sorts partitions into runs that are then merged together in the second phase. It provides examples of how to optimize disk access during the sorting process, such as grouping blocks by cylinders, using multiple disks, and leveraging pre-fetching of blocks.

Uploaded by

jass kaur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Basic Merge of Two Sorted Lists

The document discusses the two-phase multiway merge sort algorithm for sorting large datasets. It describes the two phases where the first phase sorts partitions into runs that are then merged together in the second phase. It provides examples of how to optimize disk access during the sorting process, such as grouping blocks by cylinders, using multiple disks, and leveraging pre-fetching of blocks.

Uploaded by

jass kaur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Two-phase Multiway Merge-sort (2PMMS)

Basic merge of two sorted lists


1 2

 Phase 1: Sort main memory-sized partitions to


produce the sorted sublists, also called the runs.

 Phase 2: Merge the sorted sublists into a single


sorted list.

1/27/2016

Phase 1
3 Phase
4
2
 Fill available main memory with blocks from the
relation to be sorted.
 Sort the records in main memory (using e.g.,
quicksort).
 Write the sorted records from main memory to
“new” blocks of disk.
This yields one sorted sublist.
 Each block of the relation is read and written
once in Phase 1.

Each block is read and written once, in


each round/step of phase 2

Example: 2PMMS
Example: 2PMMS (Cont’d)
5 6
 R has 10,000,000 tuples of 100 bytes each
 Suppose block size = 4KB = 4096 bytes  Phase 2 (Merging) read and write each block
 Each block can fit 40 tuples once
 R has 107/40 = 250,000 blocks  The same calculation as for phase 1
 Suppose main memory is 50 MB (50 × 220 B)  89.7 minutes
➨ 50 × 220/212 = 50 × 28 = 12,800 blocks can fit
into main memory.
 Total time to sort R using a 20-way merge-
 Assume the data blocks are placed on the disk randomly
 Phase 1: Fill main mem. 250,000/12,800 = 20 times
sort method = Phase 1 + Phase 2
 One fill: read and write 12,800 blocks = 179.4 minutes ≈ 3 hours
 Total average I/O time for phase 1:
(250,000=19*12800+6800) =
2×250,000×10.76 ms = 89.7 minutes

1
Sometimes we need many passes in phase 2 How many passes are needed?
7 8

 R = 12 tuples, tuple = 1 byte,  Block size B bytes,


 Main memory M bytes,
block size = 1 tuple,
 Tuple R bytes, relation N tuples
main memory input buffers = 3 blocks,  File(relation) size = N×R Bytes
(A toy example to illustrate the point)
 # of main memory buffers = M/B blocks
# of RUNs > main memory blocks (buffers) (Here: 4 > 3)
 We need one output buffer, so we can actually use (M/B)-1
input buffers
 Phase 1 gives r = (NR/M) runs (sorted sublists)
 We can merge (M/B)-1 runs
 After pass 1, we have r/((M/B) -1) runs
 After pass 2, we have r/((M/B) -1)2 runs
 After pass k, we have r/((M/B) -1)k runs

Ways to improve disk access time


9 10

• We are done when there is only a • Group blocks by cylinders (“cylindrification”)


single run left, i.e., • Multiple Disks (e.g., use multiple smaller disks
r/((M/B) -1)k = 1 instead of a single large disk
NR/M = ((M/B) -1)k • Mirroring disks
log (M/B)-1 (NR/M) =k • Use good scheduling algorithm
• In our toy example: k = log (12 / 3)  = 2 • Using pre-fetching and large buffers (bring expected
2 next blocks into main memory in advance)
• Total number of disk I/O = (2k+1)(NR/B)
(we don’t count the last write, since it will often be the input to another operation)

• In our example, total # of disk I/O = 5 ×12 = 60 Details 

Organize data by cylinders


11 12

• Seek time is about half the average access time! • Suppose the capacity of 1 cylinder is 1 MB
Can we do something about it?! • Size of R = 107 tuples of 100B each = 109 B
• Store relation on one cylinder, or several
consecutive cylinders • We need 109/106 = 1000 cylinders
• To read the entire relation, • In Phase 1 of 2PMMS, we fill main memory
we only need one seek and 20 times (main memory = 50MB = 50 Cyl.)
one rotational delay ➨ one fill = read 50 cylinders =
• In the sorting example, we had 6.46 ms for avg. seek once
10.75ms = 6.46ms (seek time) + + 49*1.00025 ms for 49 one-cylinder move
4.16ms (rotational latency) + + 50* (16*8.33) (50*time to read 16 tracks from the
0.13ms (block transfer time) same cylinder)
to access a random block ≈ 6.72 seconds

2
13  No, since blocks are read from the fronts of the
 20 fills = 20×6.72 Sec. = 134.4 Seconds sorted lists in an order that depends on
(1) the data, and
 Write the sorted runs to 1000 consecutive
(2) which list gets exhausted its current block.
cylinders ≈ 134.4 Sec.
o That is, output blocks are written one at a time,
 Phase 1 ≈ 268.8 seconds ≈ 4.5 minutes interspersed with block reads
 Compare this with blocks of R placed randomly  Thus, the 2nd phase will still take 89.7 (pg. 5)
which took about 90 minutes in Phase 1
minutes, and hence the total time to sort is
268.8/60+89.7=94.18 min
 Does “cylindrification” help in Phase 2?
 We cut the sorting time (179.6 min) by 50%,
but can’t do much better by cylindrification
alone.
14

Use multiple disks


15 16
 Use several disks with independent heads
 Example: Instead of a large disk, we • Once a fill is sorted in the main memory, we write
use 4 (smaller) disks. the sorted blocks back onto the disks
o We divide (stripe) the records in R among the disks; Note: splitting the output among several disks
this occupies 1000 adjacent cylinders on each disk. is not easy as it requires careful programming to
 One cylinder stores 1/4×1MB = 256 KB preserve the order among the sorted blocks.
 Phase 1 of 2PMMS: Load in main memory from
all these 4 smaller disks in “parallel.” 1234 1 2 3 4
Time for 1 fill: 6.46 + (4 × 8.33)=39.8ms 5678 5 6 7 8
Compare this to 6.46+(16×8.33)=139ms; in general we
can expect a speedup of about 3 (with 4 smaller disks)

Mirroring disks Mirroring disks (Cont’d)


17
Writing: no speedup, no slowdown
 Mirror disk = identical copy of disk
o Because whenever writing a block, we
 Improves reliability (when one crashes) at the write it on all the disks having a copy.
expense of disk copies. o Since the writing can take place in
 With n copies we can handle n reads in time equal parallel, the elapsed time is “about”
to 1 read the same as for writing to a single disk.
 A read can be done from the disk with the shortest
seek time (i.e., closer head)
 Writing: no speedup and no slowdown, compared
Obvious minus: cost of extra disks.
to single disk (why?) Useful (even essential) for some applications such as
banking, airline reservation, etc.

18

3
Scheduling: Elevator algorithm Scheduling: Elevator/FIFO algorithms
19 20

 Useful when there are many block requests to Cyl. Request Cyl. Time Cyl. Time
Request Time Request Finished Request Finished
choose from (not in our merge-sort Ex.)
8000 0 8000 4.3 8000 4.3
 Floors = cylinders
24000 0 24000 13.6 24000 13.6
Block requests = elevator calls
56000 0 56000 26.9 56000 26.9
16000 10 64000 34.2 16000 42.2
 Example on next page 64000 20 40000 45.5 64000 59.5
Avg. travel to a cyl. = 1 + (#of cyl’s)/4000 ms 40000 30 16000 56.8 40000 70.8
Avg. Rot. Delay= 4.16 and
Arrival times Elevator FIFO
BTT = 0.13 ms
Suppose the heads are on cylinder 8000

Prefetching (Double Buffering) Double Buffering


21 22
Memory:
• If we can anticipate the next needed blocks, process process
we can pre-fetch them into the main mem. A C B B
• Example:
Disk:
File : B1 B2 B3 …..
Program: Process B1; Process B2; Process B3; …
• Single Buffer: Read B1; Proc B1; Read B2; Proc B2; ..
• Time: #OfBlocks × (ReadBlock1 + ProcBlock1) done A B C D E F G
Can use this method in combination with the elevator algorithm Time: Proc1Block + (#OfBlocks × Read1Block) (i.e., P+nR),
or the cylinder-based strategies to improve speedup.
when the time (Read1Block) > time(Proc1Bloc)
Otherwise R+nP, i.e., when time(P)>time(R)

Prefetching and large buffers Prefetching (double buffering)


 Example: Phase 2 of 2PMMS (1000 cyl. on 8 surfaces) 24

 With 50MB main memory, we can afford 2 • So far, we saw how disk access time (performance)
track-sized buffers for each sublist and 2 may be improved depending on
for the output (1 trk=128KB) --> 42 trks=5M (1) the operation at hand and
 Consume one track for each sublist while (2) the way disk works.
the other is being loaded. • Next, we will look at ways to mitigate disk failures
 Similarly, write one output track while the and hence improve disk reliability.
other is being constructed. • Fails should be detected and recovered, if possible.
 Effectively, this eliminates in-memory process time • This ability is essential for continuous operations.
 A related issue here is Block size selection
(trade-off?): big block size amortize I/O cost but
wasted if the block is not fully used.
23

4
Performance/Reliability of Disk
Disk failures – A classification
Systems
25 26

 An attempt to access the disk to read or write


• We looked at ways to improve the performance may fail. A possible classification:
of disk systems.
 Temporary read prob.  intermittent failure
• Now we will look at ways to improve their
reliabilities.  If it is local to a sector:
o problem in reading -- media decay
• What is reliability? Availability of data when there
o Problem in writing -- write failure.
is a disk “failure” of some sort.
This is achieved at the cost of some redundancy (of  If the entire disk suddenly and permanently
data and/or disks).  disk crash
Parity checks, stable storage, and RAID are
ways/techniques to “cope” with the above
failures, resp.

Checksums for failure detection Detecting errors by checksums


27 28

 A useful model of disk read: the reading function  The function Read(w,s) returns the value “good” for s, if w
returns (w,s), where w is the data in the sector that has even number of 1’s; otherwise, s=“bad”.
is read and s is the status bit.  It is possible that more than one bit in a sector be
corrupted, and hence
 How s gets “good” or “bad” values? Easy! each
o an error(s) may not be detected.
sector has additional bits, called the checksum
 Suppose bits error randomly:
(written by the disk controller).
Probability of undetected error (i.e. even 1’s) is thus 50%
 Simple form of checksum is the parity bit: (Why?)
011010001
111011100
 Even parity: the number of 1’s in a data bits and
their parity is always even.

(Interleaved) Parity bits Recovery from disk/head crashes


29 30

• Suppose we have 8 parity bits  Use data and/or disk redundancy to protect
01110110 Byte 1 against permanently destroyed disks
11001101 Byte 2  Mean time to failure = when 50% of the disks
00001111 Byte 3 have crashed, e.g., 10 years
10110100 Byte of parity bits  Simplified (assuming this happens linearly)
o In the 1st year = 5%, …
o In the 2nd year = 5%, …
• With n parity bits, the probability of undetected
error = 1/2n  However the mean time to a disk crash doesn’t

• Checksum/parity may help detect but not correct have to be the same as the mean time to data
errors. loss; there are solutions.

5
RAID 4
Redundant Array of Independent Disks, RAID
31 32
• Problem with RAID 1 (also called Mirroring):
 RAID 1:Mirror each disk (data/redundant disks)
n data disks & n redundant disks
 If a disk fails, restore using the mirror • RAID 4: One redundant disk only (dedicated parity).
 Probability of mirror disk crashing while • x⊕
⊕y modulo-2 sum of x and y (XOR)
restoration = • 11110000 ⊕ 10101010 = 01011010
Suppose each disk lasts in 10 years, on average. • For any n: we have n data disks & 1 redundant disk
Assume 3 hrs (i.e., 1/2920 year) to replace a disk
• Each block in the redun. disk has the parity bits for
Probability the mirror disk fails during copying is:
the corresponding blocks in the data disks:
1/10 * 1/2920 = 1/29,200
If 1 disk fails on average every 10 years (Block-interleaved parity).
➨ A disk or its mirror fails on average every 5 years • Number the blocks (on each disk): 1,2,3,…,k
So, it takes on average 5 × 29,200 = 146,000 years for a i th Block of data disk 1: 11110000
i th Block of data disk 2: 10101010
non-recoverable error to occur. i th Block of data disk 3: 00111000
i th Block of redundant disk: 01100010

Properties of XOR: ⊕ RAID 4 (Count’d)


34

Commutative: x⊕ ⊕y = y⊕⊕x
• Reading: as usual
Associative: x⊕⊕(y⊕
⊕z) = (x⊕
⊕y)⊕⊕z o Interesting possibility: If we want to read a block from
disk i, but it is busy and all other disks are free, then
Identity: x⊕
⊕0 = 0⊕⊕x = x (0 is vector) instead, we can read corr. blocks from all other disks and
Self-inverse: x⊕
⊕x = 0 compute their sum mod 2.
• Writing:
o As a useful consequence, if x⊕y=z, then o Write block to disk i.
we can “add” x to both sides & get y=x⊕z o Update the corresponding block on redundant disk
o This also means # of writes to the redundant disk is n times
the average number of writes to any one data disk.

33

After a write, how do we get the value


How do we get the value for
for the redundant block?
35
the redundant block?
 Better to do: Suppose the new value is v.
To write v at block j of data disk i:
 Naively: Read all n corresponding data blocks
1. Read the old value of block j, say o.
➨ n+1 disk I/O’s (n-1 blocks read, the 1 data
2. Read the jth block of the redundant disk, say r.
block write, 1 redundant block write). 3. Compute w = v ⊕ o ⊕ r.
4. Write v in block j of data disk i.
 Can do better: How? 5. Write w in block j of the redundant disk.
 Total: 4 disk I/O’s, for ANY n (# of data disks)
 Problem Why this works?
o Idea: v⊕o is the “change” to the parity, and hence
the redundant disk must change to compensate.
36

6
Failure recovery in RAID 4 Failure recovery in RAID 4
37 38

 When there is a disk crash:  Use equation


If it is the redundant disk,
0 = x1 ⊕ .... ⊕ xn ⊕ xred
then replace it with a new disk and
x j = x1 ⊕ ...x j −1 ⊕ x j +1 ⊕ ... ⊕ xn ⊕ xred
re-compute redundant blocks.
else, (it is a data disk) replace it with a new disk
and re-compute data blocks from all other disks  Example: i th Block of Disk1: 11110000
i th Block of Disk 2: 10101010
The rule to re-compute any missing data is simple and i th Block of Disk 3: 00111000
is the same for ANY disk, data or redundant. i th Block of red. disk: 01100010

RAID 5 ( block-interleaved distributed parity ) RAID 5 (Cont’d)


39
 The reading/writing load for each disk is
• RAID 4 is cheaper than mirroring but still has problems:
the same.
• Problem 1: Redundant disk involved in every write
• Problem 2: Does not survive 2 simult. crashes!
 Problem. In one block write, what’s the
probability that a disk is involved?
• RAID 5: Fixes Problem 1, by distributed parity
among all the n+1 disks (numbered 0,…,n) o Each disk has probability 1/4 to have the block.
o If not, i.e., 3/4, then it has 1/3 chance that it
• Cylinder j on disk i is treated as parity cylinder for
all cylinders j iff i = j mod (n+1) will be the redundant block for that block.
• Example: Suppose n=3  we have 4 disks. o So, each of the four disks is involved in:
o Disk 0 would be redundant for blocks 0, 4, 8, 12, etc. 1/4 + (3/4)*(1/3) = 1/2 of the writes.
(because they leave reminder 0, when divided by 4).
o Disk 1 would be redundant for blocks 1, 5, 9, etc.
40

RAID 6 RAID 6 (Cont’d)


41
Example: How to recover from 2 s.f.’s
 Allows recover from multiple simultaneous failures.
Line 2 of the redundancy bit pattern
 Essentially, applies RAID 4 in group of disks. says: Apply RAID 4 to disks 1,2,4,6.

 Disks are grouped, for applying RAID 4, using a


Redundancy Bit Pattern.

42

7
RAID 6 (Cont’d) RAID 6 (Cont’d)
44
How do we find a redundancy bit pattern? • Reading is as in RAID 4, e.g., read from the disk
o Columns should be different. containing the data.
• How about writing?

o We have all the combinations of bits in the Suppose we rewrite the first block of
columns but all 0’s. disk 2 to be 00001111.
We then compute the change:
00001111 ⊕ 10101010 = 10100101

Since disk 2 has 1’s on rows 1 and 2, we see


redundant disks 5 and 6 are relevant. Thus the
change has to propagate to disks 5 and 6, changing
them to 11000111 and 10111110, resp.
43

RAID 6 (Cont’d)
45

 Suppose disks a and b fail. How to recover? Example:


 Find a row r where a and b differ.
Assume in row r , a is 1 and b is 0.
 Since all columns (of the pattern matrix) are
different, we must find some row r in which the
columns for a and b are different.
 Recover disk a according to row r
 Recover disk b according to any row that has 1 for
Before failure After failure
b

46

In-class exercise In-class exercise


47 48

• Suppose we have four disks: 1 and 2 are data disks, 3 • Disk pairs to consider:
1. {1,2}
and 4 are redundant
2. {1,3}
• Disk 3 is a mirror of 1. Disk 4 holds parity check bits 3. {1,4}
for disks 2 and 3 4. {2,3}
• which combination of simultaneous 2-disk failures can we 5. {2,4} 
recover from? 6. {3,4}
• Can’t recover if 2 & 4 crash. Why?
We could recover from all the above crash-pairs but the 5th.
1 2 3 4
==========
x z x zOx

You might also like