0% found this document useful (0 votes)

14 views

Basic Merge of Two Sorted Lists

The document discusses the two-phase multiway merge sort algorithm for sorting large datasets. It describes the two phases where the first phase sorts partitions into runs that are then merged together in the second phase. It provides examples of how to optimize disk access during the sorting process, such as grouping blocks by cylinders, using multiple disks, and leveraging pre-fetching of blocks.

Uploaded by

jass kaur

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Basic Merge of Two Sorted Lists

Uploaded by

jass kaur

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Two-phase Multiway Merge-sort (2PMMS)

Basic merge of two sorted lists

1 2

Phase 1: Sort main memory-sized partitions to

produce the sorted sublists, also called the runs.

Phase 2: Merge the sorted sublists into a single

sorted list.

1/27/2016

Phase 1
3 Phase
4
2
Fill available main memory with blocks from the
relation to be sorted.
Sort the records in main memory (using e.g.,
quicksort).
Write the sorted records from main memory to
“new” blocks of disk.
This yields one sorted sublist.
Each block of the relation is read and written
once in Phase 1.

Each block is read and written once, in

each round/step of phase 2

Example: 2PMMS
Example: 2PMMS (Cont’d)
5 6
R has 10,000,000 tuples of 100 bytes each
Suppose block size = 4KB = 4096 bytes Phase 2 (Merging) read and write each block
Each block can fit 40 tuples once
R has 107/40 = 250,000 blocks The same calculation as for phase 1
Suppose main memory is 50 MB (50 × 220 B) 89.7 minutes
➨ 50 × 220/212 = 50 × 28 = 12,800 blocks can fit
into main memory.
Total time to sort R using a 20-way merge-
Assume the data blocks are placed on the disk randomly
Phase 1: Fill main mem. 250,000/12,800 = 20 times
sort method = Phase 1 + Phase 2
One fill: read and write 12,800 blocks = 179.4 minutes ≈ 3 hours
Total average I/O time for phase 1:
(250,000=19*12800+6800) =
2×250,000×10.76 ms = 89.7 minutes

1
Sometimes we need many passes in phase 2 How many passes are needed?
7 8

R = 12 tuples, tuple = 1 byte, Block size B bytes,

Main memory M bytes,
block size = 1 tuple,
Tuple R bytes, relation N tuples
main memory input buffers = 3 blocks, File(relation) size = N×R Bytes
(A toy example to illustrate the point)
# of main memory buffers = M/B blocks
# of RUNs > main memory blocks (buffers) (Here: 4 > 3)
We need one output buffer, so we can actually use (M/B)-1
input buffers
Phase 1 gives r = (NR/M) runs (sorted sublists)
We can merge (M/B)-1 runs
After pass 1, we have r/((M/B) -1) runs
After pass 2, we have r/((M/B) -1)2 runs
After pass k, we have r/((M/B) -1)k runs

Ways to improve disk access time

9 10

• We are done when there is only a • Group blocks by cylinders (“cylindrification”)

single run left, i.e., • Multiple Disks (e.g., use multiple smaller disks
r/((M/B) -1)k = 1 instead of a single large disk
NR/M = ((M/B) -1)k • Mirroring disks
log (M/B)-1 (NR/M) =k • Use good scheduling algorithm
• In our toy example: k = log (12 / 3)  = 2 • Using pre-fetching and large buffers (bring expected
2 next blocks into main memory in advance)
• Total number of disk I/O = (2k+1)(NR/B)
(we don’t count the last write, since it will often be the input to another operation)

• In our example, total # of disk I/O = 5 ×12 = 60 Details

Organize data by cylinders

11 12

• Seek time is about half the average access time! • Suppose the capacity of 1 cylinder is 1 MB
Can we do something about it?! • Size of R = 107 tuples of 100B each = 109 B
• Store relation on one cylinder, or several
consecutive cylinders • We need 109/106 = 1000 cylinders
• To read the entire relation, • In Phase 1 of 2PMMS, we fill main memory
we only need one seek and 20 times (main memory = 50MB = 50 Cyl.)
one rotational delay ➨ one fill = read 50 cylinders =
• In the sorting example, we had 6.46 ms for avg. seek once
10.75ms = 6.46ms (seek time) + + 49*1.00025 ms for 49 one-cylinder move
4.16ms (rotational latency) + + 50* (16*8.33) (50*time to read 16 tracks from the
0.13ms (block transfer time) same cylinder)
to access a random block ≈ 6.72 seconds

2
13 No, since blocks are read from the fronts of the
20 fills = 20×6.72 Sec. = 134.4 Seconds sorted lists in an order that depends on
(1) the data, and
Write the sorted runs to 1000 consecutive
(2) which list gets exhausted its current block.
cylinders ≈ 134.4 Sec.
o That is, output blocks are written one at a time,
Phase 1 ≈ 268.8 seconds ≈ 4.5 minutes interspersed with block reads
Compare this with blocks of R placed randomly Thus, the 2nd phase will still take 89.7 (pg. 5)
which took about 90 minutes in Phase 1
minutes, and hence the total time to sort is
268.8/60+89.7=94.18 min
Does “cylindrification” help in Phase 2?
We cut the sorting time (179.6 min) by 50%,
but can’t do much better by cylindrification
alone.
14

Use multiple disks

15 16
Use several disks with independent heads
Example: Instead of a large disk, we • Once a fill is sorted in the main memory, we write
use 4 (smaller) disks. the sorted blocks back onto the disks
o We divide (stripe) the records in R among the disks; Note: splitting the output among several disks
this occupies 1000 adjacent cylinders on each disk. is not easy as it requires careful programming to
One cylinder stores 1/4×1MB = 256 KB preserve the order among the sorted blocks.
Phase 1 of 2PMMS: Load in main memory from
all these 4 smaller disks in “parallel.” 1234 1 2 3 4
Time for 1 fill: 6.46 + (4 × 8.33)=39.8ms 5678 5 6 7 8
Compare this to 6.46+(16×8.33)=139ms; in general we
can expect a speedup of about 3 (with 4 smaller disks)

Mirroring disks Mirroring disks (Cont’d)

17
Writing: no speedup, no slowdown
Mirror disk = identical copy of disk
o Because whenever writing a block, we
Improves reliability (when one crashes) at the write it on all the disks having a copy.
expense of disk copies. o Since the writing can take place in
With n copies we can handle n reads in time equal parallel, the elapsed time is “about”
to 1 read the same as for writing to a single disk.
A read can be done from the disk with the shortest
seek time (i.e., closer head)
Writing: no speedup and no slowdown, compared
Obvious minus: cost of extra disks.
to single disk (why?) Useful (even essential) for some applications such as
banking, airline reservation, etc.

3
Scheduling: Elevator algorithm Scheduling: Elevator/FIFO algorithms
19 20

Useful when there are many block requests to Cyl. Request Cyl. Time Cyl. Time
Request Time Request Finished Request Finished
choose from (not in our merge-sort Ex.)
8000 0 8000 4.3 8000 4.3
Floors = cylinders
24000 0 24000 13.6 24000 13.6
Block requests = elevator calls
56000 0 56000 26.9 56000 26.9
16000 10 64000 34.2 16000 42.2
Example on next page 64000 20 40000 45.5 64000 59.5
Avg. travel to a cyl. = 1 + (#of cyl’s)/4000 ms 40000 30 16000 56.8 40000 70.8
Avg. Rot. Delay= 4.16 and
Arrival times Elevator FIFO
BTT = 0.13 ms
Suppose the heads are on cylinder 8000

Prefetching (Double Buffering) Double Buffering

21 22
Memory:
• If we can anticipate the next needed blocks, process process
we can pre-fetch them into the main mem. A C B B
• Example:
Disk:
File : B1 B2 B3 …..
Program: Process B1; Process B2; Process B3; …
• Single Buffer: Read B1; Proc B1; Read B2; Proc B2; ..
• Time: #OfBlocks × (ReadBlock1 + ProcBlock1) done A B C D E F G
Can use this method in combination with the elevator algorithm Time: Proc1Block + (#OfBlocks × Read1Block) (i.e., P+nR),
or the cylinder-based strategies to improve speedup.
when the time (Read1Block) > time(Proc1Bloc)
Otherwise R+nP, i.e., when time(P)>time(R)

Prefetching and large buffers Prefetching (double buffering)

Example: Phase 2 of 2PMMS (1000 cyl. on 8 surfaces) 24

With 50MB main memory, we can afford 2 • So far, we saw how disk access time (performance)
track-sized buffers for each sublist and 2 may be improved depending on
for the output (1 trk=128KB) --> 42 trks=5M (1) the operation at hand and
Consume one track for each sublist while (2) the way disk works.
the other is being loaded. • Next, we will look at ways to mitigate disk failures
Similarly, write one output track while the and hence improve disk reliability.
other is being constructed. • Fails should be detected and recovered, if possible.
Effectively, this eliminates in-memory process time • This ability is essential for continuous operations.
A related issue here is Block size selection
(trade-off?): big block size amortize I/O cost but
wasted if the block is not fully used.
23

4
Performance/Reliability of Disk
Disk failures – A classification
Systems
25 26

An attempt to access the disk to read or write

• We looked at ways to improve the performance may fail. A possible classification:
of disk systems.
Temporary read prob. intermittent failure
• Now we will look at ways to improve their
reliabilities. If it is local to a sector:
o problem in reading -- media decay
• What is reliability? Availability of data when there
o Problem in writing -- write failure.
is a disk “failure” of some sort.
This is achieved at the cost of some redundancy (of If the entire disk suddenly and permanently
data and/or disks). disk crash
Parity checks, stable storage, and RAID are
ways/techniques to “cope” with the above
failures, resp.

Checksums for failure detection Detecting errors by checksums

27 28

A useful model of disk read: the reading function The function Read(w,s) returns the value “good” for s, if w
returns (w,s), where w is the data in the sector that has even number of 1’s; otherwise, s=“bad”.
is read and s is the status bit. It is possible that more than one bit in a sector be
corrupted, and hence
How s gets “good” or “bad” values? Easy! each
o an error(s) may not be detected.
sector has additional bits, called the checksum
Suppose bits error randomly:
(written by the disk controller).
Probability of undetected error (i.e. even 1’s) is thus 50%
Simple form of checksum is the parity bit: (Why?)
011010001
111011100
Even parity: the number of 1’s in a data bits and
their parity is always even.

(Interleaved) Parity bits Recovery from disk/head crashes

29 30

• Suppose we have 8 parity bits Use data and/or disk redundancy to protect
01110110 Byte 1 against permanently destroyed disks
11001101 Byte 2 Mean time to failure = when 50% of the disks
00001111 Byte 3 have crashed, e.g., 10 years
10110100 Byte of parity bits Simplified (assuming this happens linearly)
o In the 1st year = 5%, …
o In the 2nd year = 5%, …
• With n parity bits, the probability of undetected
error = 1/2n However the mean time to a disk crash doesn’t

• Checksum/parity may help detect but not correct have to be the same as the mean time to data
errors. loss; there are solutions.

5
RAID 4
Redundant Array of Independent Disks, RAID
31 32
• Problem with RAID 1 (also called Mirroring):
RAID 1:Mirror each disk (data/redundant disks)
n data disks & n redundant disks
If a disk fails, restore using the mirror • RAID 4: One redundant disk only (dedicated parity).
Probability of mirror disk crashing while • x⊕
⊕y modulo-2 sum of x and y (XOR)
restoration = • 11110000 ⊕ 10101010 = 01011010
Suppose each disk lasts in 10 years, on average. • For any n: we have n data disks & 1 redundant disk
Assume 3 hrs (i.e., 1/2920 year) to replace a disk
• Each block in the redun. disk has the parity bits for
Probability the mirror disk fails during copying is:
the corresponding blocks in the data disks:
1/10 * 1/2920 = 1/29,200
If 1 disk fails on average every 10 years (Block-interleaved parity).
➨ A disk or its mirror fails on average every 5 years • Number the blocks (on each disk): 1,2,3,…,k
So, it takes on average 5 × 29,200 = 146,000 years for a i th Block of data disk 1: 11110000
i th Block of data disk 2: 10101010
non-recoverable error to occur. i th Block of data disk 3: 00111000
i th Block of redundant disk: 01100010

Properties of XOR: ⊕ RAID 4 (Count’d)

Commutative: x⊕ ⊕y = y⊕⊕x
• Reading: as usual
Associative: x⊕⊕(y⊕
⊕z) = (x⊕
⊕y)⊕⊕z o Interesting possibility: If we want to read a block from
disk i, but it is busy and all other disks are free, then
Identity: x⊕
⊕0 = 0⊕⊕x = x (0 is vector) instead, we can read corr. blocks from all other disks and
Self-inverse: x⊕
⊕x = 0 compute their sum mod 2.
• Writing:
o As a useful consequence, if x⊕y=z, then o Write block to disk i.
we can “add” x to both sides & get y=x⊕z o Update the corresponding block on redundant disk
o This also means # of writes to the redundant disk is n times
the average number of writes to any one data disk.

After a write, how do we get the value

How do we get the value for
for the redundant block?
35
the redundant block?
Better to do: Suppose the new value is v.
To write v at block j of data disk i:
Naively: Read all n corresponding data blocks
1. Read the old value of block j, say o.
➨ n+1 disk I/O’s (n-1 blocks read, the 1 data
2. Read the jth block of the redundant disk, say r.
block write, 1 redundant block write). 3. Compute w = v ⊕ o ⊕ r.
4. Write v in block j of data disk i.
Can do better: How? 5. Write w in block j of the redundant disk.
Total: 4 disk I/O’s, for ANY n (# of data disks)
Problem Why this works?
o Idea: v⊕o is the “change” to the parity, and hence
the redundant disk must change to compensate.
36

6
Failure recovery in RAID 4 Failure recovery in RAID 4
37 38

When there is a disk crash: Use equation

If it is the redundant disk,
0 = x1 ⊕ .... ⊕ xn ⊕ xred
then replace it with a new disk and
x j = x1 ⊕ ...x j −1 ⊕ x j +1 ⊕ ... ⊕ xn ⊕ xred
re-compute redundant blocks.
else, (it is a data disk) replace it with a new disk
and re-compute data blocks from all other disks Example: i th Block of Disk1: 11110000
i th Block of Disk 2: 10101010
The rule to re-compute any missing data is simple and i th Block of Disk 3: 00111000
is the same for ANY disk, data or redundant. i th Block of red. disk: 01100010

RAID 5 ( block-interleaved distributed parity ) RAID 5 (Cont’d)

39
The reading/writing load for each disk is
• RAID 4 is cheaper than mirroring but still has problems:
the same.
• Problem 1: Redundant disk involved in every write
• Problem 2: Does not survive 2 simult. crashes!
Problem. In one block write, what’s the
probability that a disk is involved?
• RAID 5: Fixes Problem 1, by distributed parity
among all the n+1 disks (numbered 0,…,n) o Each disk has probability 1/4 to have the block.
o If not, i.e., 3/4, then it has 1/3 chance that it
• Cylinder j on disk i is treated as parity cylinder for
all cylinders j iff i = j mod (n+1) will be the redundant block for that block.
• Example: Suppose n=3 we have 4 disks. o So, each of the four disks is involved in:
o Disk 0 would be redundant for blocks 0, 4, 8, 12, etc. 1/4 + (3/4)*(1/3) = 1/2 of the writes.
(because they leave reminder 0, when divided by 4).
o Disk 1 would be redundant for blocks 1, 5, 9, etc.
40

RAID 6 RAID 6 (Cont’d)

41
Example: How to recover from 2 s.f.’s
Allows recover from multiple simultaneous failures.
Line 2 of the redundancy bit pattern
Essentially, applies RAID 4 in group of disks. says: Apply RAID 4 to disks 1,2,4,6.

Disks are grouped, for applying RAID 4, using a

Redundancy Bit Pattern.

7
RAID 6 (Cont’d) RAID 6 (Cont’d)
44
How do we find a redundancy bit pattern? • Reading is as in RAID 4, e.g., read from the disk
o Columns should be different. containing the data.
• How about writing?

o We have all the combinations of bits in the Suppose we rewrite the first block of
columns but all 0’s. disk 2 to be 00001111.
We then compute the change:
00001111 ⊕ 10101010 = 10100101

Since disk 2 has 1’s on rows 1 and 2, we see

redundant disks 5 and 6 are relevant. Thus the
change has to propagate to disks 5 and 6, changing
them to 11000111 and 10111110, resp.
43

RAID 6 (Cont’d)
45

Suppose disks a and b fail. How to recover? Example:

Find a row r where a and b differ.
Assume in row r , a is 1 and b is 0.
Since all columns (of the pattern matrix) are
different, we must find some row r in which the
columns for a and b are different.
Recover disk a according to row r
Recover disk b according to any row that has 1 for
Before failure After failure
b

In-class exercise In-class exercise

47 48

• Suppose we have four disks: 1 and 2 are data disks, 3 • Disk pairs to consider:
1. {1,2}
and 4 are redundant
2. {1,3}
• Disk 3 is a mirror of 1. Disk 4 holds parity check bits 3. {1,4}
for disks 2 and 3 4. {2,3}
• which combination of simultaneous 2-disk failures can we 5. {2,4}
recover from? 6. {3,4}
• Can’t recover if 2 & 4 crash. Why?
We could recover from all the above crash-pairs but the 5th.
1 2 3 4
==========
x z x zOx

Case Study
33% (3)
Case Study
4 pages
Ch13 External Sorting 1perpage Annotated
No ratings yet
Ch13 External Sorting 1perpage Annotated
27 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
External Sorting
No ratings yet
External Sorting
26 pages
Lec9 04
No ratings yet
Lec9 04
21 pages
DBMS Internals: How Does It All Work?
No ratings yet
DBMS Internals: How Does It All Work?
94 pages
Chapter15 2
No ratings yet
Chapter15 2
34 pages
Elmasri 6e Ch17 Week2 HW DiskStorage
No ratings yet
Elmasri 6e Ch17 Week2 HW DiskStorage
96 pages
Module 2: Storing Data: Disks and Files 2.1 Memory Hierarchy
No ratings yet
Module 2: Storing Data: Disks and Files 2.1 Memory Hierarchy
16 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
No ratings yet
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
16 pages
Review Session: External Sorting
No ratings yet
Review Session: External Sorting
6 pages
Sorting On A Mesh-Connected Parallel Computer
No ratings yet
Sorting On A Mesh-Connected Parallel Computer
30 pages
hj_listrank (1)
No ratings yet
hj_listrank (1)
20 pages
VND - Ms Powerpoint&Rendition 1
No ratings yet
VND - Ms Powerpoint&Rendition 1
118 pages
OS-Unit-6-Part-1-I Scheme
No ratings yet
OS-Unit-6-Part-1-I Scheme
13 pages
File Management15
No ratings yet
File Management15
48 pages
8 DataStorageIndexingStructures Updated
No ratings yet
8 DataStorageIndexingStructures Updated
57 pages
Unit III Lm Os Cse r20
No ratings yet
Unit III Lm Os Cse r20
42 pages
Scientific Writing Parallel Computing V2
No ratings yet
Scientific Writing Parallel Computing V2
15 pages
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
No ratings yet
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
32 pages
File System Implementation
No ratings yet
File System Implementation
35 pages
Physical Memory Address (Frame Number Page Size) +page Offset
No ratings yet
Physical Memory Address (Frame Number Page Size) +page Offset
19 pages
osy
No ratings yet
osy
13 pages
File Management15
No ratings yet
File Management15
52 pages
DS unit-IV
No ratings yet
DS unit-IV
39 pages
Memory Management
No ratings yet
Memory Management
134 pages
Lecture 28-33
No ratings yet
Lecture 28-33
47 pages
Pquick
No ratings yet
Pquick
19 pages
Query Processing + Optimization: Outline: Operator Evaluation Strategies
No ratings yet
Query Processing + Optimization: Outline: Operator Evaluation Strategies
53 pages
Swapping Algorithms
No ratings yet
Swapping Algorithms
22 pages
OS Filesystem - Implementation
No ratings yet
OS Filesystem - Implementation
20 pages
Raid Levels
No ratings yet
Raid Levels
47 pages
ADS_UNIT-1
No ratings yet
ADS_UNIT-1
12 pages
Lec15 Filesystems
No ratings yet
Lec15 Filesystems
36 pages
Lecture 01 - File Storage - Part 1
No ratings yet
Lecture 01 - File Storage - Part 1
48 pages
PDC
No ratings yet
PDC
14 pages
Parallel Quicksort Implementation Using Mpi and Pthreads: Puneet Kataria RUID - 117004233
No ratings yet
Parallel Quicksort Implementation Using Mpi and Pthreads: Puneet Kataria RUID - 117004233
14 pages
Quick Sort
No ratings yet
Quick Sort
5 pages
02 Storage (1)
No ratings yet
02 Storage (1)
104 pages
DBMS R19 UNIT IV
No ratings yet
DBMS R19 UNIT IV
25 pages
Lec 04
No ratings yet
Lec 04
25 pages
Chapter 10 - External Storage - Part 2
No ratings yet
Chapter 10 - External Storage - Part 2
41 pages
CSCI2100 Project
No ratings yet
CSCI2100 Project
7 pages
Chapter 5
No ratings yet
Chapter 5
53 pages
External Sorting: Comp 521 - Files and Databases Fall 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Fall 2010 1
21 pages
Ext Sorting
No ratings yet
Ext Sorting
17 pages
Chap_11
No ratings yet
Chap_11
33 pages
ch1
No ratings yet
ch1
39 pages
Parallel Quick Sort Without Merge
No ratings yet
Parallel Quick Sort Without Merge
19 pages
osy
No ratings yet
osy
15 pages
Data Storage and Access Methods: Min Song IS698
No ratings yet
Data Storage and Access Methods: Min Song IS698
50 pages
CH04 Cache Memory
No ratings yet
CH04 Cache Memory
44 pages
8. Allocation Methods (1)
No ratings yet
8. Allocation Methods (1)
24 pages
08 Filemanagement
No ratings yet
08 Filemanagement
66 pages
Example Problems: Thursday, 3 April 2003
No ratings yet
Example Problems: Thursday, 3 April 2003
5 pages

Basic Merge of Two Sorted Lists

Uploaded by

Basic Merge of Two Sorted Lists

Uploaded by

Two-phase Multiway Merge-sort (2PMMS)

Basic merge of two sorted lists

Phase 1: Sort main memory-sized partitions to

Phase 2: Merge the sorted sublists into a single

Each block is read and written once, in

R = 12 tuples, tuple = 1 byte, Block size B bytes,

Ways to improve disk access time

• We are done when there is only a • Group blocks by cylinders (“cylindrification”)

• In our example, total # of disk I/O = 5 ×12 = 60 Details

Organize data by cylinders

Use multiple disks

Mirroring disks Mirroring disks (Cont’d)

Prefetching (Double Buffering) Double Buffering

Prefetching and large buffers Prefetching (double buffering)

An attempt to access the disk to read or write

Checksums for failure detection Detecting errors by checksums

(Interleaved) Parity bits Recovery from disk/head crashes

Properties of XOR: ⊕ RAID 4 (Count’d)

After a write, how do we get the value

When there is a disk crash: Use equation

RAID 5 ( block-interleaved distributed parity ) RAID 5 (Cont’d)

RAID 6 RAID 6 (Cont’d)

Disks are grouped, for applying RAID 4, using a

Since disk 2 has 1’s on rows 1 and 2, we see

Suppose disks a and b fail. How to recover? Example:

In-class exercise In-class exercise

You might also like