0% found this document useful (0 votes)
43 views

Sorting and Hashing: Why Sort?

The document discusses sorting large datasets that exceed available RAM size. It describes: 1) Reasons for sorting include eliminating duplicates, grouping for summarization, and ordering results. 2) The challenge of sorting 100GB of data with only 1GB of RAM, and why virtual memory is not a solution. 3) Out-of-core algorithms that perform single-pass streaming of data through RAM in chunks to minimize RAM usage and I/O calls.

Uploaded by

Brendan Ho
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Sorting and Hashing: Why Sort?

The document discusses sorting large datasets that exceed available RAM size. It describes: 1) Reasons for sorting include eliminating duplicates, grouping for summarization, and ordering results. 2) The challenge of sorting 100GB of data with only 1GB of RAM, and why virtual memory is not a solution. 3) Out-of-core algorithms that perform single-pass streaming of data through RAM in chunks to minimize RAM usage and I/O calls.

Uploaded by

Brendan Ho
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Why Sort?

• “Rendezvous”
– Eliminating duplicates (DISTINCT)
– Grouping for summarization (GROUP BY)
Sorting and Hashing – Upcoming sort-merge join algorithm
• Ordering
– Sometimes, output must be ordered (ORDER BY)
• e.g., return results ranked in decreasing order of relevance
– First step in bulk-loading tree indexes
See R&G Chapters: • Problem: sort 100GB of data with 1GB of RAM.
– why not virtual memory?
9.1, 13.1-13.3, 13.4.2

Out-of-Core Algorithms Single-pass Streaming


• Simple case: “Map”.
Two themes – Goal: Compute f(x) for each record, write out the result
1. Single-pass streaming data through RAM – Challenge: minimize RAM, call read/write rarely
• Approach
2. Divide (into RAM-sized chunks) – Read a chunk from INPUT to an Input Buffer
and Conquer – Write f(x) for each item to an Output Buffer
– When Input Buffer is consumed, read another chunk
– When Output Buffer fills, write it to OUTPUT

Input Output
INPUT Buffer Buffer
f(x) OUTPUT

RAM

Better: Double Buffering Better: Double Buffering


• Main thread runs f(x) on one pair I/O bufs • Main thread runs f(x) on one pair I/O bufs
• 2nd I/O thread drains/fills unused I/O bufs in parallel • 2nd I/O thread drains/fills unused I/O bufs in parallel
– Why is parallelism available? – Why is parallelism available?
– Theme: I/O handling usually deserves its own thread – Theme: I/O handling usually deserves its own thread
• Main thread ready for a new buf? Swap! • Main thread ready for a new buf? Swap!
• Usable in any of the subsequent discussion
– Assuming you have RAM buffers to spare!
– But for simplicity we won’t bring this up again.

Input Output Input Output


Buffers Buffers Buffers
I/O Buffers
INPUT INPUT
f(x) OUTPUT OUTPUT

f(x)
I/O RAM RAM
Quick Check Sorting & Hashing: Formal Specs

• T/F: Single-pass streaming with separate input and • Given:


output disks is nearly all sequential I/O. – A file F:
• containing a multiset of records R
• consuming N blocks of storage
– Two “scratch” disks
• T/F: Single-pass streaming requires only a fixed • each with >> N blocks of free storage
amount of RAM. – A fixed amount of space in RAM
• memory capacity equivalent to B blocks of disk
• Sorting
• T/F: Double buffering reduces the number of I/Os
– Produce an output file FS
performed. • with contents R stored in order by a given sorting criterion
• Hashing
– Produce an output file FH
• T/F: Double buffering gets disks to work in parallel • with contents R, arranged on disk so that no 2 records that are incomparable
with the CPU. (i.e. “equal” in sort order) are separated by a greater or smaller record.
• I.e. matching records are always “stored consecutively” in FH.

Sorting: 2-Way (a strawman) Sorting: 2-Way (a strawman)


• Pass 0 (conquer a batch): • Pass 0 (conquer a batch):
– read a page, sort it, write it. – read a page, sort it, write it.
– only one buffer page is used – only one buffer page is used
– a repeated “batch job” – a repeated “batch job”
• Pass 1, 2, 3, …, etc. (merge via streaming):
– requires 3 buffer pages
• note: this has nothing to do with double buffering!
– merge pairs of runs into runs twice as long
– a streaming algorithm, as in the previous slide!
• Drain/fill buffers as the data streams through them

Sort in place

I/O Output Input Buffer


INPUT Buffer INPUT Buffer 1
OUTPUT OUTPUT
3
2
RAM Input Buffer

Two-Way External Merge Sort General External Merge Sort

• More than 3 buffer pages. How can we utilize them?


6,2 2 Input file
• Conquer and Merge: 3,4 9,4 8,7 5,6 3,1
PASS 0
sort subfiles and merge 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs – Big batches in pass 0, many streams in merge passes
PASS 1
4,7 1,3 • To sort a file with N pages using B buffer pages:
• Each pass we read + write 2,3 2-page runs

each page in file (2N)


4,6 8,9 5,6 2
– Pass 0: use B buffer pages. Produce éê N / Bùú sorted runs of
PASS 2
B pages each.
• N pages in the file. 2,3
So, the number of passes is: 4,4 1,2 4-page runs
– Pass 1, 2, …, etc.: merge B-1 runs at a time.
6,7 3,5
 log 2 N   1 8,9 6
Pass 0 Pass 1, …
PASS 3

• So total cost is: 1,2


1 1

2 N log 2 N   1
2,3
3,4 8-page runs
B ... ...

4,5
B-1
6,6 ‫ڿ‬N/B‫ۀ‬
7,8
9 Conquer Sorted Runs Merge Sorted Runs
length B Length B(B-1)
Cost of External Merge Sort
• Number of passes: 1 + é log B- 1 é N / Bù ù
# of Passes of External Sort
( I/O cost is 2N times number of passes)
• Cost = 2N * (# of passes)
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
• E.g., with 5 buffer pages, to sort 108 page file:
– Pass 0: é 108 / 5 ù = 22 sorted runs of 5 pages each 1,000 10 5 4 3 2 2
(last run is only 3 pages) 10,000 13 7 5 4 2 2
– Pass 1: é 22 / 4 ù = 6 sorted runs of 20 pages each
(last run is only 8 pages)
100,000 17 9 6 5 3 3
– Pass 2: 2 sorted runs, 80 pages and 28 pages 1,000,000 20 10 7 5 3 3
– Pass 3: Sorted file of 108 pages 10,000,000 23 12 8 6 4 3
Formula check: 1+┌log4 22┐= 1+3  4 passes √ 100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4

Memory Requirement for Quick Check


External Sorting • T/F: Two-way external sort is a good choice for a real system.

• How big of a table can we sort in two passes? • Given B buffers in memory, external merge sort can be done in
1 pass if the file is less than ______ big:
– Each “sorted run” after Phase 0 is of size B
1. B 2. Sqrt(B) 3. B(B-1)
– Can merge up to B-1 sorted runs in Phase 1
• Answer: B(B-1). • Given B buffers in memory, external merge sort can be done in
– Sort N pages of data in about B = Nspace 2 passes if the file is less than ______ big:
1. B 2. Sqrt(B) 3. B(B-1)
Pass 0 Pass 1, …
1 1 • T/F: external merge sort divides the problem during Pass 0,
conquering subproblems
B ... ...

B-1
‫ڿ‬N/B‫ۀ‬ • T/F: external merge sort makes use of single-pass streaming
Conquer Sorted Runs Merge Sorted Runs
during merge passes
length B Length B(B-1)

Alternative: Hashing Divide

• Idea: • Streaming Partition (divide):


– Many times we don’t require order Use a hash f’n hp to stream records to disk partitions
– All matches rendezvous in the same partition.
– E.g.: removing duplicates
– Each partition a mix of values
– E.g.: forming groups – Streaming alg to create partitions on disk:
• Often just need to rendezvous matches • “Spill” partitions to disk via output buffers

• Hashing does this


– And may be cheaper than sorting.
– But how to do it out-of-core??
Divide & Conquer Two Phases
Original
Relation OUTPUT Partitions
1

1
• Streaming Partition (divide): INPUT 2
2
Use a hash f’n hp to stream records to disk partitions
hash
• Partition: ...
function
hp
– All matches rendezvous in the same partition. B-1

– Each hp partition a big mix of values


(Divide) B-1
B main memory buffers
– Streaming alg to create partitions on disk:
• “Spill” partitions to disk via output buffers
• ReHash (conquer):
Read partitions into RAM hash table one at a time,
using hash f’n hr
– Each hr partition s small number of values
– Can completely hash each partition before writing out
• All duplicate values contiguous

Two Phases
Original
Relation OUTPUT Partitions
Cost of External Hashing
1

1
INPUT 2
hash 2
• Partition: ...
function
hp B-1
Divide (hp) Conquer (hr)
(Divide) B-1
B main memory buffers Hash partitions hp of
size ~N/(B-1) 1

… B
Output
Partitions Relation
Hash table for partition B-1
Ri (k <= B pages)

Hash partitions hp of
hash Hash partitions hr
function size ~N/(B-1)
Fully hashed!

• Rehash:
hr

cost = 2*N*(#passes) = 4*N IO’s


(Conquer) (includes initial read, final write)
Hash partitions hp of B main memory buffers Hash partitions hr
size ~N/(B-1) Fully hashed!

Memory Requirement Recursive Partitioning

• How big of a table can we hash in two passes?


– B-1 “partitions” result from Pass 1
– Each should be no more than B pages in size Divide (hp) Divide (hp1)
– Answer: B(B-1).
• We can hash a table of size N pages in about N space >(B-1) big!
– Note: assumes hash function distributes records evenly!
• Have a bigger table? Recursive partitioning!

Divide (hp) Conquer (hr)


1

B
B-1
Recursive Partitioning Recursive Partitioning

Divide (hp) Conquer


Divide (h(h
p1)r) Divide (hp)

A Wrinkle: Duplicates Quick Check

• Consider a dataset with a very frequent key • Given B buffers in memory, external hashing can be done in 1
pass if the file is less than ______ big:
– E.g. in a big table, consider the gender column 1. B 2. Sqrt(B) 3. B(B-1)

• What happens during recursive partitioning?


• Given B buffers in memory, external hashing can be done in 2
passes if the file is less than ______ big:
1. B 2. Sqrt(B) 3. B(B-1)

Divide (hp) Divide (hp1)


• T/F: external hashing works regardless of key values
M M
• T/F: external hashing divides the problem during the initial
F
(partitioning) passes
other

• T/F: external hashing conquers during the final (rehash) pass

Cost ofExternal
Cost of ExternalHashing
Sorting

Divide
Conquer Conquer
Merge
How does external hashing
compare with external sorting?

cost = 4*N IO’s


(including initial read, final write)
Parallelize me! Hashing Parallelize me! Hashing

• Phase 1: shuffle data across machines (hn) • Phase 1: shuffle data across machines (hn)
– streaming out to network as it is scanned – streaming out to network as it is scanned
– which machine for this record? – which machine for this record?
use (yet another) independent hash function hn use (yet another) independent hash function hn
• Receivers proceed with phase 1
as data streams in hp hr
– from local disk
hn hn
and network

Parallelize me! Sorting Parallelize me! Sorting

• Pass 0: shuffle data across machines • Pass 0: shuffle data across machines
– streaming out to network as it is scanned – streaming out to network as it is scanned
– which machine for this record? – which machine for this record?
Split on value range (e.g. [-∞,10], [11,100], [101, ∞]). Split on value range (e.g. [-∞,10], [11,100], [101, ∞]).
• Receivers proceed with pass 0
as the data streams in
• A Wrinkle: How to
range range
ensure ranges
are the same #pages?!
– i.e. avoid data skew?

So which is better ?? Summary

• Simplest analysis: • Sort/Hash Duality


– Same memory requirement for 2 passes – Hashing is Divide & Conquer
– Same I/O cost – Sorting is Conquer & Merge
– But we can dig a bit deeper… • Sorting is overkill for rendezvous
• Sorting pros: – But sometimes a win anyhow
– Great if input already sorted • Sorting sensitive to internal sort alg
– Great if we need output to be sorted anyway – Quicksort vs. HeapSort
– Not sensitive to duplicates or “bad” hash functions – In practice, QuickSort tends to win
• Hashing pros: • Don’t forget double buffering
– For duplicate elimination, scales with # of values – Can “hide” the latency of I/O behind CPU work
• Delete dups in first pass while partitioning on hp
• Vs. sort which scales with # of items!
– Easy to shuffle equally in parallel case

You might also like