0% found this document useful (0 votes)

43 views

Sorting and Hashing: Why Sort?

The document discusses sorting large datasets that exceed available RAM size. It describes: 1) Reasons for sorting include eliminating duplicates, grouping for summarization, and ordering results. 2) The challenge of sorting 100GB of data with only 1GB of RAM, and why virtual memory is not a solution. 3) Out-of-core algorithms that perform single-pass streaming of data through RAM in chunks to minimize RAM usage and I/O calls.

Uploaded by

Brendan Ho

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Sorting and Hashing: Why Sort?

Uploaded by

Brendan Ho

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Why Sort?

• “Rendezvous”
– Eliminating duplicates (DISTINCT)
– Grouping for summarization (GROUP BY)
Sorting and Hashing – Upcoming sort-merge join algorithm
• Ordering
– Sometimes, output must be ordered (ORDER BY)
• e.g., return results ranked in decreasing order of relevance
– First step in bulk-loading tree indexes
See R&G Chapters: • Problem: sort 100GB of data with 1GB of RAM.
– why not virtual memory?
9.1, 13.1-13.3, 13.4.2

Out-of-Core Algorithms Single-pass Streaming

• Simple case: “Map”.
Two themes – Goal: Compute f(x) for each record, write out the result
1. Single-pass streaming data through RAM – Challenge: minimize RAM, call read/write rarely
• Approach
2. Divide (into RAM-sized chunks) – Read a chunk from INPUT to an Input Buffer
and Conquer – Write f(x) for each item to an Output Buffer
– When Input Buffer is consumed, read another chunk
– When Output Buffer fills, write it to OUTPUT

Input Output
INPUT Buffer Buffer
f(x) OUTPUT

RAM

Better: Double Buffering Better: Double Buffering

• Main thread runs f(x) on one pair I/O bufs • Main thread runs f(x) on one pair I/O bufs
• 2nd I/O thread drains/fills unused I/O bufs in parallel • 2nd I/O thread drains/fills unused I/O bufs in parallel
– Why is parallelism available? – Why is parallelism available?
– Theme: I/O handling usually deserves its own thread – Theme: I/O handling usually deserves its own thread
• Main thread ready for a new buf? Swap! • Main thread ready for a new buf? Swap!
• Usable in any of the subsequent discussion
– Assuming you have RAM buffers to spare!
– But for simplicity we won’t bring this up again.

Input Output Input Output

Buffers Buffers Buffers
I/O Buffers
INPUT INPUT
f(x) OUTPUT OUTPUT

f(x)
I/O RAM RAM
Quick Check Sorting & Hashing: Formal Specs

• T/F: Single-pass streaming with separate input and • Given:

output disks is nearly all sequential I/O. – A file F:
• containing a multiset of records R
• consuming N blocks of storage
– Two “scratch” disks
• T/F: Single-pass streaming requires only a fixed • each with >> N blocks of free storage
amount of RAM. – A fixed amount of space in RAM
• memory capacity equivalent to B blocks of disk
• Sorting
• T/F: Double buffering reduces the number of I/Os
– Produce an output file FS
performed. • with contents R stored in order by a given sorting criterion
• Hashing
– Produce an output file FH
• T/F: Double buffering gets disks to work in parallel • with contents R, arranged on disk so that no 2 records that are incomparable
with the CPU. (i.e. “equal” in sort order) are separated by a greater or smaller record.
• I.e. matching records are always “stored consecutively” in FH.

Sorting: 2-Way (a strawman) Sorting: 2-Way (a strawman)

• Pass 0 (conquer a batch): • Pass 0 (conquer a batch):
– read a page, sort it, write it. – read a page, sort it, write it.
– only one buffer page is used – only one buffer page is used
– a repeated “batch job” – a repeated “batch job”
• Pass 1, 2, 3, …, etc. (merge via streaming):
– requires 3 buffer pages
• note: this has nothing to do with double buffering!
– merge pairs of runs into runs twice as long
– a streaming algorithm, as in the previous slide!
• Drain/fill buffers as the data streams through them

Sort in place

I/O Output Input Buffer

INPUT Buffer INPUT Buffer 1
OUTPUT OUTPUT
3
2
RAM Input Buffer

Two-Way External Merge Sort General External Merge Sort

• More than 3 buffer pages. How can we utilize them?

6,2 2 Input file
• Conquer and Merge: 3,4 9,4 8,7 5,6 3,1
PASS 0
sort subfiles and merge 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs – Big batches in pass 0, many streams in merge passes
PASS 1
4,7 1,3 • To sort a file with N pages using B buffer pages:
• Each pass we read + write 2,3 2-page runs

each page in file (2N)

4,6 8,9 5,6 2
– Pass 0: use B buffer pages. Produce éê N / Bùú sorted runs of
PASS 2
B pages each.
• N pages in the file. 2,3
So, the number of passes is: 4,4 1,2 4-page runs
– Pass 1, 2, …, etc.: merge B-1 runs at a time.
6,7 3,5
 log 2 N   1 8,9 6
Pass 0 Pass 1, …
PASS 3

• So total cost is: 1,2

1 1

2 N log 2 N   1
2,3
3,4 8-page runs
B ... ...

4,5
B-1
6,6 ‫ڿ‬N/B‫ۀ‬
7,8
9 Conquer Sorted Runs Merge Sorted Runs
length B Length B(B-1)
Cost of External Merge Sort
• Number of passes: 1 + é log B- 1 é N / Bù ù
# of Passes of External Sort
( I/O cost is 2N times number of passes)
• Cost = 2N * (# of passes)
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
• E.g., with 5 buffer pages, to sort 108 page file:
– Pass 0: é 108 / 5 ù = 22 sorted runs of 5 pages each 1,000 10 5 4 3 2 2
(last run is only 3 pages) 10,000 13 7 5 4 2 2
– Pass 1: é 22 / 4 ù = 6 sorted runs of 20 pages each
(last run is only 8 pages)
100,000 17 9 6 5 3 3
– Pass 2: 2 sorted runs, 80 pages and 28 pages 1,000,000 20 10 7 5 3 3
– Pass 3: Sorted file of 108 pages 10,000,000 23 12 8 6 4 3
Formula check: 1+┌log4 22┐= 1+3  4 passes √ 100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4

Memory Requirement for Quick Check

External Sorting • T/F: Two-way external sort is a good choice for a real system.

• How big of a table can we sort in two passes? • Given B buffers in memory, external merge sort can be done in
1 pass if the file is less than ______ big:
– Each “sorted run” after Phase 0 is of size B
1. B 2. Sqrt(B) 3. B(B-1)
– Can merge up to B-1 sorted runs in Phase 1
• Answer: B(B-1). • Given B buffers in memory, external merge sort can be done in
– Sort N pages of data in about B = Nspace 2 passes if the file is less than ______ big:
1. B 2. Sqrt(B) 3. B(B-1)
Pass 0 Pass 1, …
1 1 • T/F: external merge sort divides the problem during Pass 0,
conquering subproblems
B ... ...

B-1
‫ڿ‬N/B‫ۀ‬ • T/F: external merge sort makes use of single-pass streaming
Conquer Sorted Runs Merge Sorted Runs
during merge passes
length B Length B(B-1)

Alternative: Hashing Divide

• Idea: • Streaming Partition (divide):

– Many times we don’t require order Use a hash f’n hp to stream records to disk partitions
– All matches rendezvous in the same partition.
– E.g.: removing duplicates
– Each partition a mix of values
– E.g.: forming groups – Streaming alg to create partitions on disk:
• Often just need to rendezvous matches • “Spill” partitions to disk via output buffers

• Hashing does this

– And may be cheaper than sorting.
– But how to do it out-of-core??
Divide & Conquer Two Phases
Original
Relation OUTPUT Partitions
1

1
• Streaming Partition (divide): INPUT 2
2
Use a hash f’n hp to stream records to disk partitions
hash
• Partition: ...
function
hp
– All matches rendezvous in the same partition. B-1

– Each hp partition a big mix of values

(Divide) B-1
B main memory buffers
– Streaming alg to create partitions on disk:
• “Spill” partitions to disk via output buffers
• ReHash (conquer):
Read partitions into RAM hash table one at a time,
using hash f’n hr
– Each hr partition s small number of values
– Can completely hash each partition before writing out
• All duplicate values contiguous

Two Phases
Original
Relation OUTPUT Partitions
Cost of External Hashing
1

1
INPUT 2
hash 2
• Partition: ...
function
hp B-1
Divide (hp) Conquer (hr)
(Divide) B-1
B main memory buffers Hash partitions hp of
size ~N/(B-1) 1

… B
Output
Partitions Relation
Hash table for partition B-1
Ri (k <= B pages)

Hash partitions hp of
hash Hash partitions hr
function size ~N/(B-1)
Fully hashed!

• Rehash:
hr

cost = 2N(#passes) = 4*N IO’s

(Conquer) (includes initial read, final write)
Hash partitions hp of B main memory buffers Hash partitions hr
size ~N/(B-1) Fully hashed!

Memory Requirement Recursive Partitioning

• How big of a table can we hash in two passes?

– B-1 “partitions” result from Pass 1
– Each should be no more than B pages in size Divide (hp) Divide (hp1)
– Answer: B(B-1).
• We can hash a table of size N pages in about N space >(B-1) big!
– Note: assumes hash function distributes records evenly!
• Have a bigger table? Recursive partitioning!

Divide (hp) Conquer (hr)

1
…
B
B-1
Recursive Partitioning Recursive Partitioning

Divide (hp) Conquer

Divide (h(h
p1)r) Divide (hp)

A Wrinkle: Duplicates Quick Check

• Consider a dataset with a very frequent key • Given B buffers in memory, external hashing can be done in 1
pass if the file is less than ______ big:
– E.g. in a big table, consider the gender column 1. B 2. Sqrt(B) 3. B(B-1)

• What happens during recursive partitioning?

• Given B buffers in memory, external hashing can be done in 2
passes if the file is less than ______ big:
1. B 2. Sqrt(B) 3. B(B-1)

Divide (hp) Divide (hp1)

• T/F: external hashing works regardless of key values
M M
• T/F: external hashing divides the problem during the initial
F
(partitioning) passes
other

• T/F: external hashing conquers during the final (rehash) pass

Cost ofExternal
Cost of ExternalHashing
Sorting

Divide
Conquer Conquer
Merge
How does external hashing
compare with external sorting?

cost = 4*N IO’s

(including initial read, final write)
Parallelize me! Hashing Parallelize me! Hashing

• Phase 1: shuffle data across machines (hn) • Phase 1: shuffle data across machines (hn)
– streaming out to network as it is scanned – streaming out to network as it is scanned
– which machine for this record? – which machine for this record?
use (yet another) independent hash function hn use (yet another) independent hash function hn
• Receivers proceed with phase 1
as data streams in hp hr
– from local disk
hn hn
and network

Parallelize me! Sorting Parallelize me! Sorting

• Pass 0: shuffle data across machines • Pass 0: shuffle data across machines
– streaming out to network as it is scanned – streaming out to network as it is scanned
– which machine for this record? – which machine for this record?
Split on value range (e.g. [-∞,10], [11,100], [101, ∞]). Split on value range (e.g. [-∞,10], [11,100], [101, ∞]).
• Receivers proceed with pass 0
as the data streams in
• A Wrinkle: How to
range range
ensure ranges
are the same #pages?!
– i.e. avoid data skew?

So which is better ?? Summary

• Simplest analysis: • Sort/Hash Duality

– Same memory requirement for 2 passes – Hashing is Divide & Conquer
– Same I/O cost – Sorting is Conquer & Merge
– But we can dig a bit deeper… • Sorting is overkill for rendezvous
• Sorting pros: – But sometimes a win anyhow
– Great if input already sorted • Sorting sensitive to internal sort alg
– Great if we need output to be sorted anyway – Quicksort vs. HeapSort
– Not sensitive to duplicates or “bad” hash functions – In practice, QuickSort tends to win
• Hashing pros: • Don’t forget double buffering
– For duplicate elimination, scales with # of values – Can “hide” the latency of I/O behind CPU work
• Delete dups in first pass while partitioning on hp
• Vs. sort which scales with # of items!
– Easy to shuffle equally in parallel case

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
Anastasia: The New Broadway Musical (LIBRETTO)
94% (175)
Anastasia: The New Broadway Musical (LIBRETTO)
117 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (215)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
I Hate You - Don't Leave Me
80% (54)
I Hate You - Don't Leave Me
6 pages
TDA Birth Certificate Bond Instructions
97% (285)
TDA Birth Certificate Bond Instructions
4 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
HW3 Sol
No ratings yet
HW3 Sol
12 pages
Chapter 17 Disk Storage, Basic File Structures, and Hashing Disk Storage Devices
No ratings yet
Chapter 17 Disk Storage, Basic File Structures, and Hashing Disk Storage Devices
10 pages
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
No ratings yet
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
32 pages
Lec9 04
No ratings yet
Lec9 04
21 pages
External Sorting
No ratings yet
External Sorting
26 pages
Review Session: External Sorting
No ratings yet
Review Session: External Sorting
6 pages
External Sorting: Comp 521 - Files and Databases Fall 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Fall 2010 1
21 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
20 pages
14 QP2 1
No ratings yet
14 QP2 1
19 pages
QueryProcessing Sorting
No ratings yet
QueryProcessing Sorting
44 pages
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
No ratings yet
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
16 pages
External Sorting: Comp 521 - Files and Databases Spring 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Spring 2010 1
21 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
27 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
CS 186 Discussion 5
No ratings yet
CS 186 Discussion 5
122 pages
CSCI2100 Project
No ratings yet
CSCI2100 Project
7 pages
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
No ratings yet
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
57 pages
Processing 3
No ratings yet
Processing 3
88 pages
10 Sorting
No ratings yet
10 Sorting
3 pages
DBMS R19 UNIT IV
No ratings yet
DBMS R19 UNIT IV
25 pages
Unary Query Processing Operators: CS 186, Spring 2006 Background For Homework 2
No ratings yet
Unary Query Processing Operators: CS 186, Spring 2006 Background For Homework 2
18 pages
Ext Sorting
No ratings yet
Ext Sorting
17 pages
Ch13 External Sorting 1perpage Annotated
No ratings yet
Ch13 External Sorting 1perpage Annotated
27 pages
Final Review
No ratings yet
Final Review
96 pages
External Sorting
No ratings yet
External Sorting
16 pages
3 - QueryProcessing - Ch15
No ratings yet
3 - QueryProcessing - Ch15
56 pages
Query Processing + Optimization: Outline: Operator Evaluation Strategies
No ratings yet
Query Processing + Optimization: Outline: Operator Evaluation Strategies
53 pages
Chapter_4 - Algorithms for Query Processing and Optimization
No ratings yet
Chapter_4 - Algorithms for Query Processing and Optimization
119 pages
1 - Disk Storage - Ch13
No ratings yet
1 - Disk Storage - Ch13
31 pages
10 Sorting
No ratings yet
10 Sorting
2 pages
App. Algorithms Summary
No ratings yet
App. Algorithms Summary
37 pages
Jayashree External Sorting
No ratings yet
Jayashree External Sorting
5 pages
Layers of a DBMS
No ratings yet
Layers of a DBMS
38 pages
Chapter - 3 Algorithms For Query Processing and Optimization PDF
No ratings yet
Chapter - 3 Algorithms For Query Processing and Optimization PDF
100 pages
DSA-ASSIGNMENT !
No ratings yet
DSA-ASSIGNMENT !
4 pages
Query-Processing
No ratings yet
Query-Processing
77 pages
CSC 172 Midterm
No ratings yet
CSC 172 Midterm
11 pages
Introduction To Query Processing and Query Optimization Techniques
No ratings yet
Introduction To Query Processing and Query Optimization Techniques
77 pages
Lecture16 Fall
No ratings yet
Lecture16 Fall
81 pages
External Memory Sorting and Searching
No ratings yet
External Memory Sorting and Searching
22 pages
Sorting 2
No ratings yet
Sorting 2
19 pages
7-Query Processing
No ratings yet
7-Query Processing
47 pages
GROUP 3 POWER POINT PRESENTATION
No ratings yet
GROUP 3 POWER POINT PRESENTATION
10 pages
Data Processing Systems
No ratings yet
Data Processing Systems
2 pages
Skip List & Hashing: Cse, Postech
No ratings yet
Skip List & Hashing: Cse, Postech
36 pages
It Is A Very Efficient Method To Search The Exact Data Items Based On Hash Table
No ratings yet
It Is A Very Efficient Method To Search The Exact Data Items Based On Hash Table
49 pages
FS 18IS61 - ISE - Question Bank
100% (1)
FS 18IS61 - ISE - Question Bank
3 pages
Advanced Data Research Paper
100% (1)
Advanced Data Research Paper
6 pages
Heaps: Analysis of Algorithms
No ratings yet
Heaps: Analysis of Algorithms
27 pages
Response DB 2
No ratings yet
Response DB 2
8 pages
Elmasri Storage Hashing
No ratings yet
Elmasri Storage Hashing
27 pages
CSC3170 2024fall A3
No ratings yet
CSC3170 2024fall A3
4 pages
Chapter 10: Algorithms 10.1. Deterministic and Non-Deterministic Algorithm
No ratings yet
Chapter 10: Algorithms 10.1. Deterministic and Non-Deterministic Algorithm
5 pages
External Merge Sort
No ratings yet
External Merge Sort
13 pages
Flats Leads
No ratings yet
Flats Leads
20 pages
BSNL PDF
No ratings yet
BSNL PDF
42 pages
ANSYS CFX Introduction 18.2
No ratings yet
ANSYS CFX Introduction 18.2
72 pages
Grodan Delta NG2.0: Product Data Sheet
No ratings yet
Grodan Delta NG2.0: Product Data Sheet
2 pages
Sun TR2642
No ratings yet
Sun TR2642
16 pages
Overview of Management Accounting: MA - Lê Hoàng Oanh
No ratings yet
Overview of Management Accounting: MA - Lê Hoàng Oanh
9 pages
JNTUA Operating Systems - PPT Notes - R20
No ratings yet
JNTUA Operating Systems - PPT Notes - R20
167 pages
Pump Control (Output Flow) - Adjust - Stage 2 Pump
No ratings yet
Pump Control (Output Flow) - Adjust - Stage 2 Pump
5 pages
Activity No
No ratings yet
Activity No
6 pages
E-Contracts Law by Dayal Legal Associates: Annexure I
No ratings yet
E-Contracts Law by Dayal Legal Associates: Annexure I
6 pages
Smith Chart Problems
No ratings yet
Smith Chart Problems
12 pages
Statistical Treatment of Data
No ratings yet
Statistical Treatment of Data
3 pages
CSC 419 Group 1
No ratings yet
CSC 419 Group 1
26 pages
Lilium Calculations
No ratings yet
Lilium Calculations
2 pages
RC Walls in Australia Seismic Design and Detailing To AS 1170 4 and AS 3600
No ratings yet
RC Walls in Australia Seismic Design and Detailing To AS 1170 4 and AS 3600
19 pages
Sanrio Text Art - Google Search
No ratings yet
Sanrio Text Art - Google Search
1 page
4 - 5553 Vitesse Advanced
No ratings yet
4 - 5553 Vitesse Advanced
32 pages
Ey Megatrends 2020
No ratings yet
Ey Megatrends 2020
108 pages
0710178-PVM Gages-B
No ratings yet
0710178-PVM Gages-B
4 pages
1.3 Microsoft Forms
No ratings yet
1.3 Microsoft Forms
111 pages
Internship Progress Report: Bachelor of Engineering
No ratings yet
Internship Progress Report: Bachelor of Engineering
31 pages
Response On The RFX For Two Envelop (Service)
No ratings yet
Response On The RFX For Two Envelop (Service)
35 pages
Odisha Power Generation Corporation Limited
No ratings yet
Odisha Power Generation Corporation Limited
1 page
ICS 432 Lab 6
No ratings yet
ICS 432 Lab 6
8 pages
Internship
No ratings yet
Internship
22 pages
Cummins CPL - #Numeralkod
100% (1)
Cummins CPL - #Numeralkod
7 pages
Pilar Thesis
No ratings yet
Pilar Thesis
103 pages
Application: 1990 Toyota Land Cruiser 1990 Toyota Land Cruiser
No ratings yet
Application: 1990 Toyota Land Cruiser 1990 Toyota Land Cruiser
83 pages
Belt Cleaner Inspection
No ratings yet
Belt Cleaner Inspection
25 pages
TANGEDCO Online Payment
No ratings yet
TANGEDCO Online Payment
1 page

Sorting and Hashing: Why Sort?

Uploaded by

Sorting and Hashing: Why Sort?

Uploaded by

Why Sort?

Out-of-Core Algorithms Single-pass Streaming

Better: Double Buffering Better: Double Buffering

Input Output Input Output

• T/F: Single-pass streaming with separate input and • Given:

Sorting: 2-Way (a strawman) Sorting: 2-Way (a strawman)

I/O Output Input Buffer

Two-Way External Merge Sort General External Merge Sort

• More than 3 buffer pages. How can we utilize them?

each page in file (2N)

• So total cost is: 1,2

Memory Requirement for Quick Check

Alternative: Hashing Divide

• Idea: • Streaming Partition (divide):

• Hashing does this

– Each hp partition a big mix of values

cost = 2*N*(#passes) = 4*N IO’s

Memory Requirement Recursive Partitioning

• How big of a table can we hash in two passes?

Divide (hp) Conquer (hr)

Divide (hp) Conquer

A Wrinkle: Duplicates Quick Check

• What happens during recursive partitioning?

Divide (hp) Divide (hp1)

• T/F: external hashing conquers during the final (rehash) pass

cost = 4*N IO’s

Parallelize me! Sorting Parallelize me! Sorting

So which is better ?? Summary

• Simplest analysis: • Sort/Hash Duality

You might also like

cost = 2N(#passes) = 4*N IO’s