W5 Storage Files Indexing pt1
W5 Storage Files Indexing pt1
(R&G Chapters 8, 9)
1
The BIG picture
Relational Relational
Want to Model Algebra, SQL
store data
Want to
Result access data
Conceptual Logical
Design Design
Query Optimization
and Execution
ER to
ER Relational Relational Operators
Models
Schema Access Methods
Refinement Buffer Management
Disk Space
Management
Indexing Physical
Design Database Storage, Files
Storage 2
File and access layer
Database as a "file of records"
• Create/delete files
• insert/delete/modify record
• retrieve one particular record (point access)
– specified using record id
• retrieve range of records (range access)
– satisfying some conditions
• retrieve all records (scan)
3
File and access methods layer
• File Organization
– How is data organized in files?
• Indexing
– How to make data access efficient?
• Storage
– How data is physically stored on disk?
4
File Organization & Indexing
• File Organization
• Indexing
• Meta-data
– System Catalog
5
File Organization
Page
Students
sid name login age gpa
50000 Dave dave@cs 19 3.3
Fields
53666 Jones jones@cs 18 3.4
53688 Smith smit@ee 18 3.2
… … … … …
Records
…
Files
…
… File
Storage Pages
6
Page Format (N-ary Storage Model)
• Page = collection of slots
• Each slot stores one record
– Record identifier: <page_id, slot_number>
– Option 2: <uniq> -> <page_id, slot_number>
• Page format should support
– Fast searching, insertion, deletion
• Page format depends on record format
– Fixed-Length
– Variable-Length
7
Record Formats: Fixed-Length
F1 F2 F3 F4
L1 L2 L3 L4
9
Record Formats: Variable-Length
F1 F2 F3 F4
$ $ $ $
Fields Delimited by Special Symbols
F1 F2 F3 F4
Slot Array
12
Alternate Page Formats:
Column store
13
Page Format for Column Store
Decomposition Storage Model (DSM)
15
File Organization & Indexing
• File Organization
• Indexing
• Meta-data
– System Catalog
16
Alternative File Organizations
Many alternatives exist, each good for some
situations, and not so good in others:
•Heap files: Suitable when typical access is a file
scan retrieving all records.
•Sorted Files: Best for retrieval in some order,
or for retrieving a range of records.
•Index File Organizations: (will cover shortly..)
17
Heap (Unordered) Files
• Simplest file structure
– contains records in no particular order
– Need to be able to scan, search based on rid
• As file grows and shrinks, disk pages are
allocated and de-allocated.
– Need to manage free space
Heap File Implemented Using Lists
Data Data Data
… Full Pages
Page Page Page
Header
Page
Data Data Data
… Pages with
Page Page Page
Free Space
…
Data
DIRECTORY Page N
• The directory is a collection of pages
– linked list implementation is just one alternative.
• The entry for a page can include the number of free
bytes on the page.
– Much smaller than linked list of all HF pages!
Heap File vs. Sorted File
• Which is better?
– Let us design a cost model to find out
• Simplified cost model:
– Based only on IO cost
• IO is the dominating cost
• Ignore CPU and other overheads
• Ignore effect of prefetching and sequential access
– Consider only average case
21
More Assumptions…
• Single record insert and delete.
• Equality search - exactly one match (e.g.,
search on key)
– Question: what if more or fewer?
• Heap Files:
– Insert always appends to end of file.
• Sorted Files:
– Files compacted after deletions.
– Search done on file-ordering attribute.
Cost of Operations (in # of I/O’s)
B: Number of data pages
Heap File
Heap File Sorted File File
Sorted notes…
notes…
Scanall
Scan all B B
B B
records
records
Equality
Equality 0.5B log2 B assumes
assumesexactly
exactly
0.5B log2 B oneone
match!
Search
Search match!
Range
Range B (log22 B)
(log B) ++ (#match
B
Search
Search (#match pages)
pages)
Insert 2 (log 2B)B)++ 2*(B/2)
must R & W
Insert 2 (log 2 2*(B/2) must R & W
Delete
Delete 0.5B + 1 (log2B) + 2*(B/2) must R & W
File Organization & Indexing
• File Organization
• Indexing
• Meta-data
– System Catalog
24
Indexing
Employee
SSN Name Age Salary
Index Index
• Display employees in increasing on SSN on Salary
order of age
• age>30 111 John 28 100K
• Updates? … … 28 ..
• Increasing order of salary? … … 29 ..
… … 30 ..
• Salary>120K?
… … 31 …
• Indexing … … 34 ..
– Multiple efficient access paths
… … … ...
2 2.5 3 3.5
Data entries:
1.2* 1.7* 1.8* 1.9* 2.2* 2.4* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0*
(Index File)
(Data file)
29
Alternative 1
Actual data record (with key value k)
• The index structure becomes the file organization.
– Similar to heap files or sorted files.
• At most one index can use Alt. 1.
• Efficient but can be expensive to maintain.
– Insertions and deletion modify the data file.
Alternative 1: Example
31
Alternatives 2 & 3
<k, rid of matching data record>
<k, list of rids of matching data records>
• Easier to maintain than Alt. 1.
• If a file has several indexes, only one can use Alt. 1;
the others must use Alt. 2 or 3.
• Alt. 3 is more compact than Alt. 2, but leads to
variable sized data entries.
– even for search keys of fixed length.
• Data entries with long rid lists may span many pages!
Alternative 2: Example
33
File Organization & Indexing
• File Organization
• Index Classification
• Clustered/Unclustered
• Heap & Sorted Files • Sparse/Dense
• Types of Indexes
• Primary
• Indexing • Clustering
• Secondary Key
• Secondary Non-Key
• Meta-data • Indexing techniques
– System Catalog • Hash vs tree
• Choosing search key
34
Index Classification
Some of the material for this
topic is taken from "Database
Systems", Elmasri, Navathe.
• Clustered vs Unclustered
• Dense vs Sparse
• Indexing field
– Key
2
– Non-Key
2X2=4
• Physical ordering of the file
– Ordered on indexing field
2
– Not ordered on indexing field
Physical Ordering on Indexing Field
Ordered Not Ordered
Indexing Key Primary Index Secondary Index (Key)
Field Non Key Clustering Index Secondary Index (Non Key)
35
File Organization & Indexing
• File Organization
• Index Classification
• Clustered/Unclustered
• Heap & Sorted Files • Sparse/Dense
• Types of Indexes
• Primary
• Indexing • Clustering
• Secondary Key
• Secondary Non-Key
• Meta-data • Indexing Technique
– System Catalog • Tree vs Hash
• Choosing search key
36
Index Classification - Clustering
• Clustered vs. unclustered: If the order of the
data records is similar to the order of the index
data entries, then the index is clustered.
Index entries
CLUSTERED direct search for UNCLUSTERED
data entries
Data Data
entries (Index File) entries
(Data file)
Data Data
entries (Index File) entries
(Data file)
42
Primary Index
• Indexing Field = Key
• File is physically sorted
on indexing field
• One index entry per
block
• Index pointers can be
block pointers (anchors)
• Sparse Index
43
Clustering Index
• Indexing Field = Non-Key
• File is physically sorted
on indexing field
• One index entry per
distinct value
• Index pointer is block
pointer to first block
with the value
• Sparse Index
44
Clustering Index - Alternative
45
Secondary Key Index
• Indexing Field = Key
• File is NOT physically
sorted on indexing field
• One index entry per
record
• Index pointer is record
pointer
• Dense Index
46
Secondary Non-Key Index
• Indexing Field = Non-Key
• File is NOT physically
sorted on indexing field
• One index entry per
record
• Index pointer
‒ One per record (dense)
‒ One per value (sparse)
o Variable-length record
o Extra level of indirection
47
Index Classification: Summary
File physically
Indexing Index Sparse or
Type of Index sorted on Index Pointers
Field Entries Dense?
indexing field?
One per
Primary Key Yes Block anchor Sparse
block
One per
Clustering Non-Key Yes Block pointer Sparse
value
One per
Secondary Key Key No Record pointer Dense
record
One per Record pointer/
Secondary Sparse or
Non-Key No record/ Variable length/
Non-Key Dense
value indirection
48
File Organization & Indexing
• File Organization
• Index Classification
• Clustered/Unclustered
• Heap & Sorted Files • Sparse/Dense
• Types of Indexes
• Primary
• Indexing • Clustering
• Secondary Key
• Secondary Non-Key
• Meta-data • Indexing Technique
– System Catalog • Tree vs Hash
• Choosing search keys
49
Hash-based Index
• Good for equality selections.
– File = a collection of buckets. Bucket = primary page plus 0 or
more overflow pages.
– Hash function h: h(r.search_key) = bucket for record r.
50
Tree-based Index
• Good for range selections.
– Leaves contain data entries sorted by search key value
– B+ tree: all root->leaf paths have equal length (height)
51
Index, index, everywhere
Select E.dno
From Employees E
Where E.age > 40
53
Choosing Search Key
Select sID
From Student
Where sName = ‘Mary’ And GPA > 3.9
ag e
12,10
m
e
l
– Range query: Some field value is
sa
na
12,20 12
not a constant. E.g.: 13,75 bob 12 10 13
• age =12; or age=12 and sal > 20 <age, sal> cal 11 80 <age>
joe 12 20
• Data entries in index sorted 10,12 sue 13 75 10
by search key for range 20,12 20
Data records
queries. 75,13 75
80,11 sorted by name 80
– Lexicographic order.
<sal, age> <sal>
Composite Search Key - Tradeoffs
Select AVG(E.sal)
From Employees E
Where E.age = 25
AND E.sal BETWEEN 3000 AND 5000
• Indexing
• Meta-data
– System Catalog
57
System Catalogs
• For each relation:
– name, file name, file structure (e.g., Heap file)
– attribute name and type, for each attribute
– index name, for each index
– integrity constraints
• For each index:
– structure (e.g., B+ tree) and search key fields
• For each view:
– view name and definition
• Plus stats, authorization, buffer pool size, etc.
Catalogs are themselves stored as relations!
System Catalog in Oracle
> desc user_tables
Name Null? Type
----------------------------------------- -------- ----------------------------
TABLE_NAME NOT NULL VARCHAR2(128)
TABLESPACE_NAME VARCHAR2(30)
CLUSTER_NAME VARCHAR2(128)
IOT_NAME VARCHAR2(128)
STATUS VARCHAR2(8)
PCT_FREE NUMBER
PCT_USED NUMBER
...
59
System Catalog in Oracle
> select * from student;
60
Summary
• Database organized as a collection of files
– Several file organizations (heap, sorted, …) with tradeoffs
• Files are a collection of pages
– Several page layouts (NSM, DSM, …) with tradeoffs
• Pages contain a collection of records
– Several record formats (fixed, variable length…) with
tradeoffs
• Index is a quick way to find records
– Several index types with tradeoffs