CS 345: Topics in Data Warehousing: Thursday, October 21, 2004
CS 345: Topics in Data Warehousing: Thursday, October 21, 2004
2 5
2 4 4 5 7 8
B-Tree Index
• By far the most common type of index
• Sorted index with search tree
• Good for point queries and range queries
– Point query: A = 5
– Range query: A BETWEEN 5 AND 10
• Search tree nodes are page-sized
– Contain <Value, Pointer> pairs
– Each Pointer is to a node of the level below
• Trade-off in choosing index page sizes
– Larger pages → fewer search tree levels → fewer
page reads
– Larger pages → each page read takes longer
Hash Indexes
• Useful for point queries
– Slightly better performance than B-Trees
– Not useful for range queries
• Less widely supported than B-Trees
Alternate B-Tree Organization
• Many records with same search key causes
redundancy
– <Stanford,RID1>,<Stanford,RID2>,
<Stanford,RID3>,<Stanford,RID4>
• Can store RID-lists instead
– <Stanford, (RID1,RID2,RID3,RID4)>
– Each value occurs once in the index
– Index entry is <Value,RID-list> instead of
<Value,RID>
– Saves space when search key has many repeated
values
Clustered Indexes
• An index is clustered (or “clustering”) if records in the
relation are organized based on index search key
• Clustered indexes are good because:
– Records satisfying a range query are packed onto a small number
of consecutive pages
• In unclustered indexes, by contrast:
– Records satisfying a range query are spread across a large
number of random pages
– Commingled with other records that do not satisfy the query
• Only one clustered index allowed per relation
– A relation can’t be simultaneously sorted by 2 different attributes
– (Unless there are multiple copies of the relation)
Clustered vs. Unclustered
Clustered Sequential
2 5 Reads
2 4 4 5 7 8
2 4 4
5 7 8
Unclustered 2 5
Random
Reads
2 4 4 5 7 8
4 7 5
2 4 8
Comparing Access Plans
• Consider query “SELECT * FROM R WHERE A=5”
• Three query plans:
– Scan relation R
• Sequential read of all pages in R
• Regardless of how many tuples have A=5
– Use clustered index on A
• Sequential read of relevant pages in R
• Num. relevant pages = (# of tuples with A=5) / (# of tuples per page)
• Plus overhead of accessing index pages
– Use unclustered index on A
• Random read of relevant pages in R
• Number of relevant pages = (# of tuples with A=5)
– Less if A is highly correlated with sort order of relation
• Plus overhead of accessing index pages
Comparing Access Plans
• Clustered index is always best
– Unless all tuples are being returned (then use scan)
– But clustered index may not be available
• Unclustered index beats scan when fraction of
tuples returned is small
– Depends on these factors:
• % of tuples being returned
• Cost ratio of random I/O vs. sequential I/O
• # of tuples per page
– Query returns >10% of rows → scan is almost
certainly faster
Covering Indexes
• Example using index in a book:
– “What does this book say about fact tables?”
• Look up “fact tables” in the index
• Turn to each page that is listed
• Read that page and see what it says
– “Which of these topics are discussed in this
book: fact tables, bridge tables, B-trees?”
• Look up the three topics in the index
• See how many of them appear
• Don’t need to read any of the actual book
Covering Indexes
• Sometimes an index has all the data you need
– Allows index-only query plan
– Not necessary to access the actual tuples
– Such an index is called a covering index
• SELECT COUNT(*) FROM R WHERE A=5
– Use index on A
– Count number of <5,RID> entries
– No need to look up records referenced by RIDs
• An index is a “thin” copy of a relation
– Not all columns from the relation are included
– The index is sorted in a particular way
Multi-Column Indexes
• Multi-column indexes are very useful in data
warehousing
– We say such an index has a composite key
• Example: B-Tree index on (A,B)
– Search key is (A,B) combination
– Index entries sorted by A value
– Entries with same A value are sorted by B value
– Called a lexicographic sort
• SELECT SUM(B) FROM R WHERE A=5
– Our (A,B) index covers this query!
• Coverage vs. size trade-off
– More attributes in search key → index covers more queries
– More attributes in search key → index takes up more disk space
Fact and Dimension Indexes
• Dimension table index
• Narrow version of table with
only frequently-queried
attributes
• Always include dimension key!
• Improve performance on large
dimension tables