Chapter 4: Spatial Storage and Indexing
Chapter 4: Spatial Storage and Indexing
1
Physical Model in 3 Level Design?
Recall 3 levels of database design
Conceptual model: high level abstract description
Logical model: description of a concrete realization
Physical model: implementation using basic components
Analogy with vehicles
Conceptual model: mechanisms to move, turn, stop, ...
Logical models:
Car: accelerator pedal, steering wheel, brake pedal,
Bicycle: pedal forward to move, turn handle, pull brakes on handle
Physical models :
Car: engine, transmission, master cylinder, break lines, brake pads,
Bicycle: chain from pedal to wheels, gears, wire from handle to brake pads
We now go, so to speak, under the hood
2
What is a Physical Data Model?
What is a physical data model of a database?
Concepts to implement logical data model
Using current components, e.g. computer hardware, operating systems
In an efficient and fault-tolerant manner
Why learn physical data model concepts?
To be able to choose between DBMS brand names
some brand names do not have spatial indices!
To be able to use DBMS facilities for performance tuning
For example, if a query is running slow,
one may create an index to speed it up
For example, if loading of a large number of tuples takes for ever
one may drop indices on the table before the inserts
and recreate index after inserts are done!
3
Concepts in a Physical Data Model
Database concepts
Conceptual data model - entity, (multi-valued) attributes, relationship,
Logical model - relations, atomic attributes, primary and foreign keys
Physical model - secondary storage hardware, file structures, indices,
Examples of physical model concepts from relational DBMS
Secondary storage hardware: disk drives
File structures - sorted
Auxiliary search structure -
search trees (hierarchical collections of one-dimensional ranges)
4
An Interesting Fact about Physical Data Model
Physical data model design is a trade-off between
Efficiently support a small set of basic operations of a few data types
Simplicity of overall system
Each DBMS physical model
Choose a few physical DM techniques
Choice depends on chosen sets of operations and data types
Relational DBMS physical model
Data types: numbers, strings, date, currency
one-dimensional, totally ordered
Operations:
search on one-dimensional totally order data types
insert, delete, ...
5
Physical Data Model for SDBMS
Is relational DBMS physical data model suitable for spatial data?
Relational DBMS has simple values like numbers
Sorting, search trees are efficient for numbers
These concepts are not natural for spatial data (e.g. points in a plane)
Reusing relational physical data model concepts
Space filling curves define a total order for points
This total order helps in using ordered files, search trees
But may lead to computational inefficiency!
New spatial techniques
Spatial indices, e.g. grids, hierarchical collection of rectangles
Provide better computational performance
6
Common Assumptions for SDBMS Physical Model
Spatial data
Dimensionality of space is low, e.g. 2 or 3
Data types: OGIS data types
Approximations for extended objects (e.g. linestrings, polygons)
Minimum Orthogonal Bounding Rectangle (MOBR or MBR)
MBR(O) is the smallest axis-parallel rectangle enclosing an object O
Supports filter and refine processing of queries
Spatial operations
OGIS operations, e.g. topological, spatial analysis
Many topological operations are approximated by Overlap
Common spatial queries - listed in next slide
7
Common Spatial Queries and Operations
Physical model provides simpler operations needed by spatial
queries!
Common Queries
Point query: Find all rectangles containing a given point
Range query: Find all objects within a query rectangle
Nearest neighbor: Find the point closest to a query point
Intersection query: Find all the rectangles intersecting a query rectangle
Common operations across spatial queries
find : retrieve records satisfying a condition on attribute(s)
findnext : retrieve next record in a dataset with total order
after the last one retrieved via previous find or findnext
nearest neighbor of a given object in a spatial dataset
8
Scope of Discussion
Learn basic concepts in physical data model of SDBMS
Review related concepts from physical DM of relational DBMS
Reusing relational physical data model concepts
Space filling curves define a total order for points
This total order helps in using ordered files, search trees
But may lead to computational inefficiency!
New techniques
Spatial indices, e.g. grids, hierarchical collection of rectangles
Provide better computational performance
9
Storage Hierarchy in Computers
Computers have several components
Central Processing Unit (CPU)
Input, output devices, e.g. mouse, keyword, monitors, printers
Communication mechanisms, e.g. internal bus, network card, modem
Storage Hierarchy
Types of storage Devices
Main memories - fast but content is lost when power is off
Secondary storage - slower, retains content without power
Tertiary storage - very slow, retains content, very large capacity
DBMS usually manage data
On secondary storage, e.g. disks
Use main memory to improve performance
User tertiary storage (e.g. tapes) for backup, archival etc.
10
Secondary Storage Hardware: Disk Drives
Disk concepts
Circular platters with magnetic storage medium
Multiple platters are mounted on a spindle
Platters are divided into concentric tracks
A cylinder is a collection of tracks across platters with common radium
Tracks are divided into sectors
A sector size may a few kilobytes
Disk drive concepts
Disk heads to read and write
There is disk head for each platter (recording surface)
A head assembly moves all the heads together in radial direction
Spindle rotates at a high speed, e.g. thousands revolution per minute
Accessing a sector has three major steps:
Seek: Move head assembly to relevant track
Latency: Wait for spindle to rotate relevant sector under disk head
Transfer: Read or write the sector
Other steps involve communication between disk controller and CPU
11
Using Disk Hardware Efficiently
Disk access cost are affected by
Placement of data on the disk
Fact than seek cost > latency cost > transfer (See Table 4.2, pp.86)
A few common observations follow
Size of sectors
Larger sector provide faster transfer of large data sets
But waste storage space inside sectors for small data sets
Placement of most frequently accessed data items
On middle tracks rather than innermost or outermost tracks
Reason: minimize average seek time
Placement of items in a large data set requiring many sectors
Choose sectors from a single cylinder
Reason: Minimize seek cost in scanning the entire data set.
12
Software View of Disks: Fields, Records and File
Views of secondary storage (e.g. disks)
Hardware views - discussed in last few slides
Software views - Data on disks is organized into fields, records, files
Concepts
Field presents a property or attribute of a relation or an entity
Records represent a row in a relational table
collection of fields for attributes in relational schema of the table
Files are collections of records
homogeneous collection of records may represent a relation
heterogeneous collections may be a union of related relations
13
Mapping Records and Files to Disk
Records
Often smaller than a sector 6 records, 80 bytes each
14
Buffer Management
Motivation
Accessing a sector on disk is much slower than accessing main memory
Idea: keep repeatedly accessed data in main memory buffers
to improve the completion time of queries
reducing load on disk drive
Buffer Manager software module decides
Which sectors stay in main memory buffers?
Which sector is moved out if we run out of memory buffer space?
When to pre-fetch sector before access request from users?
These decision are based on the disk access patterns of queries!
15
File Structures
What is a file structure?
A method of organizing records in a file
For efficient implementation of common file operations on disks
Example: ordered files
Measure of efficiency
I/O cost: Number of disk sectors retrieved from secondary storage
CPU cost: Number of CPU instruction used
See Table 4.1 for relative importance of cost components
Total cost = sum of I/O cost and CPU cost
16
File Structures Selected File Operations
Common file operations
Find: key value --> record matching key values
Findnext --> Return next record after find if records were sorted
Insert --> Add a new record to file without changing file-structure
Nearest neighbor of a object in a spatial dataset
Examples using Figure 4.1, pp.88
find(Name=Canada) on Country table returns record about Canada
findnext() on Country table returns record about Cuba
since Cuba is next value after Canada in sorted order of Name
insert(record about Panama) into Country table
adds a new record
location of record in Country file depends on file-structure
Nearest neighbor Argentina in country table is Brazil
17
Common File Structures
Common file structures
Heap or unordered or unstructured
Ordered
Hashed
Clustered
Descriptions follow
Basic comparison of common File Structures
Heap file is efficient for inserts and used for logfiles
but find, findnext, etc. are very slow
Hashed files are efficient for find, insert, delete etc.
but findnext is very slow
Ordered file organization are very fast for findnext
and pretty competent for find, insert, etc.
18
File Structures: Heap, Ordered
Heap
Records are in no particular order (example: Figure 4.1)
Insert can simple add record to the last sector
find, findnext, nearest neighbor scan the entire files
Ordered
Records are sorted by a selected field (example: Figure 4.3 below)
findnext can simply pick up physically next record
find, insert, delete may use binary search, is very efficient
nearest neighbor processed as a range query (see pp.95 for details)
Ordered file storing 7 records, 73 bytes each
City table(ordered)
Brasillia ....
Buenos Aires ....
Havana .... 2 records
Toronto ....
Mexico City ....
Washington DC ....
Figure 4.3 Monterrey ....
Ottawa ....
Rosario ....
19
File Structure : Hash
Components of a Hash file structure
(Figure 4.2) 2 records
Havana .........
A set of buckets (sectors)
Ottawa .........
Hash function : key value --> bucket
Hash directory: bucket --> sector 2 records
4 buckets, 9 records Rosario .........
Operations Key ,56 1 Toronto .........
find, insert, delete are fast Key 728 2
compute hash function Key 9210 3
lookup directory Key .511 4
3 records
fetch relevant sector Brasillia .........
findnext, nearest neighbor are slow Monterrey .........
2 records
no order among records Buenos Aires .........
Mexico City
Washington DC .........
Figure 4.2
20
Spatial File Structures: Clustering
Motivation:
Ordered files are not natural for spatial data
Clustering records in sector by space filling curve is an alternative
In general, clustering groups records
accessed by common queries
into common disk sectors
to reduce I/O costs for selected queries
Clustering using Space filling curves
Z-curve
Hilbert-curve
Details on following 3 slides
21
x50 0 1 0 y5 0 1 0 0 (2,4)
Z-Curve
What is a Z-curve?
0 0 0 1 1 0 0 0 5 (24)
A space filling curve
Generated from interleaving bits
x, y coordinate Figure 4.6
see Figure 4.6
Alternative generation method
A A
see Figure 4.5 A
looks like Ns or Zs
Implementing file operations n50 n51 n52 n53
Figure 4.4
22
Example of Z-values
Figure 4.7
Left part shows a map with spatial object A, B, C
Right part and left bottom part Z-values within A, B and C
Note C gets z-values of 2 and 8, which are not close
Exercise: Compute z-values for B.
Y
A Object Points x y interleave z-value
11 B A 1 00 11 0101 5
10 B 1
2
01
3
Figure 4.7 00 C
4
C 1 01 00 0010 2
00 01 10 11 X
2 10 00 1000 8
0 4 7 8 12 16
23
Hilbert Curve
A space filling curve Figure 4.5
Example: Figure 4.5
More complex to generate
Due to rotations A A A
See details on pp.92-93
Illustration on next slide!
Implementing file operations
A
Similar to ordered files
n50 n51 n52 n53
11
10
01
00
00 01 10 11
24
Calculating Hilbert Values
x x
00 01 10 11 00 01 10 11
(a) (b)
x x
00 01 10 11 00 01 10 11
00 00 01 32 33 00 0 1 14 15
01 03 02 31 30 01 3 2 13 12
Figure 4.8 y y
10 10 13 20 23 10 4 7 8 11
11 11 12 21 22 11 5 6 9 10
(c) (d)
25
Handling Regions with Z-curve
Sx Sx
00 00
Sy 01 Sy 01
10 10
11 11
00 01 10 11 00 01 10 11
Figure 4.9
(a) (b)
26
What is an Index?
Concept of an index
Auxiliary file to search a data file
Example: Figure 4.10 Figure 4.10
Index records have
7 records, 73 bytes
Key value Havana ....
9 records, 38 bytes each
Address of relevant data sector Brasillia
Washington DC ....
Monterrey ....
see arrows in Figure 4.10 Buenos Aires
Toronto ....
Havana
Index records are ordered Mexico City
Brasillia ....
Rosario ....
find, findnext, insert are fast Monterrey
Ottawa ....
Ottawa
Note assumption of total order Rosario
....
Secondary index on
city name
27
Classifying Indexes
Figure 4.11
Classification criteria
7 records, 73 bytes
Data-file-structure
Brasillia ....
Key data type Buenos Aires ....
Others Havana ....
Primary index
2 records, 73 bytes
Data file ordered by indexed
Toronto ....
attribute Washington DC ....
1 index record / data sector
Example: Figure 4.11
Q? A table can have at most Ordered file for city table
29
Ideas Behind Grid Files
Figure 4.12
Basic idea - Divide space into cells by a grid 200
A
Figure 4.14
30
Grid Files
Grid File component
Linear scale - row/column boundaries
Grid directory: cell --> disk sector address 1
1 5 5
Operation implementation 1
Figure 4.13
31
R-Tree Family
Basic Idea
Use a hierarchical collection of rectangles to organize spatial data
Generalizes B-tree to spatial data sets
Classifying members of R-tree family
Handling of large spatial objects
allow rectangles to overlap - R-tree
duplicate objects but keep interior node rectangles disjoint - R+tree
Selection of rectangles for interior nodes
greedy procedures - R-tree, R+tree
procedure to minimize coverage, overlap - packed R-tree
Other criteria exist
Scope of our discussion
Basics of R-tree and R+tree
Focus on concepts not procedures!
32
Spatial Objects with R-Tree 11
a b
6
Properties of R-trees 2
5 7
Balanced 3 4 8 9
c d
Nodes are rectangle x
14
f
possible overlap among rectangles! 11
y 15 16
Other properties in section 4.2.2 e g
Properties of R+trees 2
5 7
Balanced 3 4 8 9
Interior nodes are rectangle c d
disjoint rectangles f 14
11
Leaf nodes - MOBR of polygons or lines y 15 16
e g
leafs rectangle overlaps with parents
Data objects may be duplicated across leafs Figure 4.18
Other properties in section 4.2.2
find operation - same as R-tree R
Search either b or c to find object 5 [1,2,3] [5,6,7] [4,5] [8,9] [10,11] [12,13,14] [15,16]
34
Trends
New developments in physical model
Use of intra-object indexes
Support for multiple concurrent operations
Index to support spatial join operations
Use of intra-object indexes
Motivation: large objects (e.g. polygon boundary of USA has 1000s of edges)
Algorithms for OGIS operations (e.g. touch, crosses)
often need to check only a few edges of the polygon
relevant edges can be identified by spatial index on edges
example: Figure 4.19, pp.105, section 4.3.1
Uniqueness
intra-object index organizes components within a large spatial object
traditional index organizes a collection of spatial objects
35
Trends Concurrency support
Why support concurrent operations?
SDBMS is shared among many users and applications
Simultaneous requests from multiple users on a spatial table
serial processing of request is not acceptable for performance
concurrent updates and find can provide incorrect results
Concurrency control idea for R-tree index
R-link tree: Add links to chain nodes at each level
Use links to ensure correct answer from find operations
Use locks on nodes to coordinate conflicting updates
Details in section 4.3.2 and Figure 4.20, pp.107
36
Trends: Join Index
Spatial join is a common operation. Expensive to compute using
traditional indexes
Spatial join index pre-computes and stores id-pairs of matched
rows across tables
Example in Figure 4.21
Speeds up computation of spatial join
details in section 4.3.3
R Relation S Relation
B2 (Facility Location) (Forest-Stand Boundary) Join-Index
Sorted by join-attribute
Spatial object-id 2
Sort Band-
Spatial object-id 2
by Diagonalize
Tuple-id-2
(re-order)
Figure 4.22 join-
attribute
Page of relation R e1 A
Figure 4.23 a1 a2 a3 b1 b2 a e2
A 1 1
e3 B B 1 0
e4 C 1 1
A1 A2 B1 B2 C1 C2
b e5 a b
Page of relation S C
Join-Index PCG Adjacency Matrix
38
Summary
Physical DM efficiently implements logical DM on computer
hardware
Physical DM has file-structure, indexes
Classical methods were designed for data with total ordering
Fall short in handling spatial data
Because spatial data is multi-dimensional
Two approaches to support spatial data and queries
Reuse classical method
use Space-Filling curves to impose a total order on
multi-dimensional data
Use new methods
R-trees, Grid files
39