0% found this document useful (0 votes)

24 views41 pages

File Organization

The document discusses file organizations and indexing in databases. It describes different file organizations like sorted files and heap files. It also explains primary and secondary indexing as well as different index structures like B-trees and hash indexes. The purpose of indexing is to improve efficiency of retrieving records from files.

Uploaded by

yashnaik7664

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views41 pages

File Organization

Uploaded by

yashnaik7664

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Overview of Storage and Indexing

 Data in a DBMS is a collection of records, or a file, and each file

consists of one or more pages

 Understanding how records are organized is essential to using a

database system effectively, and it is the main topic of this
chapter.

 A file organization is a method of arranging the records in a

file when the file is stored on disk. Each file organization makes
certain operations efficient but other operations expensive.
Example: File of employee records, each containing age, name
and sal fields

1) If we want to retrieve employee records in order of

increasing age
 sorting the file by age is a good file organization
 sort order is expensive to maintain if the file is frequently modified

2) Other operations on a given collection of records

 Ex: retrieve all employees who make more than $5000
 We have to scan the entire file to find such employee records.
 A technique called indexing can help when we have to access a
collection of records in multiple ways, in addition to efficiently
supporting various kinds of selection.
DATA ON EXTERNAL STORAGE
 A DBMS stores vast quantities of data, and the data must persist
across program executions. Therefore, data is stored on external
storage devices such as disks and tapes, and fetched into main
memory as needed for processing.

 The unit of information read from or written to disk is a page.

The size of a page is a DBMS parameter, and typical values are
4KB or 8KB.
 Data is read into memory for processing, and written to disk
for persistent storage

 The cost of page I/O (input from disk to main memory and
output from memory to disk) dominates the cost of typical
database operations, and database systems are carefully
optimized to minimize this cost.
FILE ORGANIZATIONS AND INDEXING
 The file of records is an important abstraction in a DBMS
 A file can be created, destroyed, and have records inserted into and
deleted from it. It also supports scans; a scan operation allows us to
step through all the records in the file one at a time.
 A relation is typically stored as a file of records.
 The file layer stores the records in a file in a collection of disk
pages. It keeps track of pages allocated to each file, and as records
are inserted into and deleted from the file, it also tracks available
space within pages allocated to the file.
 The simplest file structure is an unordered file, or heap file.
Records in a heap file are stored in random order across the pages
of the file.
 An index is a data structure that organizes data records on
disk to optimize certain kinds of retrieval operations.

 An index allows us to efficiently retrieve all records that satisfy

search conditions on the search key fields of the index.

 We can also create additional indexes on a given collection of

data records, each with a different search key, to speed up
search operations that are not efficiently supported by the file
organization used to store the data records
Clustered Indexes
 When a file is organized so that the ordering of data records
is the same as or close to the ordering of data entries in some
index, we say that the index is clustered; otherwise, it
clustered is an unclustered index.
Primary and Secondary Indexes
Primary Index

 An index whose search key specifies the sequential order of

the data file is a primary index.
 One index record appears for every search-key value in the
file.
 The index record contains the search-key value and a
pointer to the block containing the first data record with
that search key value
Secondary index
 An index whose search key does not specify the sequential
order of the data file is a secondary index.
 Secondary indices use an extra level of indirection
 The pointers do not point directly to the file. Instead, each
pointer points to a bucket that contains pointers to the
block containing the record
 Why is searching on an index faster than searching without
an index?
Dense Index

 There is an index record for every search key value in the data file
 This record contains the search key and also a reference to the first
data record with that search key value.
Sparse Index
 The index record appears only for a few items in the data file.
Each item points to a block as shown.
 To locate a record, we find the index record with the largest
search key value less than or equal to the search key value we
are looking for.
 We start at that record pointed to by the index record, and
proceed along with the pointers in the file (that is,
sequentially) until we find the desired record.
Sparse Index
Dense vesus Sparse Index
 Dense indexes require more maintenance than sparse indexes at
write-time. Since every row must have an entry, the database must
maintain the index on inserts, updates, and deletes.
 Having an entry for every row also means that dense indexes will
require more memory.
 The benefit of a dense index is that values can be quickly found with
just a binary search.

 Sparse indexes require less maintenance than dense indexes at write-

time since they only contain a subset of the values. This lighter
maintenance burden means that inserts, updates, and deletes will be
faster.
 Having fewer entries also means that the index will use less memory.
 Finding data is slower since a scan across the page typically follows the
binary search. Sparse indexes are also only an option when working
with ordered data.
INDEX DATA STRUCTURES

 One way to organize data entries is to build a tree-like data

structure that directs a search for data entries.
 Another way to organize data entries is to hash data entries on
the search key.
Tree-Based Indexing
 In tree-based indexing records are organized using a tree like data
structure. The data entries are arranged in sorted order by search
key value, and a hierarchical search data structure is maintained
that directs searches to the correct page of data entries.

 Figure shows the employee records organized in a tree-structured

index with search key age. Each node in this figure is a physical
page, and retrieving a node involves a disk I/O.

 The lowest level of the tree, called the leaf level, contains the
data entries; in our example, these are employee records.
Tree-Based Indexing
 This structure allows us to efficiently locate all data entries with
a particular search key value
 All searches begin at the topmost node, called the root, and the
contents of pages in non-leaf levels direct searches to the correct
leaf page.
 Non-leaf pages contain node pointers separated by search key
values.
 The node pointer to the left of a key value k points to a
subtree that contains only data entries less than k.
 The node pointer to the right of a key value k points to a
subtree that contains only data entries greater than or equal
to k.
Example

1. Find all employees who are 25 years of age

2. Find all employees having 24< age <50
Tree-Based Indexing
 Thus, the number of disk I/Os incurred during a search is equal
to the length of a path from the root to a leaf, plus the number
of leaf pages with qualifying data entries.

 Finding the correct leaf page is faster than binary search of the
pages in a sorted file because each non~leaf node can
accommodate a very large number of node-pointers, and the
height of the tree is rarely more than three or four in practice.

 The height of a balanced tree is the length of a path from root

to leaf; in Figure, the height is three. The number of l/Os to
retrieve a desired leaf page is four, including the root and the
leaf page.
 The average number of children for a non-leaf node is called the
fan-out of the tree.

 In practice, F is at least 100, which means a tree of height four

contains 100 million leaf pages. Thus, we can search a file with
100 million leaf pages and find the page we want using four
l/Os; (logF/2100,000,000)

 In contrast, binary search of the same file would take

log2100,000,000(over 25) l/Os.
Hash-Based Indexing

 We can organize records using a technique called hashing to

quickly find records that have a given search key value
 In this approach, the records in a file are grouped in buckets,
where a bucket consists of a primary page and, possibly,
additional pages linked in a chain.
 The bucket to which a record belongs can be determined by
applying a special function, called a hash function, to the
search key.
 Given a bucket number, a hash-based index structure allows us
to retrieve the primary page for the bucket in one or two disk
l/Os.
 Hash indexing is illustrated in Figure 8.2, where the data is
stored in a file that is hashed on age; the data entries in this first
index file are the actual data records.

 Applying the hash function to the age field identifies the page
that the record belongs to.

 The hash function h for this example is quite simple; it converts

the search key value to its binary representation and uses the
two least significant bits as the bucket identifier.
 To search for a record with a given search key value, we apply
the hash function to identify the bucket to which such records
belong and look at all pages in that bucket.

 On inserts, the record is inserted into the appropriate bucket,

with 'overflow‘ pages allocated as necessary.

 If we do not have the search key value for the record, for
example, the index is based on age and we want records with a
given age value, we have to scan all pages in the file
 Figure 8.2 also shows an index with search key sal that contains
(sal, rid) pairs as data entries.

 The rid (short for record id) component of a data entry in this
second index is a pointer to a record with search key value sal

 The file of employee records is hashed on age. The second

index, on sal, also uses hashing to locate data entries, which are
now <sal, rid of employee>record pairs
Organization of Records in Files
 A relation is aset of records. Given a set of records, the next
question is how to organize them in a file.
 Several of the possible ways of organizing records in files are:
 Heap file organization: Any record can be placed anywhere in
the file where there is space for the record. There is no ordering of
records. Typically, there is either a single file or a set of files for
each relation.

 Sequential file organization: Records are stored in sequential

order, according to the value of a “search key” of each record.

 Multitable clustering file organization: Generally, a

separate file or set of files is used to store the records of each
relation. However, in a multitable clustering file organization, records
of several different relations are stored in the same file, and in fact
in the same block within a file, to reduce the cost of certain join
operations.
 B+-tree file organization: The traditional sequential file
organization does support ordered access even if there are
insert, delete, and update operations, which may change the
ordering of records. However, in the face of a large number
of such operations, efficiency of ordered access suffers.
 We study another way of organizing records, called the B+-
tree file organization. The B+-tree file organization is related to
the B+-tree index structure and can provide efficient
ordered access to records even if there are a large number of
insert, delete, or update operations. Further, it supports very
efficient access to specific records, based on the search key.
 Hashing file organization: A hash function is computed
on some attribute of each record. The result of the hash
function specifies in which block of the file the record should
be placed.
Data-Dictionary Storage
 A relational database system needs to maintain data about the
relations, such as the schema of the relations. In general, such
“data about data” are referred to as metadata.
 Relational schemas and other metadata about relations are
stored in a structure called the data dictionary or system catalog.
Among the types of information that the system must store are
these:
 Names of the relations
 Names of the attributes of each relation
 Domains and lengths of attributes
 Names of views defined on the database, and definitions of those views
 Integrity constraints (e.g., key constraints)
 Further, the database may store statistical and descriptive
data about the relations and attributes, such as the number of
tuples in each relation, or the number of distinct values for
each attribute.
 The data dictionary may also note the storage organization
(heap, sequential, hash, etc.) of relations, and the location
where each relation is stored:
 If relations are stored in operating system files, the dictionary would note
the names of the file (or files) containing each relation.
 If the database stores all relations in a single file, the dictionarymay note
the blocks containing records of each relation in a data structure such as a
linked list
 There is also a need to store information about each index on
each of the relations:
 Name of the index
 Name of the relation being indexed
 Attributes on which the index is defined
 Type of index formed
References

 Chapter 8: Overview of Storage and Indexing

Data base Management Systems, RaghuramaKrishnan, Johannes
Gehrke, TATA McGrawHill, 3rd Edition, 2003.

 Chapter 13: Data Storage Structures

Data base System Concepts, A.Silberschatz, H.F. Korth,
S.Sudarshan, McGraw Hill, VI edition, 2006.

Chap. 2 File Organization and Indexing: Abel J.P. Gomes
No ratings yet
Chap. 2 File Organization and Indexing: Abel J.P. Gomes
20 pages
Lecture9 PDF
No ratings yet
Lecture9 PDF
45 pages
Lesson 9 Lecture9
No ratings yet
Lesson 9 Lecture9
45 pages
Indexing
No ratings yet
Indexing
62 pages
index1 (5)
No ratings yet
index1 (5)
25 pages
UNIT-IV - File Organization
No ratings yet
UNIT-IV - File Organization
10 pages
V Unit
No ratings yet
V Unit
15 pages
V_Unit[1]
No ratings yet
V_Unit[1]
36 pages
Lecture3 File Orgn
No ratings yet
Lecture3 File Orgn
13 pages
Index and Hashing 2017 Combined
No ratings yet
Index and Hashing 2017 Combined
60 pages
DBMS Storage and Indexing
No ratings yet
DBMS Storage and Indexing
80 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
40 pages
Module Iippt
No ratings yet
Module Iippt
27 pages
L4 Indexing
No ratings yet
L4 Indexing
56 pages
DBMS Unit 5
No ratings yet
DBMS Unit 5
58 pages
File Storage and Indexing: Lesson 13 Cs 3200 Kathleen Durant PHD
No ratings yet
File Storage and Indexing: Lesson 13 Cs 3200 Kathleen Durant PHD
46 pages
File Organizations and Indexing: R&G Chapter 8
No ratings yet
File Organizations and Indexing: R&G Chapter 8
26 pages
Chapter 8 Indexing NEW
No ratings yet
Chapter 8 Indexing NEW
43 pages
DINLect1.pptx
No ratings yet
DINLect1.pptx
69 pages
UNIT 4 updated_121124 (1)
No ratings yet
UNIT 4 updated_121124 (1)
52 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
31 pages
Lesson 8 Cs450 - Indexing
No ratings yet
Lesson 8 Cs450 - Indexing
31 pages
W5 Storage Files Indexing pt1
No ratings yet
W5 Storage Files Indexing pt1
61 pages
Lecture12(CNC 312)
No ratings yet
Lecture12(CNC 312)
36 pages
Indexing Files: Last Time
No ratings yet
Indexing Files: Last Time
5 pages
22-File Organization-06-09-2024
No ratings yet
22-File Organization-06-09-2024
23 pages
Co3 Session 21
No ratings yet
Co3 Session 21
53 pages
Unit 6 notes DBMS final
No ratings yet
Unit 6 notes DBMS final
14 pages
Lec20Indexing_v1
No ratings yet
Lec20Indexing_v1
57 pages
Data Indexing Presentation
No ratings yet
Data Indexing Presentation
38 pages
DBMS_UNIT_5_NOTES
No ratings yet
DBMS_UNIT_5_NOTES
28 pages
Indexing in Database
No ratings yet
Indexing in Database
33 pages
DBMS Unit9 (2)
No ratings yet
DBMS Unit9 (2)
44 pages
Unit-6 Storage Strategies
No ratings yet
Unit-6 Storage Strategies
43 pages
Mod4 Chap10 - 11 Indexing
No ratings yet
Mod4 Chap10 - 11 Indexing
77 pages
DBMS-U5 Notes
No ratings yet
DBMS-U5 Notes
16 pages
CIT 401 Lecture Note
No ratings yet
CIT 401 Lecture Note
46 pages
Lt20 21 Index
No ratings yet
Lt20 21 Index
28 pages
Unit5 File Organization
No ratings yet
Unit5 File Organization
112 pages
Chapter_3_File_Organization_Indexed_methods
No ratings yet
Chapter_3_File_Organization_Indexed_methods
31 pages
Layers of a DBMS
No ratings yet
Layers of a DBMS
38 pages
File Organization
No ratings yet
File Organization
11 pages
4 File & Index
No ratings yet
4 File & Index
35 pages
Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing
No ratings yet
Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing
41 pages
APznzaau2 Qp6kQQWFsnXwvNI9mRcLmOzCEkKD6XNs8L1jR BhV1PFdRVVjZd8UbQlgVD2em6PSiesHntJxnE6ihEzMeDrE4RhBGR5X9KstSsrDfvlwogTn9 FGEx0uSBSqEuhwJ 7XtrewN6wGhq1Q0hThOfEbaC 2lntBPcupU2TlkQP FEFF0tzLTzzZTo6he
No ratings yet
APznzaau2 Qp6kQQWFsnXwvNI9mRcLmOzCEkKD6XNs8L1jR BhV1PFdRVVjZd8UbQlgVD2em6PSiesHntJxnE6ihEzMeDrE4RhBGR5X9KstSsrDfvlwogTn9 FGEx0uSBSqEuhwJ 7XtrewN6wGhq1Q0hThOfEbaC 2lntBPcupU2TlkQP FEFF0tzLTzzZTo6he
22 pages
DBMS-Unit5-PPT (1)
No ratings yet
DBMS-Unit5-PPT (1)
40 pages
IT3020 L06 Indexing
No ratings yet
IT3020 L06 Indexing
41 pages
File Organization
No ratings yet
File Organization
19 pages
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
No ratings yet
IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing
44 pages
Index Method1
No ratings yet
Index Method1
24 pages
Class 6
No ratings yet
Class 6
15 pages
Chapter 11. File Organisation and Indexes
No ratings yet
Chapter 11. File Organisation and Indexes
56 pages
Indexing - II
No ratings yet
Indexing - II
57 pages
DBMS UNIT-5
No ratings yet
DBMS UNIT-5
23 pages
Indexing_Hashing_Files
No ratings yet
Indexing_Hashing_Files
68 pages
Indexes
No ratings yet
Indexes
70 pages