0% found this document useful (0 votes)

23 views22 pages

File Organizations

DBMS

Uploaded by

anithakumaran29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views22 pages

File Organizations

DBMS

Uploaded by

anithakumaran29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CS470

Introduction to Database Management Systems

File Organizations
(Chapters 13 and 14 of the textbook)

V Kumar
School of Computing and Engineering
University of Missouri-Kansas City
File Organizations
A file is a collection or set (ordered or unordered) of data elements stored on a storage media.
A system software module is responsible for managing (reading, processing, deleting, etc.) a file.
Let us see how much work goes in a simple file operation.

The system performs the following steps to write a character to a sequential file.
Example: write (f1, c); where f1 = text file name. c = a char variable contains a value 'P'.

0. In the beginning 'P' is at memory location c.

1. The program asks O/S to write the contents of c to file f1.
2. The O/S passes the job to the File Manager.
3. The file manager looks up f1 in the directory to see if f1 is open and available for use,
allowed access types, and what physical file the logical name f1, corresponds to.
4. The file manager searches a file allocation table for the physical location of the sector
(last) that is to contain the byte.
5. The file manager makes sure that the last sector in the file has been stored in a system I/O
buffer in the main memory, then deposits the 'P' into its proper position in the buffer.
6. The file manager gives instructions to the I/O processor about where the byte is stored in
the main memory and where it needs to be sent on the disk.
7. The I/O processor puts the data in the proper format for the disk. It may also buffer the
data to send it out in chunks of the proper size for the disk.
8. The I/O processor sends the data to the disk controller.
9. The controller instructs the drive to move the read/write head to the proper track, waits
for the desired sector to come under the read/write head, then sends the byte to the driver
to be deposited bit by bit, on the surface of the disk. (Remember this byte has its
companion and each companion is deposited on the disk bit by bit).
The complete the entire set of operations about 10-12,000 assembly instructions are
executed.
Logical Structure of a file
Field: Smallest (fixed) indivisible logical unit of a file. A field holds a part of some data
value.

Record: A set of logically related fields. Cardinality of this set may be fixed or variable,
i.e., a record size may be fixed or variable. A file, therefore, is a collection of
logically related records
A Field A Record A File
SSN Name Age Phone # SSN Name Age Phone #

2/22 CS470: Introduction to File Organizations. Vijay Kumar

A record can be of fixed or variable size. Most commonly fixed size records are used. The
file structure has been illustrated here in a tabular form but note that a file is not just a table its
records can be organized in many different ways each way defines a unique organization.
Operations on file
Create: Create a new file.
Write: Write to a file
Read: Read a part of a file
Rewind: Go to the beginning of the file
Delete: Remove a file from the system.
Close: Close a file (at the end of manipulation).
Open: Open a file for processing.
Modify: Modify a field value.
Relationship of a file with the storage media
Common storage media are disk, optical disk, and tame. Disk is a random storage device
because a record of a file can be accessed from the place it is stored. In some cases a file
acquires the attributes of the storage media on which it is stored, i.e., magnetic tape. To access a
record of a file stored on a magnetic tape, unlike disk files, the entire file has to search
sequentially before reaching the record.
Disk File structure: Several platters, several cylinders, several tracks on a cylinder, several
sectors on a track (Sector = Block (fixed by manufacturer)).
Access path: Disk No → Cylinder No →Track No → Sector No
Unit of data transfer between memory and disk: Block.

Relationship between logical and physical units

A fixed record size may be larger or equal to or smaller than a disk block. For storage
purpose a unit called Blocking factor (Bfr) is defined, which is the number of records in a block
and computed as follows:

Blocking factor - Bfr (Number of records in a disk block) = ⎣B/R⎦ where B is disk block size in
number of records and R is file record size. Thus, unused space in a block: B - (Bfr * R) bytes

Disk parameters: Seek time, rotational latency time and block transfer time.

Seek time: The time it takes to move the read/write arm to the correct cylinder. It is the largest in
cost. Average seek time is the same as the time it takes to traverse one third of the cylinders.

Rotational latency time: The time the disk unit takes to move or to position the read/write on to
the beginning of the sector from where the file records are stored.

Block transfer time: Time for the read/write head to pass over a disk block. During the block
transfer time, the bytes of data can be transferred between the disk and main memory.

3/22 CS470: Introduction to File Organizations. Vijay Kumar

Rotational latency and Block transfer time

Average rotational latency (r) = 1/2 of one disc revolution = 0.5. rpm = No. of rotations per min.
1 60 × 1000 0.5 × 60 × 1000
Time for 1 revolution= min = ms. Time for half revolution = ms.
rpm rpm rpm
Common speed for disk rotation is 3600 rpm so average latency time =
0.5 × 60 × 1000
ms = 8.3ms
3600
Let data transfer speed = t bytes/ms and block size = B, then the Block transfer time (btt) = B/t
ms.
In a formatted data on the disk there is a gap between two consecutive blocks. Some control
information (information about the following block) is stored in this gap. In reading a set of
blocks these gaps are also read so we need to take into account the time to read these gaps in the
computation of block transfer time.

Suppose t' = data transfer speed for formatted data (block and the gap). So the effective
block transfer time Ebt = B/t'. We use IBM 3380 disk in our example to compute various
parameter values. The total capacity of the disk is over 2 billion bytes. Table 1 lists the values
of some useful parameters of this disk.

Table 1
Parameter Definition Value
B Block size 2400 bytes
t Data transfer speed 3000 bytes/ms
btt Block transfer time 0.8ms
t' Formatted data (interblock gap) speed 2857 bytes/ms
ebt Effective block transfer time 0.84 ms (= B/t')
N No. of cylinders 885 per spindle
r Average rotational latency 8.3 ms
s Average seek time 16 ms

Time to read 10 blocks sequentially

It is the sum of average seek time, average rotational latency time, and 10 effective transfer
time. Effective block transfer time must be used, since the head must move over the interblock
gap. So reading time = s + r + 10 × ebt = (16 + 8.3 + 8.4) = 32.7 ms. Note that the seek time
and the rotational latency time are significant with respect to the transfer time.

Time to read 10 blocks randomly

In this case individual seek time and rotational latency time are required for each block. So
reading time = 10 × (s + r + btt) = 10 × (16 + 8.3 + 0.84) = 10 × 25.14 = 251 ms. The random
time for reading 10 blocks is higher than the sequential time by a factor of 8. Thus, time for
reading 100 sequential blocks = 0.1083 sec. Time for reading 100 random blocks = 2.510 sec.

4/22 CS470: Introduction to File Organizations. Vijay Kumar

The time for reading randomly 100 blocks is 25 times larger than time for reading 100 blocks
sequentially.

Conclusions:

File Organizations
One of the main objectives of file organization is to speed up the file retrieval time, that is, to
reduce the I/O time. This can be done by improving disk access speed but that has a limit
because of the hardware constraints. The other way is to store file records in such a way that
records can be accessed concurrently or in parallel. A hardware solution to this is using RAID
structure. A RAID is an array of several disks which can be accessed in parallel. A software
solution is through some file organization. We discuss a number of file organizations. Note that
no single file organization is the most efficient for all types of data access. For this reason a
database management system uses a number of different file organizations for storing the same
set of data.
There are basically three categories of file organizations (a) Sequential organization, (b)
organization using Index, and (c) Random organization. We begin with sequential category.
Files of these categories are called sequential files.
We proceed as follows; (a) understand the file organization, (b) understand how a set of
operations are performed, and (c) analyze the time required to performed each operation. This
will help us to identify which file is suitable for what kind of data processing.

Sequential file
In this organization records are written consecutively when the file is created. Records in a
sequential file can be stored in two ways. Each way identifies a file organization.
• Pile file: Records are placed one after another as they arrive (no sorting of any kind).
• Sorted file: Records are placed in ascending or descending values of the primary key.
Pile file: The total time to fetch (read) a record from a pile file requires seek time (s), rotational
latency time (r), Block transfer time (btt), and number of blocks in the file (B). We also need a
key field of the record (K). In pile file we do not know the address of a record, so we search
sequentially. The possibilities are (a) the desired record may be in the first block or (b) it may be
in some middle block or (c) it may be in the last block.
1 B 1 B × ( B + 1) B
So the average number of blocks read × ∑i = × ≈ . Thus the time to find
B i =1 B 2 2
B
(read) a record in a pile file is approximately: TF = × ebt . The fetching (searching) process is
2
shown in the flow diagram (Figure 1) and the example illustrates the time involved.

5/22 CS470: Introduction to File Organizations. Vijay Kumar

Start

Is K = the key value of the record? Y

Exit
N
N Last record of this block?
Y
All blocks searched?
Y
Display message: "Record does not exist" and exit the search

Figure 1. Search in a sequential file organization.

File Reorganization: In file reorganization all records, which are marked to be deleted are
deleted and all inserted records are moved to their correct place (sorting). File reorganization
steps are:
• read the entire file (all blocks) in RAM.
• remove all the deleted records.
• write all the modified blocks at a different place on the disk.
Time to read the entire old file blocks = B (No. of blocks in the file) × ebt.
Time to write out new file blocks (n) = (n /Bfr) × ebt.
Total reorganization time TR = (b + n/Bfr)× ebt.
Example

Environment: Hospital
Record size: 400 bytes.
File size: 100,000 records 40 megabytes.
Storage disk: IBM 3380.
Block size: 2400 bytes.
Number of blocks in the file (B): 40 mb/2400 = 16,667.
The total time to read a record is TF: 8333 × 0.84 ms = 7 secs.

This time is tolerable for infrequent reads. However, if frequency of reads is large, then this
time is not acceptable. Suppose we want to look for 104 names of patients. For this the total
search time is = 7 × 104 ms = 19 hrs. and time to read through the entire file with independent
fetches is TX (independent fetches) = 100,000 × 7000ms > 8 days.
Missouri has about 5 million personal income tax record of 400 bytes each. This will fit on
one IBM 3380 drive. The time to read this file with independent searches is:
5000000 × 7000
TX = = 2 years .
1000 × 60 × 60 × 24 × 365

6/22 CS470: Introduction to File Organizations. Vijay Kumar

Inserting a record: To insert a record, it is placed at the end of the file. No need to sort
(ascending or descending order) the file. However, the file may have duplicate records. The
total time involved in this process is TI (insert time) = s (seek) + r (latency) + btt + r = 42 ms
(on IBM 3380).
Deleting or modifying a record: This will require to fetch the block containing the record, find
the record in the block and just mark it deleted, then write the modified block to the disk. Total
time required: TD or TM = TF + 2r.
Sorted Sequential File: In a sorted file, first the record is inserted at the end of the file and then
moved to its correct location (ascending or descending). Records are stored in order of the
values of the key field.
Record 1
Record 2
y
y
y
Record n
A sequential file organization
A sequential file usually has an overflow area. This area is to avoid sorting the file after
every deletion, insertion and/or modification. All records, which were added after the file was
first populated go to this overflow area. The overflow area is not itself sorted it is a pile file with
fixed size record. At some convenient time the entire file is sorted where records from the
overflow area go to their correct position in the main file.

Record ordering: Stored in an ascending or descending order of key = K. Record Rj precedes

record Rk iff K(Rj) <= K(Rk).

Retrieval: Records are retrieved in a consecutive order. The order of record storage determines
order of retrieval. During retrieval several required operations (partial result output etc.) can be
performed simultaneously.

Insert, delete and modify (Update): Sequential files are usually not updated in place. Each
operation regenerates a new version of the file. The new version contains the up-to-date
information and the old version is usually saved for recovery. The new version becomes the old
version and the next update to the file uses this old version to generate next new version. The
intensity or frequency of use of a sequential file is defined by a parameter called “Hit ratio”,
which defines is defined as follows:
No. of records accessed for responding to a query
Hit ratio = .
Total number of records in the file

Desirable: high hit ratio value. This means a larger number of records are accessed to respond to
a query. Interactive transactions have very low hit ratio.
Advantages of sequential file
• Good for batch transactions.
• Simple to implement

7/22 CS470: Introduction to File Organizations. Vijay Kumar

• Good for report generation, statistical computation and inventory control.
Disadvantages
• Not good for interactive transactions
• High overheads in file processing for simple queries.
Index File Organization
Index organization tries to reduce access time. It may not reduce the storage requirement of
a file. We define some important terms of this organization.
Index: An index is similar to a pointer with an additional property. This additional property
allows an index to identify the record it is pointing to. For example, an index to a record of
employees points and identifies an employee record. This means that an index has some
semantics associated with it. Formally, an index maps the key space to the record space. Index
for a file may be created on the primary or secondary keys.
Index types: Primary, Nondense, and Dense.
Primary Index: An ordered file of index record. This file contains index records which are of
fixed length. Each index record has two fields:
• one field holds the primary key of the data file record.
• the other holds pointer to disk block where the data file is stored.
Nondense index: No. of entries in the index file << no. of records in the data file.
Dense index: No. of entries in the index file = no. of records in the data file.
Example: How record search time is reduced by indexing.
Sequential search (no indexing)
File size = 30,000 records. Record size = 100 bytes (R). Block size = 1024 bytes (B).
Blocking factor = ⎣1024/100⎦ =10 records per block.
No. of blocks for the data file = 30,000/10 = 3000 blocks.
A binary search on the data file ⎣log2 3000⎦ = 12 blocks accesses.
Indexing
Primary key = 9 bytes. Block pointer = 6 bytes. Index entry size = 9 + 6 = 15 bytes.
Index file blocking factor = ⎣1024/15⎦ = 68 entries. No. of blocks for index file = 3000/68 =
45.
Binary search on index file = ⎣log2 45⎦ = 6 blocks accesses. Total no. of accesses = 6 + 1 = 7.
Index Sequential File (ISAM - Index Sequential Access Method)
Introduced by IBM in early 1960s. This file organization closely related to the physical
characteristics of the storage media. It has two parts: (a) Index Part and (b) Data Part.
Index Part: This part stores pointers to the actual record location on the disk. It may have several
levels and each level represents some physical layout such as sector, cylinder, of the storage
media.
Data Part: It holds actual data records and is made up of two distinct areas: Prime area and
Overflow area. The prime area holds records of the file. The overflow area holds records when
the prime area overflows.

8/22 CS470: Introduction to File Organizations. Vijay Kumar

Organization: Data records are stored on disk tracks in their ascending key values.
Organization: Data records are stored on disk tracks in their ascending key values.

Track No. Key Record Key Record Key Record Key Record
Track 1 10 rec10 14 rec14 18 rec18 20 rec20
Track 2 26 rec26 34 rec34 41 rec41 60 rec60
Track 3 72 rec72 73 rec73 74 rec74 75 rec75
Track 4 77 rec77 78 rec78 79 rec79 82 rec82
Track 5 89 rec89 91 rec91 93 rec93 98 rec98

To locate a data record a Track Index is required.

Track Index
Track No. Highest key on the track
1 20
2 60
3 75
4 82
5 98
Operation on the file: Read: record 79.
Sequential access: Search sequentially from track 1, record 10. Very slow.
Semi-random access: (a) Search track index, (b) find the track that contains a key greater than
79. This takes us to track 4. Search track 4 sequentially to find record 79.
Problem: For a large file track index would be large. A large index is not convenient to manage.
Solution: Create Cylinder index to manage track index.
Cylinder index
Cylinder Highest key
1 98
2 184
3 278

Track index for cylinder 1 Track index for cylinder 2 Track index for cylinder 3
Track Highest key Track Highest key Track Highest key
1 20 1 107 1 201
2 60 2 122 2 210
3 75 3 148 3 223
4 82 4 163 4 259

Find record with key value 248.

Steps
• Search Cylinder index to find correct cylinder. Correct cylinder 3.
• Search track index for cylinder 3. Correct track 4.
• Search track 4 (sequentially) for the desired record.

9/22 CS470: Introduction to File Organizations. Vijay Kumar

Problem: Cylinder index may become very large.
Example: File size = 5000 tracks of data. Disk Type: 3350 (IBM), 16 platters/volume, that is,
30 surfaces/volume, and 555 tracks/surface, that is, 555 cylinders/volume. A cylinder has 30
tracks, so to store the entire file 5000/30 =167 cylinders are required (cylinder wise data storage,
i.e., store file records on the first track of cylinder 1, then first track of cylinder 2, and so on. ).
Apart from the file, there are other controlling information such as track, cylinder, and master
indexes, therefore, all 30 tracks are not used to store data. 25 tracks are used for data and the rest
are for control information. Result: 5000/25 = 200 cylinders for indexes and data, which means
there are 200 entries in a cylinder index.
Each entry of index record may be large. The size of cylinder index might grow with the file
size. To reduce the search time for a cylinder index, a master index is created, which holds the
address of cylinders storing the records.
Overflow area: It accommodates overflow record from the prime area. If the record key
indicates that the record must go on a track, which is full then some records from this track is
moved to the overflow area. With overflow area the track index takes the following form:
Track No. Highest key on track Highest key in overflow area Addr. of the first overflow
1 20 20 Null
2 60 60 Null
3 75 75 Null
4 82 82 Null
5 98 98 Null

File Processing
Deletion: Simple. Find the desired record. Put a special marker into the record to indicate that it
has been deleted.
Insertion: Cumbersome.
Example: Insert record 55.
Destination: Track 2. Track 2 is full. Move record 60 to the overflow area. Modify the track
index.
10 14 18 20 1 20 20 Null
26 34 41 55 2 55 60 Address of Rec. 60
72 73 74 75 3 75 75 Null
77 78 79 82 4 82 82 Null
89 91 93 98
60
Add record 42. Result Prime Area Track Index. Other records can be inserted similarly. When
the overflow area is full then records are inserted in overflow area on the disk, however, this
seldom happens.

10 14 18 20 1 20 20 Null
26 34 41 42 2 42 60 Address of Rec. 55
72 73 74 75 3 75 75 Null
77 78 79 82 4 82 82 Null
89 91 93 98

10/22 CS470: Introduction to File Organizations. Vijay Kumar

60 55

Implementation: The cylinder index is usually kept in the memory. The cylinder index gives the
address of the correct track index, which is located on the disk. The track index is fetched and the
address of the record is found.

Performance: Time to fetch a record: TF = r + s + btt + r + btt

As new records are added, the ISAM file degraded in performance. It becomes slow and
often had to be reorganized at high cost. This is why ISAM is not the choice of indexing method
for new systems. Let us suppose that all overflows is on the same cylinder. Then to look up a
record, we obtain cylinder track index and search the primary area. If the record is not there then
we must follow the chain of pointers to various records in the overflow area of the cylinder. If
this overflow area does not contain the record then the second overflow area on another cylinder
(not necessarily the adjacent cylinder). So a search for a record may require several seeks.
Similarly, reading the file in order after many insertions becomes quite complicated. Each linked
list must be followed, record-by-record, to read sequentially. This can require a large number of
seeks. ISAM is not used any more. It is replaced by B+ structure.

VSAM (Virtual Storage Access Method)

Structure: It has three parts: (a) Index set, (b) Index sequence set, and (c) Data blocks.
Index set: Set of records. Identical to ISAM Index file. Each index record points to one of the
lower level index block.
Index set record structure: Each record has two fields:
Index sequence set
A
High key value of the block being pointed to Ptr to the block
Index set B
High key value of the block being pointed to Ptr to the block L
y L
y T
y
High key value of the block being pointed to Ptr to the block M

In an index set (block) the index records are maintained physically in ascending sequence by
primary key value, The index records in the index set are nondense, meaning that there is only
one index record for a lower level index block.
Index Sequence Set: Nondense. Each record in this set points to one Control Interval.
Control Interval: Set of data records, which are physically stored in ascending key
values.
Control Area: An ordered set of control intervals and Free control intervals for a file.
Distributed free space: Set of free control intervals in a control area.
The sequence set contains index records that point to every control interval allocated in the
file. The following diagram illustrates Index sequence set, Control intervals, Distributed free
space and Control area. The size of control area and the number of control intervals may be
predefined. At the end of file initialization the unfilled control intervals are set aside as
distributed free space.

11/22 CS470: Introduction to File Organizations. Vijay Kumar

D A B C D
H E F H
M J M
P N P Control intervals
S Q S
Free Free control interval
Free Distributed free space
Free control interval
Index sequence set

The complete structure of a VSAM look like:

Index sequence set
A Control interval
Index set B Control interval
L Control interval
L
T
M Control interval
Control interval
T Control interval
File Processing
Sequential access: It can begin from the beginning of the file or from any other point.
Random access: Through the index set to the index sequence set to the data.
Update (insert a new record): During insert the following situation might occur:
a. Space available in the target control interval.
b. New record does not fit in the target control interval.
c. The control area is completely full.
Space available in control interval
• Bring desired control interval in the buffer.
• Move key values higher than the target record up in the control interval.
• Insert the desired record in the control interval
• Write the control interval back to the file.
C

A D F

A CD F

Not enough room in the target control interval: The available room in the control area is
insufficient to accommodate the new record. In order to insert C we must get a new control
interval. If there is a free control interval in the control area then the records with higher key
values from the control interval are copied into the new control interval. If there is no free
control interval in a control area then a new control area is allocated and initialized. The
initialization process copies all the control intervals with records having value smaller than the
new record. The new record is then copied into the proper place and the rest of the records are
then copied. The old control area is discarded, but this seldom happens.

12/22 CS470: Introduction to File Organizations. Vijay Kumar

C Record to be added

F
A B D F
Index sequence set

A B C Original control interval

Control interval split
D F New control interval

C A B C
F
D F
Mofified index sequence set

The control area is completely full: In this situation an entire new control area is created,
added to the file, and then populated.
File Size and File creation: User may specify the number and size of control intervals and
control areas. If not specified, VSAM will use some default value. The size of control interval
dictates the size of the unit of I/O data transfer. It is important for performance reasons to make
the control interval a multiple of the actual physical blocks on the disk itself.
File name: Given by the user. This name is stored in the master catalogue.
Freespace: The user has the capability of allocating distributed free space throughout the file.
This space will not be used when the file is populated. However, this will be available for control
interval spliting tasks.
Record size: Greater flexibility. The user must specify not only the maximum record length but
also the average record size. VSAM can use the average size to perform calculations as to how
large, for example, a control interval must be.
File population: One of the main ways is to sort all data records by their key field and present
this to the system. This is called mass insert and can optimize the process of how to move
records within control intervals.
The following diagram illustrates this process:
Empty file

Add records A, B, C, and D A B C D

Add records E, F, G, H, and I A B C D E F G H I

Mass insert into empty file

Original file A D K Z

Add record L A D K L Z
A D K L M N O P

Add records M, N, O, P, Q, and R Q R Z

Mass insert between existing records

Direct File
Indexing provides the most powerful and flexible tools for organizing files. However, (a)
indexes take up space and time to keep them organized and (b) the amount of I/O increases with

13/22 CS470: Introduction to File Organizations. Vijay Kumar

the size of index. This problem can be minimized by direct file organization where the address
of the desired record can be found directly (no need of indexing or sequential search). Such file
are created using some hashing function so they are called hashing organization or hashed files.
Hashing
Two types of hashing: (a) Static and (b) Dynamic.
Static: The address space size is predefined and does not grow or shrink with file.
Dynamic: The address space size can grow and shrink with file.
Key Space: Set of Primary keys.
Address Space: Set of home addresses.
Hash function (Primary Key) → Home address of the record.
Home address → file manager → disk address.
Requirements: 1. A predefined set of home addresses (address space). 2. A hashing function.
3. Primary key (Hashed key).
There is a unique relationship between the key and the record address. Any add, delete or
change in a record must not break this relationship. If this bond is broken then the record is no
longer accessible by hashing.
Address distribution
1. With hashing, the address generated is random.
2. No obvious connection between key and the home address. (For this reason hashing is
sometimes referred as randomizing.)
Record distribution in address space: a. Uniform. b. Random.
Best Worst Acceptable
Record Key Address Record Key Address Record Key Address
1 1 1
2 2 2
A 3 A 3 A 3
B 4 B 4 B 4
C 5 C 5 C 5
D 6 D 6 D 6
E 7 E 7 E 7
F 8 F 8 F 8
G 9 G 9 G 9
No synonyms All synonyms Few synonyms
Collision: When a hashing function generates the same home address for more than one record.
Solutions for collision: Progressive overflow (also called open addressing)

5 N
5 N 6 R
H(K) Home address 6 R First try (full) 7 J Wrap around
7 J Second try (full) 8 M
8 M Third try (full) H(T) Home address 9 S
Fourth try. K's home address

Collision Repair by Progressive Overflow Home address search beyond the end of file
Problems
• What happens if there is a search for a record, but the record was never placed in the file?

14/22 CS470: Introduction to File Organizations. Vijay Kumar

• What happens if a record is hashed at a filled address and only the predecessor location is
empty?
These problems increase the average search length.
Example: Consider the following keys and their home addresses:
Key Home address Actual address No. of accesses to retrieve a record
Adams 20 0
Bates 21 --
Cole 21 20 Adams 1
Dean 22 21 Bates 1
Evans 20 22 Cole 2
23 Dean 2
24 Evans 5
25
Average search length = (1 + 1 + 2 + 2 + 5)/5 = 2.2. Acceptable average search length = 2.0.
Anything higher than this is regarded high.
Improvement (reduce the average search length): by storing more than one record at a single
home address. Storing records in Buckets
Bucket: A logical unit of storage where more than one records are stored and the set of records
set is retrieved in one disk access (one I/O).
The home addresses of these records are the same. On sector-addressing disks, a bucket typically
consists of one or more sectors. On block-addressing disks a bucket might be a block (physical,
defined by the manufacturer). Consider the following set of keys, to be loaded into a hash file.
Key Home address Buckets
Green 30 30 Green Hall
Hall 30 31
Jenks 32 32 Jenks
King 33 33 King Land Marks
Land 33 Nutt is an overflow record
Marx 33
Nutt 33
Each address (30, 31 etc.) identifies a bucket capable of holding the records corresponding to
three synonyms. Only the record corresponding to Nutt cannot be accommodated in a home
address.

Access record: Green. 1. Load the buffer in the memory. 2. Search the bucket serially.

Deletions: More complicated than adding a record for two reasons:

1. The slot freed by the deletion must not be allowed to hinder later searches.
2. It should be possible to reuse the freed slot for later additions.
In progressive overflow a search for a record terminates if an open address is encountered.
Problems in searching a record

15/22 CS470: Introduction to File Organizations. Vijay Kumar

Four keys: Adams, Jones, Morris and Smith. Collisions: Adam and Smith at address 5, and
Jones and Morris at 6.

The following diagram illustrates the setup:

Record Home address Actual address File
Adam 5 5 5 Adam
Jones 6 6 6 Jones
Morris 6 7 7 Morris
Smith 5 8 8 Smith
File before deletion
Now suppose Morris is deleted, leaving an empty space. A search for Smith will starts at address
5, then looks at addresses 6 and 7. Since address 7 is now empty, it is reasonable for the program
to conclude that Smith's record is not in the file.

Solution: Mark the deleted record with a tombstone. The search for a record will not stop at a
tombstone. This solution makes insertion difficult.

Example: Record deleted: Morris. Record to be inserted: Smith.

The program gets the address 5 and tries to find an empty slot. During the search it encounters a
tombstone. The record Smith is inserted here.

Problem: The File has duplicate Smith.

Solution: The program must search the entire file and then go back to the first tombstone and
insert the record.

Another problem with a tombstone is it should not be inserted when the last record or a record,
which does not have any successor, is deleted.

Dynamic Hashing (Extensible Files also called Linear Hashing)

Files grow and shrink in database systems. The static hashing organization is not suitable for
these files since they are unable to shrink and expand with files.

Solution: Dynamic hashing. It manages expansion by

• Splitting a bucket when it becomes full.
• Distributing records between old and new buckets.
Virtual Hashing
Uses multiple hashing functions. These functions are related and the selection of the function
depends upon the level of bucket split. Division is the basis of hashing. We assume
N = Number of buckets. Address = key mod N (hashing into any bucket).
When a bucket is full then to accommodate an insert:
• The bucket is split. A different but related hashing function is used to distribute the
records of the old bucket between the old and the new buckets.
• The desired record is inserted in these buckets using the current hashing function.

16/22 CS470: Introduction to File Organizations. Vijay Kumar

Example
N = 100, C = 4 (bucket size). Record Keys: 508, 17308, 208, 408.
Hashing function: address = key mod 100. Records hashed into the bucket as follows:
508
17308
208
408
Bucket 8
The bucket is full. New record 1208 is to be inserted there. The insertion of 1208 forces a
split. A new bucket is acquired on the disk. The reorganization of these records is done first in
the memory buffer then the records are written on the disk buckets (old and new). The five
records will be rehashed using function: address = key mod 2N (key mod 200).

The following diagram shows the state of the file after the distribution of the four old records
and the insertion of new record with key 1208.
Buckets after rehashing
208 508
408 17308
1208
Bucket 8 Bucket 108
Suppose the record 1608 is now inserted into bucket 8 and then record 5808 comes. We need to
split bucket 8 a second time, and on this occasion we use: address = key mod 4N (key mod 400).
Bucket 208 is allocated to the file. The file after a second split
2408 508 208
1208 17308 5808
1608
Bucket 8 Bucket 108 Bucket 208
In general, if we are splitting a bucket that has been involved as destination of S previously
splits, we use the function: address = key mod (2s+1 × N).
Disadvantages
• Waste of space (unused buckets after each split).
• If two buckets n and n+j are in use then all buckets between n and n+j must be available
to the algorithm for inserting records.
• The history of insertion must be saved to access a record. This is because many related
hashing functions are used in the search operation.
Dynamic Hashing
In this scheme there is no relationship between the buckets. The position of one bucket is not
related to the position of any bucket, which was the case in virtual hashing. Initially files are
organized in N buckets. These buckets reside on the disk but each of them is pointed to by a cell
in the memory, and bucket address is therefore not as important as in Virtual hashing. Suppose N

17/22 CS470: Introduction to File Organizations. Vijay Kumar

(number of buckets) = 20. C (bucket size in number of records) = 3. We have the following
layout:
1 2 3 Index level 0
Main memory

Secondary storage
Buc. 1 Buc. 2 Buc. 3
A hash function transforms a key into an address of a level 0 index element (1 through 3 in this
case). The appropriate bucket is accessed by the pointer.
Insertion: The hashing function gives the address of an index element and the record is inserted
in the bucket pointed to by that node. If the target bucket is full, a new bucket is allocated to the
file. The C + 1 records are redistributed between the two buckets. These two buckets then are
pointed to by the children of the original index node as shown in the following diagram:
1 2 3
Index level 0

Index level 1
RAM
Disk
Buckets
1 2 3 21
Hash File after Bucket Split
Two questions
• When redistributing records between two buckets (bucket split), how do we decide which
record goes to which bucket?
• A record search begins from the Index level 0. When searching a record, how do we find
which of the two buckets pointed to by the index block contains the record?
Solution: To solve the above problems, a second function is used. This function, given a key
(usually primary), returns an arbitrarily long binary string (1's and 0's) of fixed length
(predefined) as shown below. This string is referred to here as B (Key).
Key H (Key) B (Key)
157 2 10100...
95 1 00011...
88 1 01100...
205 2 10010...
13 1 10111...
125 1 10001...
6 1 01000...
301 1 00110...

How a B (Key) is used?: The bits of B are used to identify the right bucket for inserting and for
searching the desired record. If the bucket being split is pointed to from a node at level I in the
tree, then we use the value of the (I + 1)st digit of B (key) to determine if a record goes to the left
(old) or to the right (new) bucket. The following diagram shows the insertions of the first five
records

18/22 CS470: Introduction to File Organizations. Vijay Kumar

1 2 Index level 0

RAM
Disk
95 157
83 205
13
Bucket 1 Bucket 2
File after 5 insertions
Insert 125 and 6
The hashing on key 125 gives 1 so it should be inserted in bucket pointed by node 1 of level
0. The bucket is full so it splits. A new bucket is allocated to the file. The full bucket was pointed
to by a node at level 0, so we use the first digit in B (key) string to decide the destination bucket
of a record 125. Records 95 and 88 go to the left bucket, and records 13 and 125 go to the right
bucket. Record 6 goes to Bucket 1.
1 2 Index level 0
0 1
Index level 1

RAM
Disk
95 157 13
88 205 125
6
Bucket 1 Bucket 2 Bucket 3
Split after inserting 125. No split in inserting 6
Insert record 301
H(301) = 1, so its B (key) takes it to bucket 1. Bucket 1 is full so a split occurs. The bucket to
be split is being pointed to by a leaf at index level 1. So in distributing the records (old) and
inserting new record (301), 2nd bit (Index level 1+1 = 2) of B key is used for distribution. The
structure of the file after inserting 301 is:
1 2
Index level 0
0 1
Index level 1
0 1
Index level 2
RAM
Disk
95 157 88 13
301 205 6 125

Bucket 1 Bucket 2 Bucket 3 Bucket 4

Bucket Split to accommodate 301
Indexing and Hashing
Hashing of any form is not at all good for sequential access. So a process, which requires a
few logically contiguous records may not be processed efficiently and indexing is the best
organization for such requirements. The performance of index organization degrades when the
volume of updates and insertion increases.
There are a large number of different file organizations such as extendible hashing, linear
hashing, B-trees group (B+-tree, B-tree, etc.) but all are based in some form of indexing and
hashing. In a database system, a number of different file organizations are used to store data. In

19/22 CS470: Introduction to File Organizations. Vijay Kumar

many cases the same data is stored in multiple file organizations to improve data retrieval for
different types of queries (key query, range query, mult-key queries, etc.). Students are strongly
advised to study these file organizations from other books.
Disk-Space Management
System allocates disk space to file blocks. Disk space management has two problems: (a)
several orders of magnitude slower disk access time and (b) an order of magnitude larger number
of blocks to deal with. Moreover, the variability and dynamic changes in file sizes make
prediction of their resource requirements difficult and unreliable in most cases.

A good space management mechanism should take into consideration:

1. Processing speed of sequential and random access to files, and allocation and
deallocation of blocks.
2. Ability to make use of multisector and multitrack transfers.
3. Disk utilization.
4. Main memory requirements of a given algorithm.
Processing speed: Speed of sequential and random access to files and not the allocation and
deallocation of blocks.
Multisector and multitrack transfer: Least number of I/O commands for large transfers, i.e.,
the entire track or a large number of sectors.
Disk utilization: Percentage of disk space allocatable to files, i.e., minimal fragmentation.
Main memory utilization: Generates low table fragmentation.
Methods of space allocation: Contiguous and non-contiguous.
Contiguous
In this method disk blocks are allocated contiguously. It also maintains logical contiguity. If
we represent the disk space as a 2-dimensional matrix, then a contiguous allocation can be
arranged as follows. LB indicates logical block of the file.
0 1 2 3 4
5 6 7 8 9
10 LB0 11 LB1 12 LB2 13 LB3 14 LB4
15 16 17 18 19
Steps
1. Estimate file size.
2. Search BFD (Basic File Directory) to find (First Fit or Best Fit) contiguous disk blocks.
3. Allocate the set of disk blocks to the file.
4. Update BFD.
5. Update free disk table.
Possible access type: sequential and random.
Sequential: Address of the first block and file size is given in the BFD. Read all the blocks
sequentially.
Random: Address of the first block + (block number × block size).
Sequential access is fast. Multitrack or multisector transfer can be conveniently used.

20/22 CS470: Introduction to File Organizations. Vijay Kumar

Deallocation: Easy. Find the file in the BFD. Overwrite just the type field and leave the address
and size field as they are.
Advantages of contiguous allocation
• Easy implementation.
• Very little directory and memory space is needed for keeping track of contiguous files.
Disadvantages of contiguous allocation
• Disk fragmentation: internal and external.
• File size must be pre-declared and pre-claimed especially in creating files, alright in
copying files. For example, what would be the size of a file when creating it using editor?
• Must provide an overflow area for managing the file expansion.
Noncontiguous allocation: Chaining. A disk based version of linked list.
Implementation: A few bytes of each disk block are reserved for forward pointer to the next
block in the sequence. The directory contains the head pointer. The following figure shows the
allocation.
0 1 2 3 LB1 [next 18] 4
5 6 7 8 LB4 [next - nil] 9
10 LB0 [next 3] 11 12 13 14
15 16 17 19 LB2 [next 20] 20 LB3 [next 8]
Properties
• Well suited for sequential access.
• Multisector transfer is difficult.
• Direct access very slow, in fact it is not easily possible since all intermediate disk blocks
must be visited before going to the desired block.
• Expansion and contraction of files can be easily managed, therefore, pre-declaring of file
size is not necessary.
• Minimum external fragmentation, therefore disk compaction is not necessary.
• Space for disk pointer is usually below 1%, its effect on the disk utilization is negligible.
• Allows easy handling of bad blocks by omitting them from both file and free chain.
• Insertion and deletion of file blocks in the middle of the chain is easy.
• Pointers are sensitive data and their correctness can be questionable.
0 1 2 3 4 LB0
5. 6 7 LB3 8 9
0 4
1 17
2 18
3 7
4 16

10 11 12 13 14
15 16 LB4 17 LB1 18 LB2 19
Noncontiguous allocation: Indexing. Based on index sequential file. The first disk block of the
file is reserved as index block and the addresses of the index block is stored in the file directory.
The following figure illustrates the allocation.

21/22 CS470: Introduction to File Organizations. Vijay Kumar

Properties
• Both random and sequential accessing of indexed files requires one disk access per level
of indexing to locate the address of the target block.
• Multisector transfer is not easy.

22/22 CS470: Introduction to File Organizations. Vijay Kumar

Chapter 17: Disk Storage, Basic File Structures, and Hashing
No ratings yet
Chapter 17: Disk Storage, Basic File Structures, and Hashing
54 pages
File Organisation Lecture Notes-1
No ratings yet
File Organisation Lecture Notes-1
28 pages
Chapter 13:disk Storage and Basic File Structures
No ratings yet
Chapter 13:disk Storage and Basic File Structures
31 pages
6 Data Storage and Querying
100% (1)
6 Data Storage and Querying
58 pages
Introduction To File Structures: CENG 351 1
No ratings yet
Introduction To File Structures: CENG 351 1
78 pages
7852 RDBMS-MRM
No ratings yet
7852 RDBMS-MRM
132 pages
m5 Main m6 Main 1 Merged
No ratings yet
m5 Main m6 Main 1 Merged
95 pages
Chapter 3 Secondary Storage and System Software
No ratings yet
Chapter 3 Secondary Storage and System Software
24 pages
Accessing The Data.: Dept. of Ise, Gsssietw
No ratings yet
Accessing The Data.: Dept. of Ise, Gsssietw
22 pages
DBMS File Organization, Indexes: Platters Read/write Head Arm
No ratings yet
DBMS File Organization, Indexes: Platters Read/write Head Arm
16 pages
Data Storage and Access Methods: Min Song IS698
No ratings yet
Data Storage and Access Methods: Min Song IS698
50 pages
Elmasri 6e Ch17 Week2 HW DiskStorage
No ratings yet
Elmasri 6e Ch17 Week2 HW DiskStorage
96 pages
Secondary Storage Introduction
No ratings yet
Secondary Storage Introduction
82 pages
OS-Chapter 5 - File Management
100% (1)
OS-Chapter 5 - File Management
10 pages
ch 3_62e9980fbd9a1022023af5af9800e9e4
No ratings yet
ch 3_62e9980fbd9a1022023af5af9800e9e4
47 pages
Data Organisation and File Allocation
No ratings yet
Data Organisation and File Allocation
17 pages
Accessing The Data.: Amogh P K
No ratings yet
Accessing The Data.: Amogh P K
21 pages
Ch4-Data Storage and Indexing
No ratings yet
Ch4-Data Storage and Indexing
116 pages
7_DataStorageIndexingStructures
No ratings yet
7_DataStorageIndexingStructures
83 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Disk Storage, Basic File Structures, and Hashing: Database Design Database Design
No ratings yet
Disk Storage, Basic File Structures, and Hashing: Database Design Database Design
13 pages
File Organization
No ratings yet
File Organization
47 pages
SEG3420 File Structures and Processing - Lecture3 Secondary Storage Media_Magnetic Disks
No ratings yet
SEG3420 File Structures and Processing - Lecture3 Secondary Storage Media_Magnetic Disks
35 pages
VND - Ms Powerpoint&Rendition 1
No ratings yet
VND - Ms Powerpoint&Rendition 1
118 pages
Storage and File Structure
No ratings yet
Storage and File Structure
60 pages
Chapter 6- - Copy
No ratings yet
Chapter 6- - Copy
62 pages
ch1
No ratings yet
ch1
39 pages
At the Time I Write This Message - The National Locksmith
No ratings yet
At the Time I Write This Message - The National Locksmith
40 pages
Unit 4 OS 2023
No ratings yet
Unit 4 OS 2023
41 pages
Secondary Storage Devices: Magnetic Disks
No ratings yet
Secondary Storage Devices: Magnetic Disks
34 pages
File Structures Indexing Kopyası
No ratings yet
File Structures Indexing Kopyası
76 pages
Ssos - U5
No ratings yet
Ssos - U5
39 pages
5 File Management
No ratings yet
5 File Management
14 pages
Chapter 6-
No ratings yet
Chapter 6-
62 pages
Ch5(Dev,File and System Level Io)
No ratings yet
Ch5(Dev,File and System Level Io)
40 pages
File
No ratings yet
File
37 pages
Physical Data Storage
No ratings yet
Physical Data Storage
2 pages
Database 2 Notes
No ratings yet
Database 2 Notes
42 pages
File System implementation
No ratings yet
File System implementation
32 pages
This Lecture: Physical Reality (Disks) File System Abstraction
No ratings yet
This Lecture: Physical Reality (Disks) File System Abstraction
8 pages
Intro File2
No ratings yet
Intro File2
36 pages
Assignment 5
No ratings yet
Assignment 5
9 pages
CS470 Introduction To Database Management Systems: (Chapters 13 and 14 of The Textbook)
100% (1)
CS470 Introduction To Database Management Systems: (Chapters 13 and 14 of The Textbook)
22 pages
8.Physical Database Design
No ratings yet
8.Physical Database Design
20 pages
File Organisation: Seekg Seekp
No ratings yet
File Organisation: Seekg Seekp
7 pages
Chapter 4 - Storage Final
No ratings yet
Chapter 4 - Storage Final
22 pages
Lecture 01 - File Storage - Part 1
No ratings yet
Lecture 01 - File Storage - Part 1
48 pages
Study Module 2
No ratings yet
Study Module 2
17 pages
OS Unit - 5 Notes
100% (1)
OS Unit - 5 Notes
34 pages
Storage and File Structure
No ratings yet
Storage and File Structure
9 pages
File System Implementation
No ratings yet
File System Implementation
27 pages
OS NOTES- FILES,DIRECTORY
No ratings yet
OS NOTES- FILES,DIRECTORY
14 pages
osunit6
No ratings yet
osunit6
16 pages
Unit 5. File and Input_output Management
No ratings yet
Unit 5. File and Input_output Management
22 pages
CH 13
No ratings yet
CH 13
6 pages
File Management
No ratings yet
File Management
26 pages
UNIT-V-DBMS (b+ tree , isam tree, indexing)
No ratings yet
UNIT-V-DBMS (b+ tree , isam tree, indexing)
12 pages
Hardware Rip PS-7000: Revision0
No ratings yet
Hardware Rip PS-7000: Revision0
67 pages
Applets: Unit - V
No ratings yet
Applets: Unit - V
19 pages
Osy 6
No ratings yet
Osy 6
19 pages
Unit-4 File Systems and Its Implementation
No ratings yet
Unit-4 File Systems and Its Implementation
10 pages
LESSON 1 - Facilities Layout and Design
100% (1)
LESSON 1 - Facilities Layout and Design
7 pages
Mach Final Case Study
No ratings yet
Mach Final Case Study
24 pages
Ca C3 Operating System Unit 4
No ratings yet
Ca C3 Operating System Unit 4
7 pages
Fuel Penalties Equations
No ratings yet
Fuel Penalties Equations
12 pages
TALLER 51 - Revisión Del Intento
No ratings yet
TALLER 51 - Revisión Del Intento
18 pages
William Stallings Computer Organization and Architecture 10 Edition
No ratings yet
William Stallings Computer Organization and Architecture 10 Edition
43 pages
Chapter Five Wet Processing
No ratings yet
Chapter Five Wet Processing
8 pages
Internship
No ratings yet
Internship
18 pages
Design and Installation of A Solar System For Own House.: Author
No ratings yet
Design and Installation of A Solar System For Own House.: Author
42 pages
GSMA Roaming Database IR.21 Data
No ratings yet
GSMA Roaming Database IR.21 Data
42 pages
Drawmer Electronics DC2476 Digital Mastering Processor User Manual
No ratings yet
Drawmer Electronics DC2476 Digital Mastering Processor User Manual
39 pages
Maintenance List KM. Kendhaga Nusantara
No ratings yet
Maintenance List KM. Kendhaga Nusantara
4 pages
Delhi+Excise+ +SOP+for+Bonded+Warehouse
No ratings yet
Delhi+Excise+ +SOP+for+Bonded+Warehouse
46 pages
COMPANY PROFILE PT GEOFORCE INDONESIA Ver 210819
No ratings yet
COMPANY PROFILE PT GEOFORCE INDONESIA Ver 210819
17 pages
BS 02982 1992 2005
100% (1)
BS 02982 1992 2005
10 pages
Lesson 4
No ratings yet
Lesson 4
18 pages
Design and Implementation of GSM Based Automation of Household Appliances
No ratings yet
Design and Implementation of GSM Based Automation of Household Appliances
7 pages
HP Compaq Presario CQ61 Quanta OP8 Schematic Diagram 1A
No ratings yet
HP Compaq Presario CQ61 Quanta OP8 Schematic Diagram 1A
40 pages
Smart Farm Monitoring System (Project Idea)
No ratings yet
Smart Farm Monitoring System (Project Idea)
22 pages
1.speed Synchronization of Multiple Motors
No ratings yet
1.speed Synchronization of Multiple Motors
89 pages
Zorica Bogicevic CV - Updated-1
No ratings yet
Zorica Bogicevic CV - Updated-1
2 pages
Laser Printer - History Computer
No ratings yet
Laser Printer - History Computer
4 pages
Log Sheet For Recording You Will Be Assessed and Moderated Using The Following Document - Ma
No ratings yet
Log Sheet For Recording You Will Be Assessed and Moderated Using The Following Document - Ma
8 pages
Intro To JavaScript
No ratings yet
Intro To JavaScript
30 pages
Bose Powered Acoustimass Am-25p Am-30p Series II Troubleshooting Guide
No ratings yet
Bose Powered Acoustimass Am-25p Am-30p Series II Troubleshooting Guide
89 pages
E Ms-50g Fx-List v2.0
No ratings yet
E Ms-50g Fx-List v2.0
9 pages
SoMachineV4.1SP2 4.1.0.1 15.12.16-ReleaseNotes - EN
No ratings yet
SoMachineV4.1SP2 4.1.0.1 15.12.16-ReleaseNotes - EN
35 pages
RK CV Offline
No ratings yet
RK CV Offline
1 page
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
From Everand
“Information Systems Unraveled: Exploring the Core Concepts”: GoodMan, #1
Patrick Mukosha
No ratings yet