0% found this document useful (0 votes)

39 views93 pages

File Organization (1)

Uploaded by

Kamalesh Pantra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views93 pages

File Organization (1)

Uploaded by

Kamalesh Pantra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 93

Chapter – 3

File Organization and

Indexing
Outline
 Disk Storage Devices
 Files of Records
 Operations on Files
 Unordered Files
 Ordered Files
 Hashed Files
 Dynamic, Extendible and linear Hashing Techniques
 RAID Technology

2
Introduction
 Databases are stored physically as files of records, which are
typically stored on magnetic disks
 The collection of data that makes up a computerized database must
be stored physically on some computer storage medium
 DBMS software can then retrieve, update, and process this data as
needed
 Two main categories of storage medium
 Primary storage

 Secondary and tertiary storage

3
Primary vs Secondary
Storage

4
Storage Hierarchy

Volatile Cache
Primary
storage Memory unit price

Secondary Flash Memory

storage
Magnetic Disk
Non-volatile speed

Tertiary Optical Disk $$

storage
Magnetic Tape

5
Storage Hierarchy
 At the primary storage level
 Cache memory
 Cache memory is typically used by the CPU to speed up execution of
program instructions using techniques such as prefetching and pipelining
 It stores the segments of program that are frequently accessed by the
processor
 Main memory
 Provides the main work area for the CPU for keeping program
instructions and data
 It is less expensive than cache memory and therefore larger in size
 The drawback is its volatility and lower speed compared with cache
memory

6
Storage Hierarchy
 At the secondary and tertiary storage level
 The hierarchy includes magnetic disks, as well as mass storage in the form
of CD-ROM (Compact Disk–Read-Only Memory) and DVD (Digital Video
Disk or Digital Versatile Disk) devices, and finally tapes at the least
expensive end of the hierarchy
 The storage capacity is measured in kilobytes (Kbyte or 1000 bytes),
megabytes (MB or 1 million bytes), gigabytes (GB or 1 billion bytes), and
even terabytes (1000 GB)
 The word petabyte is now becoming relevant in the context of very large
repositories of data
 Magnetic tapes are used for archiving and backup storage of data

7
Storage of Databases
 DBMS stores information on (‘hard’) disks
 This has major implications for DBMS design!
– READ: transfer data from disk to main memory (RAM)
– WRITE: transfer data from RAM to disk
– Both are high-cost operations, relative to in memory operations,
so must be planned carefully!
 Why not store everything in main memory?
 Costs too much
 Main memory is volatile. We want data to be saved between runs
 Typical storage hierarchy
 Main memory (RAM) for currently used data

 Disk for the main database

 Tapes for archiving older versions of the data

8
Secondary Storage Devices
 Magnetic disks are used for storing large amounts of data
 The most basic unit of data on the disk is a single bit of information.
To code information, bits are grouped into bytes
 Capacity of a disk is the number of bytes it can store, which is
usually very large

 Preferred secondary storage device for high storage capacity and

low cost.
 Data stored as magnetized areas on magnetic disk surfaces.
 A disk pack contains several magnetic disks connected to a
rotating spindle.
 Disks are divided into concentric circular tracks on each disk
surface
 Track capacities vary typically from 4 to 50 Kbytes or more

9
Disk Storage Devices
 A track is divided into smaller blocks or sectors
 because it usually contains a large amount of information

 The division of a track into sectors is hard-coded on the disk

surface and cannot be changed
 One type of sector organization calls a portion of a track that

subtends a fixed angle at the center as a sector

 A track is divided into blocks

 The block size B is fixed for each system

 Typical block sizes range from B=512 bytes to B=4096 bytes

 Whole blocks are transferred between disk and main memory for

processing

10
Disk Storage Devices

11
Disk Storage Devices
 A read-write head moves to the track that contains the block to be
transferred
 Disk rotation moves the block under the read-write head for reading or

writing
 A physical disk block (hardware) address consists of
 a cylinder number (imaginary collection of tracks of same radius from

all recorded surfaces)

 the track number or surface number (within the cylinder) and

 block number (within track)

 Time to access (read/write) a disk block

 Seek time (moving arms to position disk head on track)

 Rotational delay or latency (waiting for block to rotate under head)

 Transfer time (actually moving data to/from disk surface)

 Locating data on a disk is a major bottleneck – need efficient techniques to

do this
12
Disk Storage Devices

Figure
(a) A single-sided disk with read/write hardware 13
(b) A disk pack with read/write hardware
Components of a Disk
 The platters spin (say, 90rps)

 The arm assembly is moved in or out to position a head

on a desired track

 Read-write head
 Positioned very close to the platter surface (almost touching it)
 Reads or writes magnetically encoded information
 Only one head reads/writes at any one time

 Surface of platter divided into circular tracks

14
Physical Characteristics of
Disks
 Track
 an information storage circle on the surface of a disk.
 Over 16,000 tracks per platter
 each track can store between 4KB and 50KB of data.
 Each track is divided into sectors.
 Tracks under heads make a cylinder (imaginary!)
 Cylinder
 the tracks with the same diameter on all surfaces of a disk pack.
 Cylinder i consists of i-th track of all the platters
 Sector
 a part of a track with fixed size
 separated by fixed-size interblock gaps
 Typical sectors per track
 200 (on inner tracks) to 400 (on outer tracks)

15
Pages and Blocks
 Data files decomposed into pages (blocks)
 fixed size piece of contiguous information in the file
 sizes range from 512 bytes to several kilobytes

 Block is the smallest unit for transferring data between

the main memory and the disk

 Address of a page (block)

(cylinder#, track# (within cylinder), sector# (within track)

16
Pages and Blocks

Track

Gap
Sector

One track 1 2 3 4 ...

17
Page I/O

 Page I/O --- one page I/O is the cost (or time needed) to transfer
one page of data between the memory and the disk.
 The cost of a (random) page I/O =
 seek time + rotational delay + block transfer time

 Seek time
 time needed to position read/write head on correct track.

 Rotational delay (latency)

 time needed to rotate the beginning of page under read/write

head
 Block transfer time
 time needed to transfer data in the page/block

18
19
Magnetic Tape Storage
Devices
 Disks are random access secondary storage devices because an
arbitrary disk block may be accessed at random once we specify
its address
 Magnetic tapes are sequential access devices; to access the nth
block on tape, first we must scan the preceding n–1 blocks
 Data is stored on reels of high-capacity magnetic tape, somewhat
similar to audiotapes or videotapes
 A read/write head is used to read or write data on tape.
 Data records on tape are also stored in blocks—although the blocks
may be substantially larger than those for disks
 Tapes serve a very important function-backing up the database
One reason for backup is to keep copies of disk files in case the
data is lost due to a disk crash

20
Buffering of Blocks
 When several blocks need to be transferred from disk to main
memory and all the block addresses are known, several buffers can
be reserved in main memory to speed up the transfer

 While one buffer is being read or written, the CPU can process data
in the other buffer because an independent disk I/O processor
(controller) exists that, once started, can proceed to transfer a data
block between memory and disk independent of and in parallel
to CPU processing

 Double buffering can be used to speed up the transfer of contiguous

disk blocks

21
Buffering of Blocks

22
Placing File Records on Disk
 Records
 Data is usually stored in the form of records
 Each record consists of a collection of related data values or

items, where each value is formed of one or more bytes and

corresponds to a particular field of the record
 Records usually describe entities and their attributes

 For example

 An EMPLOYEE record represents an employee entity, and

each field value in the record specifies some attribute of that

employee, such as Name, Birth_date, Salary, or Supervisor

23
Placing File Records on Disk
 Record Types
 A collection of field names and their corresponding data types
constitutes a record type or record format definition.
 A data type, associated with each field, specifies the types of
values a field can take
 For example, an EMPLOYEE record type may be defined—using
the C programming language notation-as the following structure:
struct employee {
char name[30];
char ssn[9];
int salary;
int job_code;
char department[20]; } ;
24
Placing File Records on Disk
 Files, Fixed-Length Records, and Variable-Length Records
 File - sequence of records
 Fixed-Length Records - If every record in the file has exactly the
same size (in bytes)
 Variable-length records - If different records in the file have different
sizes

 Reasons for having variable-length records

 The file records are of the same record type, but one or more of the
fields are of varying size (variable-length fields). For example, the
Name field of EMPLOYEE can be a variable-length field
 The file records are of the same record type, but one or more of the
fields are optional; that is, they may have values for some but not all of
the file records (optional fields).
25
File Organization
 The database is stored as a collection of files
 Each file is a sequence of records
 A record is a sequence of fields
 Records are stored on disk blocks
 A file can have fixed-length records or variable-length
records

26
Placing File Records on Disk

 Fixed Length Records

 The fixed-length EMPLOYEE records in Figure have a record size
of 71 bytes

 Space is wasted when certain records do not have values for all the
physical spaces provided in each record

27
Placing File Records on Disk
 Variable-Length Records

 For variable-length fields, each record has a value for each field, but
we do not know the exact length of some field values.
 To determine the bytes within a particular record that represent each
field, we can use special separator characters (such as ? Or % or
$) - which do not appear in any field value—to terminate variable-
length fields

28
Placing File Records on Disk
 Variable-Length Records
 A file of records with optional fields can be formatted in different ways.
 If the total number of fields for the record type is large, but the number
of fields that actually appear in a typical record is small, we can
include in each record a sequence of
<field-name, field-value> pairs
rather than just the field values

 A more practical option – to assign a short field type code—say, an

integer number-to each field and include in each record a sequence of
<field-type, field-value> pairs rather than <field-name, field-value>
pairs

29
Record Blocking and
Spanned
versus
block is the unit ofUnspanned
data transfer between disk Records
The records of a file must be allocated to disk blocks because a


and memory
 When the block size > the record size, each block will contain
numerous records, although some files may have unusually large
records that cannot fit in one block

 Suppose that the block size is B bytes

 For a file of fixed-length records of size R bytes, with B ≥ R, we can
fit bfr = ⎣B/R⎦ records per block, where the ⎣(x)⎦ (floor function)
rounds down the number x to an integer
 Bfr - called the blocking factor – number of records per block
 In general, R may not divide B exactly, so we have some unused
space in each block equal to B − (bfr * R) bytes
30
Blocking Factor
 Blocking Factor (bfr) - the number of records that can fit into a
single block.
 bfr = ⌊B/R⌋

 B : Block size in bytes

 R: Record size in bytes

 Example:
 Record size R = 100 bytes
 Block Size B = 2,000 bytes
 Thus the blocking factor bfr = floor(2000/100) = 20

 The number of blocks b needed to store a file of r records:

 b = floor(r / bfr)blocks

31
Record Blocking and
Spanned
versus Unspanned
Spanned organization of records
Records


 To utilize this unused space, we can store part of a record on one block
and the rest on another.
 A pointer at the end of the first block points to the block containing the
remainder of the record in case it is not the next consecutive block on
disk
 Spanned - records can span more than one block
 Whenever a record is larger than a block - use spanned organization

 Unspanned organization of records

 Records are not allowed to cross block boundaries
 This is used with fixed-length records having B > R

 Note: For variable-length records, either a spanned or an unspanned

organization can be used
32
Record Blocking and
Spanned
versus Unspanned Records
Figure : Types of record
organization.
(a) Unspanned.
(b) Spanned

 For variable-length records using spanned organization, each block

may store a different number of records.
 In this case, the blocking factor bfr represents the average number of
records per block for the file
 bfr to calculate the number of blocks b needed for a file of r records
b = ⎡(r/bfr)⎤ blocks
where the ⎡(x)⎤ (ceiling function) rounds the value x up to the next integer

33
Allocating File Blocks on
Disk
 Several standard techniques for allocating the blocks of a file on
disk
 Contiguous allocation

 linked allocation

 Indexed allocation

 Contiguous Allocation - requires that all blocks of a file be kept

together contiguously
 Performance is very fast, because reading successive blocks of the
same file generally requires no movement of the disk heads, or at
most one small step to the next adjacent cylinder
 Problems can arise when files grow, or if the exact size of a file is
unknown at creation time
34
Allocating File Blocks on
Disk
 Contiguous Allocation

35
Allocating File Blocks on
Disk
 Linked Allocation
 Disk files can be stored as linked lists, with the expense of the storage
space consumed by each link
 Linked allocation involves no wastage of space, does not require pre-
known file sizes, and allows files to grow dynamically at any time

 A large number of seeks

are needed to access every
block individually

36
Allocating File Blocks on
Disk
 Indexed Allocation
 One or more index blocks contain pointers to the actual file blocks
 Supports direct access to the blocks occupied by the file and therefore
provides fast access to the file blocks

 The indexed allocation

would keep one entire block
(index block) for the
pointers (even for small files)
which is inefficient
in terms of memory utilization

37
Operations on Files
 DBMS software programs, access records by using the following
commands
 OPEN - Prepares the file for reading or writing. Allocates
appropriate buffers (typically at least two) to hold file blocks from
disk, and retrieves the file header. Sets the file pointer to the
beginning of the file
 Reset - Sets the file pointer of an open file to the beginning of the
file
 Find (or Locate)- Searches for the first record that satisfies a
search condition. Transfers the block containing that record into a
main memory buffer (if it is not already there). The file pointer points
to the record in the buffer and it becomes the current record

38
Operations on Files
 Read (or Get) - Copies the current record from the buffer to a program variable in
the user program. This command may also advance the current record pointer to
the next record in the file, which may necessitate reading the next file block from
disk
 FindNext - Searches for the Searches for the next record in the file that satisfies
the search condition. Transfers the block containing that record into a main
memory buffer (if it is not already there)
 Delete -Deletes the current record and (eventually) updates the file on disk to
reflect the deletion
 Modify-Modifies some field values for the current record and (eventually) updates
the file on disk to reflect the modification
 Insert - Inserts a new record in the file by locating the block where the record is to
be inserted, transferring that block into a main memory buffer (if it is not already
there), writing the record into the buffer, and (eventually) writing the buffer to disk
to reflect the insertion
 Close - Completes the file access by releasing the buffers and performing any
other needed cleanup operations

39
Operations on Files
 The preceding (except for Open and Close) are called record-at-a-
time operations because each operation applies to a single record

 In database systems, additional set-at-a-time higher-level

operations may be applied to a file
 FindAll. Locates all the records in the file that satisfy a search condition
 Find (or Locate) n. Searches for the first record that satisfies a search
condition and then continues to locate the next n – 1 records satisfying the
same condition. Transfers the blocks containing the n records to the main
memory buffer (if not already there)
 FindOrdered. Retrieves all the records in the file in some specified order.
 Reorganize. Starts the reorganization process. Some file organizations
require periodic reorganization. An example is to reorder the file records by
sorting them on a specified field

40
File Organization Vs Access
Method
 File organization
 Refers to the organization of the data of a file into records,
blocks, and access structures; this includes the way records and
blocks are placed on the storage medium and interlinked

 Access method
 On the other hand, provides a group of operations that can be
applied to a file

41
Methods for Organizing
Records of a File on Disk
 Heap file
 Sorted file
 Hash file
 RAID

42
Heap Files
 Files of Unordered Records (Heap Files)
 Also called a heap or a pile file
 New records are inserted at the end of the file
 Record insertion is quite efficient
 A linear search through the file records is necessary to search for a
record
 This requires reading and searching half the file blocks on the average, and
is hence quite expensive
 For a file of b blocks, this requires searching (b/2) blocks, on average. If
no records or several records satisfy the search condition, the program must
read and search all b blocks in the file
 Reading the records in order of a particular field requires sorting the file
records
 This organization is often used with additional access paths, such as
the secondary indexes

43
Heap Files
 To delete a record, a program must first find its block, copy the
block into a buffer, delete the record from the buffer, and finally
rewrite the block back to the disk. This leaves unused space in
the disk block.
 Deleting a large number of records in this way results in wasted
storage space.
 Another technique used for record deletion is to have an extra
byte or bit, called a deletion marker, stored with each record
 spanned or unspanned organization can be use and it may be
used with either fixed-length or variable-length records
 Modifying a variable- length record may require deleting the old
record and inserting a modified record because the modified
record may not fit in its old space on disk

44
Heap File Organization
 Records are placed in the file in the order in which they
are inserted. Such an organization is called a heap file
 Insertion is at the end
 takes constant time O(1) (very efficient)
 Searching
 requires a linear search (expensive)
 Deleting
 requires a search, then delete

 Select, Update and Delete

 take b/2 time (linear time) in average
 b is the number of blocks
45
File Stored as a Heap File

46
Sorted Files
 Files of Ordered Records (Sorted Files)
 We can physically order the records of a file on disk based on the
values of one of their fields—called the ordering field
 This leads to an ordered or sequential file
 If the ordering field is also a key field of the file—a field
guaranteed to have a unique value in each record
 Some advantages ordered files over unordered files
 Reading the records in order of the ordering key values becomes
extremely efficient because no sorting is required
 Finding the next record from the current one in order of the ordering key
usually requires no additional block accesses because the next record is
in the same block as the current o
 Using a search condition based on the value of an ordering key field
results in faster access when the binary search technique is used
47
Sorted Files

 Suppose that the file has b blocks numbered 1, 2, ..., b

 The records are ordered by ascending value of their ordering key
field
 Searching for a record whose ordering key field value is K

 Binary search usually accesses log2(b) blocks, whether the record is

found or not

An improvement over linear searches, where, on the average,
(b/2) blocks are accessed when the record is found and b blocks
are accessed when the record is not found
48
Sorted Files

Figure
Some blocks of an ordered
(sequential) file of EMPLOYEE
records with Name as the
ordering key field

49
Sorted Files

50
Sorted Files
 Inserting and deleting records are expensive operations for an
ordered file because the records must remain physically
ordered
 Insert
 To insert a record, we must find its correct position in the file,

based on its ordering field value, and then make space in the file
to insert the record in that position.
 For a large file this can be very time consuming because, on the

average, half the records of the file must be moved to make

space for the new record
 Delete
 For record deletion, the problem is less severe if deletion

markers and periodic reorganization are used

51
Sorted Files
 Modification
 Modifying a field value of a record depends on two factors: the search
condition to locate the record and the field to be modified
 Search Condition

 If the search condition involves the ordering key field, we can

locate the record using a binary search; otherwise we must do a

linear search
 Field to be modified

 A non-ordering field can be modified by changing the record and

rewriting it in the same physical location on disk-assuming fixed-

length records
 Modifying the ordering field means that the record can change its

position in the file. This requires deletion of the old record followed
by insertion of the modified record
52
Sequential File Organization
 Insertion is expensive
 records must be inserted in the correct order
 locate the position where the record is to be inserted
 if there is free space insert there
 if no free space insert the record in an overflow block
 In either case, pointer chain must be updated

Insert takes lg2(b) plus the time to re-organize records
 b is the number of blocks

 Deletion
 use pointer chains

 Searching
 very efficient (Binary search)

This requires lg2(b) on the average
53
Sorted Files

54
Hash Files
 Another type of primary file organization is based on hashing -
provides very fast access to records under certain search
conditions
 The search condition must be an equality condition on a single field,
called the hash field
 In most cases, the hash field is also a key field of the file, in which
case it is called the hash key
 Idea - to provide a function h, called a hash function or
randomizing function, which is applied to the hash field value of a
record and yields the address of the disk block in which the record is
stored
 A search for the record within the block can be carried out in a main
memory buffer. For most records, we need only a single-block
access to retrieve that record
55
Hash Files

 Internal Hashing
 For internal files, hashing is typically implemented as a hash
table through the use of an array of records. Suppose that the
array index range is from 0 to M – 1
 we have M slots whose addresses correspond to the array
indexes.
 Choose a hash function that transforms the hash field value into
an integer between 0 and M − 1.
 One common hash function is the h(K) = K mod M function -
which returns the remainder of an integer hash field value K after
division by M; this value is then used for the record address

56
Hash Files

57
Hash Files
 Other hashing functions can be used
 Folding

 Involves applying an arithmetic function such as addition or a

logical function such as exclusive or to different portions of

the hash field value to calculate the hash address
 For example, with an address space from 0 to 999 to store

1,000 keys, a 6-digit key 235469 may be folded and stored at

the address: (235+964) mod 1000 = 199)
 Another technique involves picking some digits of the hash field
value
 For instance, the third, fifth, and eighth digits—to form the hash

address 301-67-8923 a hash value of 172 by this hash function

58
Hash Files
 Most hashing functions is that they do not guarantee that
distinct values will hash to distinct addresses

 Hash collision
 Occurs when the hash field value of a record that is being
inserted hashes to an address that already contains a different
record
 In this situation, we must insert the new record in some other
position, since its hash address is occupied
 The process of finding another position is called collision
resolution

59
Hash Files
 Methods for collision resolution
 Open addressing
 Proceeding from the occupied position specified by the hash address, the

program checks the subsequent positions in order until an unused (empty)

position is found
 Chaining
 For this method, various overflow locations are kept, usually by extending

the array with a number of overflow positions. Additionally, a pointer field is

added to each record location.
 A collision is resolved by placing the new record in an unused overflow

location and setting the pointer

 Multiple hashing
 The program applies a second hash function if the first results in a

collision. If another collision results, the program uses open addressing

or applies a third hash function and then uses open addressing if
necessary

60
Hash Files
 Hash Collision - Open addressing

61
Hash Files
Hashing with Chains
When a collision occurs, elements with the same hash key will be chained together.
A chain is simply a linked list of all the elements with the same hash key.

62
Hash Files

63
Figure : Collision resolution by chaining records
Hash Files
 External Hashing for Disk Files

 Hashing for disk files is called external hashing

 The target address space is made of buckets, each of which
holds multiple records.
 A bucket is either one disk block or a cluster of contiguous disk
blocks.
 The hashing function maps a key into a relative bucket number,
rather than assigning an absolute block address to the bucket.
 A table maintained in the file header converts the bucket number
into the corresponding disk block address

64
Hash Files

Figure: Matching bucket numbers to disk block addresses 65

Hash Files
 Collision problem is less severe with buckets - as many records as
will fit in a bucket can hash to the same bucket without causing
problems
 A variation of chaining in which a pointer is maintained in each
bucket to a linked list of overflow records for the bucket
 The pointers in the linked list should be record pointers, which
include both a block address and a relative record position within
the block

 The hashing scheme described so far is called static hashing

because a fixed number of buckets M is allocated.

66
Hash Files

67
Hashing Techniques
 The hashing scheme is called static hashing if a fixed
number of buckets is allocated

 Main disadvantage of static external hashing:

 The number of buckets must be chosen large enough

that can handle large files. That is, it is difficult to

expand or shrink the file dynamically.

 Solutions to the above problem

 Dynamic hashing
 Extendible hashing
 Linear hashing

68
Hashing for Dynamic File
Organization
 Hashing for Dynamic File Organization
 Dynamic Files
 Files where record insertions and deletion take place frequently

 The file keeps growing and also shrinking

 Hashing for dynamic file organization

 Bucket numbers are integers

 The binary representation of bucket numbers

 Exploited cleverly to devise dynamic hashing schemes

69
Dynamic And Extendible
Hashed Files
 Dynamic and Extendible Hashing Techniques
 Hashing techniques are adapted to allow the dynamic growth and
shrinking of the number of file records
 These techniques include the following: dynamic hashing, extendible
hashing, and linear hashing
 Both dynamic and extendible hashing use the binary
representation of the hash value h(K) in order to access a
directory
 In dynamic hashing, the directory is a binary tree
 In extendible hashing the directory is an array of size 2d where d is
called the global depth
 The value of d can be increased or decreased by one at a time, thus
doubling or halving the number of entries in the directory array
 Doubling is needed if a bucket, whose local depth d is equal to the
global depth d, overflows
70
Dynamic And Extendible
Hashed Files
 The directories can be stored on disk, and they expand or shrink
dynamically
 Directory entries point to the disk blocks that contain the
stored records
 An insertion in a disk block that is full causes the block to split into
two blocks and the records are redistributed among the two blocks
 The directory is updated appropriately
 Dynamic and extendible hashing do not require an overflow area.

 Linear hashing does require an overflow area but does not use a
directory
 Blocks are split in linear order as the file expands

71
Insertion in Extendible
Hashing Scheme
 2 -bit sequence for the record to be inserted

72
Insertion in Extendible
Hashing Scheme

73
Deletion in Extendible
Hashing Scheme

74
Extendible Hashing

75
Dynamic Hashing
 A precursor to extendible hashing was dynamic hashing
 The storage of records in buckets for dynamic hashing is somewhat
similar to extendible hashing.
 The major difference is in the organization of the directory
 Dynamic hashing maintains a tree-structured directory with two
types of nodes:
 Internal nodes that have two pointers—the left pointer

corresponding to the 0 bit (in the hashed address) and a right

pointer corresponding to the 1 bit
 Leaf nodes—these hold a pointer to the actual bucket with

records

76
Dynamic Hashing

77
Linear Hashing
 Idea - is to allow a hash file to expand and shrink its number of
buckets dynamically without needing a directory

 Starts with M buckets numbered 0, 1, ..., M − 1 and uses the mod

hash function h(K) = K mod M; this hash function is called the initial
hash function hi.
 Overflow because of collisions is still needed and can be handled by
maintaining individual overflow chains for each bucket
 When a collision leads to an overflow record in any file bucket, the
first bucket in the file - bucket 0-is split into two buckets:
 The original bucket 0 and a new bucket M at the end of the file.
 The records originally in bucket 0 are distributed between the two
buckets based on a different hashing function hi+1(K) = K mod 2M

78
Linear Hashing
 A key property of the two hash functions hi and hi+1 is that any
records that hashed to bucket 0 based on hi will hash to either
bucket 0 or bucket M based on hi+1; this is necessary for linear
hashing to work
 As further collisions lead to overflow records, additional buckets are
split in the linear order 1, 2, 3, .... If enough overflows occur, all the
original file buckets 0, 1, ..., M− 1 will have been split, so the file now
has 2M instead of M buckets, and all buckets use the hash function
hi+1.
 Hence, the records in overflow are eventually redistributed into
regular buckets, using the function hi+1 via a delayed split of their
buckets

79
Insertion

80
Linear Hashing
 Advantages
 Directory is not needed
 Simple to implement

 Reference - Example for Linear hashing

http://queper.in/drupal/blogs/dbsys/linear_hashing

81
Parallelizing Disk Access
Using RAID Technology
 Secondary storage technology must take steps to keep up in
performance and reliability with processor technology

 A major advance in secondary storage technology is

represented by the development of RAID, which originally
stood for Redundant Arrays of Inexpensive Disks

 The main goal of RAID is to even out the widely different rates
of performance improvement of disks against those in
memory and microprocessors

82
RAID Technology
 A natural solution is a large array of small independent
(inexpensive) disks acting as a single higher-performance
logical disk
 A concept called data striping is used, which utilizes
parallelism to improve disk performance
 Data striping distributes data transparently over multiple disks
to make them appear as a single large, fast disk

83
RAID Technology
 Provides
 Increased performance
 Fault Tolerance
 Redundancy

 RAID Levels
 Level 0
 Level 1
 Level 2
 Level 3
 Level 4
 Level 5
 Level 6
 Level 10 (1+0)

84
RAID Technology

 RAID Level 0
 Minimum number of drives required - 2
 A RAID Level 0 system uses data striping - dividing data
evenly across two or more storage devices
 No redundant information is maintained
 Purpose - speed up performance as organizing data in such a
way allows faster reading and writing of files
 Not fault-tolerant should not be used for critical data
 Simple and easy to implement

85
RAID Technology
 Data striping means breaking up contiguous data that would
normally go on a single disk
 The data is distributed to many disks, either by byte (a) or by
block (b)

86
RAID Technology
 RAID Level 1 – minimum no. of drives required - 2
 Disk Mirroring - is fault-tolerant as it duplicates data by
simultaneously writing on two storage devices
 Therefore, each disk has an exact copy on another disk
 RAID 1 - ensures protection against data loss. If a problem arises
with one disk, the copy provides the data needed
 Writing takes more time as it only uses the capacity of one disk and
has to operate twice

 Disadvantages
 Uses only half of the storage capacity
 More expensive

87
RAID Technology
 RAID Level 2
 Bit-level striping means that the file is broken into “bit-sized
pieces”.
 It uses a Hamming code for error correction
 Theoretical performance is very high, but it would be so expensive
to implement

88
RAID Technology
 RAID Level 3
 Requires a minimum of 3 drives to implement
 Byte-level striping means that the file is broken into "byte sized
pieces“.
 Written in parallel on two or more drives
 An additional drive stores parity information

89
RAID Technology
 RAID Level 4
 Minimum nos. of drives required : 3 (2 disks for data and 1 for
parity)
 Level 4 provides block-level striping (like Level 0) with a parity
disk
 If a data disk fails, the parity data is used to create a replacement
disk

90
RAID Technology
 RAID Level 5
 Most common secure RAID level
 Instead of a dedicated parity disk, parity information is spread
across all the drives

91
RAID Technology
 RAID Level 6
 The parity data are written to two drives
 The chances that two drives break down at exactly the same
moment are of course very small

 Advantages
 Read data transactions are very fast
 RAID 6 is more secure than RAID 5

92
RAID Technology
 RAID level 10 – combining RAID 1 & RAID 0
 Combine the advantages of RAID 0 and RAID 1 in one single system
 Provides security by mirroring all data on secondary drives while using
striping across each set of drives to speed up data transfers

 Advantage
 If something goes wrong with one of the disks, the rebuild time is very fast since
all that is needed is copying all the data from the surviving mirror to a new drive
 Disadvantage
 Half of the storage capacity goes to mirroring. expensive way to have
redundancy.

Chapter 5-Record Storage and Primary File Organization
100% (1)
Chapter 5-Record Storage and Primary File Organization
64 pages
DBMS - Chapter 2 - Storage and File Structures
No ratings yet
DBMS - Chapter 2 - Storage and File Structures
118 pages
7 Disk Storage Architectures File Structures and Hashing Class22to24 8April2025
No ratings yet
7 Disk Storage Architectures File Structures and Hashing Class22to24 8April2025
59 pages
Secondary Storage Devices (1) :: Magnetic Disks
No ratings yet
Secondary Storage Devices (1) :: Magnetic Disks
56 pages
Storage and Multimedia: The Facts and More
No ratings yet
Storage and Multimedia: The Facts and More
49 pages
VND - Ms Powerpoint&Rendition 1
No ratings yet
VND - Ms Powerpoint&Rendition 1
118 pages
Secondary Storage Devices
No ratings yet
Secondary Storage Devices
36 pages
FULL
No ratings yet
FULL
449 pages
6 Data Storage and Querying
100% (1)
6 Data Storage and Querying
58 pages
CH 26 Storage Devices
No ratings yet
CH 26 Storage Devices
10 pages
Secondary Storage Introduction
No ratings yet
Secondary Storage Introduction
82 pages
Ch4-Data Storage and Indexing
No ratings yet
Ch4-Data Storage and Indexing
116 pages
Unit-5 Storage and Indexing
No ratings yet
Unit-5 Storage and Indexing
100 pages
Lecture 10 Storage Devices
No ratings yet
Lecture 10 Storage Devices
42 pages
Lecture 10
No ratings yet
Lecture 10
42 pages
Chapter 13:disk Storage and Basic File Structures
No ratings yet
Chapter 13:disk Storage and Basic File Structures
31 pages
Secondary Storage Devices
100% (1)
Secondary Storage Devices
75 pages
Data Storage and Access Methods: Min Song IS698
No ratings yet
Data Storage and Access Methods: Min Song IS698
50 pages
ch1
No ratings yet
ch1
39 pages
DBMS Notes Unit IV PDF
No ratings yet
DBMS Notes Unit IV PDF
73 pages
Lecture 15
No ratings yet
Lecture 15
19 pages
Unit 3 Record Storage and Primary File Organization: Structure
No ratings yet
Unit 3 Record Storage and Primary File Organization: Structure
25 pages
Unit 4-Memory
No ratings yet
Unit 4-Memory
21 pages
File Organization-Lec4
No ratings yet
File Organization-Lec4
21 pages
Disk Storage, Basic File Structures, and Hashing: Dr. Hasnaa Raafat Dr. Nora Zakie
No ratings yet
Disk Storage, Basic File Structures, and Hashing: Dr. Hasnaa Raafat Dr. Nora Zakie
31 pages
DiskStorage Part1
No ratings yet
DiskStorage Part1
10 pages
File
No ratings yet
File
37 pages
Storage and File Structure
No ratings yet
Storage and File Structure
55 pages
8 DataStorageIndexingStructures Updated
No ratings yet
8 DataStorageIndexingStructures Updated
57 pages
Ch26 - Storage Devices
No ratings yet
Ch26 - Storage Devices
10 pages
Ch4 Data Storage in Computers
No ratings yet
Ch4 Data Storage in Computers
41 pages
DBMS-UNIT-6 R16 (1)
No ratings yet
DBMS-UNIT-6 R16 (1)
16 pages
5 File Management
No ratings yet
5 File Management
14 pages
Dmbs New Slides Unit 1
No ratings yet
Dmbs New Slides Unit 1
35 pages
Secondry Memory Management
No ratings yet
Secondry Memory Management
23 pages
Unit 1.6 Secondary Memory
No ratings yet
Unit 1.6 Secondary Memory
52 pages
File Organisation and Indexing
No ratings yet
File Organisation and Indexing
10 pages
Module-4 Data storage
No ratings yet
Module-4 Data storage
78 pages
Elmasri 6e Ch17 Week2 HW DiskStorage
No ratings yet
Elmasri 6e Ch17 Week2 HW DiskStorage
96 pages
Secondary Storage, Sometimes Called Auxiliary Storage, Is Storage That Is Separate
No ratings yet
Secondary Storage, Sometimes Called Auxiliary Storage, Is Storage That Is Separate
3 pages
Secondary Storage: Sequential and Direct-Access Devices
No ratings yet
Secondary Storage: Sequential and Direct-Access Devices
6 pages
Dbms - Unit 5 Notes
No ratings yet
Dbms - Unit 5 Notes
30 pages
FIT U-2-Ch-4-Part-2
No ratings yet
FIT U-2-Ch-4-Part-2
12 pages
Storage and File Structure
No ratings yet
Storage and File Structure
60 pages
Lesson 8 Pco Storage-Pre Finals
No ratings yet
Lesson 8 Pco Storage-Pre Finals
11 pages
DBMS -Unit 3 - Page 1-6
No ratings yet
DBMS -Unit 3 - Page 1-6
19 pages
D. Disk System Architecture
No ratings yet
D. Disk System Architecture
13 pages
Unit IV
No ratings yet
Unit IV
31 pages
Storage Devices
No ratings yet
Storage Devices
20 pages
Secondary Storage Devices: Magnetic Disks
No ratings yet
Secondary Storage Devices: Magnetic Disks
34 pages
L02
No ratings yet
L02
31 pages
7_DataStorageIndexingStructures
No ratings yet
7_DataStorageIndexingStructures
83 pages
Secondary Storage Device in PC
No ratings yet
Secondary Storage Device in PC
9 pages
DBMS UNIT-4-1
No ratings yet
DBMS UNIT-4-1
45 pages
The Bare Basics: Storing Data On Disks and Files
No ratings yet
The Bare Basics: Storing Data On Disks and Files
33 pages
Chapter 6- - Copy
No ratings yet
Chapter 6- - Copy
62 pages
R18 DBMS Unit-V
No ratings yet
R18 DBMS Unit-V
43 pages
File Organization-Lec5
No ratings yet
File Organization-Lec5
21 pages
CLO-3.1
No ratings yet
CLO-3.1
43 pages
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Engineering Drawing N1 Module 1
100% (1)
Engineering Drawing N1 Module 1
16 pages
Ppt Os Diskk Formatting
No ratings yet
Ppt Os Diskk Formatting
8 pages
History of Database Systems
100% (1)
History of Database Systems
12 pages
ss2 ICT Test
100% (1)
ss2 ICT Test
3 pages
DIGITAL LOGIC U-4 Imp Questions
No ratings yet
DIGITAL LOGIC U-4 Imp Questions
3 pages
Presentation On Computer RAM
No ratings yet
Presentation On Computer RAM
5 pages
Topic Wise Bundle PDF Course 2022 - Reasoning Ability Coding & Decoding Set-1 (Eng)
No ratings yet
Topic Wise Bundle PDF Course 2022 - Reasoning Ability Coding & Decoding Set-1 (Eng)
7 pages
C D1032 Pages: 2: Answer All Questions. Each Carries 3 Marks
No ratings yet
C D1032 Pages: 2: Answer All Questions. Each Carries 3 Marks
2 pages
Installing Imac G5 20" Model A1076 Hard Drive Replacement: Tools Used in This Guide
No ratings yet
Installing Imac G5 20" Model A1076 Hard Drive Replacement: Tools Used in This Guide
4 pages
Discovering Computers 2012: Your Interactive Guide To The Digital World
No ratings yet
Discovering Computers 2012: Your Interactive Guide To The Digital World
47 pages
Scalar Tape DS00559A
No ratings yet
Scalar Tape DS00559A
4 pages
Sample Midterm Exam Questions
No ratings yet
Sample Midterm Exam Questions
2 pages
COMPUTER COMPONENTS
No ratings yet
COMPUTER COMPONENTS
18 pages
6.file Management1
No ratings yet
6.file Management1
31 pages
Chapter 3 - Computer System
No ratings yet
Chapter 3 - Computer System
88 pages
EL203 Lec7
No ratings yet
EL203 Lec7
16 pages
Major Hardware Components of A Computer System
No ratings yet
Major Hardware Components of A Computer System
9 pages
PowerMaxOS 10 Concepts and Features - Participant Guide
No ratings yet
PowerMaxOS 10 Concepts and Features - Participant Guide
36 pages
Ram vs. Rom
No ratings yet
Ram vs. Rom
2 pages
Release Note c25 3103B
No ratings yet
Release Note c25 3103B
14 pages
NAND Flash Bad Block Management
No ratings yet
NAND Flash Bad Block Management
7 pages
Computer Parts and Cables
No ratings yet
Computer Parts and Cables
32 pages
Primergy Rx200 s5 Config
No ratings yet
Primergy Rx200 s5 Config
4 pages
Explain in Brief Flash Memory - : - Sram and Dram
No ratings yet
Explain in Brief Flash Memory - : - Sram and Dram
4 pages
Csc 314 Quiz 1_ Attempt Review _ Lautech
No ratings yet
Csc 314 Quiz 1_ Attempt Review _ Lautech
7 pages
Create A LUN in Netapp Storage
No ratings yet
Create A LUN in Netapp Storage
6 pages
Fundamentals_of_IT_syllabus
No ratings yet
Fundamentals_of_IT_syllabus
2 pages
Stratis
No ratings yet
Stratis
8 pages
My New Task Combine Log
No ratings yet
My New Task Combine Log
5 pages
Computer Arithmetic and Storage Fundamentals Bcom 1year 2u
No ratings yet
Computer Arithmetic and Storage Fundamentals Bcom 1year 2u
6 pages

File Organization (1)

Uploaded by

File Organization (1)

Uploaded by

Chapter – 3

File Organization and

 Secondary and tertiary storage

Secondary Flash Memory

Tertiary Optical Disk $$

 Disk for the main database

 Tapes for archiving older versions of the data

 Preferred secondary storage device for high storage capacity and

 The division of a track into sectors is hard-coded on the disk

subtends a fixed angle at the center as a sector

 A track is divided into blocks

 Typical block sizes range from B=512 bytes to B=4096 bytes

all recorded surfaces)

 block number (within track)

 Time to access (read/write) a disk block

 Rotational delay or latency (waiting for block to rotate under head)

 Transfer time (actually moving data to/from disk surface)

 Locating data on a disk is a major bottleneck – need efficient techniques to

 The arm assembly is moved in or out to position a head

 Surface of platter divided into circular tracks

 Block is the smallest unit for transferring data between

 Address of a page (block)

One track 1 2 3 4 ...

 Rotational delay (latency)

 Double buffering can be used to speed up the transfer of contiguous

items, where each value is formed of one or more bytes and

 An EMPLOYEE record represents an employee entity, and

each field value in the record specifies some attribute of that

 Reasons for having variable-length records

 Fixed Length Records

 A more practical option – to assign a short field type code—say, an

 Suppose that the block size is B bytes

 B : Block size in bytes

 R: Record size in bytes

 The number of blocks b needed to store a file of r records:

 Unspanned organization of records

 Note: For variable-length records, either a spanned or an unspanned

 For variable-length records using spanned organization, each block

 Contiguous Allocation - requires that all blocks of a file be kept

 A large number of seeks

 The indexed allocation

 In database systems, additional set-at-a-time higher-level

 Select, Update and Delete

 Suppose that the file has b blocks numbered 1, 2, ..., b

 Binary search usually accesses log2(b) blocks, whether the record is

average, half the records of the file must be moved to make

markers and periodic reorganization are used

 If the search condition involves the ordering key field, we can

locate the record using a binary search; otherwise we must do a

 A non-ordering field can be modified by changing the record and

rewriting it in the same physical location on disk-assuming fixed-

 Involves applying an arithmetic function such as addition or a

logical function such as exclusive or to different portions of

1,000 keys, a 6-digit key 235469 may be folded and stored at

address 301-67-8923 a hash value of 172 by this hash function

program checks the subsequent positions in order until an unused (empty)

the array with a number of overflow positions. Additionally, a pointer field is

location and setting the pointer

collision. If another collision results, the program uses open addressing

 Hashing for disk files is called external hashing

Figure: Matching bucket numbers to disk block addresses 65

 The hashing scheme described so far is called static hashing

 Main disadvantage of static external hashing:

that can handle large files. That is, it is difficult to

 Solutions to the above problem

 The file keeps growing and also shrinking

 Hashing for dynamic file organization

 The binary representation of bucket numbers

 Exploited cleverly to devise dynamic hashing schemes

corresponding to the 0 bit (in the hashed address) and a right

 Starts with M buckets numbered 0, 1, ..., M − 1 and uses the mod

 Reference - Example for Linear hashing

 A major advance in secondary storage technology is

You might also like