0% found this document useful (0 votes)
132 views

Module-1 Introduction To File Structures

The document provides an overview of file structures and their design. It discusses that files can be stored in primary, secondary, and tertiary storage, with secondary storage like disks being much slower to access than primary storage like RAM. Good file structure design aims to provide fast access to all disk data. The document then summarizes the evolution of different file structure designs, from sequential to indexed sequential to tree-based structures like B-trees and B+-trees, which helped improve access performance. It also introduces the concepts of physical and logical files, and covers basic file processing operations like opening files.

Uploaded by

Varshitha Ganiga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views

Module-1 Introduction To File Structures

The document provides an overview of file structures and their design. It discusses that files can be stored in primary, secondary, and tertiary storage, with secondary storage like disks being much slower to access than primary storage like RAM. Good file structure design aims to provide fast access to all disk data. The document then summarizes the evolution of different file structure designs, from sequential to indexed sequential to tree-based structures like B-trees and B+-trees, which helped improve access performance. It also introduces the concepts of physical and logical files, and covers basic file processing operations like opening files.

Uploaded by

Varshitha Ganiga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

MODULE-1

INTRODUCTION TO FILE STRUCTURES


File structures:
 Here we study about
o Storage of data
o Data storage organization
o Access to data
o Processing of data
 File structures will be developed on the knowledge of data structures.Data structures deals with
organizing data in main memory. File structures deals with organizing data in secondary storage as files

The heart of file structure design (why to study FS?)


 Computer data can be stored in three kinds of locations:
o Primary Storage ==> Memory [Computer Memory]
o Secondary Storage [Online Disk/ Tape/ CDRom that can be accessed by the computer]
o Tertiary Storage ==> Archival Data [Offline Disk/Tape/ CDRom not directly available to the
computer.]
 The disks are very slow, even though they are technologically marvels. i.e. one can pack thousands of
megabytes on a disk that fits into a notebook computer
 Typical time to get information from main memory (RAM) is 120 nanoseconds (120x10-9 Seconds)
 Getting same information from typical disk is 30 milliseconds (120x10-3 Seconds)
 Hence disks are very slow compared to main memory.But disks provides enormous capacity at much
less cost than main memory
 The tension between disks slow access time and enormous capacity is the driving force behind FS design
 Good FS design give us access to all the capacity without making or application spend a lot of time
waiting for the disk.A file is a data structure on a secondary storage which acts as a container for data
 A File Structure is a combination of representations for data in files and of operations for accessing the
data.FS allows applications of read, write and modify data
 It also supports finding the data that matches some search criteria or reading through the data in some
particular order.An improvement in FS design makes an application hundred times faster
 Good file structure design
o Should give access to all disk capacity without making our application spend a lot of time
waiting for the data from disk
o Make an application hundred times faster
o What is the best one for the situation may be terrible for another

1
Prepared by Shilpa B, Dept. of ISE, CEC
Overview of File Structure Design
I. General Goals
o Get the information we need with one access to the disk.
o If that’s not possible, then get the information with as few accesses as possible.
o Group information so that we are likely to get everything we need with only one trip to the disk.
II. Fixed versus Dynamic Files
o It is relatively easy to come up with file structure designs that meet the general goals when the
files never change.
o When files grow or shrink when information is added and deleted, it is much more difficult.

Short history of FS design

 Sequential File Structures:


o In the beginning work with file presumed that files were on tape.
o Access were sequential and the cost of access grew in direct proportion to the size of the file
 Indexed sequential file structures:
o As file grew very large, sequential file structures was not a good solution.
o Sequential access switched into direct access and tapes storage device to disks storage device
o Indexes were invented – list of keys and pointers stored in small fie that could be searched very
quickly.Indexes are great if it fits into main memory. But as a file grows, same problem is created
 Tree
o Tree structures are emerged in 1960’s. As indexes have a sequential flow, when they grew too much,
they also become difficult to manage.So idea of using tree structures to manage indexes are emerged
o In trees, one can easily manage to insert and delete records
 AVL Trees
o In 1963, researchers came up with balanced, self-adjusting trees – Ex: AVL Tree
o AVL trees however did not apply to files because they won’t work well dozens or hundreds of
records
2
Prepared by Shilpa B, Dept. of ISE, CEC
 B-Trees
o It took nearly ten more years of design work before a solution emerged in the form of B-tree
o AVL trees grow from top down as records are added, Whereas B-tree grows from bottom up
o B-tree provides excellent access performance, but there was a cost: no longer could a file be accessed
sequentially with efficiency
 B+ Trees
o The combination of B-Tree and sequential linked list is called B+ trees. Over the next 10 years, B-
trees and B trees became the basis for many commercial file systems.It provides both sequential
access and direct access
 Hashing
o Being able to retrieve information with just three or four access is pretty good.An approach called
hashing is a good way to do that with file that do not chage size greately over time

A Conceptual Toolkit: File structure Literacy


 As we move through the history of file structures design, it addresses dynamic files sequentially, then
through tree structures, and finally through direct access
 We decrease the number of disk access by collecting data into buffers, blocks or buckets
 We manage the growth of these collections by splitting them and so on
 Progress takes the form of finding the new ways to combine these basic tools of file design
 We think of these tools as conceptual tools.They are method of framing and addressing a design problem
 An object oriented toolkit: making file structures usable
o Making file structures usable in application development requires turning this conceptual toolkit
to application programming interface - collections of data types and operations that can be used
in applications
o Here we choose object oriented approach where data types and operations are presented in a
unified fashion as class definitions

Fundamental file processing operations:


Physical and Logical files:

3
Prepared by Shilpa B, Dept. of ISE, CEC
Difference between physical and logical files:

a. Physical files
 A collection of bytes are stored on a disk  The file residing in secondary storage
or tape. device and managed by operating system
 A file, when the word is used in this sense, is called as physical file
physically exists.
 A disk drive might contain even thousands
of these physical files.
 The physical file is very much hardware
and OS dependent.
 The computer considers all kinds of files
as stream of bytes.
 The operating system acts as manager in
managing these files

b. Logical files

 Logical file is a channel that connects the program to a physical file.Programs read and write data
from logical file.Before a logical file can be used, it must be associated with a physical file.
 This act of connection is called operating the file.Data in a physical file is persistent and data in
logical file is temporary.
 A logical file is identified by a program variable or constant.The program sends (or receives) bytes to
(or from) a file through the logical file.

4
Prepared by Shilpa B, Dept. of ISE, CEC
 The program knows nothing about where the bytes go or from where it came.The OS is responsible
for associating logical file in a program to a physical file in disk or tape.
 Writing through or reading from a file in a program is done through OS
Opening Files
 To associate a logical program file with a physical system file we have two options:
1) Open an existing file
2) Create a new file, deleting anyexisting contents in the physical file.
 Opening a file makes it ready for use by the program.The C++ open function is used to open a file.
 Function to open a file:
fd = open(filename, flags, [pmode]);
Argument Type Explanation
Type
fd int The file descriptor. Used to refer to file within the program. It is an integer. If
there is an error to open file, this value is negative.
filename char* A character string contains physical file name
flags int Flags argumments controls the operations of open function, determining
whether it opens the existing file or reading or writing. The value of flags is set
by performing a bitwise OR of the following
O_RDONLY : Open the file for Read only
O_WRONLY : Open the file for Write only
O_RDWR : Open the file for Read or write
O_CREAT : Create file if it does not exist
O_APPEND : Append every write operation to the end of the file
O_TRUNC : Delete any prior file contents
pmode int If O_CREAT is specified, pmode is required. This integer argument specifies
protection mode for the file. The pmode is 3 digit octal number, 0751, that
(protection indicates how the file can be used by the owner (1st digit), by the members of
mode) the owner’s group (2nd digit), and by every one else (3rd digit). The 1st bit of
each octal digit indicates read permission, 2nd digit write permission and the
third execute permission.

Ex: fd = open(fname, O_RDWR|O_CREAT,0751)


The above function call opens an existing file for reading and writing or creates
a new one if necessary

5
Prepared by Shilpa B, Dept. of ISE, CEC
How to do it in c++?
 Standard C++ stream classes are defined in iostream.h and fstream.h
 We can create file in one statement and open it in another using open( ) function which is the member
of fstream class.In the open( ) function we include several mode bits to specify certain aspect of file
object
 Ex: fstream file;
File.open(“myfile.txt”, ios::out);
Mode bits for open( ) function
member stands
access
constant for
ios::in input File open for reading: the internal stream buffer supports input operations.
ios::out output File open for writing: the internal stream buffer supports output operations.
ios::binary binary Operations are performed in binary mode rather than text.
All output operations happen at the end of the file, appending to its existing
ios::app append
contents.
ios::trunc truncate Any contents that existed in the file before it is open are discarded.

Closing Files
 close(fd); (fd : file descriptor): Closing a file is like hanging up a phone.
 When you hang up the phone, the phone line is available for taking or placing another call
 When you close a file the logical file name or file descriptor is available for use with another file
 Files are usually closed automatically by the OS when programs terminate normally
 The execution of a close statement within a program is needed only to protect it against data loss and
to free up the logical filenames for reuse. In C++ file.close( );
Reading and Writing
 Reading and writing are fundamentals to file processing. They are the actions that make file
processing an input/output operation
a. Read function
 It requires 3 pieces of information: Read( Source_file, Destination_address, Size);
Source_file The Read call must know where it is read from. We specify the source by
logical file name through which data is specified
Destination_address Read must know where to place the information it reads from the input file.
We specify the destination by giving the 1st address of the memory block
where we want to store the data
Size Read must know how much information to bring in from the file. Here
argument is supplied as a byte count

6
Prepared by Shilpa B, Dept. of ISE, CEC
b. Write function
 Write function is used to write data from a variable inside the program into the file
 It is similar to read function but moves in the other direction
Write(Destination_file, Source_address, Size);
Destination_file The logical filename that is used for sending the data
Source_address Write must know where to find information it will send. We provide this
specification as the 1st address of the memory block where the data is stored
Size The no. of bytes to b written must be supplied
Seeking
 In previous samples we read file sequentially, reading one byte after another until we reach end of
the file.Sometimes we want to read or write without going through every byte sequentially
 Perhaps we know that next piece of information resides 10 thousand bytes away, so we need to jump
there.Or perhaps we need to jump at the end of file so we can add new information there.
 To satisfy these needs we must be able to control the movements of the read write pointer
 The action of moving directly to a certain position in a file is often called seeking
 A seek requires 2 pieces of information: Seek( Source_file, Offset);
 Source_file is the logical filename in which the seek will occur
 Offset is the number of positions in the file, the pointer is to be moved from the start of the file
 Ex: Seek(data, 373);
This makes to move the pointer directly from the origin to the 373 rd position in a file called data

Unix directory structure


/ root

bin usr usr6 dev

adb bin lib lib mydir console

cc libc.a libdf.a addr kbd

yacc libm.a df TAPE

 No matter what computer system you have, even if it is a small PC, there may be thousands of files
you have access to. To do so, computer has some method for organizing its files.In unix it is called
file system
 Unix file system is a tree structured organization of directories with the root of the tree signified by
the character ‘/’. All directories including root contains 2 kinds of files: regular files with programs
and data and directories

7
Prepared by Shilpa B, Dept. of ISE, CEC
 The above diagram shows sample unix directory structure.Since every files in unix system is part of
the file system that begins with root, any files can be uniquely identified by giving its absolute
pathname.
 Ex: the true unambiguous name for the file “addr” is /usr6/mydir/addr.
Physical and Logical files in UNIX
 It is easy to think of magnetic disk as a source of file because we are used to the idea of storing such
things on disks
 But, in unix, devices like keyboard and console are also files (as shown in the above diagram)
/dev/kbd and /dev/console respectively
 The keyboard produces a sequence of bytes that are sent to the computer when keys are pressed.
 The console accepts a sequence of bytes and displays the symbols on screen. A Unix file is
represented logically by an integer-the file descriptor
 This integer is an index to an array of more complete information about the file.
 A keyboard, a disk file, and a magnetic tape are all represented by integers.
 Once the integer that describes the file is identified, a program can access that file
 If it knows the logical name of a file, a program can access that file without knowing whether the file
comes from the disk, a tape or a connection to another computer
 This view of a file in Unix makes it possible to do with a very few operations compared to other OS.

File related header files


Header files relevant to files are: iostream.h, fstream.h, fcntl.h and file.h

Unix file system commands


Unix provides many commands for manipulating files such as:

Secondary Storage and System Software


Disks
 Compared with the time it takes to access an item in memory, disk accesses are always expensive
 But not all disk accesses are equally expensive. This has to do with the way a disk drive works
8
Prepared by Shilpa B, Dept. of ISE, CEC
 Disk drives belongs to a class of devices known as Direct Access Storage Device (DASDs) because
they make it possible to access data directly.DASDs are constructed with serial devices
 Serial devices uses media such as magnetic tape that permits only serial access
 Magnetic disks come in many forms
o Hard disks – offers high capacity and low cost per bit
o Floppy disks – inexpensive but slow and holds relatively little data. It is good for backup and
transporting small amount of data
Organization of disks
 The information stored on a disk is stored on the surface of one or more platters
 The arrangement is such that the information is stored in successive tracks on the surface of the disk
 Each track is often divided into a number of sectors
 A sector is a smallest addressable portion of a disk
 When a read statement calls for a particular byte from a disk file the OS finds the correct surface,
track and sector, reads entire sector into special area in memory called a buffer and then finds the
requested byte within that buffer
 Disk drives typically have a number of platters
 The track that are directly above and below one another form a cylinder
 The significance of cylinder is that all of the information on a single cylinder can be accessed
without moving the arm that holds read/ write heads
 Moving this arm is called seeking which is the slowest part of reading information from a disk

Estimating capacities and space needs


 In a typical disk, each platter has 2 surfaces, so the number of tracks per cylinder is twice the number
of platters

9
Prepared by Shilpa B, Dept. of ISE, CEC
 No. of cylinders is same as no. of tracks on a single surface, and each track has same capacity
No. of cylinders = no. of tracks on a single surface
 Amount of data that can be held on a track and no. of tracks on a surface depends on how densely bits
can be stored on a disk surface
 Cylinder contains group of tracks. Track contains group of sectors.Sector contains group of bytes
Track capacity = no. of sectors per track X bytes per sector
Cylinder capacity = no. of tracks per cylinder X track capacity
Drive capacity = no. of cylinders X cylinder capacity
 Ex: we want to store a file with 50000 fixed length data records on a typical 2.1 gigabyte small
computer disk with following characteristics:No. of bytes per sector = 512, No. of sectors per track =
63, No. of tracks per cylinder = 16, No. of cylinders = 4092, How many cylinders are needed?
Soln:
There will be 2 records per sector
So no. of records per track = 2 X no. of sectors per track
= 2 X 63 = 126 records
Given that No. of tracks per cylinder = 16
So, no. of records per cylinder = 16 X 126 = 2016 records
File can store 50000 fixed length data records
So, no. of cylinders = 50000 / 2016 = 24.8 cylinders

Organizing tracks by sector


 There are 2 basic ways to organize data on a disk
o By sector
o By user defined block
a. The physical placement of sector
 The hard disk consist of tracks which are further divided into sectors. Data is always stored in sectors
 Physically adjacent sectors need not be logically consecutive

10
Prepared by Shilpa B, Dept. of ISE, CEC
 So there are 2 cases to study
o Data stored in consecutive or adjacent sectors
o Data spread across non adjacent sectors
 If data stored in adjacent sectors, it should be easy to access them

 In the above figure, adjacent sectors are used to store data. Initially R/W head positioned in sector 1
 There is a delay in transferring data from sector to main memory
 When data from sector 1 is transferred, the disk would have moved some distance in circular motion
 Therefore, when R/W head is ready to read, the sector under it is not sector 2 but some other sector
say 4.
 So the disk has to rotate one rotation in order to read sector 2.This delay is called rotational delay
 Together with the transfer delay, the total time taken to read will be much longer.
 The figure below uses non adjacent sectors to store data

 Here an interleaving factor of 2 is used.When R/W head is reading sector 1, the disk rotates and by
the time the transfer data is complete, the R/W head is ready to read next sector say sector2
 Thus when sector interleaving factor is chosen intelligently, the entire track can be read with few
rotations
b. Clusters
 Another view of sector organization, also designed to improve performance is the view maintained by
the part of a computer’s OS that we call the “file manager”
 When a program access a file it is the fie manager’s job to map the logical parts of the file to their
corresponding physical locations. It does this by viewing the file as a series of clusters of sectors
 A cluster is a fixed number of contiguous sectors. Once a given cluster has been found on a disk, all
sectors in that cluster can be accessed without requiring an additional seek
 To view a file as series of clusters and still maintain the sectored view, the file manager ties logical
sectors to the physical clusters they belong to by using “file allocation table” (FAT)
11
Prepared by Shilpa B, Dept. of ISE, CEC
 The FAT contains the list of all clusters in a file

c. Extents
 Clusters may or may not be contiguous (share a common boundary) on a disk.
 Cluster sizes may range from 1 to 65,535 blocks.
 Generally, a system manager assigns a small cluster size to a disk with a relatively small number of
blocks. Relatively larger disks are assigned a larger cluster size to minimize the overhead for disk space
allocation.
 An extent is one or more adjacent clusters allocated to a file or to a portion of a file.
 If enough contiguous disk space is available, the entire file is allocated as a single extent.
 Conversely, if there is not enough contiguous disk space, the file is allocated using several extents,
which may be scattered physically on the disk.
 Figure below shows how a single file (File A) may be stored as a single extent or as multiple extents.

d. Fragmentation
 If, for example, size of a sector is 512 bytes and size of all records in the fie is 300 bytes, there is no
convenient fit between records and sectors
 There are 2 ways to deal with this situation:
o Store only one record per sector
o Allow records to span sectors so the beginning of a record might be found in one sector and the
end of it in another
 The 1st option has advantage that any record can be retrieved by retrieving just one sector
 But it might has the disadvantage that it might leave an enormous amount of unused space within each
sector. This loss of space is called internal fragmentation

12
Prepared by Shilpa B, Dept. of ISE, CEC
 The 2nd option has the advantage that it loses no space from internal fragmentation, but it has the
disadvantage that some records can be retrieved only by accessing both 2 sectors.

Organizing tracks by block


 Sometimes disk tracks are not divided into sectors, but into integral numbers of user defined blocks
whose size can vary, depending on the requirement of file designer and the capabilities of OS
 Blocks are referred to as physical records which is the smallest unit of data that the OS supports on a
particular drive.
 The figure below shows the difference between one view of data on a sectored track and that on a
blocked track

 A block organization does not present the sector spanning and fragmentation problems because blocks
can vary in size to fit the logical organization of the data.A block is usually organized to hold a integral
no. of logical record
 The term “blocking factor” is used to indicate the no. of records that are to be store in each block in a file
 If we use block organization, no space would be lost to internal fragmentation and no need to load 2
blocks to retrieve one record
 In block addressing schemes, each block of data is usually accompanied by one or more subblocks
containing extra information about the data block

13
Prepared by Shilpa B, Dept. of ISE, CEC
 There is a count subblock that contains no. of bytes in the accompanying data block
 There is a key subblock contains key for last record in data blockusing this key subblock, program can
ask its disk drive to search among all the blocks on a track for a block with desired key

Nondata overhead
 Both blocks and sectors require that a certain amount of space be taken up on the disk in the form of
nondata overhead
 Some of the overhead consist of information that is stored on the disk during preformatting which is
done before the disk can be used
 On sector addressable disk, preformatting involves
o Storing at beginning of each sector, information such as sector address track address and
condition ( whether sector is usable or defective)
o Placing gaps and synchronization marks between fields of information to help the read/write
mechanism
 Suppose we have block addressable disk drive with 20000 bytes per track and the amount of space taken
up by subblock and interblock gaps is equivalent to 300 bytes per block. We want to store a file
containing 100 byte records on the disk. How many records can be stored per track if the blocking factor
is 10? If it is 60?
o There are 100 bytes per block
Blocking factor =10
Data in each block = 100 X 10 = 1000 bytes
Space taken up by subblock and interblock gaps = 300 bytes per block
So total data = 1000+300 = 1300 bytes
So, no. of blocks that can be stored on a 20000 byte track = 20000/1300 = 15 blocks
So, no. of records can be stored per track if the blocking factor is 10= 15 X 10 = 150 records
o If Blocking factor =60,
Data in each block = 100 X 60 = 6000 bytes
So total data along with overhead = 6000+300 = 6300 bytes
So, no. of blocks that can be stored on a 20000 byte track = 20000/6300 = 3 blocks
So, no. of records can be stored per track if the blocking factor is 60= 3 X 60 = 180 records
 So larger blocking factor leads more efficient use of storage
 When blocks are larger, fewer blocks are required to hold a file
14
Prepared by Shilpa B, Dept. of ISE, CEC
The cost of a disk access
 Disk access can be divided int 3 physical operations
o Seek time
o Rotational delay
o Transfer time
a. Seek time
 It is the time required to move the access arm to the correct cylinder
 The amount of time spent during a disk access depends on how far the arm has to move
 If we are accessing a file sequentially and the file is packed into several consecutive cylinders,
seeking needs to be done only after all the tracks on a cylinder have been processed
 If we are alternately accessing sectors from 2 files that are stored at opposite extremes (one is
innermost and the other is outermost cylinder) on a disk, seeking is very expensive
 So system designers often go to minimize seeking.We usually try to determine average seek time
required for particular operation. Most hard disk available today have average seek time of less
than 10 milliseconds
 High performance disks have average seek time of less than 7.5 milliseconds
b. Rotational delay
 It is the time it takes for the disk to rotate so the sector we want is under the R/W head
 It is also referred to as latency.
 Hard disk usually rotate at about 5000rpm, i.e 1 rotation per 12 milliseconds
 Floppy disk rotate about 360rpm, i.e 1 rotation per 83.3 milliseconds
 In many cases rotational delay can be much less than the average
 The rotational delay is inversely proportional to rotational speed of the drive
 The average rotational delay is the time for the disk to rotate 180 0
c. Transfer time
 Once the data we want is under W/R head, it can be transferred
 Transfer time is given by
𝑛𝑜.𝑜𝑓𝑏𝑦𝑡𝑒𝑠𝑡𝑟𝑎𝑛𝑠𝑓𝑒𝑟𝑟𝑒𝑑
Transfer time = 𝑋𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒
𝑛𝑜.𝑜𝑓𝑏𝑦𝑡𝑒𝑠𝑜𝑛𝑎𝑡𝑟𝑎𝑐𝑘

 If drive is sectored, transfer time for one sector depends on the no. of sectors on a track

Disk as bottleneck
 Disk performance is increasing, but still slow!
 Even high-performance network is faster than disk
 A process called “disk bound” – the network and computer’s CPU have to wait for sometimes for the
disk to transmit the data. A no. of techniques are used to solve this problem
 One is Multiprogramming – CPU works on other jobs while waiting for data to arrive
15
Prepared by Shilpa B, Dept. of ISE, CEC
 Next one is Striping – splitting part of a file on several different drives, then letting the separate
drives deliver parts of the file to the network simultaneously
 Striping contains an important concept called parallelism – whenever there is a bottleneck at some
point in the system, duplicating the source of the bottleneck and configure the system, so several of
them operate parallel

Magnetic tape
 It belongs to a class of device that provides no direct accessing facility but can provide very rapid
sequential access to data
 Tapes are compact, stand up well under different environmental conditions, are easy to store and
transport, and are less expensive than disk
 Years ago tape systems were widely used to store application data
 An application that needed data from a specific tape would issue a request for the tape, which would
be mounted by an operator onto a tape drive.
 The application could then directly read and write on the tape
 The tremendous reduction in the cost of disk system has changed the way tapes are used

Organization of data on Nine-Track tapes


 Since tapes are accessed sequentially, there is no need for addresses to identify the locations of data
on a tape
 On a tape, the logical position of a byte within a file corresponds directly to its physical position
relative to the start of the file.
 The surface of a typical tape can be seen as a set of parallel tracks each of which is a sequence of
bits. These bits correspond to 1 byte + a parity bit.One Byte = a one-bit-wide slice of tape called a
frame.

 In odd parity, the bit is set to make the number of bits in the frame odd.
 This is done to check the validity of the data.
 Frames are organized into data blocks of variable size separated by interblock gaps which contains
no information, and are long enough to permit stopping and starting
 Tape drives come in many shapes, sizes and speeds
16
Prepared by Shilpa B, Dept. of ISE, CEC
 Performance is measured using 3 quantities
o Tape density – commonly 800, 1600, 6250 bits per inch (bpi) per track [ recently 30000bpi]
o Tape speed – commonly 30 to 200 inches per second (ips)
o Size of interblock gap – commonly between 0.3inch and 0.75 inch
Estimating tape length requirements
(i) Suppose we want to store a backup copy of a large mailing list file with one million 100 bytes record. If
we want to store the file on a 6250bpi tape that has an interblock gap of 0.3 inches, how much tape is
needed?
Ans:
There are mainly 2 things that takes up space on a tape
o Interblock gap
o Data blocks
Let b = physical length of data block
g = length of interblock gap
n = no. of data block
then space required for storing the file
s = n X (b + g )
We have g = 0.3 inch
n = 1 million = 1,000,000
Bytes per block = 100
Bytes per inches = 6250
b = block size (bytes per block) / tape density (bytes per inch)
So, b = 100 / 6250 = 0.016 inch
So s = 1000000 X (0.016+0.3)
= 316000 inches
= 316000/12 feet = 26333 feets tape is needed to store the record
(ii) If for the same problem, blocking factor is 50, show that only one tape is required to back up the file
Ans:
Blocking factor = 50
So no. of blocks n = 1000000/50 = 20000
And s = 20000X(0.016+0.3) = 6320 inches
Given that, we need to store 6250 bpi file
If we have tape of 6320 inches size, only one tape is required to store 6250bpi file
(iii)Suppose we want to store a backup copy of a large mailing list file with 350000 records of 80 byte. If we
want to store the file on a 6250bpi tape that has an interblock gap of 0.3 inches, and blocking factor is
50, how much tape is needed to store these records?
Ans:
17
Prepared by Shilpa B, Dept. of ISE, CEC
n = 350000
Bytes per block = 80
Bytes per inches = 6250
b = block size (bytes per block) / tape density (bytes per inch)
So, b = 80 / 6250 = 0.0128 inch
So s = n X (b + g )
Where n = 350000, b = 0.0128, g = 0.3
So, s = 350000 X (0.0128+0.3)
= 109480 inches
= 109480 /12 feet = 9123feets tape is needed to store the record
Blocking factor = 50
So no. of blocks n = 350000 /50 = 7000
And s = 7000 X (0.0128+0.3)=2189.6 inches tape is required
Estimating data transmission time
Nominal data transmission rate = tape density (bpi) X tape speed (ips)
Hence 6250bpi, 200 ips tape has a nominal transmission rate of
nominal transmission rate = 6250 X 200 = 1250000 bytes / Sec
= 1250 kilobytes/Sec
Once our data gets dispersed by interblock gaps, the effective transmission rate certainly suffers. Suppose
for the previous problem, blocking factor is 1, one million 100 bytes record and an interblock gap of 0.3
inches, then
Effective recording density = no. of bytes per block / no. of inches required to store a block
= 100 bytes / 0.316 inches = 316.4 bpi
If the tape is moving at the rate of 200ips,
then effective transmission rate = 316.4 X 200
= 63280 bytes/sec
= 63.3 kilobyte/sec
Which is very less than nominal rate. If blocking factor is larger, then it improves the result.
So, some factors that influence the performance is:
 Block size
 Gap size
 Tape speed
 Recording density
 Time it takes to start and stop the tape

18
Prepared by Shilpa B, Dept. of ISE, CEC
Disk versus Tape
Disk
 Excellent for random access and storage of files for which immediate access was desired
 Dedicated for several processes
Tape
 Ideal for processing data sequentially and long-term storage of files. Dedicated for one process

Introduction to CD-ROM
 CD-ROM – Compact Disk Read Only Memory
 It can hold a lots of data and can be reproduced cheaply. A single disc can hold more than 600
megabytes of data
 CD-ROM is read only (or write once) in the same sense as a CD audio disc: once it has been
recorded, it cannot be changed
 It is a publishing medium used for distributing information to many users rather than a data storage
and retrieval medium like magnetic disks
 It is used for distribution of
o All types of software
o Codes
o Textual data
o Digitalized images
o Video information
o Digital audio etc
A Short history of CD-ROM
 CD-ROM is developed in late 1960s and early 1970s. The goal was to store movies on disc
 Then the consumer products industry spent so much money developing other different technologies
 Then spent years fighting over which approach should become standard
 The surviving format is one called Laser Vision. So competitors of this lost money and important
market opportunities. This hard lessons made them to develop CD audio and CD-ROM
 Why Laser Vision technology became popular is that, it supports recording in both Constant Linear
Velocity (CLV) and Constant Angular Velocity (CAV) that enables fast seek performance
 In early 1980s, a no. of firms began looking at possibility of storing digital textual information on
Laser Vision discs. Laser vision stores data in an analogue form; it is after all, storing a digital signal
 Phiips and Sony began work on a way to store music on optical discs. Rather than storing the music
in the kind of analogue form used on video discs, they developed digital data format
 They had learned hard lessons from the expensive standards battles over video discs

19
Prepared by Shilpa B, Dept. of ISE, CEC
 This time they worked with the other players in the consumer product industry to develop licencing
system that resulted in the emergence of CD audio as a broadly accepted standard format as soon as
the 1st disc and players were introduced. CD audio appeared in US in early 1984
 CD-ROM which is a digital data format built on the top of CD audio standards
 The 1st commercially available CD-ROM drives appeared in 1985
 The different large and small firms worked out the main features of file system standards by early
summer of 1986. That work has become an official international standard for organizing files on CD-
ROM. The latest new technology for CDs is the DVD, which stands for Digital Video Disc
 The Sony Corporation has developed DVD for the video market, especially for the new high
definition TVs, but DVD is also available for storing file

Physical organization of CD-ROM


 CD-ROM is the child of CD audio.
 Audio discs are designed to play music, not to provide fast, random access to data
Reading pits and lands
 CD-ROMS are stamped from a master disc
 The master is formed by using digital data we want to encode to turn a powerful laser on and off very
quickly. The master disc which is made of glass, has a coating that has changed by the laser beam
 When the coating is developed the areas hit by the laser beam turn into pits along the track followed
by the beam. The smooth unchanged areas between pits are called lands

 The pits scatter the light but lands reflect most of it back to the pickup
 This alternating pattern of high and low intensity reflected light is the signal used to reconstruct the
original digital information
 The encoding scheme used for this signal is not simply a matter of calling a pit a 1 and a land a 0
 Instead the 1s are represented by the transitions from pit to land and back again
 Every time the light intensity changes we get a 1. The 0s are represented by the amount of time
between transitions. The longer between transitions, the more 0s we have
 It is not possible to have 2 adjacent 1s – 1s are always separated by 0s
 In fact, due t the limits of the resolution of the optical pickup, there must be at least 2 0s between any
pair of 1s. This means that the row pattern of 1s and 0s has to be translated to get the 8bit patterns of
1s and 0s that form the bytes of original data.
 This translation scheme which is done through a lookup table turns the original 8 bit of data into 14
expanded bits that can represented in the pits and lands on the disc

20
Prepared by Shilpa B, Dept. of ISE, CEC
 The reading process reverses this translation
CLV instead of CAV
 Data on CD ROM is stored in a single spiral track that winds for almost 3miles from the centre to the
outer edge of the disc
 A sector towards the outer edge of the disc takes the same amount of space as a sector towards the
centre of the disc
 This means that we can write all f the sectors at the maximum density permitted by storage medium
 Since reading the data requires that it pass under the optical pickup device at a constant rate, the
constant data density implies that the disc has to spin more slowly when we are reading to the outer
edges than when we are reading towards the centre

 This is why the spiral is a constant linear velocity(CLV) format:as we seek from the centre to the
edge, we change the rate of rotation of the disc so the linear speed of the spiral past the pickup device
stays the same
 CAV With its concentric tracks and pie shaped sectors writes data less denselyin the outer tracks
than in the tracks toward the centre
 We are wasting storage capacity in the outer tracks but have the advantage of being able to spin the
disc at the same speed for all positions of the read head.
 Given the sector arrangement shown in the figure, one rotation reads 8 sectors no matter where we
are on the disc.A timing mark placed on the disc makes it easy to find the start of a sector
 The CLV format is responsible for the poor seeking performance of CD ROM drives
 The CAV formats provides definite track boundaries and a timing mark to find the start of a sector
 But the CLV format provides no straight forward way to jump to specific location.

Addressing
 We use sector addressing scheme that is related to the CD ROMs route as and audio playback device
 Each second of playing time on a CD is divided into 75 sectors each of which holds 2KB of data.
 According to the originals Philips and Sony standards a CD whether used for audio or CD ROM
contains at least on hour of playing time that means that disc is capable of holding 5,40,000KB data
CD ROM Strength and weakness
a. Seek performance
21
Prepared by Shilpa B, Dept. of ISE, CEC
 The chief weakness of CD ROM is the random access performance
 Current magnetic disc technology is such that the average time for random data access combining
seek time and rotational delay is about 30m sec. On a CD ROM this average access takes 500m sec
 Our file design strategy must avoid seeks to an even greater extent than on magnetic disc
b. Data Transfer rate
 A CD ROM drive reads 70ectors or 150 KB of data per second.
 This data transfer rate is part of the fundamental definition of CD ROM
 It can’t be changed without leaving behind the commercial advantages of CD audio standard
 It is a modest transfer rate about 5 times faster than the transfer rate of floppy disc
c. Storage capacity
 A CD ROM holds more than 600 MB of data
 Although it is possible to use up this storage area very quickly particularly if you are storing raster
images 600MB is big when it comes to text applications
d. Read only access
 From a design stand point the fact that CD ROM is publishing medium, a storage device that
cannot be changed after manufacture, provides significant advantages. We never have to worry
about updating
 This not only simplifies some of the file structures but also means that it is worthwhile to optimize
our index structures and other aspects of file organization
e. Asymmetric writing and reading
 For most media files are written and read using the same computer system
 Often reading and writing are both interactive and are there for constrained by the need to provide
quick response to user. CD ROM is different
 We create the files to be placed on the disc once, then we distribute the Disc and it is accessed
1000 of times

A Journey of a Byte
 What happens when a program writes a byte to a file on a disk? What happens between the program
and the disk? We see an example of a journey of 1 byte
 Suppose we want to appended a byte representing a character P stored in a character variable ch to a
file named in the variable textfile stored somewhere on a disk
 From programmer’s point of view entire journey of the byte is represented by a statement
“write(textfile, ch, 1)”.But actual journey is much longer than this
 The write statement results in a call to the computer’s OS which has the task of seeing that the rest of
journey is completed successfully
22
Prepared by Shilpa B, Dept. of ISE, CEC
 The write statement tells the operating system to send one character to disk and gives the OS the
location of the character.
 The OS takes over the job of writing and then returns control to calling program
 Once the OS has taken over the job, the rest of the journey is largely beyond the program’s control
a. File manager
 OS is not a single program, but a collection of programs. Each one designed to manage a different
part of computer’s resource. Among these programs, one deals with file related matters – file
manager. It has several layers as below:
1 The program asks the OS to write the contents of the variable c to the next available Logical
position in TEXT.
2 The OS passes the job on to the file manager
3 The file manager looks up TEXT in a table containing information about it, such as
whether the file is open and available for use, what types of access are allowed, if
any, and what physical file the logical name TEXT corresponds to.
4 The file manager searches a File Allocation Table for the physical location of the
sector that is to
contain the byte.
5 The file manager makes sure that the last sector in the file has been stored in a system
I/O buffer in RAM, then deposits the ‘P’ into its proper position in the buffer.
6 The file manager gives instructions to the I/O processor about where the byte is stored
in RAM and where it needs to be sent on the disk
7 The I/O processor finds a time when the drive is available to receive the data and puts
the data in proper format for the disk. It may also buffer the data to send it out in
chunks of the proper size for the disk
8 The I/O processor sends the data to the disk controller.
9 The controller instructs the drive to move the r/w head to the proper track, waits for
thedesired sector to come under the r/w head, then sends the byte to the drive to be
depositedbit-by-bit, on the surface of the disk.
Physical
b. I/O Buffer

23
Prepared by Shilpa B, Dept. of ISE, CEC
 Next, the file manager determines whether the sector that is to contain P is already in memory or needs
to be loaded into memory. If the sector needs to be loaded, the file manager must find the available
system I/O buffer space for it and then read it from the disk
 Once it has the sector in a buffer, in memory, the file manager can deposit the P into its proper position
in the buffer. The file manager moves P form program’s data area to a system output buffer where it
may join other bytes headed for the same place on the disk.
 If necessary, file manager may have to load the corresponding sector from the disk into the system
output buffer. The system I/O buffer allows the file manager t read and write data in sector sized or
block sized unit
c. The bytes leaves memory: the I/O processor and disk controller

 Till now, bytes have travelled along data paths that are designed to be very fast and are relatively
expensive. Now it is time for the byte to travel along the data path that is likely to be narrower than the
one in primary memory
 Because of the bottlenecks created by these differences in speed and data path widths, our byte and its
companions might have to wait for an external path to become available

24
Prepared by Shilpa B, Dept. of ISE, CEC
 The process of dissembling and assembling groups of bytes for transmission to and from external
devices are so specialized that it is unreasonable to ask an expensive general purpose CPU to spend its
valuable time doing I/O when a simpler device could do the job and free the CPU to do its other works
 Such a special purpose device is called an I/O processor
 An I/O processor may be anything from a simple chip capable of taking a byte and passing it along one
cue to a powerful small computer capable of executing very sophisticated programs and
communicating with many devices simultaneously
 The I/O processor takes its instructions from OS, but once it begins processing I/O, it runs
independently, relieving OS
 In typical computer the file manager might now tell the I/O processor that there is data in the buffer to
be transmitted to the disk, how much data there is and where it is to go on the disk
 This information might come in the form of a little program that the OS constructs and the I/O
processor executes
 The job of controlling the operations of the disk is done by a device called disk controller
 The I/O processor asks the disk controller if the disk drive is available for writing
 If there is much I/O processing, there is a good chance that the drive will not be available and our byte
will have to wait in its buffer until the drive becomes available
 Then the disk drive is instructed to move its R/W head to the track and sector on the drive where our
byte and its companions have to be stored
 The R/W head must seek to the proper track and then wait until the disk has spun around so the desired
sector is under head
 Once the track and sector are located, the I/O processor can send out bytes, one at a time to drive where
it probably is stored in a little 1 byte buffer while it waits to be deposited on the disk
 Finally as the disk spins under the R/W head, the 8-bits of our byte are deposited one at a time on the
surface of the disk

Buffer management
 Buffer is the part of main memory available for storage of copies of disk blocks
 Buffering involves working with large chunks of data in memory
 So number of access to secondary storage can be reduced. But use of buffer within programs can also
affect the performance
a. Buffer bottlenecks
 File manager allocates I/O buffer that are big enough to hold incoming data
 It is common for file manager to allocate several buffers for performing I/O
 Consider if a program is performing both I/O on one character at a time and only one I/O buffer is
available

25
Prepared by Shilpa B, Dept. of ISE, CEC
 When the program asks for its 1stcharacter the I/O buffer is loaded with the sector containing the
character and the char is transmitted to the program
 If the program then decides to output a char, the I/O buffer is filled with the sector into which the
output char needs to go, destroying its original contents
 Then when the next i/p char is needed, the buffer contents have to be written to disk to make room
for the original sector containing the 2nd i/p char and so on
 That’s why, I/O systems use at least 2 buffers – one for i/p and one for o/p
 A program that reads many sectors from a file might have to spend much of its time waiting for the
I/O system to fill its buffer every time a read operation is performed before it can being processing
 When this happens, the program that is running is said to be I/O bound – the CPU spends much of
its time just waiting for I/O to be performed. Solution for this problem is to use more than one buffer
b. Buffering strategies -Multiple buffering
 Suppose that a program is only writing to a disk and that it is I/O bound
 The CPU wants to be filling a buffer at the same time that I/O is being performed
 If 2 buffers are used and I/O-CPU overlapping is permitted, the CPU can be filling one buffer
while the contents of the other are being transmitted to disk
 When both tasks are finished, the roles of the buffers can be exchanged
 This method is called as double buffering

 Normally any number of buffers can be used and they can be organized in a variety of ways
 Some file systems use a buffering scheme called buffer pooling
 When a system buffer is needed it is taken from a pool of available buffers and used
 When the system receives a request to read certain sector or block, it looks to see if one of its buffer
already contains that sector or block
 If no buffer contains it, the system finds from its pool of buffers one that is not currently in use and
loads the sector or block into it. Different schemes are used to decide which buffer to take from a
buffer pool. One general strategy is to take the buffer that is least recently used (LRU)

26
Prepared by Shilpa B, Dept. of ISE, CEC
 When a buffer is accessed it is put on a LRU queue so it is allowed to retain its data until all other
LRU buffers have been accessed
c. Move mode and locate mode
 Move mode – way of handling buffered data – it involves moving chunks of data from one place in
memory to another before they can be accessed – time consuming
 There are 2 ways to avoid move mode
o If file manager can perform I/O directly between secondary storage and the program’s data
area, no extra move is necessary
o File manager could use system buffers to handle I/O provided the program with locations
using pointers
 Both techniques are examples of locate mode

d. Scatter / Gather I/O


 To read a file with many blocks and each block consist of a header followed by data, we would like to
put the header in one buffer and the data in different buffer so the data can be processed as a single
entity

 To do this
o Read whole block into a single big buffer
o Move the different parts to their own positions
 Sometimes we can avoid this 2 step process using a technique called scatter input – a single read call
identifies not one but a collection of buffers into which data from a single block is to be scattered
27
Prepared by Shilpa B, Dept. of ISE, CEC
 The converse of scatter input is gather output – several buffers can be gathered and written with a
single write call. This avoids the need to copy them to a single output buffer

I/O in UNIX
a. The kernel
 Process of transmitting data from a program to an external device can be described as proceedings
through a series of layers
 The top most layer deals with data in logical, structural terms
 We store in a file a name, a body of text, an image, an array of numbers or some other logical entity
 The layer that follow collectively carry out the task of turning the logical object into collection of
bits on a physical device

 The top most I/O layer in Unix consist of processes that impose certain logical views on files
 These processes includes shell routines like cat and tail, user programs that operate on files and library
routines like scanf and fread that are called from programs to read strings, numbers and so on
 Below this layer, is the Unix kernel
 The component of the kernel is shown in the above diagram
 It views all I/O as operating on a sequence of bytes
 Once the control comes to kernel, all assumptions about logical view of file is gone
 This is to make all operations below the top layer independent of an application’s logical view of a file
Journey of byte through kernel
 When program executes a system call such as
write (fd, &ch, 1);
thekernel is invoked immediately
 The routines that let processes communicate directly with the kernel make up the system call interface
 Now system call instructs the kernel to write a character to a file
28
Prepared by Shilpa B, Dept. of ISE, CEC
 The kernel I/O system begins by connecting the file descriptor in the program to some file or device in
the file system
 It does this by proceeding through a series of 4 tables – that enables the kernel to find its way from
process to the places on the disk where it holds the file it refers to
 The 4 tables are:
o File descriptor table
o Open file table – with information about open files
o File allocation table – part of structure called index node
o Table of index nodes – one entry for each file in use
 These tables are managed by I/O system and owned by different parts of the system
o File descriptor table – owned by the process or your program
o Open file table and index node table are owned by kernel
o The index node is part of file system
 The 4 tables are invoked by kernel to get the information it needs
File descriptor table
 It is a simple table that associates each of the file descriptors used by a process with an entry in the open
file table
 Every process has its own descriptor table which includes entry for all files it has opened

Open file table


 It contains entries for open file
 Every time a file is opened or created, a new entry is added to the open file table

29
Prepared by Shilpa B, Dept. of ISE, CEC
 These entries are called file structures and they contain important information about
o How the corresponding file is to be used, such as read/write mode used when it was opened
o The no. of processes currently using it
o Offset within the file to be used for the next read or write
o Array of pointers to generic functions that can be used to operate on the file
 In general, an open file table tells the kernel what it can do with a file that has been opend in a certain
way and provides information on how it can operate on the file
 The kernel still needs more information about the file such as where the file is stored on disk, how big
the filee is, and who owns it.This information is found in index node (or inode) table
Index node (inode) table
 It is more permanent structure than the open file table’s file structure
 Inode exists as long as its corresponding file exists. When a file is opened, a copy of inode is usually
loaded into memory where it is added to the inode table for rapid access
 Most important component of inode is a list or index of the disk blocks that make up file
 Once the kernel I/O system has the inode information, it knows all it needs to know about the file
 It then invokes I/O processor program that is appropriate for the
o type of data
o type of operation
o type of device that is to be written
 in Unix this program is called device driver – it sees that your data is moved from its buffer to its proper
place on disk

30
Prepared by Shilpa B, Dept. of ISE, CEC
a. linking file names to files
 All references to files begin with a directory, for it is in directories that file names are kept
 Directory is just a small file that contains for each file, a file name together with a pointer to the file’s
inide on disk. This pointer from a directory to the inode of a file is called a hard link
 It provides a direct reference from a file name to all other information about the file
 When a file is opened, this hard link is used to bring the inode into memory and to setup the
corresponding entry in the open file table. A field in inode tells how many hard links there are to the
inode
 There is another kind of link – soft link or symbolic link – it links file name to another file name rather
than to actual file. So soft link is a path name of some file.
 These soft links are not supported by all UNIX systems
b. Normal files, special files, sockets
 Kernel distinguish files into 3 categories
o Normal filestext files
o Special filesrepresents stream of characters and control signals that drive some devices link
printers
o Socketsabstractions that serve as end point for the inter process communication

Block I/O system


 It concerns with how to transmit normal file data, viewed by user as a sequence of bytes, on block
oriented device like disk or tape.Originally all blocks are of size 512 bytes
Device drivers
 For each peripheral device, there is a separate set of routines called as device drivers
 It is the I/O process program in journey of bytes
 Its job is to take a block from a buffer and see that byte gets deposited in proper physical place on device
Kernel and I/O system
 UNIX file system is the collection of files together with secondary information about the files in the
system
 File system contains
o Directory structure
o Directories
o Ordinary files
o Inodes that describe files
 All these parts resides on disk and these are brought into memory by the kernel as needed

31
Prepared by Shilpa B, Dept. of ISE, CEC
Fundamentals of file structure concepts
Field and Record organization
 When we build file structures, we are making it possible to make data persistent i.e, one program can
create data in memory and store it in a file and other program can read the file and recreate the data in its
memory.
 The basic unit of data is field, in which we have a single data value
 Fields are organized into aggregates either as an array or as a record. When record is stored in memory,
we refer it to as an object and refer to its fields as members
 When that object is stored in a file, we call it as record.

A stream file
 Suppose we need to store name and address information about a collection of people, we will use object
of class person using object s in C++ to store information about individuals.
 Ex: the code below gives a C++ function to write the fields of a person to file as stream of bytes

 The following names and addresses are used as input to program

 When we list output file on our terminal screen, we see

 Here all fundamental units like Mary Ames, 123 Maple etc are called as fields where field is a smallest
logically meaningful unit of information in a file.

Field structure
 There are many ways of adding structure to files to maintain the identity of fields . Four of the most
common methods are:
 Force the fields into a predictable length.
 Begin each field with a length indicator.
 Place a delimiter at the end of each field to separate it from the next field.
 Use a "keyword = value" expression to identify each field and its contents .
32
Prepared by Shilpa B, Dept. of ISE, CEC
 Method 1: Fix the Length of Fields

 The fields in our sample file vary in their length. If we force the fields into predictable lengths, then
we can pull them back out of the file simply by counting our way to the end of the field.
 We can define a struct in C or a class in C++ to hold these fixed length fields. Size of array is one
larger than the longest string it can hold , because string in C or C++ are stored with a terminating 0
byte.
 But a fixed size field in a file doesn’t need to add this extra character. So object of class person can
be stored in 10+10+15+15+2+9 = 61 bytes.
 Using this structure, our output looks a following:

 Drawback is wastage of space. (Instead of using 4 bytes to store Ames, we use 10 bytes and so on)
 Method 2: Begin Each Field with a Length Indicator
 Another way to make it possible to count to the end of a field involves storing the field length just
ahead of the field, as illustrated in Fig below.

 If the fields are not too long (length less than 256 bytes), it is possible to store the length in a single
byte at the start of each field.
 Method 3: Separate the Fields with Delimiters
 We can also preserve the identity of fields by separating them with delimiters.
 All we need to do is choose some special character or sequence of characters that will not appear
within a field and then insert that delimiter into the file after writing each field.
 The choice of a delimiter character can be very important since it must be a character that does not
get in the way of processing .
 In many instances white-space characters (blank, new line, tab) make excellent delimiters because
they provide a clean separation between fields when we list them on the console.
 Also, most programming languages include I/O statements that, by default, assume that fields are
separated by white space.
 Unfortunately, white space would be a poor choice for our file since blanks often occur as legitimate
characters within an address field.
33
Prepared by Shilpa B, Dept. of ISE, CEC
 Therefore, instead of white space we use the vertical bar character as our delimiter, so our file
appears as in Fig. below.

 Method 4: Use a "Keyword = Value" Expression to Identify Fields


 This option, illustrated in Fig. below, has an advantage that the others do not: It is the first structure
in which a field provides information about itself.

 Such self-describing structures can be very useful tools for organizing files in many applications. It is
easy to tell what fields are contained in a file, even if we don't know ahead of time what fields the
file is supposed to contain.
 It is also a good format for dealing with missing fields. If a field is missing, this format makes it
obvious, because the keyword is simply not there.
 This format is used in combination with another format, a delimiter to separate fields.

Reading a stream of fields


 We can write a function that overloads the extraction operators (>>) that reads the stream of bytes back
in breaking the stream into fields and storing it as a ‘person’ object.

 Extensive use is made of the istream method getline. Arguments to getline are a character array to hold
the string, a maximum length and a delimiter.
 getline reads upto 1st occurrence of delimiter or end of line whichever comes first
 when this program is executed, we get

34
Prepared by Shilpa B, Dept. of ISE, CEC
Record structure
 A record can be defined as a set of fields th at belong together when the file is viewed in terms of a
higher level of organization.
 Like the notion of a field, a record is another conceptual tool. It is another level of organization that we
impose on the data to preserve meaning.
 Records do not necessarily exist in the file in any physical sense, yet they are an important logical notion
included in the file structure.
 Here are some of the most often used methods for organizing a file into records:
 Require that the records be a predictable number of bytes in length.
 Require that the records be a predictable number of fields in length.
 Begin each record with a length indicator consisting of a count of the number of bytes that the record
contains .
 Use a second file to keep track of the beginning byte address for each record.
 Place a delimiter at the end of each record to separate it from the next record.
 Method 1: Make Records a Predictable Number of Bytes (Fixed-length Records)
 A fixed-length record file is one in which each record contains the same number of bytes .
 We have a fixed number of fields, each with a predetermined length, which combine to make a
fixed-length record. This kind of field and record structure is illustrated in Fig. below:

 It is important to realize, however, that fixing the number of bytes in a record does not imply that the
sizes or number of fields in the record must be fixed.
 Fixed-length records are frequently used as containers to hold variable numbers of variable length
fields. It is also possible to mix fixed and variable-length fields within a record.
 Figure below illustrates how variable-length fields might be placed in a fixed-length record.

 Method 2: Make Records a Predictable Number of Fields


 Rather than specifying that each record in a file contain some fixed number of bytes, we can specify
that it will contain a fixed number of fields.
 This is a good way to organize the records in the name and address file we have been looking at.
 The writstrm program asks for six pieces of information for every person, so there are six contiguous
fields in the file for each record

35
Prepared by Shilpa B, Dept. of ISE, CEC
 Method 3: Begin Each Record with a Length Indicator
 We can communicate the length of records by beginning each record with a field containing an
integer that indicates how many bytes there are in the rest of the record.

 This is a commonly used method for handling variable-length records.


 Method 4: Use an Index to Keep Track of Addresses
 We can use an index to keep a byte offset for each record in the original file. The byte offsets allow
us to find the beginning of each successive record and also let us compute the length of each record.
 We look up the position of a record in the index and then seek to the record in the data file. Figure
illustrates this two-file mechanism.

 Method 5: Place a Delimiter at the End of Each Record


 This option, at a record level, is exactly analogous to the solution we used to keep the fields distinct
in the sample program we developed.
 As with fields, the delimiter character must not get in the way of processing.
 Because we often want to read files directly at our console, a common choice of a record delimiter
for files that contain readable text is the end-of-line character
 In Fig below we use a '#' character as the record delimiter .

A Record Structure That Uses a Length Indicator


 Selection of a method for record organization depends on the nature of the data and on what you
need to do with it.
 Writing the Variable-length Records to the File
 Buffer can simply be a character array into which we place the fields and field delimiter as we collect
them.
 A C++ program WritePerson, written using C string function is given below:

36
Prepared by Shilpa B, Dept. of ISE, CEC
 Representing the Record Length
Option-1
 Write the length in the form of a two-byte binary integer before each record. This is a natural solution in
C, since it does not require us to go to the trouble of converting the record length into character form.
Option-2
 Convert the length into character string using formatted output.
 With C stream, we use fprintf. With C++ stream class we use overloaded insertion operator (<<).
 Example

 Each of these lines inserts the length as a decimal string followed by a single blank that functions as
delimiter
 Output from an implementation with text length field is given by

 Each record has record length preceding the data field which is delimited by block
 The 1st record contains characters starting from Ames to final delimiter after 74075. So the character
36 is placed before the record followed by a blank space.
 Reading variable length record from the file
 The program must read the length of the record, move characters of the record into buffers, then
break records into fields as below:

Using Classes to Manipulate Buffers


 We can use C++ classes to encapsulate, pack, unpack, read and write operations of buffer objects.
 An object of one of these buffer classes can be used for output as follows:
37
Prepared by Shilpa B, Dept. of ISE, CEC
 Start with empty buffer
 Pack field values into object one by one
 Write buffer contents to output stream
 For input:
 Initialize buffer objects by reading record from input stream
 Extract object field value one by one
 There are 3 classes
 Delimited field
 Length based field
 Fixed length field
Buffer class for delimited text fields
 DelimTextBuf supports variable length buffers whose fields are represented by delimited texts
 Operations on buffer include constructor, read, write, pack and unpack.

 The following code declares object of class person and class DelimTextBufpacks the person into the
buffer and writes buffer to file.

 The pack function: int pack(const char*str, int size=-1);


 It copies character of its argument str to the buffer, then adds delimiter character.
 If size is -1, the C function strlen is used to determine the no. of characters to write else, it specifies
the no. of characters to be written.
 The unpack function: int unpack(char*str);
 It doesn’t need a size, since the field that is being unpacked consist of all of the characters up to the
next instance of the delimiter.
 The read and write functions

38
Prepared by Shilpa B, Dept. of ISE, CEC
 It uses the variable length strategy.
 A binary value is used to represent the length of the record
 Write inserts the current buffer size, then the character of the buffer.
 Read clears current buffer contents extract the record size, reads proper no. of bytes into the buffer
and sets the buffer size.
 Extending class person with buffer operations
 Buffer classes have the capacity of packing any no. of and type of values but they do not record how
these values are combined to make objects
 Pack operation

 Buffer classes for length based & fixed length fields


 The main members and methods of class LengthTextBuf is given below:

 The class below defines members and methods of class FixedTextBuf

 The method addfield is included to support the specification of fields and their size
 A buffer for objects of class Person is initialized by the new method InitBuffer of class Person
39
Prepared by Shilpa B, Dept. of ISE, CEC
Using Inheritance for Record Buffer Classes
 Inheritance in C++ stream classes
 C++ incorporates inheritance to allow multiple classes to share members and methods
 One or more base classes define members and methods which are then used by subclasses
 Stream classes are defined in such a hierarchy
 fstream is embedded in a class hierarchy that contains many other classes (read operation including
extraction operators are defined in class istream and write operations are defined in class ostream)
 Class fstream inherits these operations from its parent class iostream which inturn inherits from class
istream and ostream

 There are 2 base classes: ios and fstreambase that provides common declaration and basic stream
operations (ios) and access to OS file operations (fstreambase)
 There are use of multiople inheritance in these classes (classes have more than one base class)
 The keyword virtual is used to ensure that class ios is included only once in the ancestry of any of
these classes
 Objects of a class are also object of their base class and includes members and methods of base
classes.
 Ex: class fstream is the object of class fstreambase, iostream, istream, ostream and ios and includes
all of the members and methods of those base classes
 Hence the read method and extraction operations (>>) defined in istream are also available in
iostream, ifstream and fstream.
 Open and close operations of class fstreambase are also members of class fstream
 Benefits of inheritance is that operations that work on base class object also work on derived class
object.
 Class hierarchy for record buffer object
 Characteristics of 3 buffer classes can be combined into a single class hierarchy as in the figure
below.
40
Prepared by Shilpa B, Dept. of ISE, CEC
 The members and methods common to all the 3 buffer classes are included in the basic class
IOBuffer
 Other methods are in the class VariableLengthBuffer and FixedLengthBuffer which supports read
and write operations for different types of records.
 LengthFieldBuffer, DelimFieldBuffer and FixedFieldBuffer have the pack and unpack methods for
specific field representations.

 The common members of all of the buffer classes Buffer, Buffersize and Maxbytes declared as
protected members
 Protected members of the class can be used by methods of the class and by methods of class derived
from the class
 Protected members of IOBuffer can be used by methods in all of the classes in this hierarchy
 Protected members of VariableLengthBuffer can be used in its subclass but not in class IOBuffer and
FixedLengthBuffer
 The constructor for IOBuffer have only one parameter which specifies maximum bytes
 Methods are declared for reading, writing, packing and unpacking.
 IOBuffer defines these methods as virtual to allow each subclass to define its own implementation
 The ‘=0’ represents pure virtual method – it is an abstract class

41
Prepared by Shilpa B, Dept. of ISE, CEC
 The above code shows write method for VariableLengthBuffer
 !stream and !streamgood()are the 2 methods returns if stream has experienced an error
 Write method returns address in the stream where the record was written which is delimited by
calling stream.tellg().

 Class VariableLengthBuffer (and class FixedLengthBuffer) have functions: read, write and
SizeofBuffer.

 Class DelimFieldBuffer (LengthFieldBuffer and FixedFieldBuffer) have functions: pack and unpack.

Managing Fixed Length, Fixed Field Buffers

 FixedFieldBuffer is the subclass of IOBuffer that supports read and write of fixed length records.
 Write method writes fixed size record and Read function must know the size inorder to read record
properly
 Addfield used to specify field size
 initBuffer method used to initialize buffer as shown below:
42
Prepared by Shilpa B, Dept. of ISE, CEC
 Unpack function:

An object oriented class for record files


 Class BufferFile supports manipulation of files that are tied to specific buffer types.
 An object of class BufferFile is created from specific buffer object and can be used for open and
create files and to read and write records.
 Encapsulation of classes like BufferFile that add safety to our file operations

Managing files of records


 Record access
 When looking for an individual record, it is convenient to identify the record with a key based on the
record’s content (e.g., the Ames record).
 Keys should uniquely define a record and be unchanging.

43
Prepared by Shilpa B, Dept. of ISE, CEC
 Records can also be searched based on a secondary key. Those do not typically uniquely identify a
record.
 It is not dataless (no real data) and has a canonical form (i.e. there are restrictions on the values that
the key must take)
 A primary key should be unchanging. It is the key that is used to identify a record uniquely.
 In general not every field is a key Keys correspond to fields, or combination of fields, that may be
used in a search.
 Sequential search
 Evaluating performance
 Sequential search is one of the simplest forms of file searching
 The file is searched one record at a time, until a record is found with a particular key
 Sequential search is slow:
o If there are n records in the file, you may have to look at all of them before you find the one you
want
o If the key you are looking for is in the file, on average you will need to look through n/2 records
before finding it
 Sequential search is said to be O(n), because the time it takes is proportional to n.
 Although sequential search is slow, it is not appalling Sequential search always looks at the adjacent
record in the file next
 Therefore, it makes good use of the fact that every read of a file does not result in a disk access
 A big chunk of the file is read into a buffer in main memory
 So most reads of the file will not actually result in disk accesses
 Improving Sequential search performance with record blocking
 We grouped bytes into fields, fields into records and now records to blocks. Blocking is done strictly
as performance measure.
 Although blocking results in substantial performance improvement, it doesn’t change the order of
sequential search operation. The cost of searching is still O(n).
 It reflects the differences between memory access speed and cost of accessing secondary storage
 Blocking saves time because it decreases amount of seeking
 Blocking doesn’t change the no. of comparisons that must be done in memory and it probably
increase amount of data transformed between disk and memory (we always read whole block even if
record we are seeking is the 1st one in the block)
 When sequential search is good?
 Sequential search is usually considered as expensive method. But it is extremely easy to program and
it requires simplest of file structures.
 There are many situations in which you can use sequential search:

44
Prepared by Shilpa B, Dept. of ISE, CEC
o Your collection of elements is not sorted/cannot be sorted.
o Your collection of elements is very small
o When the number of searches you will perform on the data is low. (That binary search requires
sorted data is a drawback only if the data does not need to be searched many times. If you have to
perform multiple searches, it is worth sorting it once and using binary search rather than
searching in a linear fashion every time.)
 Unix tools for sequential Processing
 Most common FS that occurs in Unix is an ASCII file with the new line character as the record
delimiter and white space as field delimiter. Such files are simple and easy to process.
 Since in this kind of file structure, records are variable in length, these can be processed sequentially.
 Some of the tools in Unix for sequential processing are:

 We can also combine tools to create, on the fly, some very powerful file processing software. For
example, to find the number of words in all records containing the word Ada:

 Direct access
 The most radical alternative to searching sequentially through a file for a record is a retrieval
mechanism known as direct access .
 We have direct access to a record when we can seek directly to the beginning of the record and read
it in. Whereas sequential searching is an O(n) operation, direct access is 0(1); no matter how large
the file is, we can still get to the record we want with a single seek.
 Class IOBuffer includes Dread and DWrite functions for direct read and direct write operations:

45
Prepared by Shilpa B, Dept. of ISE, CEC
 Dread begins by seeking to the requested spot.
 If request is beyond the end of file, function fails.
 When seek succeeds, read method is called, which select the correct one.
 Direct access is predicated on knowing where the beginning of the required record is .
 Sometimes this information about record location is carried in a separate index file. But, for the
moment, we assume that we do not have an index.
 We assume, instead, that we know the relative record number (RRN) of the record that we want.
 The idea of an RRN is an important concept that emerges from viewing a file as a collection of
records rather than a collection of bytes.
 If a file is a sequence of records, then the RRN of a record gives its position relative to the beginning
of the file. The first record in a file has RRN 0, the next has RRN 1, and so forth.
 The RRN tells us the relative position of the record we want in the sequence of records, but we still
have to read sequentially through the file, counting records as we go, to get to the record we want,
but looking for a particular RRN is still an O(n) process.
 For instance, if we are interested in the record with an RRN of 546 and our file has a fixed-length
record size of 128 bytes per record, we can calculate the byte offset as follows:

 In general, given a fixed-length record file where the record size is r, the byte offset of a record with
an RRN of n is

More about Record Structures


 Choosing a Record Structure and Record Length
 Once we decide to fix the length of our records so we can use the RRN to give us direct access to a
record, we have to decide on a record length.
 Clearly, this decision is related to the size of the fields we want to store in the record.
 When record size is fixed, choice of record length is easy.
 The choice of a record length is more complicated when the lengths of the fields can vary.
 If we choose a record length that is the sum of our estimates of the largest possible values for all the
fields, we can be reasonably sure that we have enough space for everything, but we also waste a lot
of space.
 If, on the other hand, we are conservative in our use of space and fix the lengths of fields at smaller
values , we may have to leave information out of a field.
 Fortunately, we can avoid this problem to some degree through appropriate design of the field
structure within a record .
 Header Records
46
Prepared by Shilpa B, Dept. of ISE, CEC
 It is often necessary or useful to keep track of some general information about a file to assist in
future use of the file .
 A header record is often placed at the beginning of the file to hold this kind of information.
 One simple solution to this problem is to keep a count of the number of records in the file and to
store that count somewhere.
 We might also find it useful to include information such as the length of the data records, the date
and time of the file's most recent update, and so on.
 Header records can help make a file a self-describing object, freeing the software that accesses the
file from having to know a priori everything about its structure, and hence making the file-access
software able to deal with more variation in file structures.
 The header record usually has a different structure than the data records in the file.
 Furthermore, the data records created by update .c contain only character data, whereas the header
record contains an integer that tells how many data records are in the file
 Adding header to C++ buffer class.
o Class IOBuffer includes the following methods:

o Write method add a header to file and returns no. of bytes in the header
o Read method reads the header and check for consistency.

File Access and Fi le Organization


 We have studied 2 types of file access and file organization

 File Access Method


 The way by which information/data can be retrieved.
 There are two method of file accesss:
1. Direct Access
2. Sequential Access
Direct Access
 This access method the information/data stored on a device can be accessed randomly and
immediately irrespective to the order it was stored.
 The data with this access method is quicker than sequential access. This is also known as random
access method.
 For example Hard disk, Flash Memory

47
Prepared by Shilpa B, Dept. of ISE, CEC
Sequential Access
 This access method the information/data stored on a device is accessed in the exact order in which it
was stored.
 Sequential access methods are seen in older storage devices such as magnetic tape.
 File Organization Method
 The process that involves how data/information is stored so file access could be as easy and quickly
as possible. Three main ways of file organization:
1. Sequential
2. Index-Sequential
3. Random
Sequential file organization
 All records are stored in some sort of order (ascending, descending, alphabetical).
 The order is based on a field in the record.
 For example a file holding the records of employeeID, date of birth and address.
 The employee ID is used and records stored is group accordingly (ascending/descending).
 Can be used with both direct and sequential access.
Index-Sequential organization
 The records is stores in some order but there is a second file called the index-file that indicates where
exactly certain key points.
 Can not be used with sequential access method.
Random file organization
 The records are stored randomly but each record has its own specific position on the disk (address).
 With this method no time could be wasted searching for a file. Instead it jumps to the exact position
and access the data/information.
 Can only be used with direct access access method.

Question Bank
1. Define file structures. Why to study file structures design? What is the driving force behind FS
design?
2. Explain overview of file structure design / explain goals of good FS design
3. Explain history of file structure design
4. Explain the functions of READ and WRITE with parameters
5. With a neat sketch, explain UNIX directory structure
6. Differentiate between physical and logical files
7. Discuss about fundamental file processing operations
8. What is a file? Explain briefly the evolution of file structure design.
9. Discuss about the fundamental File processing Operations.
10. Explain briefly about I/O Redirection and pipes
11. List and explain different Unix file system commands
48
Prepared by Shilpa B, Dept. of ISE, CEC
12. Explain seeking with C and C++ streams.
13. Explain sector based data organization in magnetic disk.
14. Explain the different costs of disk access
15. How the data is physically stored on a CDROM?
16. Differentiate between CLV and CAV
17. What are the different buffering strategies? Explain briefly.
18. Write a note on buffer management.
19. What is seeking and how it is supported in C++ streams.
20. What do you mean by file structure? Explain in brief a short history of file structure design.
21. Briefly discuss the evolution of file structure.
22. What are file structures? What is the driving force behind the file structure design?
23. What is seeking and how it is supported in C Streams and C++ Streams.
24. Explain the following :
 Physical file,
 Logical file,
 Open function,
 Close function, and
 Reading and writing file.
25. Bring out the differences between physical files and logical files.
26. Describe the relation between physical file and the logical file.
27. With a neat sketch, explain UNIX directory structure.
28. Discuss about the Fundamental File processing operations.
29. Explain the functions OPEN, READ, and WRITE with parameters.
30. Explain the following functions:
 Open a file, and
 Close a file
31. Explain the strengths & weaknesses of CD-ROM .
32. Define the following terms :
 Seek time,
 Rotational Delay, and
 Transfer time
33. What are the two basic ways to address data on disks?
34. What are the different buffering strategies? Explain briefly.
35. Write a note on organization of CD-ROM.
36. How the data is physically stored on a CD-ROM? List the major strengths and weaknesses of
CDROMs.

49
Prepared by Shilpa B, Dept. of ISE, CEC
37. Write a note on disk organisation.
38. Explain sector based data organisation in magnetic disk with a neat diagram.
39. Suppose that we want to store a file with 60,000 fixed length data records where each requires 80
bytes and records are not allowed to space two sectors, sector/track = 63 bytes per sector = 512,
tracks per cylinder = 16 and average rotational delay = 6 m/s. How many cylinders are required for
the file?
40. Briefly explain the different basic ways to organize the data on a disk.
41. Explain the organization of data on Tapes with neat diagram. With an example estimate the tape
length requirements.
42. Write short notes on Magnetic tapes.
43. Explain the organization of data on tapes, with a neat diagram. Estimate the tape length
requirements, with a suitable example.
44. Calculate the space required on tape, if we want to store the 1 million 100 bytes records on a 7250
bpi tape that has an internal block gap of 0.2 inches and with a blocking factor of 60. Hence calculate
the space required.
45. Explain the different costs of the disk access.
46. What are the three distinct operations that contribute to the total cost of access on disk?
47. Briefly explain the organization of data on Nine-Track tapes with a neat diagram.

50
Prepared by Shilpa B, Dept. of ISE, CEC

You might also like