File Structures UNIT 1 Notes
File Structures UNIT 1 Notes
UNIT -1
INTRODUCTION TO FILE STRUCTURES
Lecture notes
1.1 The heart of file structure design
DISKs
o have enormous storage capacity
o are non volatile
o costs less than memory
o but are very slow when compared to memory
ANALOGY
o RAM access time 120ns
o DISK access time 30ms
o If finding something in the book in hand takes 20sec, and the same info if not found in
the book should be searched in the library, keeping the same ratio of memory access
and disk access, it would take 5million sec or almost 58days.
A disks relatively slow access time and the enormous, nonvolatile capacity is the driving force
behind FILE STRUCTURE design!!
FS should give access to all the capacity without making the application spend a lot of time
waiting for the disk.
FS is a combination of representation for data in files and of operations for accessing the data.
o It allows applications to read, write and modify data
o Also finding the data
o Or reading the data in a particular order
Efficiency of FS design for a particular application is decided on,
o Details of the representation of the data
o Implementation of the operations
A large variety in the types of data and in the needs of application makes FS design important.
What is best for one situation may be terrible for other.
All these are easy to achieve if the files do not change, grow or shrink. When information is
added or deleted it is much difficult.
Initially the storage device was tape,
o Access was sequential
o Accessing cost was directly proportional to the size of the file.
Then came in the disks drives
o Indexes were added to files
o List of keys and pointers were present in a smaller file (easily searchable)
o Easy to directly access the file even if it was a very huge file.
o As the indexes grew they too became difficult to manage.
Early 1960s
o Idea of applying tree structures emerged.
o But trees can grow unevenly as records are added or deleted.
o Resulting in long searches requiring many disk accesses to find a record.
In 1963
o AVL tree was developed which was a self adjusting binary tree structure for data in
memory.
o AVL tree structure was implemented for files by some researchers.
o The problem was dozens of access were required to find a record in even moderate
sized files
o A method was required to keep a tree balanced when each node of the tree was not a
single record, as in a binary tree, but a file block containing dozens or hundreds of
records.
B-tree
o After ten years of design work came up B-tree
o AVL tree grows top-down, where as B-tree grows bottom-up
o Provides excellent access performance
o Sequential access was not efficient in B-tree
+
B trees
o Solved the problem of sequential access in B-tree
o Added a linked list at the bottom level of the B-tree.
B-tree and B+ tree became the basis for many commercial file systems
They provided access times that grow in proportion to log k N where,
o N is the number of entries in the file
o k is the number of entries indexed in a single block of the B-tree structure.
Practically, you can find one file entry among millions of others with only three or four trips to
the disk.
B-tree guarantees that performance stays about the same even if you add or delete entries.
Hashing is a good way to get what we want with a single request. (size non-changing files)
Early days, hashed indexes were used to provide fast access to files.
Extendible dynamic hashing retrieves the information with one or, at most, two disk accesses no
matter how big the file become.
o
o
The receptionist does not say, you have a call from 814-789-1903
I need to have the call identified logically, not physically.
Question.
o What does the above program do?
How to open a file in C++ ?
o Ofstream outClientFile(clients.dat, ios:out)
OR
o Ofstream outClientFile;
o outClientFile.open(clients.dat, ios:out)
File Open Modes
o ios:: app - (append) write all output to the end of file
o
o
o
o
o
o
o
outCredit.seekp( ( client.accountNumber - 1 ) *
sizeof( clientData ) );
outCredit.write(
reinterpret_cast<const char *>( &client ),
sizeof( clientData ) );
cout << "Enter account number\n? ";
cin >> client.accountNumber; } return 0; }
The <istream> function inputs a specified (by sizeof(clientData)) number of bytes from the
current position of the specified stream into an object.
1.7 Seeking
Reading and printing a sequential file
// Reading and printing a sequential file
#include <iostream.h>
#include <fstream.h>
#include <iomanip.h>
#include <stdlib.h>
void outputLine( int, const char *, double );
int main()
{
// ifstream constructor opens the file
ifstream inClientFile( "clients.dat", ios::in );
if ( !inClientFile ) {
cerr << "File could not be opened\n";
exit( 1 );
}
}
File position pointer
o <istream> and <ostream> classes provide member functions for repositioning the file
pointer (the byte number of the next byte in the file to be read or to be written.)
o These member functions are:
seekg (seek get) for istream class
seekp (seek put) for ostream class
Examples of moving a file pointer
o inClientFile.seekg(0) - repositions the file get pointer to the beginning of the file
o inClientFile.seekg(n, ios:beg) - repositions the file get pointer to the n-th byte of the file
o inClientFile.seekg(m, ios:end) -repositions the file get pointer to the m-th byte from the
end of file
o nClientFile.seekg(0, ios:end) - repositions the file get pointer to the end of the file
o The same operations can be performed with <ostream> function member seekp.
Member functions tellg() and tellp().
o Member functions tellg and tellp are provided to return the current locations of the get
and put pointers, respectively.
o long location = inClientFile.tellg();
o To move the pointer relative to the current location use ios:cur
If we want to modify a record of data, the new data may be longer than the old one and
it could overwrite parts of the record following it.
Problems with sequential files
o Sequential files are inappropriate for so-called instant access applications in which a
particular record of information must be located immediately.
o These applications include banking systems, point-of-sale systems, airline reservation
systems, (or any data-base system.)
Random access files
o Instant access is possible with random access files.
o Individual records of a random access file can be accessed directly (and quickly) without
searching many other records.
1.8 Special Characters in Files
All computer systems have reserved a number of characters for specific system functions.
Examples:
o Control-Z indicates often end-of-file in MS-DOS programs
o Control-D indicates often end-of-file in Unix programs
o CR (Carriage return) and LF (Line Feed) characters together indicate end-of-line
1.9 Directory Structures
Files are stored in directories. Thus directories are collections of files
Most modern systems maintain a tree directory structure.
1.10 Physical Devices and Logical Files
I/O Redirection
o I/O redirection allows for changing the source of input to come from a file instead of a
keyboard:
program < file /* program reads input form a file instead of keyboard
o I/O redirection allows for directing the output to go a file instead of the screen
program > file /* program writes to a file instead of the screen
1.12 Pipes
o An output of one program can be used as an input to another program be using pipes:
o Example:
program1 | program2
1.11 Secondary Storage Management
Secondary storage devices:
o have much longer access time than main memory
o have access times that vary from one access to another (some accesses are relatively
fast and other accesses are slower on the same device)
o have a lot of more storage than main memory
o have storage that is non-volatile
Disks
o
In the early 1990s, controller speeds improved so that disks can now offer noninterleaving (also known as 1:1 interleaving)
Clusters
o A cluster is a fixed number of consecutive (logical) disk sectors.
o Some operating systems view each file as a series of clusters.
o Clusters are designed to improve performance since all sectors in one cluster can be
accessed without an additional seek.
o Extents
Extents of a file are those parts of the file which are stored in contiguous
clusters.
It is very beneficial to store the whole file in one extent (seek time is minimized).
Fragmentation
o Fragmentation is the wasted disk space due to the fact that the smallest organizational
unit of a disk is one sector.
o If a sector size is 512 bytes than even if we need to store only one byte, we have to
allocate to it one whole sector. Thus 511 bytes are wasted.
Blocks
o Some disk allow for storing data in user defined blocks instead of sectors.
o When the data on a disk is organized in blocks, this usually means that the amount of
data transferred in a single I/O operation can vary.
o Blocks can be either variable or fixed length.
o Block organization can be more efficient than sector organization but it is much more
complex.
Non-data Overhead
o Non-data overhead includes at the beginning of each sector:
sector address
track address
sector usability
The Cost of Disk Access
o Seek time
the time required to move the r/w head to the correct cylinder
o Rotational delay
the time required to rotate the disk so that the correct sector is positioned
under the r/w head
o Transfer time
the time required to transfer the data:
rotation time
number of bytes on a track
Disks as Bottlenecks
o Disk speeds lag far behind
CPU
main memory
local network
o Computer programs spend most of time awaiting data from the disk
Improving Disk Performance
o Disk striping
splitting the parts of a single file on several drives
o RAID
Redundant Array of Inexpensive Disks
o RAM disk
o Disk caching
o Buffering
Buffering of disk data in main memory reduces seek time, thus disk are commonly used for
sequential file processing too.
Tapes are still the most common long term archival storage.
1.14 CD-ROM