0% found this document useful (0 votes)
118 views

File Organization Methods

The document discusses different methods for organizing files, including sequential, random, serial, and indexed-sequential organization. It provides details on each method, such as how records are stored and accessed, advantages and disadvantages. Sequential organization stores records in a sorted order, allowing for faster searching methods like binary search. Random organization stores records randomly but allows direct access via a record key. Serial organization stores records sequentially without sorting, mainly used for magnetic tapes. Indexed-sequential organization stores records sequentially but uses an index to enable direct access to individual records.

Uploaded by

Aastha Chauhan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views

File Organization Methods

The document discusses different methods for organizing files, including sequential, random, serial, and indexed-sequential organization. It provides details on each method, such as how records are stored and accessed, advantages and disadvantages. Sequential organization stores records in a sorted order, allowing for faster searching methods like binary search. Random organization stores records randomly but allows direct access via a record key. Serial organization stores records sequentially without sorting, mainly used for magnetic tapes. Indexed-sequential organization stores records sequentially but uses an index to enable direct access to individual records.

Uploaded by

Aastha Chauhan
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

File organization methods

File organization refers to the way data is stored in a file. File organization is very important because
it determines the methods of access, efficiency, flexibility and storage devices to use. There are four
methods of organizing files on a storage media. This include:

 sequential,
 random,
 serial and
 indexed-sequential

Sequential file organization

 Records are stored and accessed in a particular order sorted using a key field.
 Retrieval requires searching sequentially through the entire file record by record to the end.
 Because the record in a file are sorted in a particular order, better file searching methods like
the binary search technique can be used to reduce the time used for searching a file .
 Since the records are sorted, it is possible to know in which half of the file a particular record
being searched is located, Hence this method repeatedly divides the set of records in the file
into two halves and searches only the half on which the records is found.
 For example, of the file has records with key fields 20, 30, 40, 50, 60 and the computer is
searching for a record with key field 50, it starts at 40 upwards in its search, ignoring the first
half of the set.

Advantages of sequential file organization

 The sorting makes it easy to access records.


 The binary chop technique can be used to reduce record search time by as much as half the
time taken.

Disadvantages of sequential file organization

 The sorting does not remove the need to access other records as the search looks for
particular records.
 Sequential records cannot support modern technologies that require fast access to stored
records.
 The requirement that all records be of the same size is sometimes difficult to enforce.

Random or direct file organization

 Records are stored randomly but accessed directly.


 To access a file stored randomly, a record key is used to determine where a record is stored
on the storage media.

Magnetic and optical disks allow data to be stored and accessed randomly.

Advantages of random file access

 Quick retrieval of records.


 The records can be of different sizes.

Serial file organization


 Records in a file are stored and accessed one after another.
 The records are not stored in any way on the storage medium this type of organization is
mainly used on magnetic tapes.

Advantages of serial file organization

 It is simple
 It is cheap

Disadvantages of serial file organization

 It is cumbersome to access because you have to access all proceeding records before
retrieving the one being searched.
 Wastage of space on medium in form of inter-record gap.
 It cannot support modern high speed requirements for quick record access.

Indexed-sequential file organization method

Almost similar to sequential method only that, an index is used to enable the computer to locate
individual records on the storage media. For example, on a magnetic drum, records are stored
sequential on the tracks. However, each record is assigned an index that can be used to access it
directly.

Indexing

Indexing is a data structure technique that helps to speed up data retrieval. As we can quickly locate
and access the data in the database, it is a must-know data structure that will be needed for database
optimizing. Indexing minimizes the number of disk accesses required when a query is processed.
Indexes are created as a combination of the two columns.

 First column is the Search key. It contains a copy of the primary key or candidate key of
the table. The values of this column may be sorted or not. But if the values are sorted,
the corresponding data can be accessed easily.
 Second column is the Data reference or Pointer. It contains the address of the disk
block where we can find the corresponding key value.
Types of indexing

Primary Indexing

Primary indexing only has two columns. First column has the primary key values which are the search
keys. The second column has the pointers which contain the address to the corresponding data block
of the search key value. The table should be ordered and there is a one-to-one relationship between
the records in the index file and the data blocks. This is a more traditional yet a fast mechanism.

 Dense Index

There is an index record that contains a search key and pointer for every search key value in the data
file. Though the Dense index is a fast method it requires more memory to store index records for each
key value.

 Sparse Index
There are only a few index records that point to the search key value. First, the index record starts
searching sequentially by pointing to a location of a value in the data file until it finds the actual
location of the search key value. Though sparse indexing is time-consuming, it requires less memory
to store index records as it has less of them.

Secondary Indexing (Non clustered Indexing)

The columns in the Secondary indexing hold the values of the candidate key along with the respective
pointer which has the address to the location of the values. Index and data are communicated with
each other through an intermediate node.

Clustered Indexing

The table is ordered in clustered indexing. At the times when the indexes are created using the non-
primary key, we combine two or more columns together to get the unique values to identify data
uniquely and use it to create the index.

The pointers are pointed to the cluster as a whole.


Multilevel Indexing

If the primary index does not fit in the memory, multilevel indexing is used. When the database
increases its size the indices also get increased. A single-level index can be too big to store in the main
memory. In multilevel indexing, the main data block breaks down into smaller blocks that can be
stored in the main memory.

 B+ Tree Indexing
 B- Tree Indexing

Hashing in Data Structure: What, Types, and Functions


Hashing in the data structure is used to quickly identify a specific value within a given array. It creates
a unique hash code for each element in the array and then stores the hash code instead of the actual
element. This allows for quick lookup when searching for a specific value, as well as easy
identification of any duplicates. Hashing in the data structure is a technique that is used to quickly
identify a specific value within a given array. It works by creating a unique hash code for each
element in the array and then stores the hash code in lieu of the actual element. This allows for quick
look-up when searching for a specific value, as well as easy identification of any duplicates.

The data structure's hash function validates the imported file using a hash value. You may quicken
the process by using the item's hash key. It improves search efficiency and retrieval effectiveness.
This is a straightforward method for defining hashing in a data structure. Hashing is an important tool
to have in your arsenal when constructing data structures and can be used in many different ways.

What is Hashing?

Hashing involves changing one value into another based on a specified key or string of characters.
The original string is often represented with a smaller, fixed-length value or key, which makes it
simpler to locate or use.
Implementing hash tables is the most well-liked use of hashing. A list that may be accessed by using a
hash table's index contains key and value pairs. The hash function helps map the keys to the table
size since key and value pairs are infinite. The value for a given element is then changed to a hash
value.

A hash value, or simply a hash, is a value generated by a mathematical hashing algorithm. A good
hash uses a one-way hashing algorithm to prevent the hash from being converted back into its
original key.

Use cases of Hashing

1. Data Retrieval

Utilizing algorithms or functions, hashing converts object data into a useful integer value. Once these
things have been located on such object data map, queries can be filtered using a hash.

For instance, developers store data, such as a customer record, as key and value pairs in hash tables.
Hash codes are then mapped to integers of a predetermined size, while keys are used to identify data
and are input to the hash function.

2. Digital Signatures

In addition to allowing quick data retrieval, hashing aids in the encryption and decryption of digital
signatures that are used to verify message senders and recipients. In this case, the digital signature is
changed by a hash function before the hashed value or a message digest, and the signature is
transmitted separately to the recipient.

When a message is received, the same hash function uses the signature to create the message
digest, which is then compared to the message digest that was transmitted to make sure they are
identical. The hash function indexes the initial value or key and makes data linked to a particular
value or key that is obtained accessible in a one-way hashing operation.

What is Hashing in Data Structure?

Hashing in data structure pertains to a method of breaking up a huge amount of data into smaller
tables. Also known as the message digest function, it is a method for distinguishing one distinct
object from a group of related ones. By condensing the original input strings and data assets to short
alphanumeric hash keys, developers and programmers can reduce both time and file space.

Hashing assists in focusing a search for a particular item on a data map. Hash codes create an index
to hold values in this scenario. Therefore, hashing is employed to index and retrieve data from a
database since it speeds up the process; finding an item using a smaller hashed key than just its
original value is much simpler. is employed in this situation to index and retrieve data from a
database since it speeds up the process; it is much simpler to find an item using a smaller hashed key
than just its original value.

Hash tables are used to store the data in an array format. There is a unique index number for each
value within the array. Hash tables create these distinct index numbers for each value stored in an
array format using a method that they use, called the hash technique.

Some real-life examples of hashing in data structure include -


 In Libraries - A library contains an endless supply of books. The librarian assigns each book a
unique number. This distinctive number aids in locating the exact position of the books on
the bookshelf.

 In Schools - Each student in a class is assigned a unique roll number for easy identification.
The school authority, later on, uses the unique roll number to retrieve relevant information
about the particular student.

 In diagnostic centers and laboratories - Medical laboratories use unique serial numbers to
identify and distinguish patient samples and patient information.

 Security Purposes - One must take precautions to ensure that the account does not end up
in the wrong hands when they first visit a website that requests authentication through "Sign
Up" and where they submit the login information to access the personal accounts. As a
result, the database stores the entered password as a hash.

How does Hashing in Data Structure Work?

When it comes to data structures, hashing is a technique used to store and retrieve data in a
database. It is fundamental to many data structures, such as hash tables and hash trees. Hashing
involves mapping data to a unique value, called a hash code. The hash code is then used to index into
an array, where the data is stored. To retrieve the data, the hash code is simply re-computed and
used to index into the array.

Hashing is an efficient way to store and retrieve data in a data structure because it avoids the need
for comparisons between elements. It also allows duplicate values to be stored in the same structure
without causing collisions. Hashes are typically generated using a hashing algorithm, which takes an
input and produces a hash code.

Let's look at a hashing example to get a better understanding of how it works -


For example, let's say we want to map a list of string keys to a list of string values to grasp the
concept better. Map the capital cities of countries, for example. Let's say we wish to save the
information from the map's Table 1.
Key Value

Italy Rome

France Paris

England London

Australia Canberra

Switzerland Berne

Now, assuming that the hash function is just to determine the length of the string, the hash
table will look like this -

Position (hash = key


Key Value
length)

2
Position (hash = key
Key Value
length)

5 Italy Rome

6 France Paris

7 England London

9 Australia Canberra

10

11 Switzerland Berne
Position (hash = key
Key Value
length)

As Italy's hash code (length) is 5, we placed Italy in the 5th position in the Keys array and
Rome on the fifth index… and so on.

What is the 'Key' in Hashing?

Hashing Key is the raw data that has to be hashed in a hash table. The hashing algorithm
carries out a function to translate the hash key into the hash value. The outcome of feeding
the hash key through the hashing algorithm is what is known as the hash value.

Hash Key = Key Value % Number of Slots in the Table

The different types of hashing keys are -

 Public key - Public key often termed 'asymmetric' key, is a type of key only used for
data encryption. The mechanism is relatively slower because the public key is an open
key. The common uses of public keys include the functions of cryptography, the
transfer of bitcoins, and securing online sessions. However, Public key functions
along with a set of private keys, and thus, the overall security is not compromised.
 Private key - The private key is employed in both encryption and decryption. Each
party that sends or receives sensitive information that has been encrypted shares a
key. Due to the fact that both parties share it, the private key is also referred to as
"symmetric". A private key is typically a long, impossible-to-guess string of bits
generated at random or artificially random.
 SSH public key – Secure Socket Shell uses both a public and a private key. SSH is a
set of keys that can be used to authenticate and decrypt a communication sent from a
distance. Both the distant servers and the stakeholders have access to the public key.
What are Hash Function and Hash Table?
What is a Hash Table?

The hash table is an associative data structure that stores data as associated keys. An array
format is used to store hash table data, where each value is assigned its unique index. By
knowing the index of the desired data, we can access it very quickly.

As a result, inserting and searching data is very fast, regardless of data size. In a Hash
Table, elements are stored in an array, and an index is generated using hashing techniques
in a data structure.

What is Hash Function?

A key is transformed into a hash key through a fixed process in hash functions. Hash
values are length-restricted values that can be derived from a key. Despite the fact that the
hash value is usually less than the original, the original sequence of characters is still
reflected in the hash value. Following the transfer of the digital signature, the recipient
receives both the hash value and the digital signature. A comparison is made between the
hash value generated by the receiver and the one received along with the message using
the same hash algorithm. Messages are sent without error if their hash values match
exactly.

The characteristics of the hash function are -

 There is no limit to the length of a message


 Message digests are generated with a fixed length
 A message digest can be computed quickly (and easily)
 Message digests cannot be generated from the hash - they are irreversible
 Message values change dramatically when small changes are made
 The hash is collision-free because two different messages can't result in the same hash
value
Some important characteristics of the hash function are described below -

 Collision free - A hash function is collision-free because it contains two key features.
The function should, first and foremost, transfer equally likely inputs to all feasible
outputs. Second, the function must be deterministic, guaranteeing that it will always
yield the same results when given the same input. A collision-free hash function
ensures that no two input hash maps out the same output hash.
 Property to be hidden - A hash function is used to map data of arbitrary size to
fixed-size data. One key characteristic of a good hash function is that it should be hard
to guess the input value from its output. In other words, it should be difficult to find
two different inputs that produce the same output. The output should be evenly
distributed across all possible values. This ensures that every possible input will have
a unique output and that the outputs will be spread evenly throughout the range of
possible values.
 Puzzle friendly – A hash function is ought to be suitable for puzzles. The choice of
an input that yields a predetermined result ought to be challenging. As a result, it is
best to choose the input from a range that is as diverse as feasible.
Types of Hash functions

Many hash functions use alphanumeric or numeric keys. The main hash functions cover -

 Division Method.
 Mid Square Method.
 Folding Method.
 Multiplication Method.
Let's examine these methods in more detail.

1. Division Method

The division method is the simplest and easiest method used to generate a hash value. In
this hash function, the value of k is divided by M and uses the remainder as obtained.

Formula - h(K) = k mod M

(where k = key value and M = the size of the hash table)

Advantages -

 This method works well for any value of M


 The division approach is extremely quick because it only calls for one operation.
Disadvantages -

 This may lead to poor performance as consecutive keys are mapped to consecutive
hash values in the hash table
 There are situations when choosing the value of M requires particular caution.
Example -

 k = 1320
 M = 11
 h (1320) = 1320 mod 11
 =0
2. Mid Square Method

The steps involved in computing this hash method include the following -

1. Squaring the value of k ( like k*k)


2. Extract the hash value from the middle r digits.
Formula - h(K) = h(k x k)

(where k = key value )

Advantages -

 Since most or all of the key value's digits contribute to the outcome, this strategy
performs well. The middle digits of the squared result are produced by a process in
which all of the essential digits participate.
 The top or bottom digits of the original key value do not predominate in the outcome.
Disadvantages -

 One of this method's constraints is the size of the key; if the key is large, its square
will have twice as many digits.
 Chance of repeated collisions.
Example -

Let's take the hash table with 200 memory locations and r = 2, as decided on the size of the
mapping in the table.

 k = 50
 Therefore,
 k=kxk
 = 50 x 50
 = 2500
 Thus,
 h(50) = 50
3. Folding Method

There are two steps in this method -

1. The key-value k should be divided into a specific number of parts, such as k1, k2,
k3,..., kn, each having the very same number of digits aside from the final component,
which may have fewer digits than the remaining parts.
2. Add each component separately. The last carry, if any, is disregarded to determine the
hash value.
Formula - k = k1, k2, k3, k4, ….., kn

s = k1+ k2 + k3 + k4 +….+ kn

h(K)= s

(Where, s = addition of the parts of key k)

Advantages -

 Breaks up the key value into precise equal-sized segments for an easy hash value
 Independent of distribution in a hash table
Disadvantages -

 Sometimes inefficient if there are too many collisions


Example -

 k = 54321
 k1 = 54 ; k2 = 32 ; k3 = 1
 Therefore,
 s = k1 + k2 + k3
 = 54 + 32+ 1
 = 87
 Thus,
 h (k) = 87
4. Multiplication Method

Steps to follow -

1. Pick up a constant value A (where 0 < A < 1)


2. Multiply A with the key value
3. Take the fractional part of kA
4. Take the result of the previous step and multiply it by the size of the hash table, M.
Formula - h(K) = floor (M (kA mod 1))

(Where, M = size of the hash table, k = key value and A = constant value)

Advantages -

 It may be applied to any number between 0 and 1, albeit some numbers tend to
produce better results than others.
Disadvantages -

 When the table size is a power of two, the multiplication technique is typically
appropriate since multiplication hashing makes it possible to compute the index by
key quickly.
Example -

 k = 1234
 A = 0.35784
 M = 100
 So,
 h (1234) = floor [ 100(1234 x 0.35784 mod 1)]
 = floor [ 100 ( 441.57456 mod 1)]
 = floor [100 ( 0. 57456)]
 = floor [ 57.456]
 = 57
 Thus,
 h (1234) = 57
Choosing a good hash function

 Creating an effective hash function that distributes the added item's index value
evenly across the database is important.
 Quick and easier to compute according to the requirements.
 An approach to successfully resolve collisions in hash tables is essential for
generating an index for a key whose hash index corresponds to an existing spot.

Hashing - Open Addressing for


Collision Handling
We have talked about

o A well-known search method is hashing.


o When the new key's hash value matches an already-occupied bucket in the hash
table, there is a collision.

What is a Hash Collision?

o A hash collision happens when the same hash value is produced for two different
input values by a hash algorithm. But it's important to point out that collisions
aren't a problem; they're a fundamental aspect of hashing algorithms.
o Collisions occur because different hashing techniques in data structure convert
every input into a fixed-length code, regardless of its length. Since there are an
endless number of inputs and a limited number of outputs, the hashing algorithms
will eventually produce repeating hashes.

Open Addressing for Collision Handling


Similar to separate chaining, open addressing is a technique for dealing with
collisions. In Open Addressing, the hash table alone houses all of the elements. The
size of the table must therefore always be more than or equal to the total number of
keys at all times (Note that we can increase table size by copying old data if needed).
This strategy is often referred to as closed hashing. The foundation of this entire
process is probing. We will comprehend several forms of probing later.
o Insert (k): Continue probing until a slot is left open. Put k in the first empty spot you
find.
o Search (k): Continue probing until either an empty slot is found or the slot's key no
longer equals k.
o Delete (k): An intriguing delete procedure. The search can fail if we just remove a key.
Therefore, deleted key slots are specifically noted as "deleted."

Although an item can be inserted into a deleted slot, the search continues after the
slot has been empty.

o The "removed" buckets are handled the same as any other empty buckets during
insertion.
o When searching, the search does not stop when it comes across a "deleted" bucket.
o Only when the necessary key or an empty bucket are discovered does the quest
come to an end.

Open Addressing
Open addressing is when

o All the keys are kept inside the hash table, unlike separate chaining.
o The hash table contains the only key information.

The methods for open addressing are as follows:

o Linear Probing
o Quadratic Probing
o Double Hashing

The following techniques are used for open addressing:

(a) Linear probing


In linear probing, the hash table is systematically examined beginning at the hash's
initial point. If the site we receive is already occupied, we look for a different one.

The rehashing function is as follows: table-size = (n+1)% rehash(key). As may be seen


in the sample below, the usual space between two probes is 1.
Let S be the size of the table and let hash(x) be the slot index calculated using a hash
algorithm.

1. If slot hash (x) % S is full, then we try ( hash (x) + 1 ) % S


2. If ( hash (x) + 1 ) % S is also full, then we try ( hash (x) + 2) % S
3. If ( hash (x) + 2 ) % S is also full, then we try ( hash (x) + 3 ) % S
4. ..................................................
5. ..................................................

Let's use "key mod 7" as a simple hash function with the following keys: 50, 700, 76,
85, 92, 73, 101.

Linear probing problems:

o Primary Clustering: Primary clustering is one of the issues with linear probing. Many
successive items form clusters, making it difficult to locate a free slot or to search for
an element.
o Secondary Clustering: Secondary clustering is less severe, and two records can only
share a collision chain (also known as a probe sequence) if they start out in the same
location.

Advantage-

o Calculating this is simple.

Disadvantage-

o Clustering is the fundamental issue with linear probing.


o Groups are composed of several adjacent pieces.
o After then, searching for an element or an empty bucket takes time.

Time Complexity:

The worst time in linear probing to search an element is O ( table size ). This is due to

o even if all other elements are absent and there is only one element.
o The hash table's "deleted" markers then force a full table search.

(b) Quadratic probing


If you pay close attention, you will notice that the hash value will cause the interval
between probes to grow. The above-discussed clustering issue can be resolved with
the aid of the quadratic probing technique. The mid-square method is another name
for this approach. We search for the i2'th slot in the i'th iteration using this strategy.
We always begin where the hash was generated. We check the other slots if only the
location is taken.

1. let hash (x) be the slot index computed using hash function.
2. If slot hash(x) % S is full, then we try ( hash (x) + 1*1 ) % S
3. If ( hash (x) + 1*1 ) % S is also full, then we try ( hash (x) + 2*2 ) % S
4. If ( hash (x) + 2*2 ) % S is also full, then we try ( hash (x) + 3*3 ) % S
5. ..................................................
6. ..................................................

c) Double Hash
Another hash function calculates the gaps that exist between the probes. Clustering
is optimally reduced by the use of double hashing. This method uses a different hash
function to generate the increments for the probing sequence. We search for the slot
i*hash2(x) in the i'th rotation using another hash algorithm, hash2(x).

1. let hash(x) be the slot index computed using hash function.


2. If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
3. If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
4. If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S
5. ..................................................
6. ..................................................

Comparing the first three:

o The best cache performance is provided by linear probing, although clustering is a


problem. Linear probing also has the benefit of being simple to compute.
o Between the two in terms of clustering and cache performance is quadratic probing.
o Although double hashing lacks clustering, it performs poorly in caches. Due to the
necessity to compute two hash functions, double hashing takes longer to compute.

S. Separate Chaining Open Addressing


No.

1. Chaining is easier to put into practise. Open Addressing calls for increased
processing power.

2. Hash tables never run out of space when chaining Table may fill up when addressing in
since we can always add new elements. open fashion.

3. Chaining is less susceptible to load or the hash To prevent clustering and load factor,
function. open addressing calls for extra caution.

4. When it is unclear how many or how frequently When the frequency and quantity of keys
keys might be added or removed, chaining is are known, open addressing is employed.
typically utilised.

5. Chaining's cache performance is poor since keys Since everything is stored in the same
are stored in linked lists. table, open addressing improves cache
speed.

6. Space wastage (Some Parts of hash table in A slot can be used in open addressing
chaining are never used). even if an input doesn't map to it.

7. Chaining requires additional room for links. Links absent in open addressing

Because we traverse a Linked List by essentially jumping from one node to the next
throughout the computer's memory, chaining's cache efficiency is poor. Because of
this, the CPU is unable to cache nodes that haven't been visited yet, which is bad for
us. However, since data isn't dispersed while using Open Addressing, the CPU can
cache information for speedy access if it notices that a particular area of memory is
frequently accessed.

Performance of Open Addressing: Similar to Chaining, the performance of hashing


can be assessed assuming that each key has an equal likelihood of being hashed to
any slot of the table (simple uniform hashing)

1. m = Number of slots in the hash table


2. n = Number of keys to be inserted in the hash table
3.
4. Load factor α = n/m ( < 1 )
5.
6. Expected time to search/insert/delete < 1 / ( 1 - α )
7.
8. So Search, Insert and Delete take (1 / ( 1 - α ) ) time

Load Factor (α)-


Load factor (α) is defined as-

The load factor value in open addressing is always between 0 and 1. This is due to

o In open addressing, the hash table contains all of the keys.


o As a result, the table's size is always more than or at least equal to the number of keys
it stores.

Conclusions-
o The best cache performance is achieved via linear probing, although clustering is a
problem.
o Between the two in terms of clustering and cache performance is quadratic probing.
o Although clustering is absent, double caching has poor cache performance.

You might also like