File Organization Methods
File Organization Methods
File organization refers to the way data is stored in a file. File organization is very important because
it determines the methods of access, efficiency, flexibility and storage devices to use. There are four
methods of organizing files on a storage media. This include:
sequential,
random,
serial and
indexed-sequential
Records are stored and accessed in a particular order sorted using a key field.
Retrieval requires searching sequentially through the entire file record by record to the end.
Because the record in a file are sorted in a particular order, better file searching methods like
the binary search technique can be used to reduce the time used for searching a file .
Since the records are sorted, it is possible to know in which half of the file a particular record
being searched is located, Hence this method repeatedly divides the set of records in the file
into two halves and searches only the half on which the records is found.
For example, of the file has records with key fields 20, 30, 40, 50, 60 and the computer is
searching for a record with key field 50, it starts at 40 upwards in its search, ignoring the first
half of the set.
The sorting does not remove the need to access other records as the search looks for
particular records.
Sequential records cannot support modern technologies that require fast access to stored
records.
The requirement that all records be of the same size is sometimes difficult to enforce.
Magnetic and optical disks allow data to be stored and accessed randomly.
It is simple
It is cheap
It is cumbersome to access because you have to access all proceeding records before
retrieving the one being searched.
Wastage of space on medium in form of inter-record gap.
It cannot support modern high speed requirements for quick record access.
Almost similar to sequential method only that, an index is used to enable the computer to locate
individual records on the storage media. For example, on a magnetic drum, records are stored
sequential on the tracks. However, each record is assigned an index that can be used to access it
directly.
Indexing
Indexing is a data structure technique that helps to speed up data retrieval. As we can quickly locate
and access the data in the database, it is a must-know data structure that will be needed for database
optimizing. Indexing minimizes the number of disk accesses required when a query is processed.
Indexes are created as a combination of the two columns.
First column is the Search key. It contains a copy of the primary key or candidate key of
the table. The values of this column may be sorted or not. But if the values are sorted,
the corresponding data can be accessed easily.
Second column is the Data reference or Pointer. It contains the address of the disk
block where we can find the corresponding key value.
Types of indexing
Primary Indexing
Primary indexing only has two columns. First column has the primary key values which are the search
keys. The second column has the pointers which contain the address to the corresponding data block
of the search key value. The table should be ordered and there is a one-to-one relationship between
the records in the index file and the data blocks. This is a more traditional yet a fast mechanism.
Dense Index
There is an index record that contains a search key and pointer for every search key value in the data
file. Though the Dense index is a fast method it requires more memory to store index records for each
key value.
Sparse Index
There are only a few index records that point to the search key value. First, the index record starts
searching sequentially by pointing to a location of a value in the data file until it finds the actual
location of the search key value. Though sparse indexing is time-consuming, it requires less memory
to store index records as it has less of them.
The columns in the Secondary indexing hold the values of the candidate key along with the respective
pointer which has the address to the location of the values. Index and data are communicated with
each other through an intermediate node.
Clustered Indexing
The table is ordered in clustered indexing. At the times when the indexes are created using the non-
primary key, we combine two or more columns together to get the unique values to identify data
uniquely and use it to create the index.
If the primary index does not fit in the memory, multilevel indexing is used. When the database
increases its size the indices also get increased. A single-level index can be too big to store in the main
memory. In multilevel indexing, the main data block breaks down into smaller blocks that can be
stored in the main memory.
B+ Tree Indexing
B- Tree Indexing
The data structure's hash function validates the imported file using a hash value. You may quicken
the process by using the item's hash key. It improves search efficiency and retrieval effectiveness.
This is a straightforward method for defining hashing in a data structure. Hashing is an important tool
to have in your arsenal when constructing data structures and can be used in many different ways.
What is Hashing?
Hashing involves changing one value into another based on a specified key or string of characters.
The original string is often represented with a smaller, fixed-length value or key, which makes it
simpler to locate or use.
Implementing hash tables is the most well-liked use of hashing. A list that may be accessed by using a
hash table's index contains key and value pairs. The hash function helps map the keys to the table
size since key and value pairs are infinite. The value for a given element is then changed to a hash
value.
A hash value, or simply a hash, is a value generated by a mathematical hashing algorithm. A good
hash uses a one-way hashing algorithm to prevent the hash from being converted back into its
original key.
1. Data Retrieval
Utilizing algorithms or functions, hashing converts object data into a useful integer value. Once these
things have been located on such object data map, queries can be filtered using a hash.
For instance, developers store data, such as a customer record, as key and value pairs in hash tables.
Hash codes are then mapped to integers of a predetermined size, while keys are used to identify data
and are input to the hash function.
2. Digital Signatures
In addition to allowing quick data retrieval, hashing aids in the encryption and decryption of digital
signatures that are used to verify message senders and recipients. In this case, the digital signature is
changed by a hash function before the hashed value or a message digest, and the signature is
transmitted separately to the recipient.
When a message is received, the same hash function uses the signature to create the message
digest, which is then compared to the message digest that was transmitted to make sure they are
identical. The hash function indexes the initial value or key and makes data linked to a particular
value or key that is obtained accessible in a one-way hashing operation.
Hashing in data structure pertains to a method of breaking up a huge amount of data into smaller
tables. Also known as the message digest function, it is a method for distinguishing one distinct
object from a group of related ones. By condensing the original input strings and data assets to short
alphanumeric hash keys, developers and programmers can reduce both time and file space.
Hashing assists in focusing a search for a particular item on a data map. Hash codes create an index
to hold values in this scenario. Therefore, hashing is employed to index and retrieve data from a
database since it speeds up the process; finding an item using a smaller hashed key than just its
original value is much simpler. is employed in this situation to index and retrieve data from a
database since it speeds up the process; it is much simpler to find an item using a smaller hashed key
than just its original value.
Hash tables are used to store the data in an array format. There is a unique index number for each
value within the array. Hash tables create these distinct index numbers for each value stored in an
array format using a method that they use, called the hash technique.
In Schools - Each student in a class is assigned a unique roll number for easy identification.
The school authority, later on, uses the unique roll number to retrieve relevant information
about the particular student.
In diagnostic centers and laboratories - Medical laboratories use unique serial numbers to
identify and distinguish patient samples and patient information.
Security Purposes - One must take precautions to ensure that the account does not end up
in the wrong hands when they first visit a website that requests authentication through "Sign
Up" and where they submit the login information to access the personal accounts. As a
result, the database stores the entered password as a hash.
When it comes to data structures, hashing is a technique used to store and retrieve data in a
database. It is fundamental to many data structures, such as hash tables and hash trees. Hashing
involves mapping data to a unique value, called a hash code. The hash code is then used to index into
an array, where the data is stored. To retrieve the data, the hash code is simply re-computed and
used to index into the array.
Hashing is an efficient way to store and retrieve data in a data structure because it avoids the need
for comparisons between elements. It also allows duplicate values to be stored in the same structure
without causing collisions. Hashes are typically generated using a hashing algorithm, which takes an
input and produces a hash code.
Italy Rome
France Paris
England London
Australia Canberra
Switzerland Berne
Now, assuming that the hash function is just to determine the length of the string, the hash
table will look like this -
2
Position (hash = key
Key Value
length)
5 Italy Rome
6 France Paris
7 England London
9 Australia Canberra
10
11 Switzerland Berne
Position (hash = key
Key Value
length)
As Italy's hash code (length) is 5, we placed Italy in the 5th position in the Keys array and
Rome on the fifth index… and so on.
Hashing Key is the raw data that has to be hashed in a hash table. The hashing algorithm
carries out a function to translate the hash key into the hash value. The outcome of feeding
the hash key through the hashing algorithm is what is known as the hash value.
Public key - Public key often termed 'asymmetric' key, is a type of key only used for
data encryption. The mechanism is relatively slower because the public key is an open
key. The common uses of public keys include the functions of cryptography, the
transfer of bitcoins, and securing online sessions. However, Public key functions
along with a set of private keys, and thus, the overall security is not compromised.
Private key - The private key is employed in both encryption and decryption. Each
party that sends or receives sensitive information that has been encrypted shares a
key. Due to the fact that both parties share it, the private key is also referred to as
"symmetric". A private key is typically a long, impossible-to-guess string of bits
generated at random or artificially random.
SSH public key – Secure Socket Shell uses both a public and a private key. SSH is a
set of keys that can be used to authenticate and decrypt a communication sent from a
distance. Both the distant servers and the stakeholders have access to the public key.
What are Hash Function and Hash Table?
What is a Hash Table?
The hash table is an associative data structure that stores data as associated keys. An array
format is used to store hash table data, where each value is assigned its unique index. By
knowing the index of the desired data, we can access it very quickly.
As a result, inserting and searching data is very fast, regardless of data size. In a Hash
Table, elements are stored in an array, and an index is generated using hashing techniques
in a data structure.
A key is transformed into a hash key through a fixed process in hash functions. Hash
values are length-restricted values that can be derived from a key. Despite the fact that the
hash value is usually less than the original, the original sequence of characters is still
reflected in the hash value. Following the transfer of the digital signature, the recipient
receives both the hash value and the digital signature. A comparison is made between the
hash value generated by the receiver and the one received along with the message using
the same hash algorithm. Messages are sent without error if their hash values match
exactly.
Collision free - A hash function is collision-free because it contains two key features.
The function should, first and foremost, transfer equally likely inputs to all feasible
outputs. Second, the function must be deterministic, guaranteeing that it will always
yield the same results when given the same input. A collision-free hash function
ensures that no two input hash maps out the same output hash.
Property to be hidden - A hash function is used to map data of arbitrary size to
fixed-size data. One key characteristic of a good hash function is that it should be hard
to guess the input value from its output. In other words, it should be difficult to find
two different inputs that produce the same output. The output should be evenly
distributed across all possible values. This ensures that every possible input will have
a unique output and that the outputs will be spread evenly throughout the range of
possible values.
Puzzle friendly – A hash function is ought to be suitable for puzzles. The choice of
an input that yields a predetermined result ought to be challenging. As a result, it is
best to choose the input from a range that is as diverse as feasible.
Types of Hash functions
Many hash functions use alphanumeric or numeric keys. The main hash functions cover -
Division Method.
Mid Square Method.
Folding Method.
Multiplication Method.
Let's examine these methods in more detail.
1. Division Method
The division method is the simplest and easiest method used to generate a hash value. In
this hash function, the value of k is divided by M and uses the remainder as obtained.
Advantages -
This may lead to poor performance as consecutive keys are mapped to consecutive
hash values in the hash table
There are situations when choosing the value of M requires particular caution.
Example -
k = 1320
M = 11
h (1320) = 1320 mod 11
=0
2. Mid Square Method
The steps involved in computing this hash method include the following -
Advantages -
Since most or all of the key value's digits contribute to the outcome, this strategy
performs well. The middle digits of the squared result are produced by a process in
which all of the essential digits participate.
The top or bottom digits of the original key value do not predominate in the outcome.
Disadvantages -
One of this method's constraints is the size of the key; if the key is large, its square
will have twice as many digits.
Chance of repeated collisions.
Example -
Let's take the hash table with 200 memory locations and r = 2, as decided on the size of the
mapping in the table.
k = 50
Therefore,
k=kxk
= 50 x 50
= 2500
Thus,
h(50) = 50
3. Folding Method
1. The key-value k should be divided into a specific number of parts, such as k1, k2,
k3,..., kn, each having the very same number of digits aside from the final component,
which may have fewer digits than the remaining parts.
2. Add each component separately. The last carry, if any, is disregarded to determine the
hash value.
Formula - k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Advantages -
Breaks up the key value into precise equal-sized segments for an easy hash value
Independent of distribution in a hash table
Disadvantages -
k = 54321
k1 = 54 ; k2 = 32 ; k3 = 1
Therefore,
s = k1 + k2 + k3
= 54 + 32+ 1
= 87
Thus,
h (k) = 87
4. Multiplication Method
Steps to follow -
(Where, M = size of the hash table, k = key value and A = constant value)
Advantages -
It may be applied to any number between 0 and 1, albeit some numbers tend to
produce better results than others.
Disadvantages -
When the table size is a power of two, the multiplication technique is typically
appropriate since multiplication hashing makes it possible to compute the index by
key quickly.
Example -
k = 1234
A = 0.35784
M = 100
So,
h (1234) = floor [ 100(1234 x 0.35784 mod 1)]
= floor [ 100 ( 441.57456 mod 1)]
= floor [100 ( 0. 57456)]
= floor [ 57.456]
= 57
Thus,
h (1234) = 57
Choosing a good hash function
Creating an effective hash function that distributes the added item's index value
evenly across the database is important.
Quick and easier to compute according to the requirements.
An approach to successfully resolve collisions in hash tables is essential for
generating an index for a key whose hash index corresponds to an existing spot.
o A hash collision happens when the same hash value is produced for two different
input values by a hash algorithm. But it's important to point out that collisions
aren't a problem; they're a fundamental aspect of hashing algorithms.
o Collisions occur because different hashing techniques in data structure convert
every input into a fixed-length code, regardless of its length. Since there are an
endless number of inputs and a limited number of outputs, the hashing algorithms
will eventually produce repeating hashes.
Although an item can be inserted into a deleted slot, the search continues after the
slot has been empty.
o The "removed" buckets are handled the same as any other empty buckets during
insertion.
o When searching, the search does not stop when it comes across a "deleted" bucket.
o Only when the necessary key or an empty bucket are discovered does the quest
come to an end.
Open Addressing
Open addressing is when
o All the keys are kept inside the hash table, unlike separate chaining.
o The hash table contains the only key information.
o Linear Probing
o Quadratic Probing
o Double Hashing
Let's use "key mod 7" as a simple hash function with the following keys: 50, 700, 76,
85, 92, 73, 101.
o Primary Clustering: Primary clustering is one of the issues with linear probing. Many
successive items form clusters, making it difficult to locate a free slot or to search for
an element.
o Secondary Clustering: Secondary clustering is less severe, and two records can only
share a collision chain (also known as a probe sequence) if they start out in the same
location.
Advantage-
Disadvantage-
Time Complexity:
The worst time in linear probing to search an element is O ( table size ). This is due to
o even if all other elements are absent and there is only one element.
o The hash table's "deleted" markers then force a full table search.
1. let hash (x) be the slot index computed using hash function.
2. If slot hash(x) % S is full, then we try ( hash (x) + 1*1 ) % S
3. If ( hash (x) + 1*1 ) % S is also full, then we try ( hash (x) + 2*2 ) % S
4. If ( hash (x) + 2*2 ) % S is also full, then we try ( hash (x) + 3*3 ) % S
5. ..................................................
6. ..................................................
c) Double Hash
Another hash function calculates the gaps that exist between the probes. Clustering
is optimally reduced by the use of double hashing. This method uses a different hash
function to generate the increments for the probing sequence. We search for the slot
i*hash2(x) in the i'th rotation using another hash algorithm, hash2(x).
1. Chaining is easier to put into practise. Open Addressing calls for increased
processing power.
2. Hash tables never run out of space when chaining Table may fill up when addressing in
since we can always add new elements. open fashion.
3. Chaining is less susceptible to load or the hash To prevent clustering and load factor,
function. open addressing calls for extra caution.
4. When it is unclear how many or how frequently When the frequency and quantity of keys
keys might be added or removed, chaining is are known, open addressing is employed.
typically utilised.
5. Chaining's cache performance is poor since keys Since everything is stored in the same
are stored in linked lists. table, open addressing improves cache
speed.
6. Space wastage (Some Parts of hash table in A slot can be used in open addressing
chaining are never used). even if an input doesn't map to it.
7. Chaining requires additional room for links. Links absent in open addressing
Because we traverse a Linked List by essentially jumping from one node to the next
throughout the computer's memory, chaining's cache efficiency is poor. Because of
this, the CPU is unable to cache nodes that haven't been visited yet, which is bad for
us. However, since data isn't dispersed while using Open Addressing, the CPU can
cache information for speedy access if it notices that a particular area of memory is
frequently accessed.
The load factor value in open addressing is always between 0 and 1. This is due to
Conclusions-
o The best cache performance is achieved via linear probing, although clustering is a
problem.
o Between the two in terms of clustering and cache performance is quadratic probing.
o Although clustering is absent, double caching has poor cache performance.