0% found this document useful (0 votes)
58 views

Hashing

Hashing is a technique that is used to map large data sets to smaller subsets. It works by applying a hash function to an input data element which returns an index value that corresponds to the subset the data element should be placed in. Collisions occur when two different inputs map to the same index, and resolution methods like open addressing or chaining are used to handle collisions. Common hash functions include division, folding, and mid-square methods. Hashing provides fast search and insertion times by partitioning data into buckets.

Uploaded by

Thz Esyy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Hashing

Hashing is a technique that is used to map large data sets to smaller subsets. It works by applying a hash function to an input data element which returns an index value that corresponds to the subset the data element should be placed in. Collisions occur when two different inputs map to the same index, and resolution methods like open addressing or chaining are used to handle collisions. Common hash functions include division, folding, and mid-square methods. Hashing provides fast search and insertion times by partitioning data into buckets.

Uploaded by

Thz Esyy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Hashing

Hashing
♯ Introduction
Why hashing is needed?
Structure Array Linked list BST
Add first O(n) O(1) O(height)
Add last O(1) O(1) O(height)
Search O(n) O(n) O(height) -> O(logn)
Remove O(n) O(1) O(height)

What we have to do if requirements in search and remove operations


needs higher efficiency?

- Searching in a smaller set is more efficient than


those in a lager set. Hashing
- The “Divide and Conquer principle” is applied.

Hashing 2
♯ Introduction…
Hashing: Partitioning a
large set into subsets

Hashing 3
♯ Learning Outcomes

LO7.1 Explain the concept of "hash". Define concepts


hash function and hash table and their application.
LO7.2 Demonstrate the types of hash functions: Division,
Folding,...
LO7.3 Explain the collision and collision-handling.
LO7.4 Explain the open addressing method for collision-
resolution: linear and quadratic probing.
LO7.5 Explain the chaining method for collision-resolution:
separate chaining and Coalesced chaining.
LO7.6 Define perfect hash function and extendible
hashing.

Hashing 4
♯ Contents
1- Basic of Hashing
2- Common Hash Functions
3- Data storage of a hash structure
4- Common methods of a hash table
5- Collision Resolution
6- Load Factors, Rehashing, and Efficiency
7- Deletion
Introduction:
8- Perfect Hash Functions - Definition
9- External Hashing- Hash Functions for Extendible(extensible) Files
10- Hashing in java.util
Your works: Re-implement the given project

Hashing 5
♯ 1- Basic of Hashing
• What is hashing?  A process in which a large data set will be
partitioned into some data subsets.
• What is the tool for hashing?  hash function
• What will hash function do?  This function is constructed by
implementer which accepts input data ( whole initial data or a
chunk of initial data or memory address of data) and an unique
index is it’s output  index of a subset.
• Who will implement hash function?  Hashing implementer.
• Data storage is used in hashing is called as hash table.
Number of subsets is called as table-size.
• What are hashing used for?
– Storing group of objects supporting search effectively.
– Adding security to content in cryptography (read yourself)

Hashing 6
♯ 1- Basic of Hashing …
Example: A hash function

Hashing 7
♯ 2- Common Hash Functions

• An object is usually identified using a key (ID), a nunber or a


string
• Hash function (h) transforms a particular key (K) into an
index of a subset.

Object’s Key h Integer (index of a subset)

• If the hash function gives the same index on different objects then
they belong to the same subset.
• If h transforms different keys into different numbers, it is called a
perfect hash function  A subset contains ONE element only 
Ideal case.
• To create a perfect hash function, the table has to contain at least the
same number of positions as the number of elements being hashed.

Hashing 8
♯ 2- Common Hash Functions…

• To get an integer as the output of a hash function,


an integer expression must be used. In integral
operators, the operator modulo (%) is usually used.

Object’s Key hash integer

Convert string key to integer, N


Return N % tableSize

Hashing 9
♯ 2- Common Hash Functions…

All operations on the hashing storage must rely on hash function.

4
3

Hash
Object’s Key integer
O(1)

Cost of an operation = O(1) + Cost of operation on a specific subset.

Hashing 10
♯ 2- Common Hash Functions
• The division method is the preferred choice(%, modulo).
TSize =sizeof(table), as in h(K) = K mod Tsize(K%Tsize)

Folding method: h(K) = sum of parts mod Tsize


Ex: K= 123-45-6789  sum 3 parts:123 + 45 + 6789= 6957
 h(K)=6957 mod TSize Hash table
Ex: K= 123-45-6789  Sum 5 parts:12 + 34+56+78+9 = 189 designer will
Mid-square method : decide an
approach for
Ex: TSize= 1024= 210, K= 3121  K2=9 740 641
hash function
K2= 1001010 0101000010 1100001  h(K) = 01010000102 = 32
Extraction method: Only a part of the key is used to compute the address
Ex: K = 123-45-6789, h(K) = 1289 mod Tsize
SE123456  123456, SE56  number?
Radix transformation method:
K = 34510  4239  h(K) = 423 mod Tsize ( =23 if Tsize=100 )

Hashing 11
♯ 3- Data Storage of a Hash Structure

- Central storage using an array


- All data objects are stored in an array
- An array for all subsets
- A table entry for ONE data object
- A table entry for some objects  Bucket hashing
- Separate chaining: Each subset has it’s own
individual storage  Separate chaining hashing
- Central storage using a file in external disk
- All data objects are stored in a file  External/ extendible
hashing

Hashing 12
♯ 3- Data Storage of a Hash Structure…
- Common Arrays and Hash tables
- In a common array, all data objects are identified
using unique indices and stored in a consecutive
blocks.
- In a hash table, hash function will determine the index
each stored data object.
There can be empty entries in the hash table.
The structure of a hash table entry contains
information:

Link to the next


Used? Data object
object

Hashing 13
♯ 3- Data Storage of a Hash Structure…
Data index Storage

K1, val1 0
1
K2, val2 2
K3, val3 h 3
4
… 5
Kn, valn 6
7




Structure of a hash table entry: m-1

Link to the next


Used? Data object
object
Hashing 14
♯ 4- Common methods of a hash table

Method Purpose
get (key) Getting value of a given key
put (key, value) Add a data object to hash table
remove (key) Remove a data object
size() Number of stored data objects
isEmpty() Checking the hash table is empty or not
keySet() Getting key set
values() Getting set of values

Hashing 15
♯ 5- Collision Resolution

Collision: A situation in which 2 distinct inputs but the hash


function gives the same output  Same position
• Ex:
K1= 1025  h(K1) = 1025%100 = 25
K2= 125  h(K2) = 125%100 = 25
K3= 25  h(K3) = 25%100 = 25

Common methods are used as solutions:


- Open Addressing Method– Dò tìm vị trí kế cận
- Chaining Method/ Coalesced chaining– băm theo nhóm
- Bucket Addressing: Một phần tử của bảng chứa vài objects

Hashing 16
♯ 5- Collision Resolution…
Case: Central Storage

Open Addressing Method: when a key collides with


another key, the collision is resolved by finding an
available table entry other than the position (address)
to which the colliding key is originally hashed. Common
methods:

(1) if no collision, using h(k).


(2) If collision, using h’(k) = h(k) + f(i): i varies from 1, 2, 3, 4,
5,…. until an empty position is found

f(i): probing function. It can be linear (bậc 1 – dò tuyến tính-


simplest method) or quadratic function (bậc 2 – dò bậc 2)

Hashing 17
♯ 5- Collision Resolution…
Case: Central Storage
Open Addressing Method:
Linear Probing, p(i) =i, h’(K) = (h(K) + i) mod TSize
i= 1, 2, 3…

Resolving collisions with the linear probing method. Subscripts indicate the home
positions of the keys being hashed.
Hashing 18
♯ 5- Collision Resolution…
Case: Central Storage
Open Addressing Method:
Quadratic method: p(i)=  i2, h’(K) = (h(K)  i2) mod TSize

Insert B9
Collision
Insert B5  probe: i=1
Collision  h(9+1)=h(10)=0 OK
 probe: i=1 Insert C2
 h(5+1)=h(6)=6 Collision
OK  probe: i=1
Insert B2  h(2+1)=h(3)=3 No OK
Collision  h(2-1)=h(1)=1  No OK
 probe: i=1  probe: i=2, i2=4
 h(2+1)=h(3)=3  h(2+4)= 6  No OK
No OK  h(2-4), -2<0  No OK
 h(2-1)=h(1)=1  probe: i=3, i2=9
 OK  h(2+9)= 1  No OK
 h(2-9)= 2-9<0 No OK
 probe: i=4, i2=16
 h(2+16)= 8  OK

Using quadratic probing for collision resolution: h’(K) = (h(K)  i2) mod 10

Hashing 19
♯ 5- Collision Resolution…
Case: Central Storage

Open Addressing Method: Evaluation:

Formulas approximating, for different hashing methods, the average


numbers of trials for successful and unsuccessful searches (Knuth, 1998)

Hashing 20
♯ 5- Collision Resolution…

Chaining Method
• Keys do not have to stored in table itself, each
position of the table is associated with a linked
list or chain of structures whose info fields
store keys or references to keys
• This method is called separate chaining, and a
table of references (pointers) is called a scatter
table (bảng phân phối)

Hashing 21
♯ 5- Collision Resolution…
Separate Chaining
Method
h(K)  index of a
linked list of
elements having
the same value of
hash function.

K h(K)  index N objects are partitioned into M subsets


traverse the  Average size of a subset: N/M. A
subset is a list  Search operation has
appropriate list to complexity of O(N/M)
find the element
having this key.

In chaining, colliding keys are put on the same linked list


(The most flexible hash format)
Hashing 22
♯ 5- Collision Resolution…

Coalesced chaining- Central array


• A version of chaining called coalesced hashing-
băm hợp nhất- (or coalesced chaining) combines
linear probing with chaining
• An overflow area (vùng tràn, hầm chứa) known
as a cellar can be allocated to store keys for
which there is no room in the table

Hashing 23
♯ 5- Collision Resolution…
Coalesced chaining

Index of
next
element
in the
same 7
group
 Default
:
 -1 9

Coalesced hashing puts a colliding key in the last


available position of the table
Hashing 24
♯ 5- Collision Resolution…
Coalesced chaining

Main
area

When cellar
is full,
inserted
element will
be put to
the main
Cellar: overflow area
region
Mechansm: bottom-up

Coalesced hashing that uses a cellar

Hashing 25
♯ 5- Collision Resolution…

Bucket Addressing Method


• To store colliding elements in the same position
in the table can be achieved by associating a
bucket with each address
• A bucket (khối) is a block of space large enough
to store multiple items

Hashing 26
♯ 5- Collision Resolution…

Bucket Addressing
bucket

Insert C2
 Collision
Use linear probing
Bucket 3 containing
a space
Insert C2 to bucket
3

Collision resolution with


buckets ( bucket=2)
and linear probing method

Hashing 27
♯ 5- Collision Resolution…
Bucket
Addressing

bucket

Reference to separate
overflow area

Collision
resolution with
buckets and
overflow area

Hashing 28
♯ 6- Load Factors, Rehashing, and Efficiency
Array-based Hash table

• Load factor (hệ số tải)


λ= n/N n: number of stored data object N: Table size
Recommendation: λ <0.9. By default in Java: λ =0.75
Higher load factor, higher overhead for collision resolution
When λ > pre-defined factor, the hash table is considered FULL

• Rehashing: Creating a new hash table from a full hash table:


• New larger storage is allocated  λ decreases
• Collision reduces

• Efficiency:

(Figure 10.2 in the textbook )

Hashing 29
♯ 7- Deletion
Begin

Deleted data (k)

h(k)

index
Search
Group
and
contains k
delete k

End

Hashing 30
♯ 7- Deletion…
• Structure and collision resolution of the hash
table will decide the way by which its elements
are deleted. They can be
– Linear search for deletion
– Linear search to locate the linked list of the
subgroup then delete an element in this linked
list.
– Linear search to locate the subgroup then delete
an element in this subgroup, update references to
next elements in the same subgroup.

Hashing 31
♯ 7- Deletion…
H’(k) = H(k) + i
Update locations

Delete A4
Linear search in the situation
where both insertion and deletion of keys are permitted

Hashing 32
♯ 8- Perfect Hash Functions

• A hash function is called as perfect when it transforms


different keys into different numbers A subset
contains ONE element only  Ideal case.
• Expectation:
• If a function requires only as many cells in the table
as the number of data so that no empty cell remains
after hashing is completed, it is called a minimal
perfect hash function  hàm băm hoàn hảo tối
tiểu  Ít xung đột nhất
• Methods: (Read by yourself)
• Richard J. Cichelli’s methode
• FHCD’s method

Hashing 33
♯ 9- External Hashing

• Hash Functions for Extendible Files


• File=table.
• Expandable hashing, dynamic hashing, and
extendible hashing methods distribute keys among
buckets in a similar fashion
• Data  h(Data)  index  buckets[index]
• The main difference is the structure of the index
(directory)
• In expandable hashing and dynamic hashing, a binary
tree is used as an index of buckets
• In extendible hashing, a directory of records is kept in a
table

Hashing 34
♯ 9- External Hashing …

• Extendible hashing accesses the data stored in


buckets indirectly through an index that is
dynamically adjusted to reflect changes in the
file
• Extendible hashing allows the file to expand
without reorganizing it, but requires storage
space for an index
• Values returned by such a hash function are
called pseudokeys

Hashing 35
♯ 9- External Hashing…

• It is commonly used in database files, file = hash table


• Record = <key, value>
• A bucket contains some records, a bucket has a unique index and
new buckets can be created.
• Bucket indexes are stored in an distinct area and it is called as
directory file = {directory, bucket1, bucket 2, ……}
• Extendible hashing allows the file to expand without reorganizing it,
but requires storage space for an index
• Multi-level / extendible hashing can be used
• Values returned by such a hash function are called pseudokeys
• Data  h(key)  index  access directory  file position of
buckets[index]
• Cluster = 4KB

Hashing 36
♯ 9- External Hashing…

• With this method, no index is necessary because new


buckets generated by splitting existing buckets are
always added in the same linear way, so there is no
need to retain indexes
• A bucket is full when its loading factor exceeds a certain
level. This bucket will be splitted.
• A reference split indicates which bucket is to be split
next
• After the bucket is divided, the keys in this bucket
are distributed between this bucket and the newly
created bucket, which is added to the end of the
table

Hashing 37
♯ 10- Hashing in the java.util Package

• Main classes implement hashing


technique
• The HashMap class
• The HashSet class
• The HashTable class

Hashing 38
♯ The java.util.HashMap class

• HashMap is an implementation of the interface


Map
• A map is a collection that holds pairs (key, value)
or entries
• A hash map is a collection of singly linked
lists (buckets); that is, chaining is used as a
collision resolution technique
• In a hash map, both null values and null keys
are permitted

Hashing 39
♯ The java.util.HashMap class…

Methods in class HashMap including three inherited methods


Hashing 40
♯ The java.util.HashMap class…

Methods in class HashMap including three inherited methods


Hashing 41
♯ The java.util.HashMap class…

This example (in textbook)


demonstates how to use the
HashMap class to manage a list
of person
< name, age,hashCode> in
which the hasCode is the sum
of charcter codes in the field
name.

Click to go the
HashSet class

Figure 10-17 Demonstrating the operation of the methods in class HashMap

Hashing 42
♯ The java.util.HashMap class…

Demonstrating the operation of the methods in class HashMap


Hashing 43
♯ The java.util.HashMap class…

Demonstrating
the operation of
the methods in
class HashMap

Hashing 44
♯ The java.utr]il.HashSet class

• HashSet is another implementation of a set


(an object that stores unique elements)
• Class hierarchy in java.util for HashSet is:
Object → AbstractCollection → AbstractSet → HashSet
• HashSet is implemented in terms of HashMap
public HashSet() {
map = new HashMap();
}

Hashing 45
♯ The java.util.HashSet class…

Methods in class HashSet including some inherited methods


Hashing 46
♯ The java.util.HashSet class…

Methods in class HashSet including some inherited methods


Hashing 47
♯ The java.util.HashSet class…

Methods in class HashSet including some inherited methods

Hashing 48
♯ The java.util.HashTable

• A Hashtable is roughly equivalent (gần tương đương)


to a HashMap except that it is synchronized and
does not permit null values with methods to
operate on hash tables
• The class Hashtable is considered a legacy
class, just like the class Vector
• Class hierarchy in java.util is:
Object → Dictionary → Hashtable

Hashing 49
♯ The java.util.Hashtable class…

Figure 10-20 Methods of the class Hashtable including three


inherited methods

Hashing 50
♯ The java.util.Hashtable class…

Figure 10-20 Methods of the class Hashtable including three


inherited methods (continued)

Hashing 51
♯ The java.util.Hashtable class…

Figure 10-20 Methods of the class Hashtable including three


inherited methods (continued)

Hashing 52
♯ The java.util.Hashtable class…

Figure 10-20 Methods of the class Hashtable including three


inherited methods (continued)

Hashing 53
♯ Summary: Learning Outcomes
 LO7.1 Explain the concept of "hash". Define concepts
hash function and hash table and their application.
LO7.2 Demonstrate the types of hash functions: Division,
 Folding,...
LO7.2 Explain the collision and collision-handling.

LO7.3 Explain the open addressing method for collision-
resolution: linear and quadratic probing.
LO7.4 Explain the chaining method for collision-resolution:
 separate chaining and Coalesced chaining.
LO7.5 Define perfect hash function and extendible
 hashing.

Hashing 54
♯ Summary

• Common hash functions include the division,


folding, mid-square, extraction and radix
transformation methods.
• Collision resolution includes the open
addressing, chaining, and bucket addressing
methods.
• Cichelli’s method is an algorithm to construct a
minimal perfect hash function

Hashing 55
♯ Summary (continued)

• The FHCD algorithm searches for a minimal


perfect hash function of the form (modulo
TSize), where g is the function to be determined
by the algorithm
• In expandable hashing and dynamic hashing, a
binary tree is used as an index of buckets
• In extendible hashing, a directory of records is
kept in a table

Hashing 56
♯ Summary (continued)

• A hash map is a collection of singly linked lists


(buckets); that is, chaining is used as a collision
resolution technique
• HashSet is another implementation of a set
(an object that stores unique elements)
• A Hashtable is roughly equivalent to a
HashMap except that it is synchronized and
does not permit null values with methods to
operate on hash tables

Hashing 57
♯ Notices about hash tables
• When should hash tables be used:
– Elements in a group are different and insertion
and search are main operations.
• What are things to be concerned before a hash
table is implemented?
– Choose a key for each element: number/string?
– Choose a hash function
– Choose a collision resolution
because these things will affect on algorithms that
will be chosen in our hash table.

Hashing 58
♯ Lab 1: Using HashMap to compute probabilities
of characters in a text file

Hashing 59
♯ Lab 1: Using HashMap to compute probabilities
of characters in a text file

Hashing 60
♯ Lab 1: Using HashMap to compute probabilities
of characters in a text file

Hashing 61
Lab 2:
♯ Using HashTable to manage a student list
SE140606,NGUYỄN TRỌNG HẢI,7
SE141127,VÕ TRỌNG ĐẠT,4
SE140913,TRẦN MINH HIẾU,7
SE62440,ĐOÀN LƯƠNG PHÚ,6
SE141153,THÁI ĐỨC THẢO,5
SE140244,PHẠM NHẬT TÂN,8
SE140861,PHẠM ĐĂNG HẢI,5
SE140929,NGUYỄN LÊ ANH LONG,9
SE140755,LÊ ANH DUY,8
SE140618,LÝ GIA HUY,8
SE63394,VŨ VĂN KHẢI,9
SE63391,BÙI LÊ QUỐC THẮNG,4
SE140367,CAO DUY QUANG,9
SE140130,TRẦN VĂN TÂM,4
SE140923,NGUYỄN VĂN TÂN,5
SE130182,DIỆP MINH THÔNG,6
SE140877,NGUYỄN HỒNG SƠN,6
SE140813,NGUYỄN ĐĂNG HUY,6
SE140503,LÊ VĨNH HƯNG,3
SE140874,LÊ HỮU HIẾU,6
SE141086,NGUYỄN MẠNH LỰC,9
SE140873,TÔN THẤT BẢO,4
SE140067,NGUYỄN TRẦN HOÀNG
LONG,5
SE140855,TRẦN HOÀNG HẢI DUY,5
SE140885,CAO HOÀNG QUY,7
SE140203,HÀ GIA PHƯỚC,3
SE130610,THÁI TIẾN ĐẠT,7
SE151525,TẠ MINH TIẾN,3

Hashing 62
♯ Lab 2: Using HashTable to manage a student list

Use Notepad++/ Wordpad or Netbean editor to edit Unicode text.

Use Notepad to edit Unicode text.

Hashing 63
♯ Lab 2: Using HashTable to manage a student list

Hashing 64
♯ Lab 2: Using HashTable to manage a student list

Hashing 65
♯ Lab 2: Using HashTable to manage a student list

Hashing 66
♯ Lab 2: Using HashTable to manage a student list

Hashing 67
♯ Lab 2: Using HashTable to manage a student list

Char (2 bytes/mem)  nén UTF8  mã 1 byte  file


Char (2 bytes/mem)  giải UTF8 mã 1 byte file
Hashing 68
♯ Lab 2: Using HashTable to manage a student list

Hashing 69
♯ Lab 2: Using HashTable to manage a student list

Hashing 70
♯ Lab 2: Using
HashTable to
manage a student
list

Hashing 71
♯ Lab 2: Using HashTable to manage a student list

Hashing 72
♯ Lab 2: Using HashTable to manage a student list

Hashing 73
♯ Bonus: Hashing in Cryptography
Hai nhóm giải thuật: Digest, chuỗi
SHA: Security Hash Algorithms, Hash bit có độ dài
MDA: Message Digest Algorithms Content
function cố định hoặc
thay đổi
- Các cách biến đổi rất cầu kỳ
để khó bẻ khóa. Có thể có độn Digest: nội dung tóm tắt
thêm data (gọi là muối, salt) hay chữ ký số
- Hàm băm tốt  ít xung đột

Bên gửi Bên nhận


Tạo digest từ content Nhận về (content + digest)
Gửi đi (content + digest) Tạo digest2 từ content
Nếu (digest2 == digest) thì ontent đáng tin

Áp dụng:
(1) Bảo vệ password trong các database để admin của hệ thống cũng không thể hack.
(2) Tạo tính tin cậy của dữ liệu giao dịch ( kỹ thuật blockchain)

Hashing 74

Thank you.

Hashing 75

You might also like