How A Search Engine Works - Slide
How A Search Engine Works - Slide
Search Engine
Guided By :- XxX
5,981,044 65.5
1,294,261 14.1
1,142,364 12.5
206,969 2.3
175,074 1.9
91,288 1.0
55,122 0.6
27,002 0.3
26,462 0.3
24,681 0.3
Finding documents:
It is potentially needed to find required document
distributed over tens of thousands of servers.
Formulating queries:
It needed to express exactly what kind of
information is to retrieve.
Determining relevance:
The system must determine whether a document
contains the required information or not.
Types of Search Engine
Heap:
• It is a large unstructured chunk of virtual
memory where strings can be appended.
Hash table :
• It is third data structure of size ‘n’ entries.
• Any URL can be run through a hash
function to produce a nonnegative
integer less than ‘n’.
• All URL that hash to the value ‘k’ are
hooked together on a linked list.
• Every entry into url_table is also entered
into hash table.
• The main use of hash table is to start with a
URL and be able to quickly determine
whether it is already present in url_table.
Data structure for crawler
Pointers Pointers
to URL to title Hash Overflow
chains
Code
2
String storage 0
URL 4
1
Title 5 19 6
2
21 44
URL
U 3
Title
Heap
n
Term Frequency,
Where,
• | D | : total number of documents in the corpus
• : number of documents where the
term ti appears (that is ).
Inverse Document Frequency
There are different ways of calculating the IDF
E.g.
The TF-IDF score for computer in the
collection would be :
1)TF-IDF = 0.03/0.0001= 300 , by using first
formula of IDF.