0% found this document useful (0 votes)

88 views33 pages

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

halaman dapus untuk mengerjakan skripsi. Baik untuk referensi. silahkan download text ini sebagai tambahan referensi. baik untuk mahasiswa S1 maupun S2. Halaman ini berisi halaman suatu ebook yang dapat kalian download di link lain.

Uploaded by

Diantika Ochan Puspitasari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views33 pages

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Uploaded by

Diantika Ochan Puspitasari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 33

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Java VSR Implementation

Simple vector-space retrieval (VSR) system written in Java. Code is in package ir.vsr All code is in /u/mooney/ir-code VSR code is in /u/mooney/ir-code/ir/vsr Handles HTML and generic ASCII documents where each document is a file. For now, ignore anything about feedback.
2

Simple Tokenizing
Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. However, usually they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.

Tokenizing HTML
Should text in HTML commands not typically seen by the user be included as tokens?
Words appearing in URLs. Words appearing in meta text of images.

Simplest approach used in VSR is to exclude all HTML tag information from tokenization.
Parses HTML using utilities in Java Swing package, and collects all raw text.
4

Documents in VSR
Document

TextStringDocument
(used for typed queries)

FileDocument

TextFileDocument
(used for ASCII files)

HTMLFileDocument
(used for HTML files)

Stopwords
It is typical to exclude high-frequency words (e.g. function words: a, the, in, to; pronouns: I, he, she, it). Stopwords are language dependent. VSR uses a standard set of about 500 for English. For efficiency, store strings for stopwords in a hashtable to recognize them in constant time.
6

Stemming
Reduce tokens to root form of words to recognize morphological variation.
computer, computational, computation all reduced to same token compute

Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion.
7

Porter Stemmer
Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words:
computer, computational, computation all reduced to same token comput

May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations.
8

Porter Stemmer Errors

Errors of comission:
organization, organ organ police, policy polic arm, army arm

Errors of omission:
cylinder, cylindrical create, creation Europe, European

Sparse Vectors
Vocabulary and therefore dimensionality of vectors can be very large, ~104 . However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). Need efficient methods for storing and computing with sparse vectors.

Sparse Vectors as Lists

Store vectors as linked lists of non-zeroweight tokens paired with a weight.
Space proportional to number of unique tokens (n) in document. Requires linear search of the list to find (or change) the weight of a specific token. Requires quadratic time in worst case to compute vector for a document:

i
i 1

n(n 1) O(n 2 ) 2
11

Sparse Vectors as Trees

Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves.
memory < film < bit 2

variable

Balanced Binary Tree

film memory variable 1 1 2

Sparse Vectors as Trees (cont.)

Space overhead for tree structure: ~2n nodes. O(log n) time to find or update weight of a specific token. O(n log n) time to construct vector. Need software package to support such data structures.

Sparse Vectors as HashTables

Store tokens in hashtable, with token string as key and weight as value.
Storage overhead for hashtable ~1.5n. Table must fit in main memory. Constant time to find or update weight of a specific token (ignoring collisions). O(n) time to construct vector (ignoring collisions).

Sparse Vectors in VSR

Uses the hashtable approach called a HashMapVector. The hashMapVector() method of a Document computes and returns a HashMapVector for the document. hashMapVector() only works once after initial Document creation (i.e. Document object does not store it internally for later reuse).
15

Implementation Based on Inverted Files

In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). Critical issue is logarithmic or constant-time access to token information.
16

Inverted Index

Index terms computer database

Dj, tfj
D7, 4 D1, 3

3
2

science system
Index file 4 1

D2, 4
D5, 2 Postings lists

VSR Inverted Index

TokenInfo String token

HashMap
tokenHash

double idf

ArrayList occList

TokenOccurence DocumentReference int docRef count

File double file length

TokenOccurence DocumentReference int docRef count

File double file length

Creating an Inverted Index

Create an empty HashMap, H; For each document, D, (i.e. file in an input directory): Create a HashMapVector,V, for D; For each (non-zero) token, T, in V: If T is not already in H, create an empty TokenInfo for T and insert it into H; Create a TokenOccurence for T in D and add it to the occList in the TokenInfo for T; Compute IDF for all tokens in H; Compute vector lengths for all documents in H;
19

Computing IDF
Let N be the total number of Documents; For each token, T, in H: Determine the total number of documents, M, in which T occurs (the length of Ts occList); Set the IDF for T to log(N/M); Note this requires a second pass through all the tokens after all documents have been indexed.

Document Vector Length

Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens. Remember the weight of a token is: TF * IDF Therefore, must wait until IDFs are known (and therefore until all documents are indexed) before document lengths can be determined.
21

Computing Document Lengths

Assume the length of all document vectors (stored in the DocumentReference) are initialized to 0.0; For each token T in H: Let, I, be the IDF weight of T; For each TokenOccurence of T in document D Let, C, be the count of T in D; Increment the length of D by (I*C)2; For each document D in H: Set the length of D to be the square-root of the current stored length;
22

Minimizing Iterations Through Tokens

To avoid iterating though all tokens twice (after all documents are already indexed), computing IDFs and vector lengths are combined in one iteration in VSR.

Time Complexity of Indexing

Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabularly V is O(|V|). Computing vector lengths is also O(m n). Since |V| m n, complete process is O(m n), which is also the complexity of just reading in the corpus.
24

Retrieval with an Inverted Index

Tokens that are not in both the query and the document do not effect cosine similarity.
Product of token weights is zero and does not contribute to the dot product.

Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words.
25

Inverted Query Retrieval Efficiency

Assume that, on average, a query word appears in B documents:
Q = q1
D11D1B

qn
Dn1DnB

D21D2B

Then retrieval time is O(|Q| B), which is typically, much better than nave retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N.
26

Processing the Query

Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value.

Inverted-Index Retrieval Algorithm

Create a HashMapVector, Q, for the query. Create empty HashMap, R, to store retrieved documents with scores. For each token, T, in Q: Let I be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * I; Let L be the list of TokenOccurences of T from H; For each TokenOccurence, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment Ds score by W * I * C; (product of T-weight in Q and D)
28

Retrieval Algorithm (cont)

Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights). For each retrieved document D in R: Let S be the current accumulated score of D;
(S is the dot-product of D and Q)

Let Y be the length of D as stored in its DocumentReference; Normalize Ds final score to S/(L * Y); Sort retrieved documents in R by final score and return results in an array.

Efficiency Note
To save computation and an extra iteration through the tokens in the query, in VSR, the computation of the length of the query vector is integrated with the processing of query tokens during retrieval.

User Interface
Until user terminates with an empty query: Prompt user to type a query, Q. Compute the ranked array of retrievals R for Q; Print the name of top N documents in R; Until user terminates with an empty command: Prompt user for a command for this query result: 1) Show next N retrievals; 2) Show the Mth retrieved document;
(document shown in Firefox window)
31

Running VSR
Invoke the system using the main method of InvertedIndex.
java ir.vsr.InvertedIndex <corpus-directory> Make sure your CLASSPATH has /u/mooney/ir-code

Will index all files in a directory and then process queries interactively. Optional flags include:
-html: Strips HTML tags from files -stem: Stems tokens with Porter stemmer

Sample Document Corpus

900 science pages from the web. 300 random samples each from the Yahoo indices for biology, physics, and chemistry. In /u/mooney/ir-code/corpora/yahoo-science/ Probably best to use -html flag. Sample trace with this corpus at:
http://www.cs.utexas.edu/users/mooney/ir-course/proj1/sample-trace

Invt VK-VT Hmi Manual
100% (2)
Invt VK-VT Hmi Manual
575 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Implementation
No ratings yet
Implementation
16 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Journal
No ratings yet
IR Journal
36 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
indexing_1
No ratings yet
indexing_1
61 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
ir
No ratings yet
ir
120 pages
L05
No ratings yet
L05
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
3
No ratings yet
3
8 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
thesis
No ratings yet
thesis
49 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Inverted File
No ratings yet
Inverted File
20 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
ir-journal
No ratings yet
ir-journal
41 pages
04const Flat
No ratings yet
04const Flat
54 pages
Unit 2
No ratings yet
Unit 2
58 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Lecture 4 - Index Construction _ Compressing
No ratings yet
Lecture 4 - Index Construction _ Compressing
90 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
Vmodel
No ratings yet
Vmodel
10 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
IRS Unit-3
100% (2)
IRS Unit-3
28 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
1500 Im en
No ratings yet
1500 Im en
12 pages
Thesis Service Oriented Architecture
100% (3)
Thesis Service Oriented Architecture
6 pages
2garmin City Navigator Europe NT 2014.30 (PC Installation) Unlock (Download Torrent) - TPB
No ratings yet
2garmin City Navigator Europe NT 2014.30 (PC Installation) Unlock (Download Torrent) - TPB
2 pages
DB GP30-DEMO en
No ratings yet
DB GP30-DEMO en
20 pages
DMM For Windows Manual Inclinometria
No ratings yet
DMM For Windows Manual Inclinometria
49 pages
On-Device, Real-Time Hand Tracking With MediaPipe
No ratings yet
On-Device, Real-Time Hand Tracking With MediaPipe
9 pages
File Upload
No ratings yet
File Upload
4 pages
Bihar Daroga Set-55
No ratings yet
Bihar Daroga Set-55
6 pages
Mechatronics Catalog ENG 2020
No ratings yet
Mechatronics Catalog ENG 2020
32 pages
News EPLAN en US PDF
No ratings yet
News EPLAN en US PDF
208 pages
hw1 11 12 13 16 31 12 33 PDF
No ratings yet
hw1 11 12 13 16 31 12 33 PDF
7 pages
Cultural Web Analysis - A Practical Guide To Delivering Results
No ratings yet
Cultural Web Analysis - A Practical Guide To Delivering Results
2 pages
Cloud Computing and Its Implications For Construction IT: B. Kumar J.C.P. Cheng
No ratings yet
Cloud Computing and Its Implications For Construction IT: B. Kumar J.C.P. Cheng
6 pages
Isaac Madan, Shaurya Saluja, Aojia Zhao, Automated Bitcoin Trading Via Machine Learning Algorithms
No ratings yet
Isaac Madan, Shaurya Saluja, Aojia Zhao, Automated Bitcoin Trading Via Machine Learning Algorithms
10 pages
Mail Merging2025
No ratings yet
Mail Merging2025
22 pages
PPT MINI PROJECT
No ratings yet
PPT MINI PROJECT
10 pages
Asic Design Flow
No ratings yet
Asic Design Flow
2 pages
MS07 Combined PDF
No ratings yet
MS07 Combined PDF
322 pages
SCCM Resume
No ratings yet
SCCM Resume
4 pages
Infinite Ocean For Cinema 4D by C4Depot
No ratings yet
Infinite Ocean For Cinema 4D by C4Depot
4 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
Isatis Case Studies Mining
100% (1)
Isatis Case Studies Mining
292 pages
Practical
No ratings yet
Practical
20 pages
Java Swing Intro
No ratings yet
Java Swing Intro
76 pages
H5Menu(DarkStyle).html
No ratings yet
H5Menu(DarkStyle).html
6 pages
5 Hacking Mobile Platforms
50% (2)
5 Hacking Mobile Platforms
21 pages
Dokumentasi - Laporan Pembelian & Laporan Penjualan
No ratings yet
Dokumentasi - Laporan Pembelian & Laporan Penjualan
13 pages
Advantages and Disadvantages of Mobile Phones in Points
No ratings yet
Advantages and Disadvantages of Mobile Phones in Points
2 pages
Formal Language and Compiler Design - 2
No ratings yet
Formal Language and Compiler Design - 2
40 pages

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Uploaded by

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Uploaded by

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Java VSR Implementation

Porter Stemmer Errors

Sparse Vectors as Lists

Sparse Vectors as Trees

Balanced Binary Tree

film memory variable 1 1 2

Sparse Vectors as Trees (cont.)

Sparse Vectors as HashTables

Sparse Vectors in VSR

Implementation Based on Inverted Files

Index terms computer database

VSR Inverted Index

TokenOccurence DocumentReference int docRef count

TokenOccurence DocumentReference int docRef count

Creating an Inverted Index

Document Vector Length

Computing Document Lengths

Minimizing Iterations Through Tokens

Time Complexity of Indexing

Retrieval with an Inverted Index

Inverted Query Retrieval Efficiency

Processing the Query

Inverted-Index Retrieval Algorithm

Retrieval Algorithm (cont)

Sample Document Corpus

You might also like