Cs8080 Unit3 Text Classification and Clustering
Cs8080 Unit3 Text Classification and Clustering
DEPARTMENT OF CSE
CS8080 INFORMATION RETRIEVAL TECHNIQUES
Machine Learning
algorithms that learn patterns in the data , patterns learned allow
making predictions relative to new data
learning algorithms use training data and can be of three types
supervised learning
unsupervised learning
semi-supervised learning
1
Ex: Supervised Learning Dataset
In the above example, there are 2 classes c1, c2 assigned for data . using that
new data X can be classified as c1 or c2.
2
The Text Classification Problem
D: a collection of documents
1, if dj is a member of class cp
Classification function F
3
Supervised algorithms depend on a training set
set of classes with examples of documents for each class examples
determined by human specialists
training set used to learn a classification function
The larger the number of training examples, the better is the fine tuning of
the classifier
Overfitting: classifier becomes specific to the training examples
To evaluate the classifier , use a set of unseen objects commonly referred to
as test set
Input data : set of documents to classify ,not even class labels are provided
Task of the classifier : separate documents into subsets (clusters)
automatically separating procedure is called clustering
Example
Clustering
Class labels can be generated automatically
but are different from labels specified by humans
usually, of much lower quality
thus, solving the whole classification problem with no human intervention is
hard ,If class labels are provided, clustering is more effective
K-means Clustering
4
Document representations in clustering
Vector space model
As in vector space classification, we measure relatedness between
vectors byEuclidean distance .. .which is almost equivalent to cosine
similarity.
Each cluster in K-means is defined by a centroid.
Objective/partitioning criterion: minimize the average squared difference
fromthe centroid
Recall definition of centroid:
We try to find the minimum average squared difference by iterating two steps:
reassignment: assign each vector to its closest centroid
recomputation: recompute each centroid as the average of the vectors
that were assigned to it in reassignment
5
Output : a set of K clustersSteps:
Example :
Suppose we have several objects ( 4 types of medicines) and each object
have 2 attributes or features as shown below. Our goal is to group these
objects into k=2 groupof medicine based on the features (pH and Weight
Index)
Iteration 0:
K= 2
B(2,1)Reassign :
Calculate Distance matrix
6
Take the minimum distance ( in the distance matrix D0, take each
column ,put 1 for min value in the group matrix )
Recompute:
Calculate new c1,
new c2New c1 =
{A}
New c2= centroid of {B,C,D)
Iteration 1:
Recompute
7
Iteration 2:
Reassign
Hierarchical Clustering
8
There are two types of hierarchical clustering, Divisive and Agglomerative.
Hierarchical
o Agglomerative
o Divisive
Method used for computing cluster distances defines three variants of the
algorithm
1. single-linkage
2. complete-linkage
3. average-link age
Methods to find closest pair of clusters:
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters
is defined as the shortest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the
length of the arrow between their two closest points.
9
Complete Linkage
In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the left is
equal to the length of the arrow between their two furthest points.
Average linkage
Example :
Lets now see a simple example: a hierarchical clustering of distances in
kilometers between some Italian cities. The method used is single-linkage.
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
The nearest pair of cities is MI and TO, at distance 138. These are merged
into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO)
= 138 and the new sequence number is m=1
Then we compute the distance from this new compound object to all other
objects. In single link clustering the rule is that the distance from the
compound object to another object is equal to the shortest distance from
any member of the cluster to the outside object. So the distance from
"MI/TO" to RM is chosen to be 564, which is the distance from MI to RM,
and so on.
10
After merging MI with TO we obtain:
BA FI MI/TO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
Dendrogram:
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called
NA/RML(NA/RM) = 219
m=2
After merging NA and RM we obtain:
Dendrogram:
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new
cluster calledBA/NA/RM
L(BA/NA/RM) = 255
m=3
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new
cluster calledBA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
After merging
BA/NA/RM and FI
Dendrogram:
11
Finally, we merge the last two clusters at level 295.
The process is summarized by the following dendrogram.
Classes and their labels are given as input with no training examples
Naive Classification
Input:
collection D of documents
set C = {c1, c2, . . . , cL} of L classes and their labels
Algorithm: associate one or more classes of C with each doc in D
match document terms to class labels permit partial matches
improve coverage by defining alternative class labels i.e.,
synonyms
12
The training phase of a classifier
13
3.5. Decision Tree
Process Involved
1) Construction of Decision Tree
2) Classification of Query instance
Classification of Query instance
14
Decision Tree (DT) allows predicting values of a given attribute
DT to predict values of attribute Play
Given: Outlook, Humidity, Windy
15
Classification of Documents
For document classification
with each internal node associate an index term
with each leave associate a document class
with the edges associate binary predicates that indicate presence/absence of
index term
Decision tree model for class cp can be built using a recursive splitting
strategy
first step: associate all documents with the root
second step: select index terms that provide a good separation of the
documents
third step: repeat until tree complete
16
3.6. k-NN Classifier
a. 1-Nearest Neighbor
b. K-Nearest Neighbor using a majority voting scheme
c. K-NN using a weighted-sum voting Scheme
17
kNN: How to Choose k?
In theory, if infinite number of samples available, the larger is
k, the better isclassification
The limitation is that all k neighbors have to be close
Possible when infinite no of samples available
Impossible in practice since no of
samples is finitek = 1 is often used for
efficiency, but sensitive to “noise”
18
every example in the blue shaded area every example in the blue shaded
will be misclassified as the blue area will be classified correctly as the
class(rectangle) red class(circle)
It should also be noted that all In the instance of categorical variables the
Hamming distance must be used. It also brings up the issue of
standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.
19
Simple KNN - Algorithm:
For each training example , add the example to the list of training_examples.
Example:
Consider the following data concerning credit default. Age and Loan are
two numericalvariables (predictors) and Default is the target.
20
Data to Classify:
to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean
distance.
Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1
to 3.
With K=3, there are two Default=Y and one Default=N out of three closest
neighbors. Theprediction for the unknown case is Default=Y.
Advantages
• Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear
boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough
Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive
21
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can
afford trainingstep to take a long time, but we want fast test step
• Need large number of samples for accuracy
22
Our example in a 2-dimensional system of coordinates
23
Classification of Documents
Another solution
consider binary classifier for each pair of classes cp and cq
all training documents of one class: positive examples all documents
from the other class: negative examples
24
Feature Selection
Large feature space
might render document classifiers impractical
Classic solution
select a subset of all features to represent the documents
called feature selection
reduces dimensionality of the documents representation
reduces overfitting
That is,
L
X
M I(ki , C) = P (cp ) I(ki , cp )
p=1
L ni,p
X np Nt
= log ni np
Nt Nt × Nt
p=1
L
X
IG(ki , C) = − P (cp ) log P (cp )
p=1
L
X
− − P (ki , cp ) log P (cp |ki )
p=1
L
X
− − P (k i , cp ) log P (cp |k i )
p=1
Further let
T : Dt × C → [0, 1]: training set function
nt : number of docs from training set Dt in class cp
F : D × C → [0, 1]: text classifiier function
nf : number of docs from training set assigned to class cp by the
classifier
nf,t : number of docs that both the training and classifier functions
assigned to class cp
nt − nf,t : number of training docs in class cp that were
miss-classified
The remaining quantities are calculated analogously
2P (cp )R(cp )
F1 (cp ) =
P (cp ) + R(cp )
T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000
2P R
micF1 =
P +R
P
cp ∈C nf,t
P = P
cp ∈C nf
P
cp ∈C nf,t
R = P
cp ∈C nt
classifier Ψi
training, or tuning, done on Dt minus the ith fold
testing done on the ith fold
Chapter 9
Indexing and Searching
with Gonzalo Navarro
Introduction
Inverted Indexes
Signature Files
Suffix Trees and Suffix Arrays
Sequential Searching
Multi-dimensional Indexing
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 1
Introduction
Although efficiency might seem a secondary issue
compared to effectiveness, it can rarely be neglected
in the design of an IR system
Efficiency in IR systems: to process user queries with
minimal requirements of computational resources
As we move to larger-scale applications, efficiency
becomes more and more important
For example, in Web search engines that index terabytes of data
and serve hundreds or thousands of queries per second
Vocabulary ni d1 d2 d3 d4
to 2 4 2 - -
do 3 2 - 3 3
is 1 2 - - -
be 4 2 2 2 2
or 1 - 1 - -
To do is to be.
not 1 - 1 - - To be is to do. To be or not to be.
I 2 - 2 2 - I am what I am.
am 2 - 2 1 -
what 1 - 1 - -
d1
d2
think 1 - - 1 -
I think therefore I am.
therefore 1 - - 1 - Do be do be do. Do do do, da da da.
da 1 - - - 3 Let it be, let it be.
let 1 - - - 2
it 1 - - - 2
d3
d4
1 4 12 18 21 24 35 43 50 54 64 67 77 83
In theory, there is no difference between theory and practice. In practice, there is.
Text
between 35
difference 24
practice 54 67
theory 4 43
Vocabulary Occurrences
This is a text. A text has many words. Words are made from letters.
Text
Vocabulary Occurrences
letters 4...
made 4...
many 2...
Inverted Index
text 1, 2...
words 3...
Addressing
documents 19% 26% 18% 32% 26% 47%
Addressing
64K blocks 27% 41% 18% 32% 5% 9%
Addressing
256 blocks 18% 25% 1.7% 2.4% 0.5% 0.7%
translation OR
syntax syntactic
AND AND 46
a)
146 OR 146 23467
246 237
b)
AND 4 AND 6
AND AND AND AND
1 OR 2 4 OR 2 4 OR 3 4 OR 4 6 OR 6 OR 7
4 3 4 3 4 7 6 7 7
d1
d2
I think therefore I am.
Do be do be do. Do do do, da da da.
Let it be, let it be.
d3
d4
As our collection is very small, let us assume that we
are interested in the top-2 ranked documents
We can use the following heuristic:
we process terms in idf order (shorter lists first), and
each term is processed in tf order (simple ranking order)
Text
letters: 60
"l"
"d" made: 50
"m" "a"
1 between
4:4 2:24 1:35 4:43 3:54 3:67
2 difference
3 practice sort
4 theory
1:35 2:24 3:54 3:67 4:4 4:43
Vocabulary
identify headers
Occurrences 35 24 54 67 4 43
A non trivial issue is how the memory for the many lists
of occurrences should be allocated
A classical list in which each element is allocated individually
wastes too much space
Instead, a scheme where a list of blocks is allocated, each block
holding several entries, is preferable
Level 4
Iï1..8 (final index)
3 6
1 2 4 5
Level 1
Iï1 Iï2 Iï3 Iï4 Iï5 Iï6 Iï7 Iï8 (initial dumps
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 100
Simple Strings: Brute Force
The brute force algorithm:
Try out all the possible pattern positions in the text and checks
them one by one
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 101
Simple Strings: Brute Force
A sample text and pattern searched for using brute
force
abracadabra
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 102
Simple Strings: Horspool
Horspool’s algorithm is in the fortunate position of
being very simple to understand and program
It is the fastest algorithm in many situations, especially
when searching natural language texts
Horspool’s algorithm uses the previous idea to shift the
window in a smarter way
A table d indexed by the characters of the alphabet is
precomputed:
d[c] tells how many positions can the window be shifted if the final
character of the window is c
In other words, d[c] is the distance from the end of the
pattern to the last occurrence of c in P , excluding the
occurrence of pm
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 103
Simple Strings: Horspool
The Figure repeats the previous example, now also
applying Horspool’s shift
T abraca bracadabra T abraca bracadabra
P abracadabra P abracadabra
abracadabra abracadabra
abracadabra
abracadabra
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 104
Simple Strings: Horspool
Pseudocode for Horspool’s string matching algorithm
Horspool (T = t1 t2 . . . tn , P = p1 p2 . . . pm )
(1) for c ∈ Σ do d[c] ← m
(2) for j ← 1 . . . m − 1 do d[pj ] ← m − j
(3) i←0
(4) while i ≤ n − m do
(5) j←1
(6) while j ≤ m ∧ ti+j = pj do j ← j + 1
(7) if j > m then report an occurrence at text position i + 1
(8) i ← i + d[ti+m ]
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 105
Small alphabets and long patterns
When searching for long patterns over small alphabets
Horspool’s algorithm does not perform well
Imagine a computational biology application where strings of 300
nucleotides over the four-letter alphabet {A, C, G, T} are sought
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 106
Small alphabets and long patterns
In general we can shift using q characters at the end of
the window: which is the best value for q ?
We cannot shift by more than m, and thus σ q ≤ m seems to be a
natural limit
If we set q = logσ m, the average search time will be
O(n logσ (m)/m)
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 107
Small alphabets and long patterns
This technique is used in the agrep software
A hash function is chosen to map q -grams (strings of
length q ) onto an integer range
Then the distance from each q -gram of P to the end of
P is recorded in the hash table
For the q -grams that do not exist in P , distance
m − q + 1 is used
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 108
Small alphabets and long patterns
Pseudocode for the agrep’s algorithm to match long
patterns over small alphabets (simplified)
Agrep (T = t1 t2 . . . tn , P = p1 p2 . . . pm , q, h( ), N )
(1) for i ∈ [1, N ] do d[i] ← m − q + 1
(2) for j ← 0 . . . m − q do d[h(pj+1 pj+2 . . . pj+q )] ← m − q − j
(3) i←0
(4) while i ≤ n − m do
(5) s ← d[h(ti+m−q+1 ti+m−q+2 . . . ti+m )]
(6) if s > 0 then i ← i + s
(7) else
(8) j←1
(9) while j ≤ m ∧ ti+j = pj do j ← j + 1
(10) if j > m then report an occurrence at text position i + 1
(11) i←i+1
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 109
Automata and Bit-Parallelism
Horspool’s algorithm, as well as most classical
algorithms, does not adapt well to complex patterns
We now show how automata and bit-parallelism
permit handling many complex patterns
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 110
Automata
Figure below shows, on top, a NFA to search for the
pattern P = abracadabra
The initial self-loop matches any character
Each table column corresponds to an edge of the automaton
a b r a c a d a b r a
B[a] = 0 1 1 0 1 0 1 0 1 1 0
B[b] = 1 0 1 1 1 1 1 1 0 1 1
B[r] = 1 1 0 1 1 1 1 1 1 0 1
B[c] = 1 1 1 1 0 1 1 1 1 1 1
B[d] = 1 1 1 1 1 1 0 1 1 1 1
B[*] = 1 1 1 1 1 1 1 1 1 1 1
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 111
Automata
It can be seen that the NFA in the previous Figure
accepts any string that finishes with P =
‘abracadabra’
The initial state is always active because of the self-loop
that can be traversed by any character
Note that several states can be simultaneously active
For example, after reading ‘abra’, NFA states 0, 1, and 4 will be
active
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 112
Bit-parallelism and Shift-And
Bit-parallelism takes advantage of the intrinsic
parallelism of bit operations
Bit masks are read right to left, so that the first bit of
bm . . . b1 is b1
Bit masks are handled with operations like:
| to denote the bit-wise or
& to denote the bit-wise and, and
! to denote the bit-wise xor
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 113
Bit-parallelism and Shift-And
In addition:
mask << i means shifting all the bits in mask by i positions to
the left, entering zero bits from the right
mask >> i is analogous
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 114
Bit-parallelism and Shift-And
The simplest bit-parallel algorithm permits matching
single strings, and it is called Shift-And
The algorithm builds a table B which, for each
character, stores a bit mask bm . . . b1
The mask in B[c] has the i-th bit set if and only if pi = c
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 115
Bit-parallelism and Shift-And
Pseudocode for the Shift-And algorithm
Shift-And (T = t1 t2 . . . tn , P = p1 p2 . . . pm )
n e i g h b o u r
0 1 2 3 4 5 6 7 8 9
¡
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 117
Extending Shift-And
Another feature in complex patterns is the use of wild
cards, or more generally repeatable characters
Those are pattern positions that can appear once or more times,
consecutively, in the text
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 118
Extending Shift-And
As another example, we might look for well known,
yet there might be a hyphen or one or more spaces
For instance ‘well known’, ‘well known’, ‘well-known’,
‘well - known’, ‘well \n known’, and so on
sep
w e l l sep k n o w n
0 1 2 3 4 5 6 7 8 9 10
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 119
Extending Shift-And
Figure below shows pseudocode for a Shift-And
extension that handles all these cases
Shift-And-Extended (T = t1 t2 . . . tn , m, B[ ], A, S)
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 120
Faster Bit-Parallel Algorithms
There exist some algorithms that can handle complex
patterns and still skip text characters (like Horspool)
For instance, Suffix Automata and Interlaced Shift-And algorithms
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 121
Suffix Automata
The suffix automaton of a pattern P is an automaton
that recognizes all the suffixes of P
Below we present a non-deterministic suffix automaton
for P = ‘abracadabra’
¡
I
a b r a c a d a b r a
0 1 2 3 4 5 6 7 8 9 10 11
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 122
Suffix Automata
To search for pattern P , the suffix automaton of P rev
(the reversed pattern) is built
The algorithm scans the text window backwards and
feeds the characters into the suffix automaton of P rev
If the automaton runs out of active states after scanning
ti+m ti+m−1 . . . ti+j , this means that ti+j ti+j+1 . . . ti+m is
not a substring of P
Thus, no occurrence of P can contain this substring,
and the window can be safely shifted past ti+j
If, instead, we reach the beginning of the window and
the automaton still has active states, this means that
the window is equal to the pattern
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 123
Suffix Automata
The need to implement the suffix automaton and make
it deterministic makes the algorithm more complex
An attractive variant, called BNDM, implements the
suffix automaton using bit-parallelism
It achieves improved performance when the pattern is
not very long
say, at most twice the number of bits in the computer word
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 124
Suffix Automata
Pseudocode for BNDM algorithm:
BNDM (T = t1 t2 . . . tn , P = p1 p2 . . . pm )
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 125
Interlaced Shift-And
Another idea to achieve optimal average search time is
to read one text character out of q
To fix ideas, assume P = neighborhood and q = 3
If we read one text position out of 3, and P occurs at
some text window ti+1 ti+2 . . . ti+m then we will read
either ‘ngoo’, ‘ehro’, or ‘ibhd’ at the window
Therefore, it is sufficient to search simultaneously for
the three subsequences of P
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 126
Interlaced Shift-And
Now the initial state can activate the first q positions of
P , and the bit-parallel shifts are by q positions
A non-deterministic suffix automaton for interlaced
searching of P = ‘neighborhood’ with q = 3 is:
n
0 1 2 3 4 5 6 7 8 9 10 11 12
e
i g h b o r h o o d
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 127
Interlaced Shift-And
Pseudocode for Interlaced Shift-And algorithm with
sampling step q (simplified):
Interlaced-Shift-And (T = t1 t2 . . . tn , P = p1 p2 . . . pm , q)
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 128
Regular Expressions
The first part in processing a regular expression is to
build an NFA from it
There are different NFA construction methods
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 129
Regular Expressions
Recursive Thompson’s construction of an NFA from a
regular expression
¡ a
Th ( ¡ ) = Th ( a ) =
Th ( E . E’) = Th ( E ) Th ( E’)
Th ( E ) ¡
¡
Th ( E | E’) =
¡
¡
Th ( E’)
¡ ¡
Th ( E * ) = Th ( E )
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 130
Regular Expressions
Once the NFA is built we add a self-loop (traversable by
any character) at the initial state
Another alternative is to make the NFA deterministic,
converting it into a DFA
However the number of states can grow non linearly,
even exponentially in the worst case
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 131
Multiple Patterns
Several of the algorithms for single string matching can
be extended to handle multiple strings
P = {P1 , P2 , . . . , Pr }
For example, we can extend Horspool so that d[c] is the minimum
over the di [c] values of the individual patterns Pi
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 132
Approximate Searching
A simple string matching problem where not only a
string P must be reported, but also text positions where
P occurs with at most k ‘errors’
Different definitions of what is an error can be adopted
The simplest definition is the Hamming distance that allows just
substitutions of characters
A very popular one corresponds to the so-called
Levenshtein or edit distance:
A error is the deletion, insertion, or substitution of a single
character
C[0, j] = 0,
C[i, 0] = i,
C[i, j] = if (pi = tj ) then C[i − 1, j − 1]
else 1 + min(C[i − 1, j], C[i, j − 1], C[i − 1, j − 1]),
k o l o r a m a
0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 1 1 1
o 2 2 1 2 1 2 2 2 2
l 3 3 2 1 2 2 3 3 3
o 4 4 3 2 1 2 3 4 4
u 5 5 4 3 2 2 3 4 5
r 6 6 5 4 3 2* 3 4 5
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 136
Dynamic Programming
Figure below gives the pseudocode for this variant
Approximate-DP (T = t1 t2 . . . tn , P = p1 p2 . . . pm , k)
(1) for i ← 0 . . . m do C[i] ← i
(2) last ← k + 1
(3) for j ← 1 . . . n do
(4) pC, nC ← 0
(5) for i ← 1 . . . last do
(6) if pi = tj then nC ← pC
(7) else
(8) if pC < nC then nC ← pC
(9) if C[i] < nC then nC ← C[i]
(10) nC ← nC + 1
(11) pC ← C[i]
(12) C[i] ← nC
(13) if nC ≤ k
(14) then if last = m then report an occurrence ending at position i
(15) else last ← last + 1
(16) else while C[last − 1] > k do last ← last − 1
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 137
Automata and Bit-parallelism
Approximate string matching can also be expressed as
an NFA search
Figure below depicts an NFA for approximate string
matching for the pattern ‘colour’ with two errors
c o l o u r no errors
¡ ¡ ¡ ¡ ¡ ¡
c o l o u r 1 error
¡ ¡ ¡ ¡ ¡ ¡
c o l o u r 2 errors
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 138
Automata and Bit-parallelism
Although the search phase is O(n), the NFA tends to be
large (O(kn))
A better solution, based on bit-parallelism, is an
extension of Shift-And
We can simulate k + 1 Shift-And processes while taking
care of vertical and diagonal arrows as well
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 139
Automata and Bit-parallelism
Pseudocode for approximate string matching using the
Shift-And algorithm
Approximate-Shift-And (T = t1 t2 . . . tn , P = p1 p2 . . . pm , k)
(1) for c ∈ Σ do B[c] ← 0
(2) for j ← 1 . . . m do B[pj ] ← B[pj ] | (1 << (j − 1))
(3) for i ← 0 . . . k do Di ← (1 << i) − 1
(4) for j ← 1 . . . n do
(5) pD ← D0
(6) nD, D0 ← ((D0 << 1) | 1) & B[ti ]
(7) for i ← 1 . . . k do
(8) nD ← ((Di << 1) & B[ti ]) | pD | ((pD | nD) << 1) | 1
(9) pD ← Di , Di ← nD
(10) if nD & (1 << (m − 1)) += 0
(11) then report an occurrence ending at position i
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 140
Filtration
Frequently it is easier to tell that a text position cannot
match than to ensure that it matches with k errors
Filtration is based on applying a fast filter over the text,
which hopefully discards most of the text positions
Then we can apply an approximate search algorithm
over the areas that could not be discarded
A simple and fast filter:
Split the pattern into k + 1 pieces of about the same length
Then we can run a multi-pattern search algorithm for the pieces
If piece pj . . . pj ! appears in ti . . . ti! , then we run an approximate
string matching algorithm over ti−j+1−k . . . ti−j+m+k
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 141
Searching Compressed Text
An extension of traditional compression mechanisms
gives a very powerful way of matching much more
complex patterns
Let us start with phrase queries that can be searched
for by
compressing each of its words and
searching the compressed text for the concatenated string of
target symbols
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 142
Searching Compressed Text
A more robust search mechanism is based in word
patterns
For example, we may wish to search for:
Any word matching ‘United’ in case-insensitive form and
permitting two errors
Then a separator
And then any word matching ‘States’ in case-insensitive form
and permitting two errors
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 143
Searching Compressed Text
Let C be the set of different codewords created by the
compressor
We can take C as an alphabet and see the compressed
text as a sequence of atomic symbols over C
Our pattern has three positions, each denoting a class
of characters:
The first is the set of codewords corresponding to words that
match ‘United’ in case-insensitive form and allowing two errors
The second is the set of codewords for separators and is an
optional class
The third is like the first but for the word ‘States’
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 144
Searching Compressed Text
The Figure below illustrates the previous example
, \n 010
\n 010
States 001
any
UNITED 100
state 001
unates 101
unite 100
Vocabulary B[ ] table
(alphabet)
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 145
Searching Compressed Text
This process can be used to search for much more
complex patterns
Assume that we wish to search for ‘the number of
elements successfully classified’, or
something alike
Many other phrases can actually mean more or less the
same, for example:
the number of elements classified with success
the elements successfully classified
the number of elements we successfully classified
the number of elements that were successfully classified
the number of elements correctly classified
the number of elements we could correctly classify
...
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 146
Searching Compressed Text
To recover from linguistic variants as shown above we
must resort to word-level approximate string matching
In this model, we permit a limited number of missing,
extra, or substituted words
For example, with 3 word-level errors we can recover from all the
variants in the example above
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 147
Multi-dimensional Indexing
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 148
Multi-dimensional Indexing
In multimedia data, we can represent every object by
several numerical features
For example, imagine an image from where we can
extract a color histogram, edge positions, etc
One way to search in this case is to map these object
features into points in a multi-dimensional space
Another approach is to have a distance function for
objects and then use a distance based index
The main mapping methods form three main classes:
R∗ -trees and the rest of the R-tree family,
linear quadtrees,
grid-files
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 149
Multi-dimensional Indexing
The R-tree-based methods seem to be most robust for
higher dimensions
The R-tree represents a spatial object by its minimum
bounding rectangle (MBR)
Data rectangles are grouped to form parent nodes,
which are recursively grouped, to form grandparent
nodes and, eventually, a tree hierarchy
Disk pages are consecutive byte positions on the
surface of the disk that are fetched with one disk access
The goal of the insertion, split, and deletion routines is
to give trees that will have good clustering
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 150
Multi-dimensional Indexing
Figure below illustrates data rectangles (in black),
organized in an R-tree with fanout 3
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 151
Multi-dimensional Search
A range query specifies a region of interest, requiring all
the data regions that intersect it
To answer this query, we first retrieve a superset of the
qualifying data regions:
We compute the MBR of the query region, and then we
recursively descend the R-tree, excluding the branches whose
MBRs do not intersect the query MBR
Thus, the R-tree will give us quickly the data regions whose MBR
intersects the MBR of the query region
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 152
Multi-dimensional Search
The data structure of the R-tree for the previous figure
is (fanout = 3)
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 153