0% found this document useful (0 votes)
135 views

Cs8080 Unit3 Text Classification and Clustering

The document discusses text classification and clustering techniques. It introduces supervised and unsupervised learning algorithms for text classification. Unsupervised clustering algorithms like K-means and hierarchical clustering are described. K-means partitions documents into K clusters iteratively by assigning documents to the closest cluster centroid and recomputing centroids. Hierarchical clustering builds a dendrogram by either decomposing large clusters or agglomerating smaller ones. Supervised classification depends on a training set with examples for each class.

Uploaded by

Gnanasekaran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views

Cs8080 Unit3 Text Classification and Clustering

The document discusses text classification and clustering techniques. It introduces supervised and unsupervised learning algorithms for text classification. Unsupervised clustering algorithms like K-means and hierarchical clustering are described. K-means partitions documents into K clusters iteratively by assigning documents to the closest cluster centroid and recomputing centroids. Hierarchical clustering builds a dendrogram by either decomposing large clusters or agglomerating smaller ones. Supervised classification depends on a training set with examples for each class.

Uploaded by

Gnanasekaran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

PANIMALAR ENGINEERING COLLEGE

DEPARTMENT OF CSE
CS8080 INFORMATION RETRIEVAL TECHNIQUES

UNIT III TEXT CLASSIFICATION AND CLUSTERING


(Chapter 8 , 9 - Ricardo Baeza)
Characterization of Text Classification – Unsupervised Algorithms: Clustering
– Naïve Text Classification – Supervised Algorithms – Decision Tree – k-NN
Classifier – SVM Classifier – Feature Selection or Dimensionality Reduction –
Evaluation metrics – Accuracy and Error – Organizing the classes – Indexing
and Searching – Inverted Indexes – Sequential Searching – Multi-dimensional
Indexing

3.1. Characterization of Text Classification

Introduction to Text classification

 process of associating documents with classes if classes are


referred to as categories ,this process is called text categorization.
we consider classification and categorization the same process
 Related problem: partition docs into subsets, no labels. since each
subset has no label, it is not a class instead, each subset is called
a cluster ,the partitioning process is called clustering ,we consider
clustering as a simpler variant of text classification
 Text classification - a means to organize information
Example
 Consider a large engineering company thousands of documents are
produced ,if properly organized, they can be used for business
decisions to organize large document collection, text classification
is used .Text classification key technology in modern enterprises

Machine Learning
 algorithms that learn patterns in the data , patterns learned allow
making predictions relative to new data
 learning algorithms use training data and can be of three types
 supervised learning
 unsupervised learning
 semi-supervised learning

1. Supervised learning - training data provided as input . training


data: classes for input documents

1
Ex: Supervised Learning Dataset

In the above example, there are 2 classes c1, c2 assigned for data . using that
new data X can be classified as c1 or c2.

2. Unsupervised learning - no training data is provided Examples:


neural network models , independent component analysis ,
clustering
Ex: UnSupervised Learning Dataset before applying Learning Alg

Ex: UnSupervised Learning Dataset after applying Learning Alg

3. Semi-supervised learning - small training data , combined with


larger amount of unlabeled data

2
The Text Classification Problem

A classifier can be formally defined

D: a collection of documents

C = {c1, c2, . . . , cL}: a set of L classes with their respective labels

a text classifier is a binary function F : D × C → {0, 1}, which


assigns to each pair [dj, cp], dj ∈ D and cp ∈ C, a value of

1, if dj is a member of class cp

0, if dj is not a member of class c p

Broad definition, admits supervised and unsupervised algorithms

For high accuracy, use supervised algorithm

 multi-label: one or more labels are assigned to each document


 single-label: a single class is assigned to each document

Classification function F

 defined as binary function of document-class pair [dj, cp]


 can be modified to compute degree of membership of dj in cp
 documents as candidates for membership in class cp
 candidates sorted by decreasing values of F(dj, cp)

Text Classification Algorithms


Unsupervised Algorithms Supervised Algorithms

3
 Supervised algorithms depend on a training set
 set of classes with examples of documents for each class examples
determined by human specialists
 training set used to learn a classification function
 The larger the number of training examples, the better is the fine tuning of
the classifier
 Overfitting: classifier becomes specific to the training examples
 To evaluate the classifier , use a set of unseen objects commonly referred to
as test set

3.2. Unsupervised Algorithms: Clustering

Input data : set of documents to classify ,not even class labels are provided
Task of the classifier : separate documents into subsets (clusters)
automatically separating procedure is called clustering
Example

Clustering
Class labels can be generated automatically
 but are different from labels specified by humans
 usually, of much lower quality
 thus, solving the whole classification problem with no human intervention is
hard ,If class labels are provided, clustering is more effective

K-means Clustering

 Input: number K of clusters to be generated


 Each cluster represented by its documents centroid
 K-Means algorithm:
 partition docs among the K clusters
 each document assigned to cluster with closest centroid
 recompute centroids
 repeat process until centroids do not change

4
Document representations in clustering
 Vector space model
 As in vector space classification, we measure relatedness between
vectors byEuclidean distance .. .which is almost equivalent to cosine
similarity.
 Each cluster in K-means is defined by a centroid.
 Objective/partitioning criterion: minimize the average squared difference
fromthe centroid
 Recall definition of centroid:

where we use ω to denote a cluster.

 We try to find the minimum average squared difference by iterating two steps:
 reassignment: assign each vector to its closest centroid
 recomputation: recompute each centroid as the average of the vectors
that were assigned to it in reassignment

 K-means can start with selecting as initial clusters centers K randomly


chosen objects, namely the seeds. It then moves the cluster centers around
in space in order to minimize RSS(A measure of how well the centroids
represent the members of their clusters is the Residual Sum of Squares ,
the squared distance of each vector from its centroid summed over all
vectors This is done iteratively by repeating two steps (reassignment , re
computation) until a stopping criterion is met.
 We can use one of the following stopping conditions as stopping
criterionA fixed number of iterations I has been completed.
Centroids μi do not change between iterations.
Terminate when RSS falls below a pre-estabilished threshold.
AlgorithmInput :
K: no of clusters
D: data set containing n objects

5
Output : a set of K clustersSteps:

1. Arbitrarily choose k objects from D as the initial cluster centers


2. Repeat
3. Reassign each object to the cluster to which the object is the most similar
based on the distance measure
4. Recompute the centroid for newly formed cluster
5. Until no change

Example :
Suppose we have several objects ( 4 types of medicines) and each object
have 2 attributes or features as shown below. Our goal is to group these
objects into k=2 groupof medicine based on the features (pH and Weight
Index)

Object Attribute 1 (X) Attribute 2(Y)


Weight Index pH
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4

Iteration 0:
K= 2

Initially , Centroid c1 : A(1, 1) ,Centroid c2 :

B(2,1)Reassign :
Calculate Distance matrix

Distance from c (4,3) to c1 (1,1 ) is

Same way calculate for all

6
Take the minimum distance ( in the distance matrix D0, take each
column ,put 1 for min value in the group matrix )

Recompute:
Calculate new c1,
new c2New c1 =
{A}
New c2= centroid of {B,C,D)

Iteration 1:

c1(1,1) , c2(11/3, 8/3)


Reassign
Calculate Distance matrix

Take the minimum distance

Recompute

7
Iteration 2:

Reassign

We obtain result that G2 = G1 . Comparing the groups of last


iteration , this iterationdoes not move group any more. Final Result of k –
means clustering with 2 clusters .
Object Attribute 1 (X) Attribute 2(Y) Group (result)
Weight Index pH
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

Hierarchical Clustering

Goal: to create a hierarchy of clusters by either

decomposing a large cluster into smaller ones, or agglomerating previously


defined clusters into larger ones

Build a tree based hierarchical taxonomy from a set of document is called


dendrogram

8
There are two types of hierarchical clustering, Divisive and Agglomerative.

Hierarchical
o Agglomerative
o Divisive

General hierarchical clustering algorithm


1. Input
a set of N documents to be clustered
an N × N similarity (distance) matrix
2. Assign each document to its own cluster
N clusters are produced, containing one document each
3. Find the two closest clusters
merge them into a single cluster
number of clusters reduced to N − 1
4. Recompute distances between new cluster and each old cluster
5. Repeat steps 3 and 4 until one single cluster of size N is produced

Step 4 introduces notion of similarity or distance between two clusters

Method used for computing cluster distances defines three variants of the
algorithm

1. single-linkage
2. complete-linkage
3. average-link age
Methods to find closest pair of clusters:
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters
is defined as the shortest distance between two points in each cluster. For
example, the distance between clusters “r” and “s” to the left is equal to the
length of the arrow between their two closest points.

9
Complete Linkage
In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the left is
equal to the length of the arrow between their two furthest points.
Average linkage

In average linkage hierarchical clustering, the distance between two


clusters is defined as the average distance between each point in one
cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length each
arrow between connecting the points of one cluster to the other.

Example :
Lets now see a simple example: a hierarchical clustering of distances in
kilometers between some Italian cities. The method used is single-linkage.

Input distance matrix (L = 0 for all the clusters):

BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0

The nearest pair of cities is MI and TO, at distance 138. These are merged
into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO)
= 138 and the new sequence number is m=1
Then we compute the distance from this new compound object to all other
objects. In single link clustering the rule is that the distance from the
compound object to another object is equal to the shortest distance from
any member of the cluster to the outside object. So the distance from
"MI/TO" to RM is chosen to be 564, which is the distance from MI to RM,
and so on.

10
After merging MI with TO we obtain:

BA FI MI/TO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
Dendrogram:

min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called
NA/RML(NA/RM) = 219
m=2
After merging NA and RM we obtain:
Dendrogram:

min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new
cluster calledBA/NA/RM
L(BA/NA/RM) = 255
m=3

After merging BA and NA/RM:


Dendrogram:

min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new
cluster calledBA/FI/NA/RM
L(BA/FI/NA/RM) = 268

m=4
After merging
BA/NA/RM and FI
Dendrogram:

11
Finally, we merge the last two clusters at level 295.
The process is summarized by the following dendrogram.

3.3. Naïve Text Classification

Classes and their labels are given as input with no training examples
Naive Classification
Input:
 collection D of documents
 set C = {c1, c2, . . . , cL} of L classes and their labels
Algorithm: associate one or more classes of C with each doc in D
 match document terms to class labels permit partial matches
 improve coverage by defining alternative class labels i.e.,
 synonyms

3.4. Supervised Algorithms

 Depend on a training set


 Dt ⊂ D: subset of training documents
 T : Dt × C → {0, 1}: training set function
 Assigns to each pair [dj, cp], dj ∈ Dt and cp ∈ C a value of 1, if dj ∈ cp,
according to judgement of human specialists 0, if dj ƒ∈ cp, according to
judgement of human specialists
 Training set function T is used to fine tune the classifier

12
The training phase of a classifier

To evaluate the classifier, use a test set


subset of docs with no intersection with training set classes to
documents determined by human specialists
Evaluation is done in a two steps process
use classifier to assign classes to documents in test set
compare classes assigned by classifier with those specified by human
specialists

Classification and evaluation processes

Once classifier has been trained and validated


can be used to classify new and unseen documents
if classifier is well tuned, classification is highly effective

13
3.5. Decision Tree

A decision Tree is a tree of with the following properties:


An inner node represents an attribute
An edge represents a test on the attribute of the father node
A Leaf represents one of the classes.

Process Involved
1) Construction of Decision Tree
2) Classification of Query instance
Classification of Query instance

Training set used to build classification rules


 organized as paths in a tree
 tree paths used to classify documents outside training set
 rules, amenable to human interpretation, facilitate interpretation of
results
Consider the small relational database below

14
Decision Tree (DT) allows predicting values of a given attribute
DT to predict values of attribute Play
Given: Outlook, Humidity, Windy

Internal nodes → attribute names Edges → attribute values


Traversal of DT → value for attribute “Play”.
(Outlook = sunny) ∧ (Humidity = high) → (Play = no)

 Predictions based on seen instances


 New instance that violates seen patterns will lead to erroneous
prediction
 Example database works as training set for building the decision
tree

Construction of Decision Tree

 DT for a database can be built using recursive splitting strategy


 Goal: build DT for attribute Play
 select one of the attributes, other than Play, as root
 use attribute values to split tuples into subsets
 for each subset of tuples, select a second splitting attribute
 repeat

15
Classification of Documents
For document classification
with each internal node associate an index term
with each leave associate a document class
with the edges associate binary predicates that indicate presence/absence of
index term

Decision tree model for class cp can be built using a recursive splitting
strategy
first step: associate all documents with the root
second step: select index terms that provide a good separation of the
documents
third step: repeat until tree complete

To select splitting terms use


information gain or entropy
Selection of terms with high information gain tends to
 increase number of branches at a given level, and
 reduce number of documents in each resultant subset
 yield smaller and less complex decision trees

16
3.6. k-NN Classifier

 K nearest neighbors is a simple algorithm that stores all available


cases and classifies new cases based on a similarity measure (e.g.,
distance functions). K represents number of nearest neighbors.
 It classify an unknown example with the most common class
among k closest examples
 KNN is based on “tell me who your neighbors are, and I‟ll tell you who
you are”
Example

 If K = 5, then in this case query instance xq will be classified as


negative since three of its nearest neighbors are classified as
negative.
Different Schemes of KNN

a. 1-Nearest Neighbor
b. K-Nearest Neighbor using a majority voting scheme
c. K-NN using a weighted-sum voting Scheme

17
kNN: How to Choose k?
In theory, if infinite number of samples available, the larger is
k, the better isclassification
The limitation is that all k neighbors have to be close
 Possible when infinite no of samples available
 Impossible in practice since no of
samples is finitek = 1 is often used for
efficiency, but sensitive to “noise”

18
every example in the blue shaded area every example in the blue shaded
will be misclassified as the blue area will be classified correctly as the
class(rectangle) red class(circle)

Larger k gives smoother boundaries, better for generalization But only if


locality is preserved. Locality is not preserved if end up looking at samples
too far away, not from the same class.
•Interesting theoretical properties if k < sqrt(n), n is # of examples .

Find a heuristically optimal number k of nearest neighbors, based


on RMSE(root-mean-square error). This is done using cross validation.

Cross-validation is another way to retrospectively determine a good K value


by using an independent dataset to validate the K value. Historically, the
optimal K for most datasets has been between 3-10. That produces much
better results than 1NN.

Distance Measure in KNN


There are three distance measures are valid for continuous variables.

It should also be noted that all In the instance of categorical variables the
Hamming distance must be used. It also brings up the issue of
standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.

19
Simple KNN - Algorithm:
For each training example , add the example to the list of training_examples.

Given a query instance xq to be classified,


 Let x1 ,x2….xk denote the k instances from training_examples
that are nearest toxq .
 Return the class that represents the maximum of the
k instances
Steps:
1. Determine parameter k= no of nearest neighbor
2. Calculate the distance between the query instance and all the training
samples
3. Sort the distance and determine nearest neighbor based on the k
–th minimumdistance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the
prediction value of thequery instance.

Example:
Consider the following data concerning credit default. Age and Loan are
two numericalvariables (predictors) and Default is the target.

Given Training Data set :

20
Data to Classify:
to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean
distance.

Step1: Determine parameter k


K=3

Step 2: Calculate the distance


D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1
to 3.

Step 4: Gather the category of the nearest neighbors


Age Loan Default Distance
33 $150000 Y 8000
35 $120000 N 22000
60 $100000 Y 42000

With K=3, there are two Default=Y and one Default=N out of three closest
neighbors. Theprediction for the unknown case is Default=Y.

Advantages
• Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear
boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough

Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive

21
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can
afford trainingstep to take a long time, but we want fast test step
• Need large number of samples for accuracy

3.7. SVM Classifier

 a vector space method for binary classification problems


 documents represented in t-dimensional space
 find a decision surface (hyperplane) that best separate documents of
two classes
 new document classified by its position relative to hyperplane

Simple 2D example: training documents linearly separable

 Line s—The Decision Hyperplane


o maximizes distances to closest docs of each class
o it is the best separating hyperplane
 Delimiting Hyperplanes
o parallel dashed lines that delimit region where to look for a
solution

 Lines that cross the delimiting hyperplanes


o candidates to be selected as the decision hyperplane
o lines that are parallel to delimiting hyperplanes: best candidates
 Support vectors: documents that belong to, and define, the
delimiting hyperplanes

22
Our example in a 2-dimensional system of coordinates

SVM Technique – Formalization


Let,
Hw : a hyperplane that separates docs in classes ca and cb
ma: distance of Hw to the closest document in class ca
mb: distance of Hw to the closest document in class cb
ma + mb: margin m of the SVM
The decision hyperplane maximizes the margin m

23
Classification of Documents

SVM with Multiple Classes


SVMs can only take binary decisions
a document belongs or not to a given class
With multiple classes
reduce the multi-class problem to binary classification natural way:
one binary classification problem per class
To classify a new document dj
run classification for each class
each class cp paired against all others
classes of dj : those with largest margins

Another solution
consider binary classifier for each pair of classes cp and cq
all training documents of one class: positive examples all documents
from the other class: negative examples

24
Feature Selection
Large feature space
might render document classifiers impractical
Classic solution
select a subset of all features to represent the documents
called feature selection
reduces dimensionality of the documents representation
reduces overfitting

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 113


Term-Class Incidence Table
Feature selection
dependent on statistics on term occurrences inside docs and
classes
Let
Dt : subset composed of all training documents
Nt : number of documents in Dt
ti : number of documents from Dt that contain term ki
C = {c1 , c2 , . . . , cL }: set of all L classes
T : Dt × C → [0, 1]: a training set function

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 114


Term-Class Incidence Table
Term-class incidence table

Case Docs in cp Docs not in cp Total


Docs that contain ki ni,p ni − ni,p ni
Docs that do not contain ki np − ni,p Nt − ni − (np − ni,p ) Nt − ni
All docs np Nt − np Nt

ni,p : # docs that contain ki and are classified in cp


ni − ni,p : # docs that contain ki but are not in class cp
np : total number of training docs in class cp
np − ni,p : number of docs from cp that do not contain ki

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 115


Term-Class Incidence Table
Given term-class incidence table above, define
ni
Probability that ki ∈ dj : P (ki ) = Nt
Nt −ni
Probability that ki 6∈ dj : P (k i ) = Nt
np
Probability that dj ∈ cp : P (cp ) = Nt
Nt −np
Probability that dj 6∈ cp : P (cp ) = Nt
ni,p
Probability that ki ∈ dj and dj ∈ cp : P (ki , cp ) Nt
np −ni,p
Probability that ki 6∈ dj and dj ∈ cp : P (ki , cp ) = Nt
ni −ni,p
Probability that ki ∈ dj and dj 6∈ cp : P (ki , cp ) = Nt
Nt −ni −(np −ni,p )
Probability that ki 6∈ dj and dj 6∈ cp : P (ki , cp ) = Nt

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 116


Feature Selection by Doc Frequency
Let Kth be a threshold on term document frequencies
Feature Selection by Term Document Frequency
retain all terms ki for which ni ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Even if simple, method allows reducing dimensionality


of space with basically no loss in effectiveness

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 117


Feature Selection by Tf-Idf Weights
wi,j : tf-idf weight associated with pair [ki , dj ]
Kth : threshold on tf-idf weights
Feature Selection by TF-IDF Weights
retain all terms ki for which wi,j ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Experiments suggest that this feature selection allows


reducing dimensionality of space by a factor of 10 with
no loss in effectiveness

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 118


Feature Selection by Mutual Informati
Mutual information
relative entropy between distributions of two random variables
If variables are independent, mutual information is zero
knowledge of one of the variables does not allow inferring
anything about the other variable

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 119


Mutual Information
Mutual information across all classes
ni,p
P (ki , cp ) Nt
I(ki , cp ) = log = log ni np
P (ki )P (cp ) Nt × Nt

That is,
L
X
M I(ki , C) = P (cp ) I(ki , cp )
p=1
L ni,p
X np Nt
= log ni np
Nt Nt × Nt
p=1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 120


Mutual Information
Alternative: maximum term information over all classes

Imax (ki , C) = maxL


p=1 I(ki , cp )
ni,p
Nt
= maxL
p=1 log ni np
Nt × Nt

Kth : threshold on entropy


Feature Selection by Entropy
retain all terms ki for which M I(ki , C) ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 121


Feature Selection: Information Gain
Mutual information uses probabilities associated with
the occurrence of terms in documents
Information Gain
complementary metric
considers probabilities associated with absence of terms in docs
balances the effects of term/document occurrences with the
effects of term/document absences

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 122


Information Gain
Information gain of term ki over set C of all classes

IG(ki , C) = H(C) − H(C|ki ) − H(C|¬ki )


H(C): entropy of set of classes C
H(C|ki ): conditional entropies of C in the presence of term ki
H(C|¬ki ): conditional entropies of C in the absence of term ki
IG(ki , C): amount of knowledge gained about C due to the fact
that ki is known

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 123


Information Gain
Recalling the expression for entropy, we can write

L
X
IG(ki , C) = − P (cp ) log P (cp )
p=1
 
L
X
− − P (ki , cp ) log P (cp |ki )
p=1
 
L
X
− − P (k i , cp ) log P (cp |k i )
p=1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 124


Information Gain
Applying Bayes rule
L 
X P (ki , cp )
IG(ki , C) = − P (cp ) log P (cp ) − P (ki , cp ) log −
p=1
P (ki )

P (ki , cp )
P (ki , cp ) log
P (k i )

Substituting previous probability definitions


L    
X np np ni,p ni,p np − ni,p np − ni,p
IG(ki , C) = − log − log − log
p=1
Nt Nt Nt ni Nt Nt − ni

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 125


Information Gain
Kth : threshold on information gain
Feature Selection by Information Gain
retain all terms ki for which IG(ki , C) ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 126


Feature Selection using Chi Square
Statistical metric defined as
2
N t (P (k ,
i pc )P (¬k i , ¬c p ) − P (ki , ¬c p )P (¬k ,
i pc ))
χ2 (ki , cp ) =
P (ki ) P (¬ki ) P (cp ) P (¬cp )

quantifies lack of independence between ki and cp


Using probabilities previously defined
Nt (ni,p (Nt − ni − np + ni,p ) − (ni − ni,p ) (np − ni,p ))2
χ2 (ki , cp ) =
np (Nt − np ) ni (Nt − ni )
2
Nt (Nt ni,p − np ni )
=
np ni (Nt − np )(Nt − ni )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 127


Chi Square
Compute either average or max chi square
L
X
χ2avg (ki ) = P (cp ) χ2 (ki , cp )
p=1

χ2max (ki ) = maxL


p=1 χ 2
(ki , cp )

Kth : threshold on chi square


Feature Selection by Chi Square
retain all terms ki for which χ2avg (ki ) ≥ Kth
discard all others
recompute doc representations to consider only terms retained

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 128


Evaluation Metrics

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 129


Evaluation Metrics
Evaluation
important for any text classification method
key step to validate a newly proposed classification method

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 130


Contingency Table
Let
D: collection of documents
Dt : subset composed of training documents
Nt : number of documents in Dt
C = {c1 , c2 , . . . , cL }: set of all L classes

Further let
T : Dt × C → [0, 1]: training set function
nt : number of docs from training set Dt in class cp
F : D × C → [0, 1]: text classifiier function
nf : number of docs from training set assigned to class cp by the
classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 131


Contingency Table
Apply classifier to all documents in training set
Contingency table is given by

Case T (dj , cp ) = 1 T (dj , cp ) = 0 T otal


F(dj , cp ) = 1 nf,t nf − nf,t nf
F(dj , cp ) = 0 nt − nf,t Nt − nf − nt + nf,t Nt − nf
All docs nt Nt − nt Nt

nf,t : number of docs that both the training and classifier functions
assigned to class cp
nt − nf,t : number of training docs in class cp that were
miss-classified
The remaining quantities are calculated analogously

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 132


Accuracy and Error
Accuracy and error metrics, relative to a given class cp

nf,t + (Nt − nf − nt + nf,t )


Acc(cp ) =
Nt
(nf − nf,t ) + (nt − nf,t )
Err(cp ) =
Nt
Acc(cp ) + Err(cp ) = 1

These metrics are commonly used for evaluating


classifiers

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 133


Accuracy and Error
Accuracy and error have disadvantages
consider classification with only two categories cp and cr
assume that out of 1,000 docs, 20 are in class cp
a classifier that assumes all docs not in class cp
accuracy = 98%
error = 2%
which erroneously suggests a very good classifier

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 134


Accuracy and Error
Consider now a second classifier that correctly predicts
50% of the documents in cp
T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000

In this case, accuracy and error are given by


10 + 980
Acc(cp ) = = 99%
1, 000
10 + 0
Err(cp ) = = 1%
1, 000

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 135


Accuracy and Error
This classifier is much better than one that guesses that
all documents are not in class cp
However, its accuracy is just 1% better, it increased
from 98% to 99%
This suggests that the two classifiers are almost
equivalent, which is not the case.

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 136


Precision and Recall
Variants of precision and recall metrics in IR
Precision P and recall R relative to a class cp
nf,t nf,t
P (cp ) = R(cp ) =
nf nt
Precision is the fraction of all docs assigned to class cp by the
classifier that really belong to class cp
Recall is the fraction of all docs that belong to class cp that were
correctly assigned to class cp

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 137


Precision and Recall
Consider again the classifier illustrated below
T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000

Precision and recall figures are given by


10
P (cp ) = = 100%
10
10
R(cp ) = = 50%
20

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 138


Precision and Recall
Precision and recall
computed for every category in set C
great number of values
makes tasks of comparing and evaluating algorithms more
difficult
Often convenient to combine precision and recall into a
single quality measure
one of the most commonly used such metric: F-measure

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 139


F-measure
F-measure is defined as

(α2 + 1)P (cp )R(cp )


Fα (cp ) =
α2 P (cp ) + R(cp )
α: relative importance of precision and recall
when α = 0, only precision is considered
when α = ∞, only recall is considered
when α = 0.5, recall is half as important as precision
when α = 1, common metric called F1 -measure

2P (cp )R(cp )
F1 (cp ) =
P (cp ) + R(cp )

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 140


F-measure
Consider again the the classifier illustrated below

T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000

For this example, we write


2 ∗ 1 ∗ 0.5
F1 (cp ) = ∼ 67%
1 + 0.5

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 141


F1 Macro and Micro Averages
Also common to derive a unique F1 value
average of F1 across all individual categories

Two main average functions


Micro-average F 1, or micF1
Macro-average F1 , or macF1

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 142


F1 Macro and Micro Averages
Macro-average F1 across all categories
P|C|
p=1 F1 (cp )
macF1 =
|C|
Micro-average F1 across all categories

2P R
micF1 =
P +R
P
cp ∈C nf,t
P = P
cp ∈C nf
P
cp ∈C nf,t
R = P
cp ∈C nt

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 143


F1 Macro and Micro Averages
In micro-average F1
every single document given the same importance
In macro-average F1
every single category is given the same importance
captures the ability of the classifier to perform well for many
classes
Whenever distribution of classes is skewed
both average metrics should be considered

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 144


Cross-Validation
Cross-validation
standard method to guarantee statistical validation of results
build k different classifiers: Ψ1 , Ψ2 , . . . , Ψk
for this, divide training set Dt into k disjoint sets (folds) of sizes

Nt1 , Nt2 , . . . , Ntk

classifier Ψi
training, or tuning, done on Dt minus the ith fold
testing done on the ith fold

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 145


Cross-Validation
Each classifier evaluated independently using
precision-recall or F1 figures
Cross-validation done by computing average of the k
measures
Most commonly adopted value of k is 10
method is called ten-fold cross-validation

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 146


Organizing the Classes
Taxonomies

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 152


Taxonomies
Labels provide information on semantics of each class
Lack of organization of classes restricts comprehension
and reasoning
Hierarchical organization of classes
most appealing to humans
hierarchies allow reasoning with more generic concepts
also provide for specialization, which allows breaking up a larger
set of entities into subsets

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 153


Taxonomies
To organize classes hierarchically use
specialization
generalization
sibling relations
Classes organized hierarchically compose a taxonomy
relations among classes can be used to fine tune the classifier
taxonomies make more sense when built for a specific domain of
knowledge

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 154


Taxonomies
Geo-referenced taxonomy of hotels in Hawaii

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 155


Taxonomies
Taxonomies are built manually or semi-automatically
Process of building a taxonomy:

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 156


Taxonomies
Manual taxonomies tend to be of superior quality
better reflect the information needs of the users
Automatic construction of taxonomies
needs more research and development
Once a taxonomy has been built
documents can be classified according to its concepts
can be done manually or automatically
automatic classification is advanced enough to work well in
practice

Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 157


Modern Information Retrieval

Chapter 9
Indexing and Searching
with Gonzalo Navarro
Introduction
Inverted Indexes
Signature Files
Suffix Trees and Suffix Arrays
Sequential Searching
Multi-dimensional Indexing
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 1
Introduction
Although efficiency might seem a secondary issue
compared to effectiveness, it can rarely be neglected
in the design of an IR system
Efficiency in IR systems: to process user queries with
minimal requirements of computational resources
As we move to larger-scale applications, efficiency
becomes more and more important
For example, in Web search engines that index terabytes of data
and serve hundreds or thousands of queries per second

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 2


Introduction
Index: a data structure built from the text to speed up
the searches
In the context of an IR system that uses an index, the
efficiency of the system can be measured by:
Indexing time: Time needed to build the index
Indexing space: Space used during the generation of the index
Index storage: Space required to store the index
Query latency: Time interval between the arrival of the query
and the generation of the answer
Query throughput: Average number of queries processed per
second

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 3


Introduction
When a text is updated, any index built on it must be
updated as well
Current indexing technology is not well prepared to
support very frequent changes to the text collection
Semi-static collections: collections which are updated
at reasonable regular intervals (say, daily)
Most real text collections, including the Web, are indeed
semi-static
For example, although the Web changes very fast, the crawls of a
search engine are relatively slow

For maintaining freshness, incremental indexing is used

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 4


Inverted Indexes

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 5


Basic Concepts
Inverted index: a word-oriented mechanism for
indexing a text collection to speed up the searching task
The inverted index structure is composed of two
elements: the vocabulary and the occurrences
The vocabulary is the set of all different words in the text
For each word in the vocabulary the index stores the
documents which contain that word (inverted index)

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 6


Basic Concepts
Term-document matrix: the simplest way to represent
the documents that contain each word of the vocabulary

Vocabulary ni d1 d2 d3 d4
to 2 4 2 - -
do 3 2 - 3 3
is 1 2 - - -
be 4 2 2 2 2
or 1 - 1 - -
To do is to be.
not 1 - 1 - - To be is to do. To be or not to be.
I 2 - 2 2 - I am what I am.
am 2 - 2 1 -
what 1 - 1 - -
d1
d2
think 1 - - 1 -
I think therefore I am.
therefore 1 - - 1 - Do be do be do. Do do do, da da da.
da 1 - - - 3 Let it be, let it be.
let 1 - - - 2
it 1 - - - 2
d3
d4

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 7


Basic Concepts
The main problem of this simple solution is that it
requires too much space
As this is a sparse matrix, the solution is to associate a
list of documents with each word
The set of all those lists is called the occurrences

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 8


Basic Concepts
Basic inverted index

Vocabulary ni Occurrences as inverted lists


to 2 [1,4],[2,2]
do 3 [1,2],[3,3],[4,3]
is 1 [1,2]
be 4 [1,2],[2,2],[3,2],[4,2]
or 1 [2,1]
To do is to be.
not 1 [2,1] To be is to do. To be or not to be.
I 2 [2,2],[3,2] I am what I am.
am 2 [2,2],[3,1]
what 1 [2,1]
d1
d2
think 1 [3,1]
I think therefore I am.
therefore 1 [3,1] Do be do be do. Do do do, da da da.
da 1 [4,3] Let it be, let it be.
let 1 [4,2]
it 1 [4,2]
d3
d4

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 9


Inverted Indexes
Full Inverted Indexes

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 10


Full Inverted Indexes
The basic index is not suitable for answering phrase or
proximity queries
Hence, we need to add the positions of each word in
each document to the index (full inverted index)

1 4 12 18 21 24 35 43 50 54 64 67 77 83
In theory, there is no difference between theory and practice. In practice, there is.

Text

between 35
difference 24
practice 54 67
theory 4 43

Vocabulary Occurrences

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 11


Full Inverted Indexes
In the case of multiple documents, we need to store one
occurrence list per term-document pair
Vocabulary ni Occurrences as full inverted lists
to 2 [1,4,[1,4,6,9]],[2,2,[1,5]]
do 3 [1,2,[2,10]],[3,3,[6,8,10]],[4,3,[1,2,3]]
is 1 [1,2,[3,8]]
be 4 [1,2,[5,7]],[2,2,[2,6]],[3,2,[7,9]],[4,2,[9,12]]
or 1 [2,1,[3]]
not 1 [2,1,[4]] To do is to be.
I 2 [2,2,[7,10]],[3,2,[1,4]] To be is to do. To be or not to be.
I am what I am.
am 2 [2,2,[8,11]],[3,1,[5]]
what 1 [2,1,[9]] d1
think 1 [3,1,[2]]
d2
therefore 1 [3,1,[3]] I think therefore I am.
Do be do be do. Do do do, da da da.
da 1 [4,3,[4,5,6]] Let it be, let it be.
let 1 [4,2,[7,10]]
it 1 [4,2,[8,11]]
d3
d4

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 12


Full Inverted Indexes
The space required for the vocabulary is rather small
Heaps’ law: the vocabulary grows as O(nβ ), where
n is the collection size
β is a collection-dependent constant between 0.4 and 0.6

For instance, in the TREC-3 collection, the vocabulary


of 1 gigabyte of text occupies only 5 megabytes
This may be further reduced by stemming and other
normalization techniques

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 13


Full Inverted Indexes
The occurrences demand much more space
The extra space will be O(n) and is around
40% of the text size if stopwords are omitted
80% when stopwords are indexed

Document-addressing indexes are smaller, because


only one occurrence per file must be recorded, for a
given word
Depending on the document (file) size,
document-addressing indexes typically require 20% to
40% of the text size

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 14


Full Inverted Indexes
To reduce space requirements, a technique called
block addressing is used
The documents are divided into blocks, and the
occurrences point to the blocks where the word appears
Block 1 Block 2 Block 3 Block 4

This is a text. A text has many words. Words are made from letters.

Text
Vocabulary Occurrences

letters 4...
made 4...
many 2...
Inverted Index
text 1, 2...
words 3...

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 15


Full Inverted Indexes
The Table below presents the projected space taken by
inverted indexes for texts of different sizes

Index Single document Small collection Medium collection


granularity (1 MB) (200 MB) (2 GB)
Addressing
words 45% 73% 36% 64% 35% 63%

Addressing
documents 19% 26% 18% 32% 26% 47%

Addressing
64K blocks 27% 41% 18% 32% 5% 9%

Addressing
256 blocks 18% 25% 1.7% 2.4% 0.5% 0.7%

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 16


Full Inverted Indexes
The blocks can be of fixed size or they can be defined
using the division of the text collection into documents
The division into blocks of fixed size improves efficiency
at retrieval time
This is because larger blocks match queries more frequently and
are more expensive to traverse

This technique also profits from locality of reference


That is, the same word will be used many times in the same
context and all the references to that word will be collapsed in just
one reference

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 17


Single Word Queries
The simplest type of search is that for the occurrences
of a single word
The vocabulary search can be carried out using any
suitable data structure
Ex: hashing, tries, or B-trees

The first two provide O(m) search cost, where m is the


length of the query
We note that the vocabulary is in most cases sufficiently
small so as to stay in main memory
The occurrence lists, on the other hand, are usually
fetched from disk

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 18


Multiple Word Queries
If the query has more than one word, we have to
consider two cases:
conjunctive (AND operator) queries
disjunctive (OR operator) queries

Conjunctive queries imply to search for all the words


in the query, obtaining one inverted list for each word
Following, we have to intersect all the inverted lists to
obtain the documents that contain all these words
For disjunctive queries the lists must be merged
The first case is popular in the Web due to the size of
the document collection

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 19


List Intersection
The most time-demanding operation on inverted
indexes is the merging of the lists of occurrences
Thus, it is important to optimize it

Consider one pair of lists of sizes m and n respectively,


stored in consecutive memory, that needs to be
intersected
If m is much smaller than n, it is better to do m binary
searches in the larger list to do the intersection
If m and n are comparable, Baeza-Yates devised a
double binary search algorithm
It is O(log n) if the intersection is trivially empty
It requires less than m + n comparisons on average

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 20


List Intersection
When there are more than two lists, there are several
possible heuristics depending on the list sizes
If intersecting the two shortest lists gives a very small
answer, might be better to intersect that to the next
shortest list, and so on
The algorithms are more complicated if lists are stored
non-contiguously and/or compressed

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 21


Phrase and Proximity Queries
Context queries are more difficult to solve with inverted
indexes
The lists of all elements must be traversed to find
places where
all the words appear in sequence (for a phrase), or
appear close enough (for proximity)
these algorithms are similar to a list intersection algorithm

Another solution for phrase queries is based on


indexing two-word phrases and using similar algorithms
over pairs of words
however the index will be much larger as the number of word
pairs is not linear

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 22


More Complex Queries
Prefix and range queries are basically (larger)
disjunctive queries
In these queries there are usually several words that
match the pattern
Thus, we end up again with several inverted lists and we can use
the algorithms for list intersection

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 23


More Complex Queries
To search for regular expressions the data structures
built over the vocabulary are rarely useful
The solution is then to sequentially traverse the
vocabulary, to spot all the words that match the pattern
Such a sequential traversal is not prohibitively costly
because it is carried out only on the vocabulary

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 24


Boolean Queries
In boolean queries, a query syntax tree is naturally
defined AND

translation OR

syntax syntactic

Normally, for boolean queries, the search proceeds in


three phases:
the first phase determines which documents to match
the second determines the likelihood of relevance of the
documents matched
the final phase retrieves the exact positions of the matches to
allow highlighting them during browsing, if required
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 25
Boolean Queries
Once the leaves of the query syntax tree find the
classifying sets of documents, these sets are further
operated by the internal nodes of the tree
Under this scheme, it is possible to evaluate the syntax
tree in full or lazy form
In the full evaluation form, both operands are first completely
obtained and then the complete result is generated
In lazy evaluation, the partial results from operands are delivered
only when required, and then the final result is recursively
generated

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 26


Boolean Queries
Processing the internal nodes of the query syntax tree
In (a) full evaluation is used
In (b) we show lazy evaluation in more detail

AND AND 46
a)
146 OR 146 23467

246 237

b)
AND 4 AND 6
AND AND AND AND

1 OR 2 4 OR 2 4 OR 3 4 OR 4 6 OR 6 OR 7

4 3 4 3 4 7 6 7 7

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 27


Inverted Indexes
Searching

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 28


Ranking
How to find the top-k documents and return them to the
user when we have weight-sorted inverted lists?
If we have a single word query, the answer is trivial as
the list can be already sorted by the desired ranking
For other queries, we need to merge the lists

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 29


Ranking
Suppose that we are searching the disjunctive query
“to do” on the collection below
To do is to be.
To be is to do. To be or not to be.
I am what I am.

d1
d2
I think therefore I am.
Do be do be do. Do do do, da da da.
Let it be, let it be.

d3
d4
As our collection is very small, let us assume that we
are interested in the top-2 ranked documents
We can use the following heuristic:
we process terms in idf order (shorter lists first), and
each term is processed in tf order (simple ranking order)

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 30


Ranking
Ranking-in-the-vector-model( query terms t )
01 Create P as C-candidate similarities initialized to (P d , Pw ) = (0, 0)
02 Sort the query terms t by decreasing weight
03 c←1
04 for each sorted term t in the query do
05 Compute the value of the threshold t add
06 Retrieve the inverted list for t, L t
07 for each document d in L t do This is a variant of
08 if wd,t < tadd then break Persin’s algorithm
09 psim ← wd,t × wq,t /Wd
10 if d ∈ Pd (i) then We use a priority queue P of
11 Pw (i) ← Pw (i) + psim C document candidates
11 elif psim > minj (Pw (j)) then
where we will compute partial
11 n ← minj (Pw (j))
12 elif c ≤ C then
similarities
13 n←c
14 c ←c+1
15 if n ≤ C then P (n) ← (d, psim)
16 return the top-k documents according to P w

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 31


Internal Algorithms
Building an index in internal memory is a relatively
simple and low-cost task
A dynamic data structure to hold the vocabulary (B-tree,
hash table, etc.) is created empty
Then, the text is scanned and each consecutive word is
searched for in the vocabulary
If it is a new word, it is inserted in the vocabulary before
proceeding

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 32


Internal Algorithms
A large array is allocated where the identifier of each
consecutive text word is stored
A full-text inverted index for a sample text with the
incremental algorithm:
1 6 9 11 17 19 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.

Text
letters: 60
"l"
"d" made: 50
"m" "a"

many: 28 Vocabulary trie


"t" "n"
text: 11, 19
"w"
words: 33, 40

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 33


Internal Algorithms
A full-text inverted index for a sample text with a sorting
algorithm:
1 4 12 18 21 24 35 43 50 54 64 67 77 83
In theory, there is no difference between theory and practice. In practice, there is.
Text
collect identifiers

1 between
4:4 2:24 1:35 4:43 3:54 3:67
2 difference
3 practice sort

4 theory
1:35 2:24 3:54 3:67 4:4 4:43
Vocabulary
identify headers

Occurrences 35 24 54 67 4 43

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 34


Internal Algorithms
An alternative to avoid this sorting is to separate the
lists from the beginning
In this case, each vocabulary word will hold a pointer to its own
array (list) of occurrences, initially empty

A non trivial issue is how the memory for the many lists
of occurrences should be allocated
A classical list in which each element is allocated individually
wastes too much space
Instead, a scheme where a list of blocks is allocated, each block
holding several entries, is preferable

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 35


Internal Algorithms
Once the process is completed, the vocabulary and the
lists of occurrences are written on two distinct disk files
The vocabulary contains, for each word, a pointer to the
position of the inverted list of the word
This allows the vocabulary to be kept in main memory
at search time in most cases

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 36


External Algorithms
All the previous algorithms can be extended by using
them until the main memory is exhausted
At this point, the partial index Ii obtained up to now is
written to disk and erased from main memory
These indexes are then merged in a hierarchical
fashion

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 37


External Algorithms
Merging the partial indexes in a binary fashion
Rectangles represent partial indexes, while rounded rectangles
represent merging operations

Level 4
Iï1..8 (final index)

Iï1..4 Iï5..8 Level 3

3 6

Iï1..2 Iï3..4 Iï5..6 Iï7..8 Level 2

1 2 4 5

Level 1
Iï1 Iï2 Iï3 Iï4 Iï5 Iï6 Iï7 Iï8 (initial dumps

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 38


External Algorithms
In general, maintaining an inverted index can be done
in three different ways:
Rebuild
If the text is not that large, rebuilding the index is the simplest
solution
Incremental updates
We can amortize the cost of updates while we search
That is, we only modify an inverted list when needed
Intermittent merge
New documents are indexed and the resultant partial index is
merged with the large index
This in general is the best solution

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 39


Inverted Indexes
Compressed Inverted Indexes

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 40


Compressed Inverted Indexes
It is possible to combine index compression and text
compression without any complication
In fact, in all the construction algorithms mentioned, compression
can be added as a final step

In a full-text inverted index, the lists of text positions or


file identifiers are in ascending order
Therefore, they can be represented as sequences of
gaps between consecutive numbers
Notice that these gaps are small for frequent words and large for
infrequent words
Thus, compression can be obtained by encoding small values
with shorter codes

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 41


Compressed Inverted Indexes
A coding scheme for this case is the unary code
In this method, each integer x > 0 is coded as (x − 1) 1-bits
followed by a 0-bit

A better scheme is the Elias-γ code, which represents a


number x > 0 by a concatenation of two parts:
1. a unary code for 1 + "log2 x#
2. a code of "log2 x# bits that represents the number x − 2!log2 x" in
binary

Another coding scheme is the Elias-δ code


Elias-δ concatenates parts (1) and (2) as above, yet part
(1) is not represented in unary but using Elias-γ instead

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 42


Compressed Inverted Indexes
Example codes for integers

Gap x Unary Elias-γ Elias-δ Golomb


(b = 3)
1 0 0 0 00
2 10 100 1000 010
3 110 101 1001 011
4 1110 11000 10100 100
5 11110 11001 10101 1010
6 111110 11010 10110 1011
7 1111110 11011 10111 1100
8 11111110 1110000 11000000 11010
9 111111110 1110001 11000001 11011
10 1111111110 1110010 11000010 11100

Note: Golomb codes will be explainedIndexing


later and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 43
Compressed Inverted Indexes
In general,
Elias-γ for an arbitrary integer x > 0 requires 1 + 2"log2 x# bits
Elias-δ requires 1 + 2"log2 log2 2x# + "log2 x# bits

For small values of x Elias-γ codes are shorter than


Elias-δ codes, and the situation is reversed as x grows
Thus the choice depends on which values we expect to
encode

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 44


Compressed Inverted Indexes
Golomb presented another coding method that can be
parametrized to fit smaller or larger gaps
For some parameter b, let q and r be the quotient and
remainder, respectively, of dividing x − 1 by b
I.e., q = "(x − 1)/b# and r = (x − 1) − q · b

Then x is coded by concatenating


the unary representation of q + 1
the binary representation of r, using either "log2 b# or $log2 b% bits

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 45


Compressed Inverted Indexes
If r < 2!log2 b"−1 then r uses "log2 b# bits, and the
representation always starts with a 0-bit
Otherwise it uses $log2 b% bits where the first bit is 1 and
the remaining bits encode the value r − 2!log2 b"−1 in
"log2 b# binary digits
For example,
For b = 3 there are three possible remainders, and those are
coded as 0, 10, and 11, for r = 0, r = 1, and r = 2, respectively
For b = 5 there are five possible remainders r, 0 through 4, and
these are assigned the codes 00, 01, 100, 101, and 110

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 46


Compressed Inverted Indexes
To encode the lists of occurrences using Golomb
codes, we must define the parameter b for each list
Golomb codes usually give better compression than
either Elias-γ or Elias-δ
However they need two passes to be generated as well as
information on terms statistics over the whole document collection
For example, in the TREC-3 collection, the average
number of bits per list entry for each method is
Golomb = 5.73
Elias-δ = 6.19
Elias-γ = 6.43

This represents a five-fold reduction in space compared


to a plain inverted index representation
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 47
Compressed Inverted Indexes
Let us now consider inverted indexes for ranked search
In this case the documents are sorted by decreasing frequency of
the term or other similar type of weight

Documents that share the same frequency can be


sorted in increasing order of identifiers
This will permit the use of gap encoding to compress
most of each list

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 48


Inverted Indexes
Structural Queries

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 49


Structural Queries
Let us assume that the structure is marked in the text
using tags
The idea is to make the index take the tags as if they
were words
After this process, the inverted index contains all the
information to answer structural queries

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 50


Structural Queries
Consider the query:
select structural elements of type A that contain a structure of
type B

The query can be translated into finding <A> followed


by <B> without </A> in between
The positions of those tags are obtained with the
full-text index
Many queries can be translated into a search for tags
plus validation of the sequence of occurrences
In many cases this technique is efficient and its
integration into an existing text database is simpler

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 51


Sequential Searching

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 98


Sequential Searching
In general the sequential search problem is:
Given a text T = t1 t2 . . . tn and a pattern denoting a set of strings
P, find all the occurrences of the strings of P in T

Exact string matching: the simplest case, where the


pattern denotes just a single string P = p1 p2 . . . pm
This problem subsumes many of the basic queries,
such as word, prefix, suffix, and substring search
We assume that the strings are sequences of
characters drawn from an alphabet Σ of size σ

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 99


Sequential Searching
Simple Strings

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 100
Simple Strings: Brute Force
The brute force algorithm:
Try out all the possible pattern positions in the text and checks
them one by one

More precisely, the algorithm slides a window of length


m across the text, ti+1 ti+2 . . . ti+m for 0 ≤ i ≤ n − m
Each window denotes a potential pattern occurrence
that must be verified
Once verified, the algorithm slides the window to the
next position

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 101
Simple Strings: Brute Force
A sample text and pattern searched for using brute
force

The first text window is T abraca bracadabra T


abracabraca
After verifying that it P abracadabra P
does not match P , the
window is shifted by abracadabra
one position
abracadabra

abracadabra

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 102
Simple Strings: Horspool
Horspool’s algorithm is in the fortunate position of
being very simple to understand and program
It is the fastest algorithm in many situations, especially
when searching natural language texts
Horspool’s algorithm uses the previous idea to shift the
window in a smarter way
A table d indexed by the characters of the alphabet is
precomputed:
d[c] tells how many positions can the window be shifted if the final
character of the window is c
In other words, d[c] is the distance from the end of the
pattern to the last occurrence of c in P , excluding the
occurrence of pm
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 103
Simple Strings: Horspool
The Figure repeats the previous example, now also
applying Horspool’s shift
T abraca bracadabra T abraca bracadabra

P abracadabra P abracadabra

abracadabra abracadabra

abracadabra

abracadabra

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 104
Simple Strings: Horspool
Pseudocode for Horspool’s string matching algorithm
Horspool (T = t1 t2 . . . tn , P = p1 p2 . . . pm )
(1) for c ∈ Σ do d[c] ← m
(2) for j ← 1 . . . m − 1 do d[pj ] ← m − j
(3) i←0
(4) while i ≤ n − m do
(5) j←1
(6) while j ≤ m ∧ ti+j = pj do j ← j + 1
(7) if j > m then report an occurrence at text position i + 1
(8) i ← i + d[ti+m ]

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 105
Small alphabets and long patterns
When searching for long patterns over small alphabets
Horspool’s algorithm does not perform well
Imagine a computational biology application where strings of 300
nucleotides over the four-letter alphabet {A, C, G, T} are sought

This problem can be alleviated by considering


consecutive pairs of characters to shift the window
On other words, we can align the pattern with the last pair of
window characters, ti+m−1 ti+m

In the previous example, we would shift by 42 = 16


positions on average

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 106
Small alphabets and long patterns
In general we can shift using q characters at the end of
the window: which is the best value for q ?
We cannot shift by more than m, and thus σ q ≤ m seems to be a
natural limit
If we set q = logσ m, the average search time will be
O(n logσ (m)/m)

Actually, this average complexity is optimal, and the


choice for q we derived is close to correct
It can be analytically shown that, by choosing
q = 2 logσ m, the average search time achieves the
optimal O(n logσ (m)/m)

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 107
Small alphabets and long patterns
This technique is used in the agrep software
A hash function is chosen to map q -grams (strings of
length q ) onto an integer range
Then the distance from each q -gram of P to the end of
P is recorded in the hash table
For the q -grams that do not exist in P , distance
m − q + 1 is used

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 108
Small alphabets and long patterns
Pseudocode for the agrep’s algorithm to match long
patterns over small alphabets (simplified)
Agrep (T = t1 t2 . . . tn , P = p1 p2 . . . pm , q, h( ), N )
(1) for i ∈ [1, N ] do d[i] ← m − q + 1
(2) for j ← 0 . . . m − q do d[h(pj+1 pj+2 . . . pj+q )] ← m − q − j
(3) i←0
(4) while i ≤ n − m do
(5) s ← d[h(ti+m−q+1 ti+m−q+2 . . . ti+m )]
(6) if s > 0 then i ← i + s
(7) else
(8) j←1
(9) while j ≤ m ∧ ti+j = pj do j ← j + 1
(10) if j > m then report an occurrence at text position i + 1
(11) i←i+1

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 109
Automata and Bit-Parallelism
Horspool’s algorithm, as well as most classical
algorithms, does not adapt well to complex patterns
We now show how automata and bit-parallelism
permit handling many complex patterns

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 110
Automata
Figure below shows, on top, a NFA to search for the
pattern P = abracadabra
The initial self-loop matches any character
Each table column corresponds to an edge of the automaton

a b r a c a d a b r a

B[a] = 0 1 1 0 1 0 1 0 1 1 0
B[b] = 1 0 1 1 1 1 1 1 0 1 1
B[r] = 1 1 0 1 1 1 1 1 1 0 1
B[c] = 1 1 1 1 0 1 1 1 1 1 1
B[d] = 1 1 1 1 1 1 0 1 1 1 1
B[*] = 1 1 1 1 1 1 1 1 1 1 1

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 111
Automata
It can be seen that the NFA in the previous Figure
accepts any string that finishes with P =
‘abracadabra’
The initial state is always active because of the self-loop
that can be traversed by any character
Note that several states can be simultaneously active
For example, after reading ‘abra’, NFA states 0, 1, and 4 will be
active

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 112
Bit-parallelism and Shift-And
Bit-parallelism takes advantage of the intrinsic
parallelism of bit operations
Bit masks are read right to left, so that the first bit of
bm . . . b1 is b1
Bit masks are handled with operations like:
| to denote the bit-wise or
& to denote the bit-wise and, and
! to denote the bit-wise xor

Unary operation ‘∼’ complements all the bits

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 113
Bit-parallelism and Shift-And
In addition:
mask << i means shifting all the bits in mask by i positions to
the left, entering zero bits from the right
mask >> i is analogous

Finally, it is possible to operate bit masks as numbers,


for example adding or subtracting them

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 114
Bit-parallelism and Shift-And
The simplest bit-parallel algorithm permits matching
single strings, and it is called Shift-And
The algorithm builds a table B which, for each
character, stores a bit mask bm . . . b1
The mask in B[c] has the i-th bit set if and only if pi = c

The state of the search is kept in a machine word


D = dm . . . d1 , where di is set if the state i is active
Therefore, a match is reported whenever dm = 1

Note that state number zero is not represented in D


because it is always active and then can be left implicit

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 115
Bit-parallelism and Shift-And
Pseudocode for the Shift-And algorithm
Shift-And (T = t1 t2 . . . tn , P = p1 p2 . . . pm )

(1) for c ∈ Σ do B[c] ← 0


(2) for j ← 1 . . . m do B[pj ] ← B[pj ] | (1 << (j − 1))
(3) D←0
(4) for i ← 1 . . . n do
(5) D ← ((D << 1) | 1) & B[ti ]
(6) if D & (1 << (m − 1)) += 0
(7) then report an occurrence at text position i − m + 1

There must be sufficient bits in the computer word to


store one bit per pattern position
For longer patterns, in practice we can search for p1 p2 . . . pw , and
directly check the occurrences of this prefix for the complete P
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 116
Extending Shift-And
Shift-And can deal with much more complex patterns
than Horspool
The simplest case is that of classes of characters:
This is the case, for example, when one wishes to search in
case-insensitive fashion, or one wishes to look for a whole word

Let us now consider a more complicated pattern


Imagine that we search for neighbour, but we wish the u to be
optional (accepting both English and American style)

The Figure below shows an NFA that does the task


using an ε-transition

n e i g h b o u r
0 1 2 3 4 5 6 7 8 9
¡

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 117
Extending Shift-And
Another feature in complex patterns is the use of wild
cards, or more generally repeatable characters
Those are pattern positions that can appear once or more times,
consecutively, in the text

For example, we might want to catch all the transfer


records in a banking log

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 118
Extending Shift-And
As another example, we might look for well known,
yet there might be a hyphen or one or more spaces
For instance ‘well known’, ‘well known’, ‘well-known’,
‘well - known’, ‘well \n known’, and so on

sep

w e l l sep k n o w n
0 1 2 3 4 5 6 7 8 9 10

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 119
Extending Shift-And
Figure below shows pseudocode for a Shift-And
extension that handles all these cases
Shift-And-Extended (T = t1 t2 . . . tn , m, B[ ], A, S)

(1) I ← (A >> 1) & (A ! (A >> 1))


(2) F ← A & (A ! (A >> 1))
(3) D←0
(4) for i ← 1 . . . n do
(5) D ← (((D << 1) | 1) | (D & S)) & B[ti ]
(6) Df ← D | F
(7) D ← D | (A & ((∼ (Df − I)) ! Df ))
(8) if D & (1 << (m − 1)) += 0
(9) then report an occurrence at text position i − m + 1

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 120
Faster Bit-Parallel Algorithms
There exist some algorithms that can handle complex
patterns and still skip text characters (like Horspool)
For instance, Suffix Automata and Interlaced Shift-And algorithms

Those algorithms run progressively slower as the


pattern gets more complex

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 121
Suffix Automata
The suffix automaton of a pattern P is an automaton
that recognizes all the suffixes of P
Below we present a non-deterministic suffix automaton
for P = ‘abracadabra’

¡
I
a b r a c a d a b r a
0 1 2 3 4 5 6 7 8 9 10 11

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 122
Suffix Automata
To search for pattern P , the suffix automaton of P rev
(the reversed pattern) is built
The algorithm scans the text window backwards and
feeds the characters into the suffix automaton of P rev
If the automaton runs out of active states after scanning
ti+m ti+m−1 . . . ti+j , this means that ti+j ti+j+1 . . . ti+m is
not a substring of P
Thus, no occurrence of P can contain this substring,
and the window can be safely shifted past ti+j
If, instead, we reach the beginning of the window and
the automaton still has active states, this means that
the window is equal to the pattern

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 123
Suffix Automata
The need to implement the suffix automaton and make
it deterministic makes the algorithm more complex
An attractive variant, called BNDM, implements the
suffix automaton using bit-parallelism
It achieves improved performance when the pattern is
not very long
say, at most twice the number of bits in the computer word

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 124
Suffix Automata
Pseudocode for BNDM algorithm:
BNDM (T = t1 t2 . . . tn , P = p1 p2 . . . pm )

(1) for c ∈ Σ do B[c] ← 0


(2) for j ← 1 . . . m do B[pj ] ← B[pj ] | (1 << (m − j))
(3) i←0
(4) while i ≤ n − m do
(5) j ←m−1
(6) D ← B[ti+m ]
(7) while j > 0 ∧ D += 0 do
(8) D ← (D << 1) & B[ti+j ]
(9) j ←j−1
(10) if D += 0 then report an occurrence at text position i + 1
(11) i←i+j+1

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 125
Interlaced Shift-And
Another idea to achieve optimal average search time is
to read one text character out of q
To fix ideas, assume P = neighborhood and q = 3
If we read one text position out of 3, and P occurs at
some text window ti+1 ti+2 . . . ti+m then we will read
either ‘ngoo’, ‘ehro’, or ‘ibhd’ at the window
Therefore, it is sufficient to search simultaneously for
the three subsequences of P

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 126
Interlaced Shift-And
Now the initial state can activate the first q positions of
P , and the bit-parallel shifts are by q positions
A non-deterministic suffix automaton for interlaced
searching of P = ‘neighborhood’ with q = 3 is:

n
0 1 2 3 4 5 6 7 8 9 10 11 12
e

i g h b o r h o o d

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 127
Interlaced Shift-And
Pseudocode for Interlaced Shift-And algorithm with
sampling step q (simplified):
Interlaced-Shift-And (T = t1 t2 . . . tn , P = p1 p2 . . . pm , q)

(1) for c ∈ Σ do B[c] ← 0


(2) for j ← 1 . . . m do B[pj ] ← B[pj ] | (1 << (j − 1))
(3) S ← (1 << q) − 1
(4) D←0
(5) for i ← 1 . . . "n/q# do
(6) D ← ((D << q) | S) & B[tq·i ]
(7) if D & (S << ("m/q# · q − q)) += 0
(8) then run Shift-And over tq·i−m+1 . . . tq·i+q−1

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 128
Regular Expressions
The first part in processing a regular expression is to
build an NFA from it
There are different NFA construction methods

We present the more traditional Thompson’s technique


as it is simpler to explain

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 129
Regular Expressions
Recursive Thompson’s construction of an NFA from a
regular expression
¡ a
Th ( ¡ ) = Th ( a ) =

Th ( E . E’) = Th ( E ) Th ( E’)

Th ( E ) ¡
¡

Th ( E | E’) =
¡
¡
Th ( E’)

¡ ¡
Th ( E * ) = Th ( E )

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 130
Regular Expressions
Once the NFA is built we add a self-loop (traversable by
any character) at the initial state
Another alternative is to make the NFA deterministic,
converting it into a DFA
However the number of states can grow non linearly,
even exponentially in the worst case

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 131
Multiple Patterns
Several of the algorithms for single string matching can
be extended to handle multiple strings
P = {P1 , P2 , . . . , Pr }
For example, we can extend Horspool so that d[c] is the minimum
over the di [c] values of the individual patterns Pi

To compute each di we must truncate Pi to the length of


the shortest pattern in P , and that length will be m
Other variants that perform well are extensions of
BNDM
Yet, bit-parallel algorithms are not useful for this case

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 132
Approximate Searching
A simple string matching problem where not only a
string P must be reported, but also text positions where
P occurs with at most k ‘errors’
Different definitions of what is an error can be adopted
The simplest definition is the Hamming distance that allows just
substitutions of characters
A very popular one corresponds to the so-called
Levenshtein or edit distance:
A error is the deletion, insertion, or substitution of a single
character

This model is simple enough to permit fast searching,


being useful for most IR scenarios
This can be extended to approximate pattern matching
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 133
Dynamic Programming
The classical solution to approximate string matching is
based on dynamic programming
A matrix C[0..m, 0..n] is filled column by column, where
C[i, j] represents the minimum number of errors needed
to match p1 p2 . . . pi to some suffix of t1 t2 . . . tj
This is computed as follows:

C[0, j] = 0,
C[i, 0] = i,
C[i, j] = if (pi = tj ) then C[i − 1, j − 1]
else 1 + min(C[i − 1, j], C[i, j − 1], C[i − 1, j − 1]),

where a match is reported at text positions j such that


C[m, j] ≤ k
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 134
Dynamic Programming
The dynamic programming algorithm to search for
‘colour’ in the text kolorama with k = 2 errors

k o l o r a m a
0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 1 1 1
o 2 2 1 2 1 2 2 2 2
l 3 3 2 1 2 2 3 3 3
o 4 4 3 2 1 2 3 4 4
u 5 5 4 3 2 2 3 4 5
r 6 6 5 4 3 2* 3 4 5

The starred entry indicates a position finishing an


approximate occurrence
Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 135
Dynamic Programming
The previous algorithm requires O(mn) time
Several extensions of it have been presented that
achieve O(kn) time
A simple O(kn) algorithm is obtained by computing each column
only up to the point where one knows that all the subsequent cell
values will exceed k
The memory needed can also be reduced to O(kn)

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 136
Dynamic Programming
Figure below gives the pseudocode for this variant
Approximate-DP (T = t1 t2 . . . tn , P = p1 p2 . . . pm , k)
(1) for i ← 0 . . . m do C[i] ← i
(2) last ← k + 1
(3) for j ← 1 . . . n do
(4) pC, nC ← 0
(5) for i ← 1 . . . last do
(6) if pi = tj then nC ← pC
(7) else
(8) if pC < nC then nC ← pC
(9) if C[i] < nC then nC ← C[i]
(10) nC ← nC + 1
(11) pC ← C[i]
(12) C[i] ← nC
(13) if nC ≤ k
(14) then if last = m then report an occurrence ending at position i
(15) else last ← last + 1
(16) else while C[last − 1] > k do last ← last − 1

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 137
Automata and Bit-parallelism
Approximate string matching can also be expressed as
an NFA search
Figure below depicts an NFA for approximate string
matching for the pattern ‘colour’ with two errors
c o l o u r no errors

¡ ¡ ¡ ¡ ¡ ¡

c o l o u r 1 error

¡ ¡ ¡ ¡ ¡ ¡

c o l o u r 2 errors

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 138
Automata and Bit-parallelism
Although the search phase is O(n), the NFA tends to be
large (O(kn))
A better solution, based on bit-parallelism, is an
extension of Shift-And
We can simulate k + 1 Shift-And processes while taking
care of vertical and diagonal arrows as well

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 139
Automata and Bit-parallelism
Pseudocode for approximate string matching using the
Shift-And algorithm
Approximate-Shift-And (T = t1 t2 . . . tn , P = p1 p2 . . . pm , k)
(1) for c ∈ Σ do B[c] ← 0
(2) for j ← 1 . . . m do B[pj ] ← B[pj ] | (1 << (j − 1))
(3) for i ← 0 . . . k do Di ← (1 << i) − 1
(4) for j ← 1 . . . n do
(5) pD ← D0
(6) nD, D0 ← ((D0 << 1) | 1) & B[ti ]
(7) for i ← 1 . . . k do
(8) nD ← ((Di << 1) & B[ti ]) | pD | ((pD | nD) << 1) | 1
(9) pD ← Di , Di ← nD
(10) if nD & (1 << (m − 1)) += 0
(11) then report an occurrence ending at position i

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 140
Filtration
Frequently it is easier to tell that a text position cannot
match than to ensure that it matches with k errors
Filtration is based on applying a fast filter over the text,
which hopefully discards most of the text positions
Then we can apply an approximate search algorithm
over the areas that could not be discarded
A simple and fast filter:
Split the pattern into k + 1 pieces of about the same length
Then we can run a multi-pattern search algorithm for the pieces
If piece pj . . . pj ! appears in ti . . . ti! , then we run an approximate
string matching algorithm over ti−j+1−k . . . ti−j+m+k

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 141
Searching Compressed Text
An extension of traditional compression mechanisms
gives a very powerful way of matching much more
complex patterns
Let us start with phrase queries that can be searched
for by
compressing each of its words and
searching the compressed text for the concatenated string of
target symbols

This is true as long as


the phrase is made up of simple words, each of which can be
translated into one codeword, and
we want the separators to appear exactly as in the query

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 142
Searching Compressed Text
A more robust search mechanism is based in word
patterns
For example, we may wish to search for:
Any word matching ‘United’ in case-insensitive form and
permitting two errors
Then a separator
And then any word matching ‘States’ in case-insensitive form
and permitting two errors

This search problem can be modeled by means of an


automaton over codewords

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 143
Searching Compressed Text
Let C be the set of different codewords created by the
compressor
We can take C as an alphabet and see the compressed
text as a sequence of atomic symbols over C
Our pattern has three positions, each denoting a class
of characters:
The first is the set of codewords corresponding to words that
match ‘United’ in case-insensitive form and allowing two errors
The second is the set of codewords for separators and is an
optional class
The third is like the first but for the word ‘States’

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 144
Searching Compressed Text
The Figure below illustrates the previous example
, \n 010
\n 010

States 001

any
UNITED 100

 United separator  States


United 100
Unnited 100 ¡

state 001

unates 101
unite 100

Vocabulary B[ ] table
(alphabet)

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 145
Searching Compressed Text
This process can be used to search for much more
complex patterns
Assume that we wish to search for ‘the number of
elements successfully classified’, or
something alike
Many other phrases can actually mean more or less the
same, for example:
the number of elements classified with success
the elements successfully classified
the number of elements we successfully classified
the number of elements that were successfully classified
the number of elements correctly classified
the number of elements we could correctly classify
...

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 146
Searching Compressed Text
To recover from linguistic variants as shown above we
must resort to word-level approximate string matching
In this model, we permit a limited number of missing,
extra, or substituted words
For example, with 3 word-level errors we can recover from all the
variants in the example above

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 147
Multi-dimensional Indexing

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 148
Multi-dimensional Indexing
In multimedia data, we can represent every object by
several numerical features
For example, imagine an image from where we can
extract a color histogram, edge positions, etc
One way to search in this case is to map these object
features into points in a multi-dimensional space
Another approach is to have a distance function for
objects and then use a distance based index
The main mapping methods form three main classes:
R∗ -trees and the rest of the R-tree family,
linear quadtrees,
grid-files

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 149
Multi-dimensional Indexing
The R-tree-based methods seem to be most robust for
higher dimensions
The R-tree represents a spatial object by its minimum
bounding rectangle (MBR)
Data rectangles are grouped to form parent nodes,
which are recursively grouped, to form grandparent
nodes and, eventually, a tree hierarchy
Disk pages are consecutive byte positions on the
surface of the disk that are fetched with one disk access
The goal of the insertion, split, and deletion routines is
to give trees that will have good clustering

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 150
Multi-dimensional Indexing
Figure below illustrates data rectangles (in black),
organized in an R-tree with fanout 3

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 151
Multi-dimensional Search
A range query specifies a region of interest, requiring all
the data regions that intersect it
To answer this query, we first retrieve a superset of the
qualifying data regions:
We compute the MBR of the query region, and then we
recursively descend the R-tree, excluding the branches whose
MBRs do not intersect the query MBR
Thus, the R-tree will give us quickly the data regions whose MBR
intersects the MBR of the query region

The retrieved data regions will be further examined for


intersection with the query region

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 152
Multi-dimensional Search
The data structure of the R-tree for the previous figure
is (fanout = 3)

Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 – p. 153

You might also like