Digital Library
Digital Library
Index
Table
(Oracle Add to Matched?
Database) Index Table Y
Database
processing is carried out on the document images embedded index table is used to store information of the words that
in the PDF files, including connected component detection, have previously been searched. Hence, if there are matches,
skew estimation and rectification(if applicable), word object information of the corresponding documents that contain
bounding, etc. Next, each word object is represented using this query word will be retrieved and stored in a temporary
the word image coding as described below. With respect table for subsequent merging. This includes information of
to each PDF file, a feature code file is then constructed, the documents’ URLs as well as the normalized occurrence
including the URL of the corresponding PDF file, and each frequency of the query word appearing in each of these
word’s information, such as its location, word image codes. documents. Otherwise, if no matches are found in the index
This feature code file is stored in a server, and will be table, we generate the feature code string of the query word,
used instead of the original document images for document and then exploit the word matching algorithm to perform
retrieval. The above processing is performed off-line prior search in the underlying feature code files.
to the user’s online query. With the purpose of constructing an incremental
On the client side, the user is prompted to input a set intelligence system and speeding up the search process,
of query words through a web interface, in particular, an the results of earlier query searches are stored in the
Active Server Page (ASP). Meanwhile, they can choose index table. If there are newly found matches, the index
to perform logical operation(“AND”,“OR”,“NOT”) among table will be updated accordingly, i.e. the corresponding
these query words. Once the request is submitted to the query word will be added to the index table with the
server, the server will start processing each query word and current document’s name with its URL and the normalized
merge the results at the end of each step based on the logical occurrence frequency that the query word appears in this
operation the user has chosen. Finally, a temporary table document.
that stores all the matching documents with their URLs and
the normalized occurrence frequencies of the query words 3. Word Image Coding
will be returned to the user for display. The user will be
able to link to the actual documents for online reading or A word object extracted from the document images
download them for future reference. is represented by the word image codes according to
As for the processing of each query word, it is done it features. The features employed in our approach are
as follows: First, the server tries to search for the query Left-to-Right Primitive String(LRPS) which is a code string
word in an index table stored in an Oracle database. This sequenced from the leftmost of a word to its rightmost.
The line feature and traversal feature will be used to extract ‘Q’: the primitive is between the top-boundary and the
primitives of the word image. bottom-boundary.
A word printed in documents can be in various sizes, The definition of x-line, baseline, top-boundary and
fonts, and spacings. When we extract features from word bottom-boundary may be found in Figure 3. A word bitmap
bitmaps, we have to take this fact into consideration. extracted from a document image already contains the
Generally speaking, it is easy to find a way to cope information of baseline and x-line, which is a by-product
with different sizes. But dealing with multiple fonts and of the text line extraction in the previous stage.
touching characters caused by condensed spacing, is still
a challenging issue. 3.2. Generating Line-or-Traversal Attribute
For a word image in a printed text, two characters
could be spaced apart by a few white columns caused by The generation of LTA is performed in two steps. We
intercharacter spacing in general, as shown in Figure 2(a). extract the straight stroke line feature from the word bitmap
But it is also common that one character overlaps with first, as shown in Figure 3(a). Note that only the vertical
another by a few columns caused by kerning, as shown stroke lines and diagonal stroke lines are extracted in this
in Figure 2(b). Worse still, as shown in Figure 2(c), two stage. Then the traversal features of the remainder part are
or more adjacent characters may touch each other due to computed. Finally, the features from the above two steps
condensed spacing. This poses a challenge to us to separate are aggregated to generate the LTAs of the corresponding
such touching characters. We will utilize inexact feature primitives. In other words, the LTA of a primitive is
string matching to handle this problem. represented by either a straight line feature(if applicable)
or a traversal feature.
3.2.1. Straight Stroke Line Feature A run-length based
fr(a)
fr (b)
fr(c)
method is utilized to extract the straight stroke lines from
word images. We use R(a, θ) to represent a directional run,
which is defined by a set of black concatenating pixels that
contains pixel a, along the specified direction θ. |R(a, θ)|
Figure 2. Different Spacing: (a)separated is the run length of R(a, θ), which is the number of black
adjacent characters, (b) overlapped adjacent points in the run.
characters, (c) touched adjacent characters The straight stroke line detection algorithm is
summarized as follows:
(1)Along the middle line(between the x-line and
base-line), detect the boundary pair [Al , Ar ] of each stroke,
where Al and Ar are the left and right boundary points
3.1. LRPS Feature Representation respectively.
(2) Detect the midpoint Am of a line segment Al Ar .
A word is explicitly segmented, from the leftmost to the (3) Calculate R(Am , θ) for different θ, from which we
rightmost, to discrete entities. Each entity, called a primitive select θmax as the As ’s run direction.
here, is represented using definite attributes. A primitive p (4) If |R(Am , θmax )| is near to or larger than the
will be described using a two-tuple (σ, ω), where σ is the x-height, the pixels containing Am , between the boundary
Line-or-Traversal Attribute(LTA) of the primitive and ω is points Al and Ar , along the direction θmax , are extracted as
the Ascender-and-Descender Attribute(ADA). As a result, a stroke line.
the word image is then expressed as a sequence P of pi ’s. As shown in Figure 3, the stroke lines are extracted
as in Figure 3(a), while the remainder is in Figure 3(b).
P =< p1 p2 . . . pn >=< (σ1 , ω1 )(σ2 , ω2 ) . . . (σn , ωn ) > According to the direction of a line, it is assigned to one
(1) of three basic stroke lines: vertical stroke line, left-down
where the ADA of a primitive ω ∈ Ω={‘x’,‘a’,‘A’,‘D’,‘Q’}, diagonal stroke and right-down diagonal stroke. With
which are defined as: respect to these types of stokes, three basic primitives
‘x’: the primitive is between the x-line and the baseline. are generated from the extracted stroke lines. Meanwhile,
‘a’: the primitive is between the top-boundary and the their ADAs can be assigned based on their top-end
x-line. and bottom-end positions. Their LTAs are respectively
‘A’: the primitive is between the top-boundary and the expressed as:
baseline. ‘l’: vertical straight stroke line, such as that in the
‘D’: the primitive is between the x-line and the characters ‘l’, ‘d’, ’p’, ‘q’, ‘D’, ‘P’, etc. For the primitive
bottom-boundary. whose ADA is ‘x’ or ‘D’, we will further check whether
processing is not carried out on the part represented by
stroke line features as described above. According to the
value of TN , different feature codes are assigned as follows.
‘&’: there is no image pixel in the column(TN = 0).
It corresponds to the blank intercharacter space. We treat
intercharacter space as a special primitive. In addition, the
overlap of adjacent characters caused by kerning is easily
detected by analyzing the relative positions of the adjacent
connected components. We can insert a space primitive
between them in this case.
If TN = 2, two parameters are utilized to assign it a
feature code. One is the ratio of its black pixel number to
x-height, κ. The other is its relative position with respect to
the x-line and the base line, ξ = Dm /Db , where Dm is the
distance from the topmost stroke pixel in the column to the
x-line and Db is the distance from the bottommost stroke
pixel to the baseline.
Figure 3. Primitive string extraction (a) ‘n’: κ < 0.2 and ξ < 0.3
straight stroke line features, (b) remainder ‘u’: κ < 0.2 and ξ > 3
part of (a), (c) traversal TN = 2, (d) traversal ‘c’: κ > 0.5 and 0.5 < ξ < 1.5
TN = 4, (e) traversal TN = 6 If TN ≥ 4, the feature code is assigned as:
‘o’: TN = 4
‘e’: TN = 6
there is a dot over the vertical stroke line. If yes, the LTA of ‘g’: TN = 8
the primitive is re-assigned as ‘i’. Then the same feature codes in the consecutive columns
‘v’: right-down diagonal straight stroke line, such as that are merged and represented by one primitive. Note that
in the characters ‘v’, ‘w’, ‘V’, ‘W’, etc. a few columns possibly have no resultant feature codes
‘w’: left-down diagonal straight stroke line, such as that because they cannot meet the requirements of all eligible
in the characters ‘v’, ‘w’, ‘z’, etc. For the primitive whose feature codes described above, which is usually caused by
ADA is ‘x’ or ‘A’, we will further check whether there are noise. Such columns are eliminated directly.
two horizontal stroke lines connected with it at the top and As illustrated in Figure 3, the word image is decomposed
bottom. If so, the LTA of the primitive is re-assigned as ‘z’. into one part with stroke lines(as in Figure 3(a)) and other
Additionally, it is easy to detect primitives containing parts with different traversal numbers(refer to Figure 3(c)
two or more straight stroke lines. They are: for TN = 2,(d) for TN = 4 and (e) for TN = 6).
‘x’: one left-down diagonal straight stroke line crosses The number of legal combination of primitive’s two
one right-down diagonal straight stroke line. properities, i.e. σ and ω, is limited. For conciseness
‘y’: one left-down diagonal straight stroke line meets one sake, each legal 2-tuple is replaced by one exact letter,
right-down diagonal straight stroke line at its middle point. as listed in Table 1. Then, the primitive code string
‘Y’: one left-down diagonal stroke line, one right-down is composed by concatenating the above generated
diagonal stroke line and one vertical stroke line cross in one primitives from the leftmost to the rightmost. The
point, like character ‘Y’. resultant primitive string of the word image in
Figure 3 will be <nmuomuomonomu&Odomn&ceo
‘k’: one left-down diagonal stroke line, one right-down
&oemuOd&ndoOdonomu&y>.
diagonal stroke line and one vertical stroke line meet in one
point, like character ‘k’ and ‘K’.
3.3. Postprocessing
3.2.2. Traversal Feature After the primitives based on
the stroke line features are extracted as described above, To achieve the ability of dealing with different fonts,
the primitives of the remainder part in the word image is the primitives should be independent of typefaces. Among
computed based on the traversal features. various fonts, a significant difference that has an impact on
To extract the traversal feature, we scan the word image the extraction of LRPS is serif, especially for that expressed
column by column, and the traversal number TN is recorded by traversal features. Figure 4 gives some examples in
by counting the number of transitions from black pixel which some have serif whereas some have not. It is a basic
to white pixel, or vice versa, along each column. This necessity to avoid the effect of serif in LRPS representation.
Primitive properties
(o,x)
Coding represention
o
health (Times Roman) health (Arial)
(e,x)
(l,x) m
e
health (Bookman) health (Helvetica)
(c,x) c
(n,x) n health (Courier)
KHDOWK (Tahoma)
(u,x) u
KHDOWK KHDOWK
(v,x) v
(w,x) w (Century) (Verdana)
(g,D) g
(i,A) i Figure 4. Different fonts
(i,Q) j
(k,x) k
(x,x) x For example, the primitive string token of character ‘b’ is
(y,D) y <doc>, and that of character ‘p’ is <qoc>. Table 2 lists the
(z,x) z primitive string tokens of all characters. The primitive string
(l,A) d of a word can be generated by synthesizing the primitive
(l,D) q string token of each characters in the word and inserting a
(u,a) T special primitive <&> among them to identify a spacing
(c,a) P gap.
(o,A) O Generally speaking, the resulting primitive string of
(e,A) E a real image is not as perfect as that synthesized from
(c,A) C the standard PST of corresponding characters, due to a
(v,A) V multitude of facts such as connection between adjacent
(w,A) W characters, noise effect, etc. As shown in Figure 3, the
(k,A) K primitive substring with respect to character ‘h’ is changed
(x,A) X from <dnm> to <dom> because of the effect of noise.
(Y,A) Y The touched characters ‘al’ and ‘th’ have also affected the
(z,A) Z corresponding primitive substrings. Inexact matching will
(e,Q) Q be employed in Section 5 to develop robust matching.
Our observation shows that a primitive produced by We use an English dictionary including 25133
serif can be eliminated by analyzing its preceding and commonly used words to evaluate the validity of the
succeeding primitives. For instance, a primitive ‘u’ in a proposed word image coding scheme. Each word
primitive subsequence <du&> is normally generated by is represented by its corresponding word primitive
a right-side serif of characters such as ‘a’, ‘h’, ‘m’, ‘n’, token(WPT) generated by aggregating the characters’
‘u’, etc. Therefore, we can remove the primitive <u> from primitive string tokens described above, according to
a primitive subsequence <du&>. Similarly, a primitive the character sequence of the word, with the special
<o> in a primitive subsequence <nom> is normally primitive <&> inserted between two adjacent PSTs.
generated by a serif of characters such as ‘h’, ‘m’, ‘n’, For example, the WPT of the word “health” will be
etc. Therefore, we can directly eliminate the primitive <o> <dnm&ceo&oem&d&ndo&dnm>.
from a primitive subsequence <nom>. The investigation found that each word in the dictionary
More postprocessing rules can be used to eliminate has a unique coding representation which can be
the primitives caused by serif. With this postprocessing, distinguished from others, although there is ambiguity at
the primitive string of the word in Figure 3 will become character level, e.g. the PSTs of character ‘l’ and ‘I’ are
<mumuomnm&dom&ceo&oemd&ndodnm&y>. same. This concludes that there is no ambiguity for our
Based on the feature extraction described above, we can coding scheme, but at the cost of more computational
give each character a standard primitive string token(PST). burden compared to Sptiz’s coding scheme.
Ch PST Ch PST
[FileName]
a oem A WV b20169802.pdf
b doc B dEd [url]
c co C CO http://137.132.82.156/ASP/DataStore/p1206dpdf/b20169802.pdf
[NumberOfPage]
d cod D dOC
114
e ceo E dE [NumberOfWord]
f ndT F dOT 34591
g g G COEO <WordInfo>
……
h dnm H dnd [FeatureCode]
i i I d -mnmnm*coc*cod*mum*d*oem*mn-
j j J ud [x]
k k K K 1
[PageNo]
l d L du 12
m mnmnm M dVWd [Location]
n mnm N dVd 231,1058,256,1129
o coc O COC ……
[FeatureCode]
p qoc P dOP -ceo*d*ceo*mnmnm*ceo*mnm*ndo-
q coq Q COQC [x]
r mn R dOEO 1
s oeo S OEO [PageNo]
85
t ndo T TdT [Location]
u mum U dud 198,1422,236,1496
v vw V VW ……
w vwvw W VWVW </WordInfo>
x x X X
y y Y Y Figure 5. Code file for a PDF with imaged
z z Z Z documents
d 0 2 1 2 1 0 2 1 4 3 2 1 4 3 2 1 0 0 0 0 2 1 0 2 3 2 2 3 6 5 4
& 0 1 1 1 1 0 1 1 3 6 5 4 3 6 5 4 3 2 1 0 1 1 3 1 2 2 1 2 5 8 7
c 0 0 1 1 1 1 0 1 1 5 4 5 4 5 8 7 6 5 4 3 2 1 2 3 2 2 1 1 4 7 6
e
e 0 0 0 1 1 1 1 0 1 4 3 4 5 4 7 10 9 8 7 6 5 4 3 2 1 2 1 1 3 6 5
o 0 0 0 0 1 3 2 1 0 3 2 5 4 4 6 9 12 11 10 9 8 7 6 5 4 3 2 1 2 5 4
& 0 0 0 0 0 2 2 1 0 2 2 4 4 6 5 8 11 14 13 12 11 10 9 8 7 6 5 4 3 4 4
o 0 0 0 0 0 2 2 2 1 1 1 4 4 5 6 7 10 13 16 15 14 13 12 11 10 9 8 7 6 5 4
e 0 0 0 0 0 1 2 2 2 1 0 3 4 4 5 8 9 12 15 18 17 16 15 14 13 12 11 10 9 8 7
a
m 0 2 1 2 1 0 3 2 4 3 2 2 5 4 4 7 8 11 14 17 20 19 18 17 16 15 14 13 12 11 10
& 0 1 1 1 1 0 2 2 3 6 5 4 4 7 6 6 7 10 15 16 19 19 21 20 19 18 17 16 15 14 13
d 0 0 0 1 0 0 1 1 2 5 8 7 6 6 5 5 6 9 14 15 18 21 20 19 22 21 20 19 18 17 16
l
& 0 0 0 0 0 0 0 0 1 4 7 7 6 8 7 6 5 8 13 14 17 20 23 22 21 20 20 19 18 20 19
n 0 0 0 0 0 0 0 2 1 3 6 7 7 7 8 7 6 7 12 13 16 19 22 25 24 23 22 22 21 20 19
t
d 0 0 0 0 0 0 0 1 2 2 5 6 7 6 7 6 5 6 11 12 15 18 21 24 27 26 25 24 23 22 21
o 0 0 0 0 0 2 1 0 1 1 4 7 6 6 6 7 8 7 10 11 14 17 20 23 26 29 28 27 26 25 24
& 0 0 0 0 0 1 1 0 0 3 3 6 6 8 7 6 7 7 9 10 13 16 19 22 25 28 28 27 26 28 27
d 0 0 0 0 0 0 1 0 0 2 5 5 6 7 6 5 6 6 8 9 12 15 18 21 24 27 30 29 28 27 26
n 0 0 0 0 0 0 0 3 2 1 4 4 5 6 7 6 5 5 7 8 11 14 17 20 23 26 29 32 31 30 29
h
m 0 2 1 2 1 0 2 1 2 1 3 3 6 5 6 7 6 5 6 7 10 13 16 19 22 25 28 31 34 33 32
image through the available hyperlinks over the document’s word has not been queried before. In this case, we need to
name for online reading or download it for future reference. convert the query word to a feature code string and perform
The time used to process each input query is also recorded the word codes matching operation among the underlying
and shown on the top of the final query result for user’s feature code files stored on the win2k server.
information. Intermediate processes such as searching index
table and updating index table will be shown to notify the
6.2. AND/OR/NOT Operation
user if these operations are carried out for certain set of
query input. Figure 7 demonstrates an example of the user’s
The system supports AND/OR/NOT operations over a
query result.
set of query words. Users are prompted through a web
The Oracle database is used to store an index table that interface to input a set of query words separated by
contains corresponding information of all the words that an empty space and then choose to perform AND/OR
have been queried before. The information available there operation on them, then followed by a set of query words
for each word are the URLs of the matching document that are not to be included in the resulting documents. The
images and the occurrence frequency of this word in each NOT operation is performed after the AND/OR operations,
of the corresponding documents. Moreover, the Oracle which removes those documents that contain those words
database is also used to store a temporary result table that specified in the NOT query input box.
records all the retrieved documents during the processing
for each particular query word. This result table is merged 6.2.1. AND Operation Generally speaking, if the AND
with each subsequent result set and finally ends up to operation is chosen, the system will do as follows: It starts
be a list of documents that match the whole input query from the first word, processes it to obtain a result table
expression. The final result is then retrieved directly from and stores the table temporarily in the Oracle database.
this result table and returned to the user for display. It then joins this table with the resulting record set of
Finally, the win2k server is used to store a list of each subsequent round to obtain a new result table. In
feature code files generated from the original document this manner, at the end of each round, the result table
images. The feature code file generation step is done will store information of the documents that contain all
through some off-line operations with noise removal and the query words up to now and the multiplication of their
skew rectification preprocessing procedures. These files corresponding normalized frequencies appearing in this
are stored as a database and to be used during the word document. In the end, only those documents that contain
primitive codes matching step, which happens when the all the specified query words will be left in the result table
query word is not found in the index table, that is, the query for merging with the subsequent result of NOT operation.
Figure 7. Search Result
Union on “Documents”
scenario is shown in Figure 11, where an AND operation been implemented, and the preliminary test results with the
is first performed between “computational approach”, then imaged documents of students’ theses in our digital library
followed by the NOT operation on “Poincare”. Since show that the proposed system provides an efficient and
“Poincare” was not queried before, we need to perform promising tool for document image retrieval.
partial word matching among the feature code files. As a Future work will focus on improving the experimental
result, one document is found containing “Poincare” and system to achieve a practical usage. The setting of matching
inserted to the index table as it is shown in the interface. In scoring between different primitive codes is left to future
this case, the time needed is longer than retrieving directly research. How to integrate linguistical knowledge to the
from the index table, which is 22.375 seconds. present system to improve the retrieval performance will
Therefore, the advantage of our system is that it is need further investigation in future. Especially, all of the
an incremental intelligence system, i.e., as more users word objects are included in the feature coding file in the
come to use this system, only searching the index table present system, which obviously slows down the processing
will probably fulfill the user’s request, hence achieve an speed. If some meaningful keywords are extracted from
impressive efficiency. them, and are used to generate the feature coding file, not
only the processing will speed up greatly, but also retrieval
performance benefits from this.
7. Conclusions and Future Work