0% found this document useful (0 votes)
206 views

Text Processing: Data Structures and Algorithms in Java 1/47

The document discusses text processing and string matching algorithms. It introduces the problems of string matching and finding patterns within strings. It describes the brute force algorithm and its quadratic runtime complexity. It then introduces the Knuth-Morris-Pratt algorithm which improves on this by avoiding re-checking already seen characters, achieving linear time complexity. It explains the prefix function used by KMP to encapsulate pattern matching knowledge, and the KMP matching algorithm which uses the prefix function to efficiently find patterns within strings.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views

Text Processing: Data Structures and Algorithms in Java 1/47

The document discusses text processing and string matching algorithms. It introduces the problems of string matching and finding patterns within strings. It describes the brute force algorithm and its quadratic runtime complexity. It then introduces the Knuth-Morris-Pratt algorithm which improves on this by avoiding re-checking already seen characters, achieving linear time complexity. It explains the prefix function used by KMP to encapsulate pattern matching knowledge, and the KMP matching algorithm which uses the prefix function to efficiently find patterns within strings.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 47

8.

Text Processing

Data Structures and Algorithms in Java 1/47


2
Objectives
• Abundance of Digitized Text
• The problem of String Matching
• Brute-Force algorithm
• Knunth-Morris-Pratt Algorithm
• Data Compression
• Condition for Data Compression
• Huffman Coding Algorithm
• LZW Algorithm
• Run-length Encoding
Data Structures and Algorithms in Java 2/47
Abundance of Digitized Text
Despite the wealth of multimedia information, text processing remains one of the
dominant functions of computers. Computers are used to edit, store, and display
documents, and to transport files over the Internet. Furthermore, digital systems
are used to archive a wide range of textual information, and new data is being
generated at a rapidly increasing pace. Common examples of digital collections
that include textual information are:
• Snapshots of the World Wide Web, as Internet document formats HTML and XML
are primarily text formats, with added tags for multimedia content.
• All documents stored locally on a users computer
• Email archives
• Compilations of status updates on social networking sites such as Facebook
• Feeds from microblogging sites such as Twitter and Tumblr

These collections include written text from hundreds of international languages.


Furthermore, there are large data sets (such as DNA) that can be viewed
computationally as “strings” even though they are not language.
In this lesson, we explore some of the fundamental algorithms that can be used
to efficiently analyze and process large textual data sets.

Data Structures and Algorithms in Java 3/47


The problem of String Matching

Given a string S, the problem of string matching


deals with finding whether a pattern p occurs
in S and if p does occur then returning
position in S where p occurs.

Data Structures and Algorithms in Java 4/47


Brute-Force algorithm
One of the most obvious approach towards the
string matching problem would be to compare
the first element of the pattern to be searched p,
with the first element of the string S in which to
locate p. If the first element of p matches the
first element of S, compare the second element
of p with second element of S. If match found
proceed likewise until entire p is found. If a
mismatch is found at any position, shift p one
position to the right and repeat comparison
beginning from first element of p. This algorithm
is called Brute-Force. Its' complexity (worst
case) is O(nm).
Data Structures and Algorithms in Java 5/47
Brute-Force algorithm demo - 1

Below is an illustration of how the previously


described O(mn) approach works.

String S a b c a b a a b c a b a c

Pattern p a b a a

Data Structures and Algorithms in Java 6/47


Brute-Force algorithm demo - 2
Step 1:compare p[1] with S[1]
S a b c a b a a b c a b a c

p a b a a

Step 2: compare p[2] with S[2]


S a b c a b a a b c a b a c

p a b a a
Data Structures and Algorithms in Java 7/47
Brute-Force algorithm demo - 3
Step 3: compare p[3] with S[3]

S a b c a b a a b c a b a c

p a b a a
Mismatch occurs here..

Since mismatch is detected, shift p one position to the left and


perform steps analogous to those from step 1 to step 3. At position
where mismatch is detected, shift p one position to the right and
repeat matching procedure.

Data Structures and Algorithms in Java 8/47


Brute-Force algorithm demo - 4
S a b c a b a ab c a b a c

p a ba a
Finally, a match would be found after shifting p three times to the right side.

Drawbacks of this approach: if m is the length of pattern p and n the length of string
S, the matching time is of the order O(mn). This is a certainly a very slow running
algorithm.
What makes this approach so slow is the fact that elements of S with which
comparisons had been performed earlier are involved again and again in
comparisons in some future iterations. For example: when mismatch is detected for
the first time in comparison of p[3] with S[3], pattern p would be moved one position
to the right and matching procedure would resume from here. Here the first
comparison that would take place would be between p[0]=a and S[1]=b. It should be
noted here that S[1]=b had been previously involved in a comparison in step 2. this
is a repetitive use of S[1] in another comparison.
It is these repetitive comparisons that lead to the runtime of O(mn).
Data Structures and Algorithms in Java 9/47
The Knuth-Morris-Pratt Algorithm
Knuth, Morris and Pratt proposed a linear time
algorithm for the string matching problem.
A matching time of O(n+m) is achieved by
avoiding comparisons with elements of S that
have previously been involved in comparison
with some element of the pattern p to be
matched. i.e., backtracking on the string S
never occurs

Data Structures and Algorithms in Java 10/47


Components of KMP algorithm
• The prefix function, Π
The prefix function,Π for a pattern encapsulates
knowledge about how the pattern matches against
shifts of itself. This information can be used to avoid
useless shifts of the pattern p. In other words, this
enables avoiding backtracking on the string S.
• The KMP Matcher
With string S, pattern p and prefix function Π as inputs,
finds the occurrence of p in S and returns the
number of shifts of p after which occurrence is
found. Data Structures and Algorithms in Java 11/47
The prefix function, Π
Following pseudocode computes the prefix fucnction, Π:

Compute-Prefix-Function (p)
1 m  length[p] //p pattern to be matched
2 Π[1]  0
3 k0
4 for q  2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k  Π[k]
7 If p[k+1] = p[q]
8 then k  k +1
9 Π[q]  k
10 return Π

Data Structures and Algorithms in Java 12/47


Example: compute Π for the pattern p below:
p a b a b a c a
Initially: m = length[p] = 7
Π[1] = 0
k=0
q 1 2 3 4 5 6 7
Step 1: q = 2, k=0 p a b a b a c a
Π 0 0
Π[2] = 0

q 1 2 3 4 5 6 7
p a b a b a c a
Step 2: q = 3, k = 0,
Π[3] = 1 Π 0 0 1

q 1 2 3 4 5 6 7
Step 3: q = 4, k = 1 p a b a b a c A
Π[4] = 2 Π 0 0 1 2
Data Structures and Algorithms in Java 13/47
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7
Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3

q 1 2 3 4 5 6 7
Step 5: q = 6, k = 3
Π[6] = 1 p a b a b a c a
Π 0 0 1 2 3 1

q 1 2 3 4 5 6 7
Step 6: q = 7, k = 1 p a b a b a c a
Π[7] = 1 Π 0 0 1 2 3 1 1

After iterating 6 times, the prefix q 1 2 3 4 5 6 7


function computation is p a b A b a c a
complete: 
Π 0 0 1 2 3 1 1

Data Structures and Algorithms in Java 14/47


The KMP Matcher
The KMP Matcher, with pattern p, string S and prefix function Π as input, finds a
match of p in S.
Following pseudocode computes the matching component of KMP algorithm:
KMP-Matcher(S,p)
1 n  length[S]
2 m  length[p]
3 Π  Compute-Prefix-Function(p)
4q0 //number of characters matched
5 for i  1 to n //scan S from left to right
6 do while q > 0 and p[q+1] != S[i]
7 do q  Π[q] //next character does not match
8 if p[q+1] = S[i]
9 then q  q + 1 //next character matches
10 if q = m //is all of p matched?
11 then print “Pattern occurs with shift” i – m
12 q  Π[ q] // look for the next match

Note: KMP finds every occurrence of a p in S. That is why KMP does not terminate in
step 12, rather it searches remainder of S for any more occurrences of p.

Data Structures and Algorithms in Java 15/47


Illustration: given a String S and pattern p as follows:

S b a c b a b a b a b a c a c a

p a b a b a c a
Let us execute the KMP algorithm to find
whether p occurs in S.
For p the prefix function, Π was computed previously and is as follows:

q 1 2 3 4 5 6 7

p a b A b a c a

Π 0 0 1 2 3 1 1
Data Structures and Algorithms in Java 16/47
Initially: n = size of S = 15;
m = size of p = 7

Step 1: i = 1, q = 0
comparing p[1] with S[1]

S b a c b a b a b a b a c a a b

p a b a b a c a
P[1] does not match with S[1]. p will be shifted one position to the right.

Step 2: i = 2, q = 0
comparing p[1] with S[2]

S b a c b a b a b a b a c a a b

p a b a b a c a
P[1] matches S[2]. Since there is a match, p is not shifted.
Data Structures and Algorithms in Java 17/47
Step 3: i = 3, q = 1
Comparing p[2] with S[3] p[2] does not match with S[3]
S b a c b a b a b a b a c a a b

p a b a b a c a
Backtracking on p, comparing p[1] and S[3]
Step 4: i = 4, q = 0
comparing p[1] with S[4] p[1] does not match with S[4]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 5: i = 5, q = 0
comparing p[1] with S[5] p[1] matches with S[5]

S b a c b a b a b a b a c a a b

p a b a b a c a
Data Structures and Algorithms in Java 18/47
Step 6: i = 6, q = 1

Comparing p[2] with S[6] p[2] matches with S[6]


S b a c b a b a b a b a c a a b

p a b a b a c a
Step 7: i = 7, q = 2

Comparing p[3] with S[7] p[3] matches with S[7]

S b a c b a b a b a b a c a a b

p a b a b a c a
Step 8: i = 8, q = 3

Comparing p[4] with S[8] p[4] matches with S[8]

S b a c b a b a b a b a c a a b

p a b a b a c a
Data Structures and Algorithms in Java 19/47
Step 9: i = 9, q = 4

Comparing p[5] with S[9] p[5] matches with S[9]


S b a c b a b a b a b a c a a b

p a b a b a c a
Step 10: i = 10, q = 5

Comparing p[6] with S[10] p[6] doesnt match with S[10]

S b a c b a b a b a b a c a a b

p a b a b a c a
Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3
Step 11: i = 11, q = 4

Comparing p[5] with S[11] p[5] matches with S[11]

S b a c b a b a b a b a c a a b

p a b a b a c a
Data Structures and Algorithms in Java 20/47
Step 12: i = 12, q = 5

Comparing p[6] with S[12] p[6] matches with S[12]


S b a c b a b a b a b a c a a b

p a b a b a c a

Step 13: i = 13, q = 6

Comparing p[7] with S[13] p[7] matches with S[13]

S b a c b a b a b a b a c a a b

p a b a b a c a

Pattern p has been found to completely occur in string S. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.

Data Structures and Algorithms in Java 21/47


Running - time analysis
• Compute-Prefix-Function (Π) • KMP Matcher
1 m  length[p] //p pattern to be 1 n  length[S]
matched 2 m  length[p]
2 Π[1]  0 3 Π  Compute-Prefix-Function(p)
3 k0 4q0
4 for q  2 to m 5 for i  1 to n
5 do while k > 0 and p[k+1] != p[q] 6 do while q > 0 and p[q+1] != S[i]
6 do k  Π[k] 7 do q  Π[q]
7 If p[k+1] = p[q] 8 if p[q+1] = S[i]
8 then k  k +1 9 then q  q + 1
9 Π[q]  k 10 if q = m
10 return Π 11 then print “Pattern occurs with shift” i
–m
12 q  Π[ q]
In the above pseudocode for computing the
prefix function, the for loop from step The for loop beginning in step 5 runs n times,
4 to step 10 runs m times. Step 1 to i.e., as long as the length of the string S.
step 3 take constant time. Hence the Since step 1 to step 4 take constant time,
running time of compute prefix the running time is dominated by this for
function is Θ(m). loop. Thus running time of matching
function is Θ(n).

Data Structures and Algorithms in Java 22/47


Data Compression - 1
• “In computer science and information theory, data
compression or source coding is the process of encoding
information using fewer bits (or other information-bearing
units) than an unencoded representation would use
through use of specific encoding schemes.” (Wikipedia)
• Compression reduces the consumption of storage (disks) or
bandwidth.
• However, it needs processing time to restore or view the
compressed code.
encoding
Raw data Compressed
data
decoding

Data Structures and Algorithms in Java 23/47


Data Compression - 2
24

• Types of compression
– Lossy: MP3, JPG
– Lossless: ZIP, GZ
• Compression Algorithm:
– Huffman Encoding
– Lempel-Ziv
– RLE: Run Length Encoding
• Performance of compression depends on
file types.
Data Structures and Algorithms in Java 24/47
Compress data by decoding symbols
contained in it (lossless compression)
• The information content of the set M, called the entropy of
the source X = (x1, x2,…, xn ), is defined by:
Lave = H = P(x1)L(x1) + · · · + P(xn)L(xn)
Where L(xi) = - log2P(xi), which is the minimum length of a
codeword for symbol xi (Claude E.Shannon, 1948). Shannons
entropy represents an absolute limit on the best possible
lossless compression of any communication.
• To compare the efficiency of different data compression
methods when applied to the same data, the same measure is
used; this measure is the compression rate
length(input) – length (output) Codeword:
length(input) sequence of bits of a code
corresponding to a symbol.
Data Structures and Algorithms in Java 25/47
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword)
of variable length to every message value
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence of bits
1011 ?
Is it aba, ca, or, ad?
A uniquely decodable code is a variable length code in
which bit strings can always be uniquely
decomposed into its codewords.

Data Structures and Algorithms in Java 26/47


Prefix Codes
A prefix code is a variable length code in which
no codeword is a prefix of another codeword
e.g a = 0, b = 110, c = 111, d = 10
Can be viewed as a binary tree with message
values at the leaves and 0 or 1s on the edges.
0 1
0 1
a
0 1 d
b c
Data Structures and Algorithms in Java 27/47
Average Length
• For a code C with associated probabilities p(c)
the average length is defined as
la (C )   p(c)l (c)
cC

• We say that a prefix code C is optimal if for all


prefix codes C, la(C)  la(C)
• The Huffman code is known to be provably
optimal under certain well-defined conditions
for data compression.
Data Structures and Algorithms in Java 28/47
Huffman Coding algorithm
29 Main idea: Encode high probability symbols with fewer bits
1. Make a leaf node for each code symbol
Add the generation probability or the frequency of each
symbol to the leaf node (arrange them from left to right
in descending order by probability)
2. Take the two leaf nodes with the smallest
probability and connect them into a new node
Add 1 or 0 to each of the two branches
The probability of the new node is the sum of the
probabilities of the two connecting nodes
3. If there is only one node left, the code
construction is completed. If not, go back to (2)
Data Structures and Algorithms in Java 29/47
Huffman Coding example - 1
Character (or symbol) frequencies
– A : 20% (.20) e.g., A occurs 20 times in a 100
character document, 1000
times in a 5000 character document,
etc.
– B : 9% (.09)
– C : 15% (.15)
– D : 11% (.11)
– E : 40% (.40)
– F : 5% (.05)
• Also works if you use character counts
• Must know frequency of every character in the
document

Data Structures and Algorithms in Java 30/47


Huffman Coding example - 2
• Symbols and their associated frequencies.
• Now we combine the two least common symbols
(those with the smallest frequencies) to make a
new symbol string and corresponding frequency.

E A C D B F
.40 .20 .15 .11 .09 .05

Data Structures and Algorithms in Java 31/47


Huffman Coding example - 3
• Heres the result of combining symbols once.
• Now repeat until youve combined all the symbols into a single
string.

E A C BF D
.40 .20 .15 .14 .11

B F
.09 .05

Data Structures and Algorithms in Java 32/47


Huffman Coding example - 4
• Heres the result of combining symbols once.
• Now repeat until youve combined all the symbols into a single
string.

E BFD A C
.40 .25 .20 .15

BF D
.14 .11

B F
.09 .05

Data Structures and Algorithms in Java 33/47


Huffman Coding example - 5
• Heres the result of combining symbols once.
• Now repeat until youve combined all the symbols into a single
string.

E AC BFD
.40 .35 .25

A C BF D
.20 .15 .14 .11

B F
.09 .05

Data Structures and Algorithms in Java 34/47


Huffman Coding example - 6

BFDAC E
.60 .40

AC BFD
.35 .25

A C BF D
.20 .15 .14 .11

B F
.09 .05

Data Structures and Algorithms in Java 35/47


Huffman Coding example - 7
Codes: (reading from top to bottom)
BFDACE
A: 000 1.00
B: 0100 1
0
C: 001
D: 011
E: 1 BFDAC E
F: 0101 .60 .40
0 1
Note
None are prefixes of another AC BFD
.35 .25
0 1 0 1

A C BF D
.20 .15 0 .14 .11
1
B F
.09 .05
Data Structures and Algorithms in Java 36/47
Huffman Coding example - 8

Character Code Length Probability


A 000 3 .20
B 0100 4 .09
C 001 3 .15
D 011 3 .11
E 1 1 .40
F 0101 4 .05

Average Code Length:


(3 x 0.20) + (4 x 0.09) + (3 x 0.15) + (3 x 0.11) + (1 x 0.40) + (4 x 0.05)
= 1.89 digits

Data Structures and Algorithms in Java 37/47


Huffman Coding notes
• There is no unique Huffman code
– Assigning 0 and 1 to the branches is arbitrary
– If there are more nodes with the same probability, it
doesnt matter how they are connected.
– However, if the probability in each node is unique
and the left nodes probability is always larger than
the rights one, then the code is unique.
• Every Huffman code has the same average
code length!

Data Structures and Algorithms in Java 38/47


Some important statements
in Huffman Encoding program
static final int R = 256;
// tabulate frequency counts
int[] freq = new int[R]; = new int[R];
for (int i = 0; i < input.length; i++) freq[input[i]]++;
// build Huffman tree
Node root = buildTree(freq);
// build code table
String [] st = new String[R]; buildCode(st, root, "");

// make a codewords table from symbols and their encodings


void buildCode(String[] st, Node x, String s)
{ if (!x.isLeaf())
{buildCode(st, x.left, s + 0); buildCode(st, x.right, s + 1); }
else
{st[x.ch] = s; }
}
Data Structures and Algorithms in Java 39/47
Lempel-Ziv Compression
40

• Encode sequences of symbols with location of


sequence in a dictionary => dictionary coder
• Originated by Abraham Lempel and Jacob Ziv,
improved by Tery Welch in 1984 (that is why it
gets name LZW)
• This coding method is lossless coding
• Algorithms versions: LZ77, LZ78 and LZW

Data Structures and Algorithms in Java 40/47


LZW Encoding Algorithm
41

P is a current word, thusData


at Structures
the start, it is empty.
and Algorithms in Java 41/47
LZW Algorithm - Encoding process demo
Contents of the dictionary at the
beginning of encoding: (1)A (2)B (3)C

• The column Step indicates the


number of the encoding step. Each
encoding step is completed when
the step 3.b. in the encoding
algorithm is executed.
• The column Pos indicates the
current position in the input data.
• The column Dictionary shows the
string that has been added to the
dictionary and its index number in
brackets.
• The column Output shows the code
word output in the corresponding
encoding step. The compressed output: (1)(2)(2)(4)(7)(3)
Data Structures and Algorithms in Java 42/47
Run-Length Encoding

Raw : FFFFOOOOFFFOOFFFFFOOOOOOOO
4 4 3 2 5 7

Compressed: 4F4O3F2O5F7O

Compression Rate = (25-12)/25 = 52%

Data Structures and Algorithms in Java 43/47


Summary
• Abundance of Digitized Text
• The problem of String Matching
• Brute-Force algorithm
• Knunth-Morris-Pratt Algorithm
• Data Compression
• Condition for Data Compression
• Huffman Coding Algorithm
• LZW Algorithm
• Run-length Encoding

Data Structures and Algorithms in Java 44/47


Reading at home
Text book: Data Structures and Algorithms in Java

• 13 Text Processing 573


• 13.1 Abundance of Digitized Text - 574
• 13.2 Pattern-Matching Algorithms - 576
• 13.2.1 Brute Force - 576
• 13.2.3 The Knuth-Morris-Pratt Algorithm - 582
• 13.4 Text Compression and the Greedy Method -
595
• 13.4.1 The Huffman Coding Algorithm - 596

Data Structures and Algorithms in Java 45/47


LZW Decoding Algorithm
46

Data Structures and Algorithms in Java 46/47


LZW Algorithm - Decoding process demo
•Lets analyze the step 4. The previous Contents of the dictionary at the
code word (2) is stored in pW, and cW beginning of decoding: (1)A (2)B (3)C
is (4). The string.cW is output ("A B").
The string.pW ("B") is extended with
the first character of the string.cW ("A")
and the result ("B A") is added to the
dictionary with the index (6).
We come to the step 5. The content of
cW=(4) is copied to pW, and the new
value for cW is read: (7). This entry in
the dictionary is empty. Thus, the
string.pW ("A B") is extended with its
own first character ("A") and the result
("A B A") is stored in the dictionary with
the index (7). Since cW is (7) as well,
this string is also sent to the output.

The decompressed output: ABBABABA


Data Structures and Algorithms in Java 47/47

You might also like