0% found this document useful (0 votes)
62 views56 pages

1.1 Artificial Neural Networks

The document introduces artificial neural networks and discusses their inspiration from biological neural systems. It describes the typical structure of a multi-layer perceptron neural network, which consists of interconnected layers of nodes that take weighted inputs and pass them through an activation function to produce outputs. The network architecture allows modeling of complex nonlinear functions through adjustment of weights and thresholds between nodes.

Uploaded by

tamaryyyyy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views56 pages

1.1 Artificial Neural Networks

The document introduces artificial neural networks and discusses their inspiration from biological neural systems. It describes the typical structure of a multi-layer perceptron neural network, which consists of interconnected layers of nodes that take weighted inputs and pass them through an activation function to produce outputs. The network architecture allows modeling of complex nonlinear functions through adjustment of weights and thresholds between nodes.

Uploaded by

tamaryyyyy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 56

CHAPTER 1

INTRODUCTION

1.1 ARTIFICIAL NEURAL NETWORKS Modeling systems and functions using neural network mechanisms is a relatively new and developing science in computer technologies. The particular area derives its basis from the way neurons interact and function in the natural animal brain, especially humans. The animal brain is known to operate in massively parallel manner in recognition, reasoning, reaction and damage recovery. All these seemingly sophisticated undertakings are now understood to be attributed to aggregations of very simple algorithms of pattern storage and retrieval. Neurons in the brain communicate with one another across special electrochemical links known as synapses. At a time one neuron can be linked to as many as 10,000 others although links as high as hundred thousands are observed to exist. The typical human brain at birth is estimated to house one hundred billion plus neurons. Such a combination would yield a synaptic connection of 1015, which gives the brain its power in complex spatio-graphical computation. Unlike the animal brain, the traditional computer works in serial mode, which is to mean instructions are executed only one at a time, assuming a uni-processor machine. The illusion of multitasking and real-time interactivity is simulated by the use of high computation speed and process scheduling. In contrast to the natural brain which communicates internally in electrochemical links, that can achieve a maximum speed in milliseconds range, the microprocessor executes instructions in the lower microseconds range. A modern processor such as the Intel Pentium-4 or AMD Opteron making use of multiple pipes and hyper-threading technologies can perform up to 20 MFloPs (Million Floating Point executions) in a single second.

It is the inspiration of this speed advantage of artificial machines, and parallel capability of the natural brain that motivated the effort to combine the two and enable performing complex Artificial Intelligence tasks believed to be impossible in the past. Although artificial neural networks are currently implemented in the traditional serially operable computer, they still utilize the parallel power of the brain in a simulated manner. Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. Indeed, anywhere that there are problems of prediction, classification or control, neural networks are being introduced. This sweeping success can be attributed to a few key factors: Power: Neural networks are very sophisticated modeling techniques capable of modeling extremely complex functions. In particular, neural networks are nonlinear. For many years linear modeling has been the commonly used technique in most modeling domains since linear models have well-known optimization strategies. Where the linear approximation was not valid (which was frequently the case) the models suffered accordingly. Neural networks also keep in check the curse of dimensionality problem that bedevils attempts to model nonlinear functions with large numbers of variables. Ease of use: Neural networks learn by example. The neural network user gathers representative data, and then invokes training algorithms to automatically learn the structure of the data. Although the user does need to have some heuristic knowledge of how to select and prepare data, how to select an appropriate neural network, and how to interpret the results, the level of user knowledge needed to successfully apply neural networks is much lower than would be the case using (for example) some more traditional nonlinear statistical methods.

1.2 THE MULTI-LAYER PERCEPTION NEURAL NETWORK MODEL To capture the essence of biological neural systems, an artificial neuron is defined as follows:

It receives a number of inputs (either from original data, or from the output of other neurons in the neural network). Each input comes via a connection that has a strength (or weight); these weights correspond to synaptic efficacy in a biological neuron. Each neuron also has a single threshold value. The weighted sum of the inputs is formed, and the threshold subtracted, to compose the activation of the neuron (also known as the post-synaptic potential, or PSP, of the neuron). The activation signal is passed through an activation function (also known as a transfer function) to produce the output of the neuron.

If the step activation function is used (i.e., the neuron's output is 0 if the input is less than zero, and 1 if the input is greater than or equal to 0) then the neuron acts just like the 2

biological neuron described earlier (subtracting the threshold from the weighted sum and comparing with zero is equivalent to comparing the weighted sum to the threshold). Actually, the step function is rarely used in artificial neural networks, as will be discussed. Note also that weights can be negative, which implies that the synapse has an inhibitory rather than excitatory effect on the neuron: inhibitory neurons are found in the brain. This describes an individual neuron. The next question is: how should neurons be connected together? If a network is to be of any use, there must be inputs (which carry the values of variables of interest in the outside world) and outputs (which form predictions, or control signals). Inputs and outputs correspond to sensory and motor nerves such as those coming from the eyes and leading to the hands. However, there also can be hidden neurons that play an internal role in the network. The input, hidden and output neurons need to be connected together. A typical feed forward network has neurons arranged in a distinct layered topology. The input layer is not really neural at all: these units simply serve to introduce the values of the input variables. The hidden and output layer neurons are each connected to all of the units in the preceding layer. Again, it is possible to define networks that are partiallyconnected to only some units in the preceding layer; however, for most applications fullyconnected networks are better.

Fig. 1.1 A Typical Feed Forwarded Network

The Multi-Layer Perceptron Neural Network is perhaps the most popular network architecture in use today. The units each perform a biased weighted sum of their inputs and pass this activation level through an activation function to produce their output, and the units are arranged in a layered feed forward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds (biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the number of layers, and the number of units in each layer, determining the function complexity. Important issues in Multilayer Perceptron (MLP) design include specification of the number of hidden layers and the number of units in each layer. Most common activation functions are the logistic and hyperbolic tangent sigmoid functions. The project used the hyperbolic tangent function: derivative: . and

1.3 OPTICAL LANGUAGE SYMBOLS Several languages are characterized by having their own written symbolic representations (characters). These characters are either a delegate of a specific audio glyph, accent or whole words in some cases. In terms of structure world language characters manifest various levels of organization. With respect to this structure there always is an issue of compromise between ease of construction and space conservation. Highly structured alphabets like the Latin set enable easy construction of language elements while forcing the use of additional space. Medium structure alphabets like the Ethiopic, conserve space due to representation of whole audio glyphs and tones in one symbol, but dictate the necessity of having extended sets of symbols and thus a difficult level of use and learning. Some alphabets, namely the oriental alphabets, exhibit a very low amount of structuring that whole words are delegated by single symbols. Such languages are composed of several thousand symbols and are known to need a learning cycle spanning whole lifetimes. Representing alphabetic symbols in the digital computer has been an issue from the beginning of the computer era. The initial efforts of this representation (encoding) was for the alphanumeric set of the Latin alphabet and some common mathematical and formatting symbols. It was not until the 1960s that a formal encoding standard was prepared and issued by the American computer standards bureau ANSI and named the ASCII Character set. It is composed of and 8-bit encoded computer symbols with a total of 256 possible unique symbols. In some cases certain combination of keys were allowed to form 16-bit words to represent extended symbols. The final rendering of the characters on the user display was left for the application program in order to allow for various fonts and styles to be implemented. At the time, the 256+ encoded characters were thought of suffice for all the needs of computer usage. But with the emergence of computer markets in the non-western societies and the internet era, representation of a further set of alphabets in the computer was necessitated. Initial attempts to meet this requirement were based on further combination of ASCII encoded characters to represent the new symbols. This however led to a deep chaos in rendering characters especially in web pages since the user had to choose the correct encoding on the browser. Further difficulty was in coordinating the usage of key combinations between different implementers to ensure uniqueness. It was in the 1990s that a final solution was proposed by an independent consortium to extend the basic encoding width to 16-bit and accommodate up to 65,536 unique symbols. The new encoding was named Unicode due to its ability to represent all the known symbols in a single encoding. The first 256 codes of this new set were reserved for the ASCII set in order to maintain compatibility with existing systems. ASCII characters can be extracted form a Unicode word by reading the lower 8 bits and ignoring the rest or vise versa, depending on the type of endian (big or small) used.

The Unicode set is managed by the Unicode consortium which examines encoding requests, validate symbols and approve the final encoding with a set of unique 16-bit codes. The set still has a huge portion of it non-occupied waiting to accommodate any upcoming requests. Ever since its founding, popular computer hardware and software manufacturers like Microsoft have accepted and supported the Unicode effort.

CHAPTER 2

HISTORY

In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-to-computer text conversion. Kurzweil Computer Products became a subsidiary of Xerox known as Scansoft, now Nuance Communications. 1992-1996 Commissioned by the U.S. Department of Energy (DOE), Information Science Research Institute (ISRI) conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s. Information Science Research Institute (ISRI) is a research and development unit of University of Nevada, Las Vegas. ISRI was established in 1990 with funding from the U.S. Department of Energy. Its mission is to foster the improvement of automated technologies for understanding machine printed documents. Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

OCR systems require calibration to read a specific font; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.

CHAPTER 3

ABOUT THE OCR

An emerging technique in this particular application area is the use of Artificial Neural Network implementations with networks employing specific guides (learning rules) to update the links (weights) between their nodes. Such networks can be fed the data from the graphic analysis of the input picture and trained to output characters in one or another form. Specifically some network models use a set of desired outputs to compare with the output and compute an error to make use of in adjusting their weights. Such learning rules are termed as Supervised Learning. One such network with supervised learning rule is the Multi-Layer Perception (MLP) model. It uses the Generalized Delta Learning Rule for adjusting its weights and can be trained for a set of input/desired output values in a number of iterations. The very nature of this particular model is that it will force the output to one of nearby values if a variation of input is fed to the network that it is not trained for, thus solving the proximity issue. Both concepts will be discussed in the introduction part of this report. The project has employed the MLP technique mentioned and excellent results were obtained for a number of widely used font types. The technical approach followed in processing input images, detecting graphic symbols, analyzing and mapping the symbols and training the network for a set of desired Unicode characters corresponding to the input images are discussed in the subsequent sections. Even though the implementation might have some limitations in terms of functionality and robustness, the researcher is confident that it fully serves the purpose of addressing the desired objectives.

CHAPTER 4

TECHNICAL OVERVIEW

The operations of the network implementation in this project can be summarized by the following steps:

Training phase o Analyze image for characters o Convert symbols to pixel matrices o Retrieve corresponding desired output character and convert to Unicode o Linearize matrix and feed to network o Compute output o Compare output with desired output Unicode value and compute error o Adjust weights accordingly and repeat process until preset number of iterations Testing phase o Analyze image for characters o Convert symbols to pixel matrices o Compute output o Display character representation of the Unicode output

Essential components of the implementation are:


Formation of the network and weight initialization routine Pixel analysis of images for symbol detection Loading routines for training input images and corresponding desired output characters in special files named character trainer sets (*.cts) Loading and saving routines for trained network (weight values) Character to binary Unicode and vice versa conversion routines 9

Error, output routines

and

weight

calculation

Fig: 4.1 The Project MLP Network

4.1 NETWORK FORMATION The MLP Network implemented for the purpose of this project is composed of 3 layers, one input, one hidden and one output. The input layer constitutes of 150 neurons which receive pixel binary data from a 10x15 symbol pixel matrix. The size of this matrix was decided taking into consideration the average height and width of character image that can be mapped without introducing any significant pixel noise. The hidden layer constitutes of 250 neurons whose number is decided on the basis of optimal results on a trial and error basis. The output layer is composed of 16 neurons corresponding to the 16-bits of Unicode encoding. To initialize the weights a random function was used to assign an initial random number which lies between two preset integers named weight_bias. The weight bias is selected from trial and error observation to correspond to average weights for quick convergence.

4.2 SYMBOL IMAGE DETECTION The process of image analysis to detect character symbols by examining pixels is the core part of input set preparation in both the training and testing phase. Symbolic extents are recognized out of an input image file based on the color value of individual pixels, which for the limits of this project is assumed to be either black RGB(255,0,0,0) or white RGB(255,255,255,255). The input images are assumed to be in bitmap form of any resolution which can be mapped to an internal bitmap object in the Microsoft Visual Studio environment. The procedure also assumes the input image is composed of only characters and any other type of bounding object like a border line is not taken into consideration.

10

The procedure for analyzing images to detect characters is listed in the following algorithms: 4.2.1 Determining character lines Enumeration of character lines in a character image (page) is essential in delimiting the bounds within which the detection can proceed. Thus detecting the next character in an image does not necessarily involve scanning the whole image all over again. Algorithm: 1. start at the first x and first y pixel of the image pixel(0,0), Set number of lines to 0 2. scan up to the width of the image on the same y-component of the image a. if a black pixel is detected register y as top of the first line b. if not continue to the next pixel c. if no black pixel found up to the width increment y and reset x to scan the next horizontal line 3. start at the top of the line found and first x-component pixel(0,line_top) 4. scan up to the width of the image on the same y-component of the image a. if no black pixel is detected register y-1 as bottom of the first line. Increment number of lines b. if a black pixel is detected increment y and reset x to scan the next horizontal line 5. start below the bottom of the last line found and repeat steps 1-4 to detect subsequent lines 6. If bottom of image (image height) is reached stop. 4.2.2 Detecting Individual symbols Detection of individual symbols involves scanning character lines for orthogonally separable images composed of black pixels. Algorithm: 1. start at the first character line top and first x-component 2. scan up to image width on the same y-component a. if black pixel is detected register y as top of the first line b. if not continue to the next pixel 3. start at the top of the character found and first x-component, pixel(0,character_top) 4. scan up to the line bottom on the same x-component a. if black pixel found register x as the left of the symbol b. if not continue to the next pixel c. if no black pixels are found increment x and reset y to scan the next vertical line 5. start at the left of the symbol found and top of the current line, pixel(character_left, line_top) 6. scan up to the width of the image on the same x-component 11

a. if no black characters are found register x-1 as right of the symbol b. if a black pixel is found increment x and reset y to scan the next vertical line 7. start at the bottom of the current line and left of the symbol, pixel(character_left,line_bottom) 8. scan up to the right of the character on the same y-component a. if a black pixel is found register y as the bottom of the character b. if no black pixels are found decrement y and reset x to scan the next vertical line

Fig 4.2: Line and Character boundary detection

From the procedure followed and the above figure it is obvious that the detected character bound might not be the actual bound for the character in question. This is an issue that arises with the height and bottom alignment irregularity that exists with printed alphabetic symbols. Thus a line top does not necessarily mean top of all characters and a line bottom might not mean bottom of all characters as well. Hence a confirmation of top and bottom for the character is needed. An optional confirmation algorithm implemented in the project is: A. start at the top of the current line and left of the character B. scan up to the right of the character 1. if a black pixels is detected register y as the confirmed top 2. if not continue to the next pixel 3. if no black pixels are found increment y and reset x to scan the next horizontal line

Fig 4.3: Confirmation of Character boundaries

12

4.3 SYMBOL IMAGE MATRIX MAPPING The next step is to map the symbol image into a corresponding two dimensional binary matrix. An important issue to consider here will be deciding the size of the matrix. If all the pixels of the symbol are mapped into the matrix, one would definitely be able to acquire all the distinguishing pixel features of the symbol and minimize overlap with other symbols. However this strategy would imply maintaining and processing a very large matrix (up to 1500 elements for a 100x150 pixel image). Hence a reasonable tradeoff is needed in order to minimize processing time which will not significantly affect the saperability of the patterns. The project employed a sampling strategy which would map the symbol image into a 10x15 binary matrix with only 150 elements. Since the height and width of individual images vary, an adaptive sampling algorithm was implemented. The algorithm is listed below: Algorithm: a. For the width (initially 20 elements wide) 1. Map the first (0,y) and last (width, y) pixel components directly to the first (0,y) and last (20,y) elements of the matrix 2. Map the middle pixel component (width/2,y) to the 10th matrix element 3. subdivide further divisions and map accordingly to the matrix b. For the height (initially 30 elements high) 1. Map the first x,(0) and last (x, height) pixel components directly to the first (x,0) and last (x,30) elements of the matrix 2. Map the middle pixel component (x, height/2) to the 15th matrix element 3. subdivide further divisions and map accordingly to the matrix c. Further reduce the matrix to 10x15 by sampling by a factor of 2 on both the width and the height

Fig. 4.4: Mapping symbol images onto a binary matrix

13

In order to be able to feed the matrix data to the network (which is of a single dimension) the matrix must first be linearized to a single dimension. This is accomplished with a simple routine with the following algorithm: 1. start with the first matrix element (0,0) 2. increment x keeping y constant up to the matrix width a. map each element to an element of a linear array (increment array index) b. if matrix width is reached reset x, increment y 3. repeat up to the matrix height (x, y)=(width, height) Hence the linear array is our input vector for the MLP Network. In a training phase all such symbols from the trainer set image file are mapped into their own linear array and as a whole constitute an input space. The trainer set would also contain a file of character strings that directly correspond to the input symbol images to serve as the desired output of the training. A sample mini trainer set is shown below:

Fig. 4.5: Input Image and Desired output text files for the sample Mini-Tahoma trainer set

4.4 TRAINING Once the network has been initialized and the training input space prepared the network is ready to be trained. Some issues that need to be addressed upon training the network are:

How chaotic is the input space? A chaotic input varies randomly and in extreme range without any predictable flow among its members. How complex are the patterns for which we train the network? Complex patterns are usually characterized by feature overlap and high data size. What should be used for the values of: o Learning rate o Sigmoid slope o Weight bias How many Iterations (Epochs) are needed to train the network for a given number of input sets? What error threshold value must be used to compare against in order to prematurely stop iterations if the need arises?

Alphabetic optical symbols are one of the most chaotic input sets in pattern recognitions studies. This is due to the unpredictable nature of their pictorial 14

representation seen from the sequence of their order. For instance the Latin alphabetic consecutive character A and B have little similarity in feature when represented in their pictorial symbolic form. The figure below demonstrates the point of chaotic and nonchaotic sequence with the Latin and some factious character set:

Fig. 4.6: Example of chaotic and non-chaotic symbol sequences

Features might result in pattern overlap and the minimum amount of data required makes it one of the most complex classes of input space in pattern recognition. Other than the known issues mentioned, the other numeric parameters of the network are determined in real time. They also vary greatly from one implementation to another according to the number of input symbols fed and the network topology. For the purpose of this project the parameters use are:

Learning rate = 150 Sigmoid Slope = 0.014 Weight bias = 30 (determined by trial and error) Number of Epochs = 300-600 (depending on the complexity of the font types) Mean error threshold value = 0.0002 (determined by trial and error)

Algorithm: The training routine implemented the following basic algorithm 1. 2. 3. 4. 5. 6. Form network according to the specified topology parameters Initialize weights with random values within the specified weight_bias value load trainer set files (both input image and desired output text) analyze input image and map all detected symbols into linear arrays read desired output text from file and convert each character to a binary Unicode value to store separately for each character : a. calculate the output of the feed forward network b. compare with the desired output corresponding to the symbol and compute error c. back propagate error across each link to adjust the weights move to the next character and repeat step 6 until all characters are visited compute the average error of all characters repeat steps 6 and 8 until the specified number of epochs a. Is error threshold reached? If so abort iteration b. If not continue iteration 15

7. 8. 9.

Flowchart: The flowchart representation of the algorithm is illustrated below

Fig. 4.7: Flowchart representation of the algorithm

4.5 TESTING The testing phase of the implementation is simple and straightforward. Since the program is coded into modular parts the same routines that were used to load, analyze and compute network parameters of input vectors in the training phase can be reused in the testing phase as well. The basic steps in testing input images for characters can be summarized as follows: Algorithm:

load image file analyze image for character lines 16

for each character line detect consecutive character symbols o analyze and process symbol image to map into an input vector o feed input vector to network and compute output o convert the Unicode binary output to the corresponding character and render to a text box

Flowchart:

Fig. 4.8: Flowchart of testing

4.6 RESULTS AND DISCUSSION The network has been trained and tested for a number of widely used font type in the Latin alphabet. Since the implementation of the software is open and the program code is scalable, the inclusion of more number of fonts from any typed language alphabet is straight forward. The necessary steps are preparing the sequence of input symbol images in a single image file (*.bmp [bitmap] extension), typing the corresponding characters in a text file (*.cts [character trainer set] extension) and saving the two in the same folder (both must have the same file name except for their extensions). The application will provide a file opener dialog for the user to locate the *.cts text file and will load the corresponding image file by itself. 17

Although the results listed in the subsequent tables are from a training/testing process of symbol images created with a 72pt. font size the use of any other size is also straight forward by preparing the input/desired output set as explained. The application can be operated with symbol images as small as 20pt font size. Note: Due to the random valued initialization of weight values results listed represent only typical network performance and exact reproduction might not be obtained with other trials. Table 4.1: Results for variation in number of Epochs Number of characters=90, Learning rate=150, Sigmoid slope=0.014 300 Font Type No. of wrong % Error characters Latin Arial 4 4.44 Latin Tahoma 1 1.11 Latin Times 0 0 Roman 600 No. of wrong % Error characters 3 3.33 0 0 0 0 800 No. of wrong % Error characters 1 1.11 0 0 1 1.11

Table 4.2: Results for variation in number of Input characters Number of Epochs=100, Learning rate=150, Sigmoid slope=0.014 20 No. of wrong % Error characters Latin Arial 0 0 Latin Tahoma 0 0 Latin Times 0 0 Roman Font Type 50 90 No. of wrong No. of wrong % Error % Error characters characters 6 12 11 12.22 3 6 8 8.89 2 4 9 10

Table 4.3: Results for variation in Learning rate parameter Number of characters=90, Number of Epochs=600, Sigmoid slope=0.014 50 Font Type No. of wrong % Error characters Latin Arial 82 91.11 Latin Tahoma 56 62.22 Latin Times 77 85.56 Roman 100 120 No. of wrong No. of wrong % Error % Error characters characters 18 20 3 3.33 11 12.22 1 1.11 15 16.67 0 0

18

4.7 PERFORMANCE OBSERVATION 4.7.1 Influence of parameter variation i. Increasing the number of iterations has generally a positive proportionality relation to the performance of the network. However in certain cases further increasing the number of epochs has an adverse effect of introducing more number of wrong recognitions. This partially can be attributed to the high value of learning rate parameter as the network approaches its optimal limits and further weight updates result in bypassing the optimal state. With further iterations the network will try to swing back to the desired state and back again continuously, with a good chance of missing the optimal state at the final epoch. This phenomenon is known as over learning. The size of the input states is also another direct factor influencing the performance. It is natural that the more number of input symbol set the network is required to be trained for the more it is susceptible for error. Usually the complex and large sized input sets require a large topology network with more number of iterations. For the above maximum set number of 90 symbols the optimal topology reached was one hidden layer of 250 neurons. Learning rate parameter variation also affects the network performance for a given limit of iterations. The less the value of this parameter, the lower the value with which the network updates its weights. This intuitively implies that it will be less likely to face the over learning difficulty discussed above since it will be updating its links slowly and in a more refined manner. But unfortunately this would also imply more number of iterations is required to reach its optimal state. Thus a trade of is needed in order to optimize the overall network performance. The optimal value decided upon for the learning parameter is 150.

ii.

iii.

4.7.2 Pictorial representation overlap anomalies One can easily observe from the results listing that the entry for the Latin Arial font type has, in general, the lowest performance among its peers. This has been discovered to arise due to an overlap in the pictorial representation of two of its symbols, namely the upper case letter I (I in Times Roman) and the lower case letter l (l in Times Roman).

Fig. 4.9: Matrix analysis for both lower case l (006Ch) and upper case I (0049h) of the Arial font.

19

This would definitely present a logically non-separable recognition task to the network as the training set will be instructing it to output one state for a symbolic image and at some other time another state for the same image. This will be disturbing not only the output vectors of the two characters but also nearby states as well as can be seen in the number of wrong characters. The best state the network can reach in such a case is to train itself to output one vector for both inputs, necessitating a wrong state to one of the output. Still this optimal state can be reached only with more number of iterations which for this implementation was 800. At such high number of epochs the other sets tend to jump into over learning states as discussed above. 4.7.3 Orthogonal inseparability Some symbol sequences are orthogonally inseparable. This is to mean there cannot be a vertical line that passes between the two symbols without crossing bitmap areas of either. Such images could not be processed for individual symbols within the limits of the project since it requires complex image processing algorithms. Some cases are presented below:

Fig. 4.10: Some orthogonally inseparable symbolic combinations in the Latin alphabet

20

CHAPTER 5

BENEFITS AND APPLICATIONS

5.1 BENEFITS Save data entry costs - automatic recognition by OCR/ICR/OMR/barcode engines ensure lower manpower costs for data entry and validation. Lower licensing cost - since the product enables distributed capture licensing costs for OCR/ICR engine is much lower. For instance 5 workstations may be used for scanning and indexing but only one OCR/ICR license may be required. Export the recognized data in XML or any other standard format for integration with any application or database

5.2 APPLICATIONS Industries and Institutions in which control of large amounts of paper work is critical Banking, Credit cards, Insurance industries Libraries and archives For conservation and preservation of vulnerable documents and for the provision of access to source documents

21

OCR fonts are used for several purposes where automated systems need a standard character shape defined to properly read text without the use of barcodes. Some examples of OCR font implementations include bank checks, passports, serial labels and postal mail.

22

CHAPTER 6

SOFTWARE ARCHITECTURE

6.1 ARCHITECTURE The overall architecture of the OCR consists of three main phases- Segmentation, Recognition and Post-processing. We explain each of these phases below.

6.1.1 Segmentation Segmentation in the context of character recognition can be defined as the process of extracting from the preprocessed image the smallest possible character units which are suitable for recognition. It consist of the following steps : Locate the Header Line An image is stored in the form of a two dimensional array in computer. A black pixel is represented by 1 and a white pixel by a 0. The array is scanned row by row and the number of black pixels is recorded for each row resulting in horizontal histogram. The row with the maximum number of black pixels is the position of the header line called as Shirorekha. This position is identified as hLinePos.

Separate the Character boxes Characters are present below the header line. To identify the character boxes, we make a vertical histogram of the image starting from the hLinePos to 23

boundary of the word i.e. the row where there are no black pixels. The boundaries for characters are identified as the columns that have no black pixels.

Separate the upper modifier symbols To identify the upper modifier symbols, we make a vertical histogram of the image starting from the top row of the image to the hLinePos.

Separate the lower modifiers We did not attempt lower modifier separation due to lack of time.

6.1.2 Feature Extraction Feature extraction refers to the process of characterizing the images generated from the segmentation procedure based on certain specific parameters. We did not explore this further.

6.1.3 Classification Classification involves labeling each of the symbols as one of the known characters, based on the characteristics of that symbols. Thus, each character image is mapped to a textual representation.

6.1.4 Post-processing The output of the classification process goes through an error detection and correction phase. This phase consists of the following three steps: 1) Select an appropriate partition of the dictionary based on the characteristics of the input word; select the candidate words from the selected partition to match the input word with. 2) Match the input word with the selected words. 3) In case the input word is found in the dictionary, no more processing is done and the word is assumed to be correct. If the word is not found, there are two options available. We can generate aliases for the input word or restrict to an exact match.

24

Diagrammatic presentation of the stages of OCR

Input Image

Fig 6.1: Stages of OCR

6.2 SYSTEM ANALYSIS System Analysis by definition is a process of systematic investigation for the purpose of gathering data, interpreting the facts, diagnosing the problem and using this information to either build a completely new system or to recommend the improvements to the existing system. A satisfactory system analysis involves the process of examining a business situation with the intent of improving it through better methods and procedures. In its core sense, the analysis phase defines the requirements of the system and the problems which user is trying to solve irrespective of how the requirements would be accomplished. 6.2.1 Structured Analysis

Fig. 6.2: Structure Analysis

25

CHAPTER 7

FEASIBILITY STUDY

A feasibility study determines whether the proposed solution is feasible based on the priorities of the requirements of the organization. A feasibility study culminates in a feasibility report that recommends a solution. It helps you to evaluate the costeffectiveness of a proposed system. During this phase, various solutions to the existing problems were examined. For each of these solutions the Cost and Benefits were the major criteria to be examined before deciding on any of the proposed systems. These Solutions would provide coverage of the following: a) Specification of information to be made available by the system. b) A clear cut description of what tasks will be done manually and what needs to be handled by the automated system. c) Specifications of new computing equipment needed. A system that passes the feasibility tests is considered a feasible system. Let us see some feasible tests in my project.

26

7.1 TECHNICAL FEASIBILITY It is related to the software and equipment specified in the design for implementing a new system. Technical feasibility is a study of function, performance and constraints that may affect the ability to achieve an acceptable system. During technical analysis, the analyst evaluates the technical merits of the system, at the same time collecting additional information about performance, reliability, maintainability and productivity. Technical feasibility is frequently the most difficult areas to assess. 7.1.1 Assessing System Performance It involves ensuring that the system responds to user queries and is efficient, reliable, accurate and easy to use. Since we have the excellent network setup which is supported and excellent configuration of servers with 80 GB hard disk and 512 MB RAM, it satisfies the performance requirement. After the conducting the technical analysis we found that our project fulfills all the technical pre-requisites, the network environments if necessary are also adaptable according to the project and

7.2 ECONOMIC FEASIBILITY This feasibility has great importance as it can outweigh other feasibilities because costs affect organization decisions. The concept of Economic Feasibility deals with the fact that a system that can be developed and will be used on installation must be profitable for the Organization. The cost to conduct a full system investigation, the cost of hardware and software, the benefits in the form of reduced expenditure are all discussed during the economic feasibility. 7.2.1 Cost of No Change The cost will be in terms of utilization of resources leading to the cost to the company. Since our cost of project is our efforts, which is obviously less than the longterm gain for the company, the project should be made. 7.2.2 Cost- Benefit Analysis A cost-benefit analysis is necessary to determine economic feasibility. The primary objective of the cost benefit analysis is to find out whether it is economically worthwhile to invest in the project. If the returns on the investment 27 are good, then the project is

considered economically worthwhile. Cost benefit analysis is performed by first listing all the costs associated with the project cost which consists of both direct costs and indirect costs. 7.3 OPERATIONAL FEASIBILITY Operation feasibility is a measure of how people feel about the system. Operational Feasibility criteria measure the urgency of the problem or the acceptability of a solution. Operational Feasibility is dependent upon determining human resources for the project. It refers to projecting whether the system will operate and be used once it is installed. If the ultimate users are comfortable with the present system and they see no problem with its continuance, then resistance to its operation will be zero.

Our Project is operationally feasible since there is no need for special training of staff member and whatever little instructing on this system is required can be done so quite easily and quickly as it is essentially This project is being developed keeping in mind the general people who one have very little knowledge of computer operation, but can easily access their required database and other related information. The redundancies can be decreased to a large extent as the system will be fully automated.

28

CHAPTER 8

SOFTWARE ENGINEERING PARADIGM APPLIED


Software Engineering is a planned and systematic approach to the development of software. It is a discipline that consists of methods, tools and techniques used for developing and maintaining software. To solve actual problems in an industry setting, a software engineer or team of engineers must incorporate a development strategy that encompasses the process, methods and tool layers and generic phases. This strategy is often referred to as a process model or Software Engineering paradigm. For developing a software product, user requirements are identified and the design is made based on these requirements. The design is then translated into a machine executable language that can be interpreted by a computer. Finally, the software product is tested and delivered to the customer.

29

Fig. 8.1: Spiral Model

The Spiral model incorporates the best characteristics of both the waterfall and prototyping model. In addition, the Spiral model also contains a new component called Risk Analysis, which is not there in waterfall and prototype model. In the Spiral model, the basic structure of the software product is developed first. After the basic structure is developed, new features such as user interface and data administration are added to the existing software product. This functionality of the Spiral model is similar to a spiral where the circles of the spiral increase in diameter. Each circle represents a more complete version of the software product.

30

CHAPTER 9

DEVELOPMENT REQUIREMENTS

9.1 SOFTWARE REQUIREMENTS During the solution development the following softwares were used: Microsoft Visual Studio MS-SQL Server 2005

9.2 HARDWARE REQUIREMENTS During the solution development the following hardaware specificationswere used: 2.4GHZ P-IV Processor Minimum 256MB Ram

9.3 INPUT REQUIREMENTS OCR system needs textual scanned Image as the input.

31

CHAPTER 10

SOFTWARE REQUIREMENTS SPECIFICATIONS

Fig. 10.1: OCR Software requirement specifications

A key feature in the development of any software is analysis of the requirements that must be satisfied by software. A thorough understanding of these requirements is essential for the successful development and implementation of software. The software requirement specification is produced at the culmination of the analysis task. The function and performance allocated to software as part of system engineering are 32

refined by establishing a complete information description, a detailed functional and behavioral description, an indication of performance requirements and design constraints, appropriate validation criteria. The Software Requirements Specifications basically states the goals and objectives of the software. It provides a detailed description of the functionality that software must perform.

33

CHAPTER 11

SYSTEM DESIGN PHASE

Design is an activity of translating the specifications generated in the software requirements analysis into specific design. The design involves designing a system that satisfies customer requirements. In order to transform requirements into a working system, we must satisfy both the customer and the system builders on development team. The customer understands what the system is to do. At the same time, the system builders must understand how the system is to work. For this reason, system design is really a two-part process. First, we produce a system specification that tells the customer exactly what the system will do. This specification is sometimes called a conceptual system design.

11.1 TECHNICAL DESIGN The technical design explains the system to those hardware and software experts who will implement it. The design describes the hardware configuration, the software needs, the communication interfaces, the input and output of the system and anything else that translates the requirements into a solution to the customers problem. The design description is a technical picture of the system specification. Thus we include the following items in the technical design:

34

The System Architecture: A description of the major hardware components and their functions. The System Software Structure: The hierarchy and function of the software components. The data structure and flow through the system.

11.2 DESIGN APPROACH Modular approach has been taken into consideration. Design is the determination of the modules and inters modular interfaces that satisfy a specified set of requirements. A design module is a functional entity with a well-defined set of inputs and outputs. Therefore, each module can be viewed as a component of the whole system, just as each room is a component of a house. A module is well defined if all the inputs to the module are essential to the function of the module and all outputs are produced by some action of the module. Thus if one input will be left out, the module will not perform its full function. There are no unnecessary inputs; every input is used in generating the output. Finally, the module is well defined only when each output is a result of the functioning of the module and when no input becomes an output without having the transformed in some way by the module.

11.2.1 Modularity Modularity is a characteristic of good system design. High level modules give us the opportunity to view the problem as whole and hide details that may distract us. By being able to reach down to a lower level for more detail when we want to, modularity provides the flexibility , trace the flow of data through the system, and target the pockets of complexity. These all are interrelated with each other and also self-sufficient among themselves and help in running the system in an efficient and complete manner.

11.2.2 Level of Abstraction Abstraction an information hiding allows us to examine the way in which modules are related to one another in the overall design the degree to which the modules are

35

independent of one another is a measure of how good the system design is. Independence is desirable for two reasons. First it is easier to understand how a module works if its function is not tied to others. It is much easier to modify a module if it is independent of others. Often a change in requirements or in a design decision means that certain modules must be modified. Each change affects data or function or both. If the modules depend heavily on each other, a change to one module may mean changes module that are affected by the change.

11.2.3 Coupling Coupling is a measure of how modules depend on each other. Two modules are highly coupled if there is a great deal of dependence between them. Loosely couple modules have no interconnection at all. Coupling depends on several things: The references made from one module to another. The amount of data passed from one module to another. The amount of control one module has over the other. The degree of complexity in the interface between one module and another.

Thus, coupling really represents a range of dependence, from complete dependence to complete independence. We want to minimize the dependence among modules for several reasons. First, if an element is affected by a system action, we always want to know which module causes an effect at a given time. Second, modularity helps in tracking the cause of the system errors. If an error occurs during the performance of particular function, independence of modules allows us to isolate the defective module more easily.

11.2.4 Cohesion Cohesion refers to the internal glue with which a module is constructed. The more cohesive a module, the more related are the internal parts of the module to each other and to the functionality of the module. In other words, a module is cohesive if all elements of the module are directed towards and essential for performing the same function. For example the various triggers written for the Subscription entry form are performing the functionality of the module like querying the old data, saving the new data, updating records etc. So its a highly cohesive system.

36

11.2.5 Scope of Control and Effect Finally we want to be sure that the modules in our design do not affect other modules over which they have the control. The modules controlled by the given module are collectively referred to as the scope of effect. No module should be in scope of effect if it not in scope control. Thus in order to make the system easier to construct, test, correct, and maintain our goals had been: Low coupling of modules High cohesive modules Scope of effect of a module limited to its scope of control

It was decided to store data in different tables in SQL Server. The tables were normalized and various modules identified so as to store data properly create designed reports and on screen queries were written. A menu driven (user friendly) package has been designed containing understandable and presentable menus. Table structures are enclosed. Input and output details were made which are enclosed herewith. The specifications in our design include User interface Design screens and their description Entity Relationship Diagrams

37

CHAPTER 12

MODULE SPECIFICATIONS
0. MAIN Input Output : : none none - Choose a file - Loading a file - Line Segmentation - Edit line segmentation - Word segmentation - Edit word segmentation - Clear

Subordinates :

1.

CHOOSE_FILE Input event Output : : open button click a file is choosed and text field is set. none selects a file from given menu.

Subordinates : Purpose 2. LOAD_FILE Input event : :

file is choosed. 38

Output

shows image in the panel. none shows the selected image file

Subordinates : Purpose :

3.

LINE_SEGMENTATION Input event Output : : line button click display the line segmentation. image scan do the line segmentation of image.

Subordinates : Purpose :

4.

EDIT_LINE_SEGMENTATION Input event Output : : click of mouse in white space or on some line display edited line segmentation and stores new array none change the drawn line according to the user.

Subordinates : Purpose :

5.

WORD_SEGMENTATION Input event Output : : word button click display the word segmentation. word segmentor do the word segmentation of image

Subordinates : Purpose :

6.

EDIT_WORD_SEGMENTATION Input event Output : : click of mouse in white space or on some line display edited word segmentation and stores new array none change the drawn line according to the user.

Subordinates : Purpose 7. CLEAR Input event : :

click on clear button 39

Subordinates : Purpose :

none to clear the panel for loading new image

Design is flexible and accommodates other expected needs of the customer and suitable changes can be made at a later date. After thoroughly examining the requirements only that design has been suggested which can meet current and probably the future desires of the customer.

40

CHAPTER 13

TESTING (TESTING TECHNIQUES AND TESTING STRATEGIES)

Fig. 13.1: Testing Procedures

All software intended for public consumption should receive some level of testing. Without testing, you have no assurance that software will behave as expected. The results in public environment can be truly embarrassing. Testing is a critical element of software quality assurance and represents the ultimate review of specification, designing, and coding. Testing is done throughout the system development at various stages. If this is not done, then the poorly tested system can fail after installation. Testing is a very important part of SDLC and takes approximately 50%of the time. 41

The first step in testing is developing a test plan based on the product requirements. The test plan is usually a formal document that ensures that the product meets the following standards: Is thoroughly Tested- Untested code adds an unknown element to the product and increases the risk of product failure Meets product requirements- To meet customer needs, the product must provide the features and behavior described in the product specification. Does not contain defects- Features must work within established quality standards and those standards should be clearly stated within the test plan.

13.1 TESTING TECHNIQUES 13.1.1 Black box Testing Aims to test a given programs behavior against its specification or component without making any reference to the internal structures of the program or the algorithms used. Therefore the source code is not needed, and so even purchased modules can be tested. We study the system by examining its inputs and related outputs. The key is to devise inputs that have a higher likelihood of causing outputs that reveal the presence of defects. We use experience and knowledge of the domain to identify such test cases. Failing this a systematic approach may be necessary. Equivalence partitioning is where the input to a program falls into a number of classes. e.g. positive numbers vs. negative numbers. Programs normally behave the same way for each member of a class. Partitions exist for both input and output. Partitions may be discrete or overlap. Invalid data (i.e. outside the normal partitions) is one for which partitions should be tested. Test cases are chosen to exercise each portion. Also test boundary cases (atypical, extreme, zero) should be considered since these frequently show up defects. For completeness, test all combinations of partitions. Black box testing is rarely exhaustive (because one doesn't test every value in an equivalence partition) and sometimes fails to reveal corruption defects caused by weird combination of inputs. Black box testing should not be used to try and reveal corruption defects caused, Example, by assigning a pointer to point to an object of the wrong type. Static inspection (or using a better programming language) is preferred.

42

13.1.2 White box Testing Was used as an important primary testing approach. Code is tested using code scripts,

drivers, stubs, etc. which are employed to directly interface with it and drive the code. The tester can analyze the code and use the knowledge about the structure of a component to derive test data. This testing is based on the knowledge of structure of component (e.g. by looking at source code). The advantage is that structure of code can be used to find out how many test cases needed to be performed. Knowledge of the algorithm (examination of the code) can be used to identify the equivalence partitions. Path testing is where the tester aims to exercise every independent execution path through the component. All conditional statements tested for both true and false cases. If a unit has n control statements, there will be up to 2n possible paths through it. This demonstrates that it is much easier to test small program units than large ones. Flow graphs are a pictorial representation of the paths of control through a program (ignoring assignments, procedure calls and I/O statements). We use a flow graph to design test cases that execute each path. Static tools may be used to make this easier in programs that have a complex branching structure. Dynamic program analyzers instrument a program with additional code. Typically this will count how many times each statement is executed. At end, print out report showing which statements have and have not been executed. Possible methods: Usual method is to ensure that every line of code is executed at least once. Test capabilities rather than components (e.g. concentrate on tests for data loss over ones for screen layout). Test old in preference to new (users less affected by failure of new capabilities). Test typical cases rather than boundary ones (ensure normal operation works properly). 13.1.3 Debugging Debugging is a cycle of detection, location, repair and test. Debugging is a hypothesis testing process. When a bug is detected, the tester must form a hypothesis about the cause and location of the bug. Further examination of the execution of the program (possible including many returns of it) will usually take place to confirm the hypothesis. If 43

the hypothesis is demonstrated to be incorrect, a new hypothesis must be formed. Debugging tools that show the state of the program are useful for this, but inserting print statements is often the only approach. Experienced debuggers use their knowledge of common and/or obscure bugs to facilitate the hypothesis testing process. After fixing a bug, the system must be reset to ensure that the fix has worked and that no other bugs have been introduced. In principle, all tests should be performed again but this is often too expensive to do.

13.2 TESTING PLANNING Testing needs to be planned to be cost and time effective. Planning is setting out standards for tests. Test plans set the context in which individual engineers can place their own work. Typical test plan contains: Overview of Testing Process. Recording procedures so that tests can be audited. Hardware and Software Requirements. Constraints.

13.2.1 Testing Done in our System The best testing is to test each subsystem separately as we have done in our project. It is best to test a system during the implementation stage in form of small sub steps rather than large chunks. We have tested each module separately i.e. have completed unit testing first and system testing was done after combining /linking all different Modules with different menus and thorough testing was done. Once each lowest level unit has been tested, units are combined with related units and retested in combination. This proceeds hierarchically bottom-up until the entire system is tested as a whole. Hence we have used the Top Up approach for testing our system.

44

Typical levels of testing in our system: Unit -procedure, function, method Module -package, abstract data type Sub-system - collection of related modules, method-message paths Acceptance Testing - whole system with real data(involve customer, user , etc)

13.2.2 Beta Testing Beta Testing is acceptance testing with a single client. It is conducted at the developers site by a customer. The software is used in a natural setting with the developer looking over the shoulder of the user and recording errors and usage problems. conducted in a controlled environment. Usually comes in after the completion of basic design of the program. The project guide who looks over the program or other knowledgeable officials may make suggestions and give ideas to the designer for further improvement. They also report any minor or major problems and help in locating them and may further suggest ideas to get rid of them. Naturally a number of bugs are expected after the completion of a program and are most likely to be known to the developers only after the alpha testing. involves distributing the system to potential customers to use and provide feedback. It is conducted at one or more customer sites by the end-user of the software. Unlike alpha testing, the developer is generally not present. Therefore, the beta test is a live application of the software in an environment that cannot be controlled by the developer. The customer records all problems (real or imagined) that are encountered during beta testing and reports these to the developer at regular intervals. As a result of problems reported during beta test, software engineers make modifications and then prepare for release of the software product to the entire customer base. In, this project, This exposes system to situations and errors that might not be anticipated by us.

13.3 IMPLEMENTATION
Implementation includes all those activities that take place to convert from old system to the new one. The new system may be completely new. Successful Implementation may not guarantee improvement in the organization using the new system, improper installation will prevent it. Implementation uses the design document to produce 45

code. Demonstration that the program satisfies its specifications validates the code. Typically, sample runs of the program demonstrating the behavior for expected data values and boundary values are required. Small programs are written using the model: It may take several iterations of the model to produce a working program. As programs get more complicated, testing and debugging alone may not be enough to produce reliable code. Instead, we have to write programs in a manner that will help insure that errors are caught or avoided. . 13.3.1 Incremental program development As program becomes more complex, changes have a tendency to introduce unexpected effects. Incremental programming tries to isolate the effects of changes. We add new features in preference to adding new functions, and add new function rather than writing new programs. The program implementation model becomes: 1. Define types/compile/fix; 2. Add load and dump functions/compile/test; 3. Add first processing function/compile/test/fix; 4. Add features/compile/test/fix; 5. Add second processing function/compile/test/fix; 6. Keep adding features/and compiling/and testing/ and fixing.

46

CHAPTER 14

MAINTENENCE

Fig. 14.1: OCR Maintenance

The maintenance starts after the final software product is delivered to the client. The maintenance phase identifies and implements the change associated with the correction of errors that may arise after the customer has started using the developed software. This also maintains the change associated with changes in the software environment and customer requirements. Once the system is a live one, Maintenance phase is important. Service after sale is a must and users/ clients must be helped after the system is implemented. If he/she faces any problem in using the system, one or two trained persons from developers side can be deputed at the clients site, so as to avoid any problem and if any problem occurs immediate solution may be provided.

47

The maintenance provided with our system after installation is as follows: First of all there was a Classification of Maintenance Plan which meant that the people involved in providing the after support were divided. The main responsibility was on the shoulders of the Project Manager who would be informed in case any bug appeared in the system or any other kind of problem rose causing a disturbance in functioning. The Project leader in turn would approach us to solve the various problems at technical level. (E.g. The form isnt accepting data in a proper format or it is not saving data in the database.)

14.1 COST ESTIMATION The cost estimation depends upon the following: Project complexity Project size Degree of structural uncertainty Human, technical, environmental, political can affect the ultimate cost of software and effort applied to develop it. Delay estimation until late in the project. Base estimates on similar projects that have already been completed. Use relatively simple decomposition techniques to generate project cost and effort estimates. Use one or more empirical models for software cost and effort estimation.

Project complexity, project size and the degree of structural uncertainty all affect the reliability of estimates. For complex, custom systems, a large cost estimation error can make the difference between profit and loss. A model is based on experience and takes the form: D = f (Vi) Whered is one of a number of estimated values (e.g. effort, cost, project duration) and (Vi) are selected independent parameters (e.g. estimated LOC (Line of Code) or FP (Functional parameters))

48

CHAPTER 15

ASSUMPTIONS MADE
1. The input scanned document is assumed to be only in jpg, gif or jpeg format.

2. The input scanned document only consists of text in black written on a white background, it contains no graphical images.

3. After loading the image, first Line segmentation is performed and then only word segmentation can be performed, that is first the line segmentation button has to be clicked .Trying to do word segmentation will not affect the original document.

4. Lines can be dragged, dropped, added or deleted only after default line segmentation has been performed on the click of Line Segmentation button.

5. For loading another image, the clear button is pressed and then the image is loaded.

49

CHAPTER 16

RESULTS
A sample text, its line segmentation, word segmentation and character segmentation are shown next. These are actual screen dumps.

Fig. 16.1: OCR Working

50

Fig. 16.2: OCR Working

51

Fig. 16.3: OCR Working

52

Fig. 16.4: OCR Working

53

CHAPTER 17

SUMMARY AND CONCLUSION

A document system has been developed which uses various knowledge sources to improve the performance. The composite characters are first segmented into its constituent symbols which helps in reducing the size of the set, in addition to being a natural way of dealing with the script. The automated trainer makes two passes over the text image to learn the features of all the symbols of the script. A character pair expert resolves confusion between two candidate characters. The composition processor puts the symbols back together to get the words which are then passed through the dictionary. The dictionary corrects only those characters which cause a mismatch and have been recognized with low confidence. The preliminary results on testing of the system show a performance of more than 95% on printed texts on individual fonts. Further testing is currently underway for multi-font and hand printed texts. Most of the errors are due to inaccurate segmentation of symbols within a word. We are using only upto word level knowledge in our system. The domain knowledge and sentence level knowledge could be integrated to further enhance the performance in addition to making it more robust. The method utilizes an initial stage in which successive columns (vertical strips) of the scanned array are organized in groups of one pitch width to yield a coarse line pattern (CLP) that crudely shows the distribution of white and black along the line. The CLP is analyzed to estimate baseline and line skew parameters by transforming the CLP by different trial line skews within a specified range. For every transformed CLP (XCLP), the number of black 54

elements in each row is counted and the row-to-row change in this count is also calculated. The XCLP giving the maximum negative change (decrease) is assumed to have zero skew. The skew corrected row that gives the maximum gradient serves as the estimated baseline. Successive pattern fields of the scanned array having unit pitch width are superposed (after skew correction) and summed. The resulting sum matrix tends to be sparse in the intercharacter area. Thus, the column having minimum sum is recorded as an "average", or coarse, Xdirection segmentation position. Each character pattern is examined individually, with the known baseline (corrected for skew) and average segmentation column as references. A number of neighboring columns (3 columns, for example) to the left and right of the average segmentation columns are included in the view that is analyzed for full segmentation by conventional algorithm.

55

CHAPTER 18

REFERENCES
1. http://en.wikipedia.org/wiki/Optical_character_recognition

2. G. Nagy. At the frontiers of OCR. Proceedings of the IEEE, 80(7):1093--1100, July 1992.

3. S. Tsujimoto and H. Asada. Major components of a complete text reading system. Proceedings of the IEEE, 80(7):1133--1149, July 1999.

4. Y. Tsujimoto and H. Asada. Resolving Ambiguity in Segmenting Touching Characters. In ICDAR [ICD91], pages 701--709.

5. R. A. Wilkinson, J. Geist, S. Janet, P. J. Grother, C. J. C. Burges, R. Creecy, B. Hammond, J. J. Hull, N. J. Larsen, T. P. Vogl, and C. L. Wilson. The first census optical character recognition systems conference. Technical Report NISTIR-4912, National Institute of Standards and Technology, U.S. Department of Commerce, September 2001.

56

You might also like