SlideShare a Scribd company logo
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification TechniqueRami Al-SahharIdeas for today and tomorrow
AgendaOCR OverviewThe Arabic OCR Problem OCR ChallengesProposed SolutionDetailed system stagesSample RunFuture WorkDemo
OCR Overview(OCR) is the process of converting an image of text, such as a scanned paper document, into computer-editable textThe ultimate goal of OCR is to simulate the human ability to read both machine-printed and hand-written textsMost of the work on OCR has been on Latin and Chinese charactersArabic character recognition started recently and advanced relatively slowly due to the complexity of recognizing Arabic text, which has characters that are cursive in nature. Arabic character recognition is still an open and challenging field of research
The Arabic OCR ProblemTo propose a complete system that classifies and recognizes machine-printed Arabic textThe input to the system is TIFF image file The Arabic font size varies from 8 up to 36The font type is Arabic Simplified or Traditional ArabicThe image scanned at 300 dpi  ( Resolution )The output is editable text in a word processor program ( MS Word)
OCR ChallengesUnderstanding TIFF image format and pixel representationProgrammatically , read TIFF image pixel by pixel from right to leftFeatures extractionSegmentation freeSpaces ,Words , Letters and Line isolation Noise reductionDots and holesOverlapped characters
OCR Challenges Arabic Character CharacteristicsRight to left Always cursive Change of character shape according to its location in the word Four different shapes 28 basic characters: 15 with dots, 13 without No fixed character width and no fixed size
OCR Challenges Arabic Character CharacteristicsGroup of Arabic character shapesA sample of written Arabic showing some of its characteristics
The Proposed SolutionThe proposed system starts from the document image acquisition stage and ends with recognized Arabic text in standard Simplified true type font format in MS Word 2007 We started designing our system by experimenting with prior researchers’ techniques, adopting or modifying some of them if they met our requirements, but otherwise developing our own techniquesConsequently, the components of our system are either due to the work of others, the result of our improvement of others’ work, or our own completely new techniques.
Prolog-BasedRECOGNIZED TEXTCLASSIFICATION AND RECOGNITIONC -BasedPOSTENHANCEMENTFEATURE EXTRACTIONPREPROCESSINGTIFFIMAGE FILEThe Proposed SolutionATR (Arabic Text Recognition) System model
The Proposed SolutionPreprocessing PhaseDigitalization, scaling, word-level segmentation, noise removal and elimination of redundant information as far as possibleImage information retrievalLoad/Read the input (TIFF) image file as binary; retrieve the image properties (size, width, height, pixel resolution, image channels and image alignment; and create memory storage for system intermediate processingImage digitalizationDigitizes the TIFF image in order to apply fixed-level thresholding and to convert the gray-scale and bitmapped image to a binary (0’s and 1’s representation) scale image
The Proposed SolutionIt does the vertical and horizontal histograms to retrieve the number of lines per page and number of components (words) per each line We calculate the font baseline and size by finding the maximum horizontal histogram of each line per pageThis enables the dots or other special characters such as Shadda, Madda, and Tanween to be classified as upper or lower components related to this baselineText line detectionWord segmentation
The Proposed SolutionB&W image is found in file name: [ test1.tif]Processing a [1615x2160] image with [1] channel(s)Image Origin : [Top-left Origin] , Align : [4-]Data Order :[Interleaved Color Channels]Number of Lines(s) found: [6]Line #0 , Y = 78 , Height = [67]Line #1 , Y = 185 , Height =[ 67]Line #2 , Y = 292 , Height = [67]Line #3 , Y = 399 , Height = [67]Line #4 , Y = 506 , Height = [67]Line #5 , Y = 613 , Height = [67]Font Baseline =[ 38 pixels]Number of Components found at Image Line #0 : [9]Number of Components found at Image Line #1 : [14]Number of Components found at Image Line #2 : [16]Number of Components found at Image Line #3 : [18]Number of Components found at Image Line #4 : [10]Number of Components found at Image Line #5 : [6]Preprocessing phase Text Line Detection
12Number of retrieved contours : [2]*************  Bounding Rectangle (1,22)-(72,65)  ************Component [1] Origin Y = 22 , Height = 43 Area = 429.000000Component [2] Origin Y = 47, Height = 7 Area = 19.000000 Max Component Area = 429.000000 , Y = 22 , H = 43The Proposed SolutionPreprocessing phase Word Segmentation
The Proposed SolutionFeature Extraction PhaseIs the most challenging part for character or text recognitionThe choice of good features significantly improves the recognition rate and minimizes the error in case of noiseThe main selected features are :Outer contours described in Freeman chain codesContours’ cornersDot information  Font estimated size  All of these features are extracted for all detected components during the page scanning
The Proposed SolutionFreeman chain code Chain code was introduced by Freeman as a mean of representing lines or boundaries of shapes by a connected sequence of straight-line segments of specified length and directionAn example of the 8-connectivity chain codeChain code numbering schemes
The Proposed SolutionContour extraction processThis is the core process to extract the main word-level features of the Arabic text in Freeman Chain code formatAfter extracting the Freeman codes, we aggregate those codes into pairs as (X, Y) where X is the direction (i.e. from 1 to 7) and Y is the length in pixels
The Proposed SolutionContours Freeman Chain Codes[2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,6,5,7,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,4,3,1,1,1,1,2,2,2,3,4,4,3,4,4,4,4,4,4,4,4,5,4,6,5,6,6,6,1,0,0,7,7,7,5,4,5,4,4,5,4,4,4,4,4,3,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,5,7,6,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,3,2,2,2,2,2,2,2,3,2,2,2,3,2,3,2,3,3,3,3,4,3,6,6,6,6,7,0,7,7,7,7,6,7,6,6,6,7,6,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,6,6,6,6,7,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,7,7,7,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]Contours Freeman Chain Code Pairs - AlignedTotal Pairs : [95] ==> [(2,14),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,2),(5,1),(7,3),(6,1),(7,1),(6,4),(5,1),(4,19),(3,1),(4,1),(3,1),(1,4),(2,3),(3,1),(4,2),(3,1),(4,8),(5,1),(4,1),(6,1),(5,1),(6,3),(1,1),(0,2),(7,3),(5,1),(4,1),(5,1),(4,2),(5,1),(4,5),(3,1),(2,9),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,1),(5,1),(7,1),(6,1),(7,2),(6,1),(7,1),(6,4),(5,1),(4,13),(3,1),(2,7),(3,1),(2,3),(3,1),(2,1),(3,1),(2,1),(3,4),(4,1),(3,1),(6,4),(7,1),(0,1),(7,4),(6,1),(7,1),(6,3),(7,1),(6,5),(5,1),(4,14),(3,2),(6,4),(7,2),(0,38),(1,1),(0,2),(1,1),(0,1),(1,1),(0,1),(1,1),(7,3),(0,1),(7,1),(0,22)]Contour Corners Positions:[5,18,23,33,37,60,67,81,87,90,92,103,113,118,132,146,160,162,168,172,178,180,189,202,206,210,258,263,]Feature extraction: Freeman chain codes, pairs and corner positions
The Proposed SolutionCorner DetectionThis phase detects and extracts the component’s contour corners of the text under processingIt is based on an implementation of contour detection and curve representation by circular local histogram of contour chain code presented by [Arrebola, Camacho, Bandera , & Sandoval (1999)]The corner detection phase is very important for the next classification and recognition phaseIt helps our Prolog engine to determine the unique shape of the character’s feature regardless of the character orientationThe output of this phase is a stream of corner information to be input for the next phase
The Proposed SolutionContour enhancement
We introduced an algorithm to remove noisy pixels which come within any straight line, and to convert Arabic characters to approximately straight lines.
These enhancement rules, which are derived from testing Arabic characters multiple times, reduce the time required for character recognitionThe Proposed SolutionDot Detection and Font Size EstimationDot detection is another challenging important task It helps the prolog classification engine to recognize words that include dots and it decides which characters are un-dotted by nature  Font size estimation is an critical task in Omni-size character recognition systemsFont size estimation is usually used to find the pen width in online recognition systemsApproaches used to estimate the font size by dots is presented in [Shirali-Shahreza , & Shirali-Shahreza  (2006)]
The Proposed Solution
The Proposed SolutionFont size against calculated component’s height (in pixels)
The Proposed SolutionDefinite Clause Grammar (DCG): Provides a mechanism for defining the grammar rules of a languageThese rules are automatically translated to a Prolog program which defines a parser for the language being definedGrammar rules are a feature only in some Prolog systems, and are designed to facilitate the parsing of natural languageUsing this notation, a grammar is represented as a set of logical rulesWhen the DCG rules are consulted (or optimized), they are translated into Prolog clauses
The Proposed SolutionWord-level Classification and Recognition PhaseThis is the most critical phase in our proposed ATR systemIt is written in Prolog language using Prolog matching, backtracking and DCG techniquesThe input for this phase is data on two features :The first input stream is the corner sequence of the word-level outer contours for each component that represents the elevation information of the input stream (the upper part that holds most of the features) The second input stream is the dot information found in the same component
The Proposed SolutionThe Prolog matching and backtracking techniques also use the corner sequence stream to classify the unknown inputs into character classes, while the Prolog DCG technique uses the dot information stream to recognize the actual Arabic letters of a particular character class
The Proposed SolutionDCG implementation:  The DCG grammar structure and some of the character classes are described below :% DCG part for Arabic text recognition based on two input streams% usage: phrase(s(R),[m,h_c,d1,m,dc]).s([H|T]) -->cc(H), subs(T).     % every string is a character class followed by a sub-strings(R)-->cc(R).	  	      % or a string can be simply a character classsubs(R)-->s(R). 		      % a substring is nothing but a string (recursively)cc(R)-->ch(R).   		      % a character class can be a simple character% or character classes can belong to any of the following classescc(R)-->bc(R).   		    % Ba class     (ba, ta, tha, ya_md)cc(R)-->h_c(R).  		    % H_ class     (h_, jeem, kha)cc(R)-->dc(R).   		    % Dal Class    (dal, thal)cc(R)-->rc(R).   		    % Ra' Class   (ra, zay)cc(R)-->sc(R).   		    % Seen Class  (seen, sheen)
The Proposed SolutionMicrosoft Word Document IntegrationThis is the final phase of our  optical text recognition systemIt is written in Prolog language to interface with Microsoft Word programIt uses Microsoft Word Document API to write the recognized characters into a new Word documentIt writes the output text in the same recognized font size in a predefined font typeIt also writes the white spaces and new lines to maintain the same original text alignment and format
Image Information RetrievalHeight/ WidthDocument Image (TIFF File)Pixel ResolutionImageDigitalizationFont BaselineFont SizeWord-levelSegmentationLines per PageComponent Coordinates(X, Y)Words per LineWord-levelContour ExtractionFreeman Chain CodesComponent Area, Height and WidthContour EnhancementCharacter-ShapeProlog MatchingCorner DetectionDots DetectionCharacter Shape StreamCharacter Reference DatabaseDot Information StreamDCG EngineWord-levelRecognitionMS WORD Document IntegrationRecognized Text (Word Document)The Proposed Solution
Sample RunOriginal TIFF image with Arabic textThe recognized Arabic text in MS Word 2007
Future WorkSupport more Arabic font typesSupport more image types ( GIF , BMP , JPEG…etc)Support different font sizes in same page Support Arabic & English fonts together , numeric and special charactersSupport Spellchecker and word suggestionsImplement the system as Arabic Business Card readerCapture and Recognize feature for iPhone
Demo
OCR ApplicationsIndustries and Institutions in which control of large amounts of paper work is criticalBanking, Credit cards, Insurance industries The medical community To capture, store and transmit radiology images Libraries and archives For conservation and preservation of vulnerable documents and for the provision of access to source documents
Ad

More Related Content

What's hot (20)

Machine learning
Machine learningMachine learning
Machine learning
Amit Gupta
 
ocr
ocrocr
ocr
Sangram Keshari Senapati
 
An OCR System for recognition of Urdu text in Nastaliq Font
An OCR System for recognition of Urdu text in Nastaliq FontAn OCR System for recognition of Urdu text in Nastaliq Font
An OCR System for recognition of Urdu text in Nastaliq Font
Dr. Syed Hassan Amin
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
Biniam Asnake
 
PDF OCR
PDF OCRPDF OCR
PDF OCR
OliviaSmith160
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using Labview
Bharat Thakur
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
iosrjce
 
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...
CSCJournals
 
Handwriting Recognition
Handwriting RecognitionHandwriting Recognition
Handwriting Recognition
Bindu Karki
 
Final Report on Optical Character Recognition
Final Report on Optical Character Recognition Final Report on Optical Character Recognition
Final Report on Optical Character Recognition
Vidyut Singhania
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Text reader [OCR]
Text reader [OCR]Text reader [OCR]
Text reader [OCR]
MisbahUddin52
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
Er. Ashish Pandey
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
Durjoy Saha
 
OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)
Neeraj Neupane
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
balamurugan.k Kalibalamurugan
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009
Svetlin Nakov
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
Viet-Trung TRAN
 
Automatic handwriting recognition
Automatic handwriting recognitionAutomatic handwriting recognition
Automatic handwriting recognition
BIJIT GHOSH
 
Machine learning
Machine learningMachine learning
Machine learning
Amit Gupta
 
An OCR System for recognition of Urdu text in Nastaliq Font
An OCR System for recognition of Urdu text in Nastaliq FontAn OCR System for recognition of Urdu text in Nastaliq Font
An OCR System for recognition of Urdu text in Nastaliq Font
Dr. Syed Hassan Amin
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
Biniam Asnake
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using Labview
Bharat Thakur
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
iosrjce
 
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...
CSCJournals
 
Handwriting Recognition
Handwriting RecognitionHandwriting Recognition
Handwriting Recognition
Bindu Karki
 
Final Report on Optical Character Recognition
Final Report on Optical Character Recognition Final Report on Optical Character Recognition
Final Report on Optical Character Recognition
Vidyut Singhania
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
Er. Ashish Pandey
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
Durjoy Saha
 
OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)
Neeraj Neupane
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009Tesseract OCR Engine - OpenFest 2009
Tesseract OCR Engine - OpenFest 2009
Svetlin Nakov
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
Viet-Trung TRAN
 
Automatic handwriting recognition
Automatic handwriting recognitionAutomatic handwriting recognition
Automatic handwriting recognition
BIJIT GHOSH
 

Viewers also liked (20)

Off-line English Character Recognition: A Comparative Survey
Off-line English Character Recognition: A Comparative SurveyOff-line English Character Recognition: A Comparative Survey
Off-line English Character Recognition: A Comparative Survey
idescitation
 
Part 3 binding navigator vb.net
Part 3 binding navigator vb.netPart 3 binding navigator vb.net
Part 3 binding navigator vb.net
Girija Muscut
 
Part 8 add,update,delete records using records operation buttons in vb.net
Part 8 add,update,delete records using records operation buttons in vb.netPart 8 add,update,delete records using records operation buttons in vb.net
Part 8 add,update,delete records using records operation buttons in vb.net
Girija Muscut
 
Part2 database connection service based using vb.net
Part2 database connection service based using vb.netPart2 database connection service based using vb.net
Part2 database connection service based using vb.net
Girija Muscut
 
Transforming the world with Information technology
Transforming the world with Information technologyTransforming the world with Information technology
Transforming the world with Information technology
Glenn Klith Andersen
 
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert HenrichsPioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Wolfgang Stock
 
Presentation1
Presentation1Presentation1
Presentation1
Liba Cheema
 
Information Overload and Information Science / Mieczysław Muraszkiewicz
Information Overload and Information Science / Mieczysław MuraszkiewiczInformation Overload and Information Science / Mieczysław Muraszkiewicz
Information Overload and Information Science / Mieczysław Muraszkiewicz
Zakład Systemów Informacyjnych, Instytut Informacji Naukowej i Studiów Bibliologicznych (UW)
 
How Not To Be Seen
How Not To Be SeenHow Not To Be Seen
How Not To Be Seen
Mark Pesce
 
Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3
a_akhavan
 
Python Tools for Visual Studio: Python na Microsoftovom .NET-u
Python Tools for Visual Studio: Python na Microsoftovom .NET-uPython Tools for Visual Studio: Python na Microsoftovom .NET-u
Python Tools for Visual Studio: Python na Microsoftovom .NET-u
Nikola Plejic
 
Part 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative valuePart 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative value
Girija Muscut
 
What’s new in Visual C++
What’s new in Visual C++What’s new in Visual C++
What’s new in Visual C++
Microsoft
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
shannonsdavis
 
Making Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information DesignMaking Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information Design
Hubbard One
 
Cognitive information science
Cognitive information scienceCognitive information science
Cognitive information science
S. Kate Devitt
 
Logical Programming With ruby-prolog
Logical Programming With ruby-prologLogical Programming With ruby-prolog
Logical Programming With ruby-prolog
Preston Lee
 
Vb.net session 15
Vb.net session 15Vb.net session 15
Vb.net session 15
Niit Care
 
Debugging in visual studio (basic level)
Debugging in visual studio (basic level)Debugging in visual studio (basic level)
Debugging in visual studio (basic level)
Larry Nung
 
Part 1 picturebox using vb.net
Part 1 picturebox using vb.netPart 1 picturebox using vb.net
Part 1 picturebox using vb.net
Girija Muscut
 
Off-line English Character Recognition: A Comparative Survey
Off-line English Character Recognition: A Comparative SurveyOff-line English Character Recognition: A Comparative Survey
Off-line English Character Recognition: A Comparative Survey
idescitation
 
Part 3 binding navigator vb.net
Part 3 binding navigator vb.netPart 3 binding navigator vb.net
Part 3 binding navigator vb.net
Girija Muscut
 
Part 8 add,update,delete records using records operation buttons in vb.net
Part 8 add,update,delete records using records operation buttons in vb.netPart 8 add,update,delete records using records operation buttons in vb.net
Part 8 add,update,delete records using records operation buttons in vb.net
Girija Muscut
 
Part2 database connection service based using vb.net
Part2 database connection service based using vb.netPart2 database connection service based using vb.net
Part2 database connection service based using vb.net
Girija Muscut
 
Transforming the world with Information technology
Transforming the world with Information technologyTransforming the world with Information technology
Transforming the world with Information technology
Glenn Klith Andersen
 
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert HenrichsPioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Pioneers of Information Science in Europe: The Oeuvre of Norbert Henrichs
Wolfgang Stock
 
How Not To Be Seen
How Not To Be SeenHow Not To Be Seen
How Not To Be Seen
Mark Pesce
 
Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3
a_akhavan
 
Python Tools for Visual Studio: Python na Microsoftovom .NET-u
Python Tools for Visual Studio: Python na Microsoftovom .NET-uPython Tools for Visual Studio: Python na Microsoftovom .NET-u
Python Tools for Visual Studio: Python na Microsoftovom .NET-u
Nikola Plejic
 
Part 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative valuePart 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative value
Girija Muscut
 
What’s new in Visual C++
What’s new in Visual C++What’s new in Visual C++
What’s new in Visual C++
Microsoft
 
Making Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information DesignMaking Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information Design
Hubbard One
 
Cognitive information science
Cognitive information scienceCognitive information science
Cognitive information science
S. Kate Devitt
 
Logical Programming With ruby-prolog
Logical Programming With ruby-prologLogical Programming With ruby-prolog
Logical Programming With ruby-prolog
Preston Lee
 
Vb.net session 15
Vb.net session 15Vb.net session 15
Vb.net session 15
Niit Care
 
Debugging in visual studio (basic level)
Debugging in visual studio (basic level)Debugging in visual studio (basic level)
Debugging in visual studio (basic level)
Larry Nung
 
Part 1 picturebox using vb.net
Part 1 picturebox using vb.netPart 1 picturebox using vb.net
Part 1 picturebox using vb.net
Girija Muscut
 
Ad

Similar to Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique (20)

Team-98 research paper presentation.pptx
Team-98 research paper presentation.pptxTeam-98 research paper presentation.pptx
Team-98 research paper presentation.pptx
dipakshukla158
 
Two Methods for Recognition of Hand Written Farsi Characters
Two Methods for Recognition of Hand Written Farsi CharactersTwo Methods for Recognition of Hand Written Farsi Characters
Two Methods for Recognition of Hand Written Farsi Characters
CSCJournals
 
Ethiopic Scrip OCR App Front End and Backend
Ethiopic Scrip OCR App Front End and BackendEthiopic Scrip OCR App Front End and Backend
Ethiopic Scrip OCR App Front End and Backend
girmashume1
 
OCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural NetworkOCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural Network
ijsrd.com
 
Opticalcharacter recognition
Opticalcharacter recognition Opticalcharacter recognition
Opticalcharacter recognition
Shobhit Saxena
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
guest5de1a5
 
Ocr 1
Ocr 1Ocr 1
Ocr 1
Manoj Nanduri
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
osify
 
9781111530532 ppt ch03
9781111530532 ppt ch039781111530532 ppt ch03
9781111530532 ppt ch03
Terry Yoast
 
9781111530532 ppt ch03
9781111530532 ppt ch039781111530532 ppt ch03
9781111530532 ppt ch03
Terry Yoast
 
A12REVIEW.pptx
A12REVIEW.pptxA12REVIEW.pptx
A12REVIEW.pptx
Moinuddin143394
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
Solin TEM
 
9781439035665 ppt ch03
9781439035665 ppt ch039781439035665 ppt ch03
9781439035665 ppt ch03
Terry Yoast
 
50120140505010
5012014050501050120140505010
50120140505010
IAEME Publication
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
IRJET Journal
 
Holistic Approach for Arabic Word Recognition
Holistic Approach for Arabic Word RecognitionHolistic Approach for Arabic Word Recognition
Holistic Approach for Arabic Word Recognition
Editor IJCATR
 
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent ApplicationsXuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Machine Learning Prague
 
Assignment-1-NF.docx
Assignment-1-NF.docxAssignment-1-NF.docx
Assignment-1-NF.docx
KhondokerAbuNaim
 
01. introduction
01. introduction01. introduction
01. introduction
babaaasingh123
 
A Survey on Tamil Handwritten Character Recognition using OCR Techniques
A Survey on Tamil Handwritten Character Recognition using OCR TechniquesA Survey on Tamil Handwritten Character Recognition using OCR Techniques
A Survey on Tamil Handwritten Character Recognition using OCR Techniques
cscpconf
 
Team-98 research paper presentation.pptx
Team-98 research paper presentation.pptxTeam-98 research paper presentation.pptx
Team-98 research paper presentation.pptx
dipakshukla158
 
Two Methods for Recognition of Hand Written Farsi Characters
Two Methods for Recognition of Hand Written Farsi CharactersTwo Methods for Recognition of Hand Written Farsi Characters
Two Methods for Recognition of Hand Written Farsi Characters
CSCJournals
 
Ethiopic Scrip OCR App Front End and Backend
Ethiopic Scrip OCR App Front End and BackendEthiopic Scrip OCR App Front End and Backend
Ethiopic Scrip OCR App Front End and Backend
girmashume1
 
OCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural NetworkOCR for Gujarati Numeral using Neural Network
OCR for Gujarati Numeral using Neural Network
ijsrd.com
 
Opticalcharacter recognition
Opticalcharacter recognition Opticalcharacter recognition
Opticalcharacter recognition
Shobhit Saxena
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
osify
 
9781111530532 ppt ch03
9781111530532 ppt ch039781111530532 ppt ch03
9781111530532 ppt ch03
Terry Yoast
 
9781111530532 ppt ch03
9781111530532 ppt ch039781111530532 ppt ch03
9781111530532 ppt ch03
Terry Yoast
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
Solin TEM
 
9781439035665 ppt ch03
9781439035665 ppt ch039781439035665 ppt ch03
9781439035665 ppt ch03
Terry Yoast
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
IRJET Journal
 
Holistic Approach for Arabic Word Recognition
Holistic Approach for Arabic Word RecognitionHolistic Approach for Arabic Word Recognition
Holistic Approach for Arabic Word Recognition
Editor IJCATR
 
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent ApplicationsXuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Machine Learning Prague
 
A Survey on Tamil Handwritten Character Recognition using OCR Techniques
A Survey on Tamil Handwritten Character Recognition using OCR TechniquesA Survey on Tamil Handwritten Character Recognition using OCR Techniques
A Survey on Tamil Handwritten Character Recognition using OCR Techniques
cscpconf
 
Ad

Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification Technique

  • 1. Offline Omni Font Arabic Optical Text Recognition System using Prolog Classification TechniqueRami Al-SahharIdeas for today and tomorrow
  • 2. AgendaOCR OverviewThe Arabic OCR Problem OCR ChallengesProposed SolutionDetailed system stagesSample RunFuture WorkDemo
  • 3. OCR Overview(OCR) is the process of converting an image of text, such as a scanned paper document, into computer-editable textThe ultimate goal of OCR is to simulate the human ability to read both machine-printed and hand-written textsMost of the work on OCR has been on Latin and Chinese charactersArabic character recognition started recently and advanced relatively slowly due to the complexity of recognizing Arabic text, which has characters that are cursive in nature. Arabic character recognition is still an open and challenging field of research
  • 4. The Arabic OCR ProblemTo propose a complete system that classifies and recognizes machine-printed Arabic textThe input to the system is TIFF image file The Arabic font size varies from 8 up to 36The font type is Arabic Simplified or Traditional ArabicThe image scanned at 300 dpi ( Resolution )The output is editable text in a word processor program ( MS Word)
  • 5. OCR ChallengesUnderstanding TIFF image format and pixel representationProgrammatically , read TIFF image pixel by pixel from right to leftFeatures extractionSegmentation freeSpaces ,Words , Letters and Line isolation Noise reductionDots and holesOverlapped characters
  • 6. OCR Challenges Arabic Character CharacteristicsRight to left Always cursive Change of character shape according to its location in the word Four different shapes 28 basic characters: 15 with dots, 13 without No fixed character width and no fixed size
  • 7. OCR Challenges Arabic Character CharacteristicsGroup of Arabic character shapesA sample of written Arabic showing some of its characteristics
  • 8. The Proposed SolutionThe proposed system starts from the document image acquisition stage and ends with recognized Arabic text in standard Simplified true type font format in MS Word 2007 We started designing our system by experimenting with prior researchers’ techniques, adopting or modifying some of them if they met our requirements, but otherwise developing our own techniquesConsequently, the components of our system are either due to the work of others, the result of our improvement of others’ work, or our own completely new techniques.
  • 9. Prolog-BasedRECOGNIZED TEXTCLASSIFICATION AND RECOGNITIONC -BasedPOSTENHANCEMENTFEATURE EXTRACTIONPREPROCESSINGTIFFIMAGE FILEThe Proposed SolutionATR (Arabic Text Recognition) System model
  • 10. The Proposed SolutionPreprocessing PhaseDigitalization, scaling, word-level segmentation, noise removal and elimination of redundant information as far as possibleImage information retrievalLoad/Read the input (TIFF) image file as binary; retrieve the image properties (size, width, height, pixel resolution, image channels and image alignment; and create memory storage for system intermediate processingImage digitalizationDigitizes the TIFF image in order to apply fixed-level thresholding and to convert the gray-scale and bitmapped image to a binary (0’s and 1’s representation) scale image
  • 11. The Proposed SolutionIt does the vertical and horizontal histograms to retrieve the number of lines per page and number of components (words) per each line We calculate the font baseline and size by finding the maximum horizontal histogram of each line per pageThis enables the dots or other special characters such as Shadda, Madda, and Tanween to be classified as upper or lower components related to this baselineText line detectionWord segmentation
  • 12. The Proposed SolutionB&W image is found in file name: [ test1.tif]Processing a [1615x2160] image with [1] channel(s)Image Origin : [Top-left Origin] , Align : [4-]Data Order :[Interleaved Color Channels]Number of Lines(s) found: [6]Line #0 , Y = 78 , Height = [67]Line #1 , Y = 185 , Height =[ 67]Line #2 , Y = 292 , Height = [67]Line #3 , Y = 399 , Height = [67]Line #4 , Y = 506 , Height = [67]Line #5 , Y = 613 , Height = [67]Font Baseline =[ 38 pixels]Number of Components found at Image Line #0 : [9]Number of Components found at Image Line #1 : [14]Number of Components found at Image Line #2 : [16]Number of Components found at Image Line #3 : [18]Number of Components found at Image Line #4 : [10]Number of Components found at Image Line #5 : [6]Preprocessing phase Text Line Detection
  • 13. 12Number of retrieved contours : [2]************* Bounding Rectangle (1,22)-(72,65) ************Component [1] Origin Y = 22 , Height = 43 Area = 429.000000Component [2] Origin Y = 47, Height = 7 Area = 19.000000 Max Component Area = 429.000000 , Y = 22 , H = 43The Proposed SolutionPreprocessing phase Word Segmentation
  • 14. The Proposed SolutionFeature Extraction PhaseIs the most challenging part for character or text recognitionThe choice of good features significantly improves the recognition rate and minimizes the error in case of noiseThe main selected features are :Outer contours described in Freeman chain codesContours’ cornersDot information Font estimated size All of these features are extracted for all detected components during the page scanning
  • 15. The Proposed SolutionFreeman chain code Chain code was introduced by Freeman as a mean of representing lines or boundaries of shapes by a connected sequence of straight-line segments of specified length and directionAn example of the 8-connectivity chain codeChain code numbering schemes
  • 16. The Proposed SolutionContour extraction processThis is the core process to extract the main word-level features of the Arabic text in Freeman Chain code formatAfter extracting the Freeman codes, we aggregate those codes into pairs as (X, Y) where X is the direction (i.e. from 1 to 7) and Y is the length in pixels
  • 17. The Proposed SolutionContours Freeman Chain Codes[2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,6,5,7,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,4,3,1,1,1,1,2,2,2,3,4,4,3,4,4,4,4,4,4,4,4,5,4,6,5,6,6,6,1,0,0,7,7,7,5,4,5,4,4,5,4,4,4,4,4,3,2,2,2,2,2,2,2,2,2,3,2,3,2,3,6,5,6,5,7,6,7,7,6,7,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,3,2,2,2,2,2,2,2,3,2,2,2,3,2,3,2,3,3,3,3,4,3,6,6,6,6,7,0,7,7,7,7,6,7,6,6,6,7,6,6,6,6,6,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,6,6,6,6,7,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,7,7,7,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]Contours Freeman Chain Code Pairs - AlignedTotal Pairs : [95] ==> [(2,14),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,2),(5,1),(7,3),(6,1),(7,1),(6,4),(5,1),(4,19),(3,1),(4,1),(3,1),(1,4),(2,3),(3,1),(4,2),(3,1),(4,8),(5,1),(4,1),(6,1),(5,1),(6,3),(1,1),(0,2),(7,3),(5,1),(4,1),(5,1),(4,2),(5,1),(4,5),(3,1),(2,9),(3,1),(2,1),(3,1),(2,1),(3,1),(6,1),(5,1),(6,1),(5,1),(7,1),(6,1),(7,2),(6,1),(7,1),(6,4),(5,1),(4,13),(3,1),(2,7),(3,1),(2,3),(3,1),(2,1),(3,1),(2,1),(3,4),(4,1),(3,1),(6,4),(7,1),(0,1),(7,4),(6,1),(7,1),(6,3),(7,1),(6,5),(5,1),(4,14),(3,2),(6,4),(7,2),(0,38),(1,1),(0,2),(1,1),(0,1),(1,1),(0,1),(1,1),(7,3),(0,1),(7,1),(0,22)]Contour Corners Positions:[5,18,23,33,37,60,67,81,87,90,92,103,113,118,132,146,160,162,168,172,178,180,189,202,206,210,258,263,]Feature extraction: Freeman chain codes, pairs and corner positions
  • 18. The Proposed SolutionCorner DetectionThis phase detects and extracts the component’s contour corners of the text under processingIt is based on an implementation of contour detection and curve representation by circular local histogram of contour chain code presented by [Arrebola, Camacho, Bandera , & Sandoval (1999)]The corner detection phase is very important for the next classification and recognition phaseIt helps our Prolog engine to determine the unique shape of the character’s feature regardless of the character orientationThe output of this phase is a stream of corner information to be input for the next phase
  • 20. We introduced an algorithm to remove noisy pixels which come within any straight line, and to convert Arabic characters to approximately straight lines.
  • 21. These enhancement rules, which are derived from testing Arabic characters multiple times, reduce the time required for character recognitionThe Proposed SolutionDot Detection and Font Size EstimationDot detection is another challenging important task It helps the prolog classification engine to recognize words that include dots and it decides which characters are un-dotted by nature Font size estimation is an critical task in Omni-size character recognition systemsFont size estimation is usually used to find the pen width in online recognition systemsApproaches used to estimate the font size by dots is presented in [Shirali-Shahreza , & Shirali-Shahreza (2006)]
  • 23. The Proposed SolutionFont size against calculated component’s height (in pixels)
  • 24. The Proposed SolutionDefinite Clause Grammar (DCG): Provides a mechanism for defining the grammar rules of a languageThese rules are automatically translated to a Prolog program which defines a parser for the language being definedGrammar rules are a feature only in some Prolog systems, and are designed to facilitate the parsing of natural languageUsing this notation, a grammar is represented as a set of logical rulesWhen the DCG rules are consulted (or optimized), they are translated into Prolog clauses
  • 25. The Proposed SolutionWord-level Classification and Recognition PhaseThis is the most critical phase in our proposed ATR systemIt is written in Prolog language using Prolog matching, backtracking and DCG techniquesThe input for this phase is data on two features :The first input stream is the corner sequence of the word-level outer contours for each component that represents the elevation information of the input stream (the upper part that holds most of the features) The second input stream is the dot information found in the same component
  • 26. The Proposed SolutionThe Prolog matching and backtracking techniques also use the corner sequence stream to classify the unknown inputs into character classes, while the Prolog DCG technique uses the dot information stream to recognize the actual Arabic letters of a particular character class
  • 27. The Proposed SolutionDCG implementation: The DCG grammar structure and some of the character classes are described below :% DCG part for Arabic text recognition based on two input streams% usage: phrase(s(R),[m,h_c,d1,m,dc]).s([H|T]) -->cc(H), subs(T). % every string is a character class followed by a sub-strings(R)-->cc(R). % or a string can be simply a character classsubs(R)-->s(R). % a substring is nothing but a string (recursively)cc(R)-->ch(R). % a character class can be a simple character% or character classes can belong to any of the following classescc(R)-->bc(R). % Ba class (ba, ta, tha, ya_md)cc(R)-->h_c(R). % H_ class (h_, jeem, kha)cc(R)-->dc(R). % Dal Class (dal, thal)cc(R)-->rc(R). % Ra' Class (ra, zay)cc(R)-->sc(R). % Seen Class (seen, sheen)
  • 28. The Proposed SolutionMicrosoft Word Document IntegrationThis is the final phase of our optical text recognition systemIt is written in Prolog language to interface with Microsoft Word programIt uses Microsoft Word Document API to write the recognized characters into a new Word documentIt writes the output text in the same recognized font size in a predefined font typeIt also writes the white spaces and new lines to maintain the same original text alignment and format
  • 29. Image Information RetrievalHeight/ WidthDocument Image (TIFF File)Pixel ResolutionImageDigitalizationFont BaselineFont SizeWord-levelSegmentationLines per PageComponent Coordinates(X, Y)Words per LineWord-levelContour ExtractionFreeman Chain CodesComponent Area, Height and WidthContour EnhancementCharacter-ShapeProlog MatchingCorner DetectionDots DetectionCharacter Shape StreamCharacter Reference DatabaseDot Information StreamDCG EngineWord-levelRecognitionMS WORD Document IntegrationRecognized Text (Word Document)The Proposed Solution
  • 30. Sample RunOriginal TIFF image with Arabic textThe recognized Arabic text in MS Word 2007
  • 31. Future WorkSupport more Arabic font typesSupport more image types ( GIF , BMP , JPEG…etc)Support different font sizes in same page Support Arabic & English fonts together , numeric and special charactersSupport Spellchecker and word suggestionsImplement the system as Arabic Business Card readerCapture and Recognize feature for iPhone
  • 32. Demo
  • 33. OCR ApplicationsIndustries and Institutions in which control of large amounts of paper work is criticalBanking, Credit cards, Insurance industries The medical community To capture, store and transmit radiology images Libraries and archives For conservation and preservation of vulnerable documents and for the provision of access to source documents

Editor's Notes

  • #2: A commonly used term in conjunction with OCR software. Omnifont recognition refers to the capability of computer software, usually OCR software, to read (or recognize) virtually any font that maintains fairly standard character shapes
  • #4: The field of Optical Character Recognition (OCR) is a branch of technology that deals with automatic reading of a text. The ultimate goal of OCR is to simulate the human ability to read both machine-printed and hand-written texts. Currently, a variable system can read faster than a human, but it cannot reliably read such a wide variety of texts. Humans are also better at reading from highly distorted text or noisy (unclear) media. Therefore, a great deal of intensive research is still needed to narrow the gap between the humans’ and machines’ reading capabilities. OCR has been used in many practical areas that are independent of the language to which it is applied. One of the earliest and most successful applications was sorting checks in banks, where the volume of checks circulating daily proved to be too enormous to be handled by a manual entry method. Reading of handwritten and printer postal codes, text archiving and retrieving, reading of customers’ handwritten forms and aiding visually impaired people to read are a few other examples. Most of the work on OCR has been on Latin and Chinese characters. Work on Arabic character recognition has only started recently and had advanced relatively slowly due to the complexity of recognizing Arabic text, which has characters that are cursive in nature. Arabic character recognition is still an open and challenging field of research.
  • #5: This thesis proposes a complete system that classifies and recognizes machine-printed Arabic text. The input to the system is a clean, high-resolution Tag Image File Format (.TIFF) that contains Arabic text to be recognized; the output is simply the generated Arabic text saved in a Microsoft Word Document (.DOC) file of the recognized Arabic text. The technique is based on cleverly describing the text in terms of shape primitives derived from Freeman chain codes. A rule-based data enhancement technique is used to improve recognized features as much as possible. The recognized features are processed by a Prolog feature-matching engine to classify character classes as well as diacritic information as three separate streams (character class stream, diacritic stream and corners information stream). In addition to the three provided streams, estimated font size is also provided as a fourth input. Characters are finally determined by processing a permutation of the three streams using Definite Clause Grammar (DCG).
  • #7: The extant literature includes a considerable number of studies focused on recognition of writing in languages such as Latin, Chinese and Hebrew. Unfortunately, limited research has been done on the recognition of Arabic writing. This might be attributed to the peculiar aspects of the writing of Arabic characters . For example, Arabic writing is from right to left. This may not be considered a major technical point, but it becomes an issue when an Arabic text contains some foreign text such as Latin or French, which is written from left to right, and vice versa. Arabic writing is cursive. Arabic characters are generally tailored to each other. This requires a process of segmentation of the Arabic words and characters before any recognition step can be taken. In fact, segmentation and character isolation is the most difficult problem in the character recognition schemes of Arabic language. An Arabic word might be comprised of both cursive or separated characters, Twenty-two of the 29 sets of Arabic characters assume different shapes and sizes depending on other positions within the words. Some characters such as “ ع ” take four shapes, which are all different from each other: “ ع , ع , ع ,ع ”. The character “ ”ج takes three shapes: “ .” ج ,ج ,جThere are only seven characters, which, regardless of their positions within the words,have only one shape: “ و, ذ, د, ظ, ط, ز, ر ”. In general, characters, even if they are of the same group, are of different sizes and they require different rectangular boxes that can enclose them.
  • #9: This thesis proposes a complete system that classifies and recognizes machine-printed Arabic text. The input to the system is a clean, high-resolution Tag Image File Format (.TIFF) that contains Arabic text to be recognized; the output is simply the generated Arabic text saved in a Microsoft Word Document (.DOC) file of the recognized Arabic text. The technique is based on cleverly describing the text in terms of shape primitives derived from Freeman chain codes. A rule-based data enhancement technique is used to improve recognized features as much as possible. The recognized features are processed by a Prolog feature-matching engine to classify character classes as well as diacritic information as three separate streams (character class stream, diacritic stream and corners information stream). In addition to the three provided streams, estimated font size is also provided as a fourth input. Characters are finally determined by processing a permutation of the three streams using Definite Clause Grammar (DCG).
  • #10: Having conducted a thorough literature review, we started designing our system by experimenting with prior researchers’ techniques, adopting or modifying some of them if they met our requirements, but otherwise developing our own techniques. Consequently, the components of our system are either due to the work of others, the result of our improvement of others’ work, or our own completely new techniques. During the development phase of the system, many of the investigated techniques were rejected due to their ineffectiveness. This ineffectiveness may be due to inherent deficiency or due to incomplete description specified by their text sources.
  • #11: Preprocessing PhaseThis phase is implemented in C language to perform some image processing with the help of open source image processing libraries such as LibTiff and OpenCV.In our proposed system, Arabic text images have been obtained by optical scanning of the character images on the plain paper. The input data obtained by scanning of printed text is almost contaminated with noise and contains redundant information. Preprocessing includes digitalization, scaling, word-level segmentation, noise removal and elimination of redundant information as far as possible.Image information retrievalIn this process, we load and read the input Tagged Image File Format (TIFF) image file as binary; retrieve the image properties (size, width, height, pixel resolution, image channels and image alignment; and create memory storage for system intermediate processing.Image digitalizationThe image digitalization process digitizes the TIFF image in order to apply fixed-level thresholding and to convert the gray-scale and bitmapped image to a binary (0’s and 1’s representation) scale image. It does the vertical and horizontal histograms to retrieve the number of lines per page and number of components (words) per each line. In this phase, we calculate the font baseline and size. The font baseline can be calculated by finding the maximum horizontal histogram of each line per page. This enables the dots or other special characters such as Shadda, Madda, and Tanween to be classified as upper or lower components related to this baseline. Text line detection: Text line detection has been performed by scanning the input page image horizontally. Frequency of black pixels in each row is counted in order to construct the row histogram. The position between two consecutive lines, where the number of pixels in a row is zero, denotes a boundary between the lines. Here it is assumed that the text block contains only single column of textWord segmentation: After a line has been detected, it is scanned vertically. In order to find the column histogram, the number of black pixels in each column is calculated. If there exists n consecutive scans that find no black pixel, we denote it to be a marker between two words. The value of n is taken experimentally. Figure 5 andFigure 6show the preprocessing phase.
  • #20: It is suggested from our practical experimentation to apply some rules to the retrieved Freeman chain contours. We introduced an algorithm to remove noisy pixels which come within any straight line, and to convert Arabic characters to approximately straight lines. These enhancement rules, which are derived from testing Arabic characters multiple times, reduce the time required for character recognition. The result of the algorithm is illustrated in Figure 10.