SlideShare a Scribd company logo
Simon Tanner
Blog: simon-tanner.blogspot.co.uk
Twitter: @SimonTanner


      www.slideshare.net/KDCS/
King’s Digital Consultancy Services




              www.digitalconsultancy.net
Deciding whether Optical Character Recognition is feasible
(PDF document) created for the Oxford University Digital
Library
www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf

Measuring Mass Text Digitization Quality and Usefulness:
Lessons Learned from Assessing the OCR Accuracy of the
British Library's 19th Century Online Newspaper Archive
www.dlib.org/dlib/july09/munoz/07munoz.html



www.impact-project.eu
Digitisation Doctor Optical Character Recognition
Uniformity
Language
Text alignment
Complexity of alignment
Lines, graphics and pictures
Handwriting
Evaluating OCR accuracy is about more than just
character to character accuracy rates
    Character accuracy rates are misleading
    (more later…)
It is also about assessing the functionality enabled
through the OCR’s output
    Search accuracy
    Volume of hits returned
    Ability to structure searches and results
    Accuracy of result ranking
    Amount of correction required to
    achieve the required performance
Consider this scenario:
  1,000 words with 5,000 characters
  (an average of 5 per word) excluding spaces

  90% character accuracy means:
    4,500 characters correct
    Possibly a maximum 900 words correct (90%)
    Possibly a minimum 500 words correct (50%)
    Reality is somewhere in between
    Depending on the number of “significant
    words” the search results could still
    be almost 100% or near zero
Digitisation Doctor Optical Character Recognition
100



90



80



70



60



50
      1801


             1810


                      1820



                             1830




                                                1840




                                                       1850



                                                              1860




                                                                     1870




                                                                                      1880




                                                                                                 1890




                                                                                                               1900
                    characters                                              words
                    words with capital letter start                         significant words
                    Poly. (characters)                                      Poly. (words)
                    Poly. (significant words)                               Poly. (words with capital letter start
Digitisation Doctor Optical Character Recognition
Digitisation Doctor Optical Character Recognition
OCR Results
                   % characters     % words correct     No. of corrections
 OCR Engine          correct
 FineReader            91.1              70.9                      110
 PrimeOCR              93.95             79.1                       79

Total number of characters = 2109
Total number of words = 379




                                                  I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe
                                                  Men's affections tp; and seal for all Party-making Notions
                                                  amdngft CfiriftiansybefGieirie will raife his,Church to that prof-
                                                  perous, flourilhing State prophefied of, and prOmifed in the
                                                  Scrip* tures. There mult be more Love, and Charity, and
                                                  Unanimity amongft Chriftians,.
Digitisation Doctor Optical Character Recognition
OCR Results

                  % characters     % words correct   No. of corrections
OCR Engine          correct
FineReader            73.7              57.5                31
PrimeOCR              75.9             62.37                28


Total number of characters = 411
Total number of words = 73




                                                                          A THEATRE
                                                                           erein be reprc-fented as wel the miferies &
                                                                          calamities tijat foiioto tht too*
                                                                          e^jr alfo the greate toyts and
                                                                          plefures tobtcf) tbe fatrfc faltooenio^
                                                                          An Argument both profitable and
                                                                          dele&able, to all that finccrcly
                                                                          loue the word of Codt'.
                                                                          *Deuifedby S. hhnv&n~ derlS^oodt.
                                                                          s 3^ Scene and allowed according to the order
                                                                          appointed.
                                                                          , ^ Imprinted at London by Henry Bynncman*
                                                                          Anno Domini.
                                                                          CVM PHIT
Simon Tanner
Blog: simon-tanner.blogspot.co.uk
Twitter: @SimonTanner

More Related Content

PDF
J054 Burbank
World
 
PPTX
The Value of Archives for the Fédération Internationale des Archives de Télév...
Simon Tanner
 
PDF
Optical Character Recognition
aavi241
 
PPTX
Final Report on Optical Character Recognition
Vidyut Singhania
 
PPTX
Optical character recognition (ocr) ppt
Deijee Kalita
 
PPT
OCR
jacekb
 
PPTX
Optical Character Recognition (OCR)
Vidyut Singhania
 
PPTX
Basics of-optical-character-recognition
document scanning services
 
J054 Burbank
World
 
The Value of Archives for the Fédération Internationale des Archives de Télév...
Simon Tanner
 
Optical Character Recognition
aavi241
 
Final Report on Optical Character Recognition
Vidyut Singhania
 
Optical character recognition (ocr) ppt
Deijee Kalita
 
OCR
jacekb
 
Optical Character Recognition (OCR)
Vidyut Singhania
 
Basics of-optical-character-recognition
document scanning services
 

Viewers also liked (20)

PPT
optical character recognition system
Vijay Apurva
 
PPTX
Optical Character Recognition( OCR )
Karan Panjwani
 
PPTX
A Glance at the Future - the Image as Dr Who's TARDIS
Simon Tanner
 
PPTX
Value & Impact for Museums
Simon Tanner
 
PPTX
Through a glass, darkly – reflections upon digitisation
Simon Tanner
 
PPTX
Return on Investment for the Content Industries
Simon Tanner
 
PPTX
Avoiding the Digital Death Spiral – how measuring value and impact can preser...
Simon Tanner
 
PPTX
Planning for Success: Surviving and Thriving through understanding the Value ...
Simon Tanner
 
PPTX
Research support with optical character recognition apps
Jim Hahn
 
PDF
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
PDF
Optical Character Recognition: the What, Why, and How
mackenziekbrooks
 
PDF
Avoiding the Digital Death Spiral: Surviving & Thriving through understanding...
Simon Tanner
 
PDF
Design and implementation of optical character recognition using template mat...
eSAT Journals
 
PPTX
Region filling
hetvi naik
 
PDF
Museum of impact powerpoint pdf
urbanmomentum
 
PPTX
Number plate recogition
hetvi naik
 
PDF
Thesis
Ciaran Cooney
 
PDF
Optical character recognition of handwritten Arabic using hidden Markov models
Muhannad Aulama
 
PPTX
Optical Character Recognition
Durjoy Saha
 
optical character recognition system
Vijay Apurva
 
Optical Character Recognition( OCR )
Karan Panjwani
 
A Glance at the Future - the Image as Dr Who's TARDIS
Simon Tanner
 
Value & Impact for Museums
Simon Tanner
 
Through a glass, darkly – reflections upon digitisation
Simon Tanner
 
Return on Investment for the Content Industries
Simon Tanner
 
Avoiding the Digital Death Spiral – how measuring value and impact can preser...
Simon Tanner
 
Planning for Success: Surviving and Thriving through understanding the Value ...
Simon Tanner
 
Research support with optical character recognition apps
Jim Hahn
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
Optical Character Recognition: the What, Why, and How
mackenziekbrooks
 
Avoiding the Digital Death Spiral: Surviving & Thriving through understanding...
Simon Tanner
 
Design and implementation of optical character recognition using template mat...
eSAT Journals
 
Region filling
hetvi naik
 
Museum of impact powerpoint pdf
urbanmomentum
 
Number plate recogition
hetvi naik
 
Optical character recognition of handwritten Arabic using hidden Markov models
Muhannad Aulama
 
Optical Character Recognition
Durjoy Saha
 
Ad

More from Simon Tanner (20)

PPTX
The Balanced Value Impact Model V2.0
Simon Tanner
 
PPTX
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Simon Tanner
 
PPTX
Developing the Balanced Value Impact Model to assess the impact of digital re...
Simon Tanner
 
PPTX
Life Writes Its Own Stories: The value and research benefits gained from digi...
Simon Tanner
 
PDF
Teaching Digital Preservation at scale on the MA Digital Asset & Media Manage...
Simon Tanner
 
PPTX
Focusing on European citizens and the impact of Open Access monographs for them
Simon Tanner
 
PPTX
Proposing the modes of digital value for a memory institution
Simon Tanner
 
PPTX
OpenGLAM – the Cultural, Social and Academic Importance of Sharing
Simon Tanner
 
PPTX
Walking the talk of open research and open innovation in practice
Simon Tanner
 
PPTX
So, can I use that or not? Navigating rights, reproductions, and risk in an O...
Simon Tanner
 
PPTX
Impact: A Europeana Case Study
Simon Tanner
 
PPTX
Opening up Data - the benefits and value from a community and funding perspec...
Simon Tanner
 
PPTX
Mirror, Signal, Manoeuvre How  understanding context, indicators and strategi...
Simon Tanner
 
PPTX
The Academic Book of the Future - Progress & REF2014 data
Simon Tanner
 
PPTX
When Crowdsourcing was called Telecrofting - origin stories and challenges
Simon Tanner
 
PPTX
Raising Funds for Digitisation
Simon Tanner
 
PPTX
Raising Funds: some advice for our PhD students
Simon Tanner
 
PPTX
Impact, the REF and Digital Humanities
Simon Tanner
 
PPTX
Democratisation of Collections through Digitisation.
Simon Tanner
 
PPTX
The Impact of Digitisation on Photographic Heritage
Simon Tanner
 
The Balanced Value Impact Model V2.0
Simon Tanner
 
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Simon Tanner
 
Developing the Balanced Value Impact Model to assess the impact of digital re...
Simon Tanner
 
Life Writes Its Own Stories: The value and research benefits gained from digi...
Simon Tanner
 
Teaching Digital Preservation at scale on the MA Digital Asset & Media Manage...
Simon Tanner
 
Focusing on European citizens and the impact of Open Access monographs for them
Simon Tanner
 
Proposing the modes of digital value for a memory institution
Simon Tanner
 
OpenGLAM – the Cultural, Social and Academic Importance of Sharing
Simon Tanner
 
Walking the talk of open research and open innovation in practice
Simon Tanner
 
So, can I use that or not? Navigating rights, reproductions, and risk in an O...
Simon Tanner
 
Impact: A Europeana Case Study
Simon Tanner
 
Opening up Data - the benefits and value from a community and funding perspec...
Simon Tanner
 
Mirror, Signal, Manoeuvre How  understanding context, indicators and strategi...
Simon Tanner
 
The Academic Book of the Future - Progress & REF2014 data
Simon Tanner
 
When Crowdsourcing was called Telecrofting - origin stories and challenges
Simon Tanner
 
Raising Funds for Digitisation
Simon Tanner
 
Raising Funds: some advice for our PhD students
Simon Tanner
 
Impact, the REF and Digital Humanities
Simon Tanner
 
Democratisation of Collections through Digitisation.
Simon Tanner
 
The Impact of Digitisation on Photographic Heritage
Simon Tanner
 
Ad

Recently uploaded (20)

PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Doc9.....................................
SofiaCollazos
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 

Digitisation Doctor Optical Character Recognition

  • 1. Simon Tanner Blog: simon-tanner.blogspot.co.uk Twitter: @SimonTanner www.slideshare.net/KDCS/
  • 2. King’s Digital Consultancy Services www.digitalconsultancy.net
  • 3. Deciding whether Optical Character Recognition is feasible (PDF document) created for the Oxford University Digital Library www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive www.dlib.org/dlib/july09/munoz/07munoz.html www.impact-project.eu
  • 5. Uniformity Language Text alignment Complexity of alignment Lines, graphics and pictures Handwriting
  • 6. Evaluating OCR accuracy is about more than just character to character accuracy rates Character accuracy rates are misleading (more later…) It is also about assessing the functionality enabled through the OCR’s output Search accuracy Volume of hits returned Ability to structure searches and results Accuracy of result ranking Amount of correction required to achieve the required performance
  • 7. Consider this scenario: 1,000 words with 5,000 characters (an average of 5 per word) excluding spaces 90% character accuracy means: 4,500 characters correct Possibly a maximum 900 words correct (90%) Possibly a minimum 500 words correct (50%) Reality is somewhere in between Depending on the number of “significant words” the search results could still be almost 100% or near zero
  • 9. 100 90 80 70 60 50 1801 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 characters words words with capital letter start significant words Poly. (characters) Poly. (words) Poly. (significant words) Poly. (words with capital letter start
  • 12. OCR Results % characters % words correct No. of corrections OCR Engine correct FineReader 91.1 70.9 110 PrimeOCR 93.95 79.1 79 Total number of characters = 2109 Total number of words = 379 I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe Men's affections tp; and seal for all Party-making Notions amdngft CfiriftiansybefGieirie will raife his,Church to that prof- perous, flourilhing State prophefied of, and prOmifed in the Scrip* tures. There mult be more Love, and Charity, and Unanimity amongft Chriftians,.
  • 14. OCR Results % characters % words correct No. of corrections OCR Engine correct FineReader 73.7 57.5 31 PrimeOCR 75.9 62.37 28 Total number of characters = 411 Total number of words = 73 A THEATRE erein be reprc-fented as wel the miferies & calamities tijat foiioto tht too* e^jr alfo the greate toyts and plefures tobtcf) tbe fatrfc faltooenio^ An Argument both profitable and dele&able, to all that finccrcly loue the word of Codt'. *Deuifedby S. hhnv&n~ derlS^oodt. s 3^ Scene and allowed according to the order appointed. , ^ Imprinted at London by Henry Bynncman* Anno Domini. CVM PHIT