An Intuitive Explanation of Connectionist Temporal Classification

Uploaded by

ConejoSinPompon goma

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

An Intuitive Explanation of Connectionist Temporal Classification

Uploaded by

ConejoSinPompon goma

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

An Intuitive Explanation of Connectionist Temporal

Classification
towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c

9 de junio de
2018

Text recognition with the Connectionist Temporal

Classification (CTC) loss and decoding operation

Harald Scheidl
Jun 10, 2018

If you want a computer to recognize text, neural networks (NN) are a good choice as they
outperform all other approaches at the moment. The NN for such use-cases usually consists
of convolutional layers (CNN) to extract a sequence of features and recurrent layers
(RNN) to propagate information through this sequence. It outputs character-scores for each
sequence-element, which simply is represented by a matrix. Now, there are two things we
want to do with this matrix:
1/7
1. train: calculate the loss value to train the NN
2. infer: decode the matrix to get the text contained in the input image

Both tasks are achieved by the CTC operation. An overview of the a handwriting
recognition system is shown in Fig. 1.

Let’s have a closer look at the CTC operation and discuss how it works without hiding the
clever ideas it is based on behind complicated formulas. At the end, I will point you to
references where you can find Python code and the (not too complicated) formulas, if you
are interested.

Fig. 1: Overview of a NN for handwriting recognition.

Why we want to use CTC

We could, of course, create a data-set with images of text-lines, and then specify for each
horizontal position of the image the corresponding character as shown in Fig. 2. Then, we
could train a NN to output a character-score for each horizontal position. However, there
are two problems with this naive solution:

1. it is very time-consuming (and boring) to annotate a data-set on character-level.

2. we only get character-scores and therefore need some further processing to get the
final text from it. A single character can span multiple horizontal positions, e.g. we
could get “ ttooo” because the “o” is a wide character as shown in Fig. 2. We have to
remove all duplicate “t”s and “o”s. But what if the recognized text would have been
“too”? Then removing all duplicate “o”s gets us the wrong result. How to handle this?

2/7
Fig. 2: Annotation for each horizontal position of the image.

CTC solves both problems for us:

1. we only have to tell the CTC loss function the text that occurs in the image. Therefore
we ignore both the position and width of the characters in the image.
2. no further processing of the recognized text is needed.

How CTC works

As already discussed, we don’t want to annotate the images at each horizontal position
(which we call time-step from now on). The NN-training will be guided by the CTC loss
function. We only feed the output matrix of the NN and the corresponding ground-truth
(GT) text to the CTC loss function. But how does it know where each character occurs? Well,
it does not know. Instead, it tries all possible alignments of the GT text in the image and
takes the sum of all scores. This way, the score of a GT text is high if the sum over the
alignment-scores has a high value.

Encoding the text

There was the issue of how to encode duplicate characters (you remember what we said
about the word “too”?). It is solved by introducing a pseudo-character (called blank, but
don’t confuse it with a “real” blank, i.e. a white-space character). This special character will be
denoted as “-” in the following text. We use a clever coding schema to solve the duplicate-
character problem: when encoding a text, we can insert arbitrary many blanks at any

3/7
position, which will be removed when decoding it. However, we must insert a blank between
duplicate characters like in “hello”. Further, we can repeat each character as often as we
like.

Let’s look at some examples:

“to” → “---ttttttooo”, or “-t-o-”, or “to”

“too” → “---ttttto-o”, or “-t-o-o-”, or “to-o”, but not “too”

As you see, this schema also allows us to easily create different alignments of the same text,
e.g. “t-o” and “too” and “-to” all represent the same text (“to”), but with different alignments
to the image. The NN is trained to output an encoded text (encoded in the NN output
matrix).

Loss calculation
We need to calculate the loss value for given pairs of images and GT texts to train the NN.
You already know that the NN outputs a matrix containing a score for each character at
each time-step. A minimalistic matrix is shown in Fig. 3: there are two time-steps (t0, t1) and
three characters (“a”, “b” and the blank “-”). The character-scores sum to 1 for each time-
step.

Fig. 3: Output matrix of NN. The character-probability is color-coded and is also printed next
to each matrix entry. Thin lines are paths representing the text “a”, while the thick dashed
line is the only path representing the text “”.

4/7
Further, you already know that the loss is calculated by summing up all scores of all
possible alignments of the GT text, this way it does not matter where the text appears in the
image.

The score for one alignment (or path, as it is often called in the literature) is calculated by
multiplying the corresponding character scores together. In the example shown above, the
score for the path “aa” is 0.4·0.4=0.16 while it is 0.4·0.6=0.24 for “a-” and 0.6·0.4=0.24 for “-a”.
To get the score for a given GT text, we sum over the scores of all paths corresponding to
this text. Let’s assume the GT text is “a” in the example: we have to calculate all possible
paths of length 2 (because the matrix has 2 time-steps), which are: “aa”, “a-” and “-a”. We
already calculated the scores for these paths, so we just have to sum over them and get
0.4·0.4+0.4·0.6+0.6·0.4=0.64. If the GT text is assumed to be “”, we see that there is only one
corresponding path, namely “--”, which yields the overall score of 0.6·0.6=0.36.

If you looked closely, you have seen that we calculated the probability of a GT text, but not
the loss. However, the loss simply is the negative logarithm of the probability. The loss value
is back-propagated through the NN and the parameters of the NN are updated according to
the optimizer used, which I will not discuss any further here.

Decoding
When we have a trained NN, we usually want to use it to recognize text in previously unseen
images. Or in more technical terms: we want to calculate the most likely text given the
output matrix of the NN. You already know a method to calculate the score of a given text.
But this time, we are not given any text, in fact, it is exactly this text we are looking for.
Trying every possible text would work if there are only a few time-steps and characters, but
for practical use-cases, this is not feasible.

A simple and very fast algorithm is best path decoding which consists of two steps:

1. it calculates the best path by taking the most likely character per time-step.
2. it undoes the encoding by first removing duplicate characters and then removing all
blanks from the path. What remains represents the recognized text.

An example is shown in Fig. 4. The characters are “a”, “b” and “-” (blank). There are 5 time-
steps. Let’s apply our best path decoder to this matrix: the most likely character of t0 is “a”,
the same applies for t1 and t2. The blank character has the highest score at t3. Finally, “b” is
most likely at t4. This gives us the path “aaa-b”. We remove duplicate characters, this yields
“a-b”, and then we remove any blank from the remaining path, which gives us the text “ab”
which we output as the recognized text.

5/7
Fig. 4: Output matrix of NN. The thick dashed line represents the best path.

Best path decoding is, of course, only an approximation. It is easy to construct examples for
which it gives the wrong result: if you decode the matrix from Fig. 3, you get “” as the
recognized text. But we already know that the probability of “” is only 0.36 while it is 0.64 for
“a”. However, the approximation algorithm often gives good results in practical situations.
There are more advanced decoders such as beam-search decoding, prefix-search decoding
or token passing, which also use information about language structure to improve the
results.

Conclusion and further reading

First, we looked at the problems arising with a naive NN solution. Then, we saw how CTC is
able to tackle these problems. We then examined how CTC works by looking at how it
encodes text, how loss calculation is done and how it decodes the output of a CTC-trained
NN.

This should give you a good understanding of what is happening behind the scenes when
you e.g. call functions like ctc_loss or ctc_greedy_decoder in TensorFlow. However, when you
want to implement CTC yourself, you need to know some more details, especially to make it
run fast. Graves et al. [1] introduce the CTC operation, the paper also shows all the relevant

6/7
math. If you are interested in how to improve decoding, take a look at the articles about
beam search decoding [2][3]. I implemented some decoders and the loss function in Python
and C++, which you can find on github [4][5]. Finally, if you want to look at the bigger picture
of how to recognize (handwritten) text, look at my article on how to build a handwritten text
recognition system [6].

7/7

Aerospace Standard: AS8879 Rev. D
100% (7)
Aerospace Standard: AS8879 Rev. D
25 pages
Rocket Simulation With Matlab Code
No ratings yet
Rocket Simulation With Matlab Code
12 pages
Midterm
No ratings yet
Midterm
5 pages
Unit I Asymptotic Notations Asymptotic Notation: O, , !, and
No ratings yet
Unit I Asymptotic Notations Asymptotic Notation: O, , !, and
59 pages
This Paper Is Partially Supported by Natural Science Foundation of AnHui Provin Ce Under Grant No. 1208085QF107 A Novel LSTM-RNN Decoding Algorithm in CAPTCHA Re
No ratings yet
This Paper Is Partially Supported by Natural Science Foundation of AnHui Provin Ce Under Grant No. 1208085QF107 A Novel LSTM-RNN Decoding Algorithm in CAPTCHA Re
6 pages
Systemverilog UVM QA
No ratings yet
Systemverilog UVM QA
9 pages
2.2 Analyzing Algorithms: Analyzing An Algorithm Has Come To Mean Predicting The Resources That The Algorithm Requires
No ratings yet
2.2 Analyzing Algorithms: Analyzing An Algorithm Has Come To Mean Predicting The Resources That The Algorithm Requires
4 pages
An Overview of Tesseract OCR Engine
No ratings yet
An Overview of Tesseract OCR Engine
15 pages
Project Report: Optical Character Recognition Using Artificial Neural Network
No ratings yet
Project Report: Optical Character Recognition Using Artificial Neural Network
9 pages
Dynamic Programming Applications
No ratings yet
Dynamic Programming Applications
9 pages
Generating Latex Code For Commutative Diagrams
No ratings yet
Generating Latex Code For Commutative Diagrams
8 pages
Math Concepts
No ratings yet
Math Concepts
4 pages
An Introduction To JPEG Compression Using Matlab: Arno Swart 28th October 2003
No ratings yet
An Introduction To JPEG Compression Using Matlab: Arno Swart 28th October 2003
5 pages
Simple Rmarkdown
No ratings yet
Simple Rmarkdown
6 pages
Compression: Author: Paul Penfield, Jr. 2004 Massachusetts Institute of Technology Url: Start: Back: Next
No ratings yet
Compression: Author: Paul Penfield, Jr. 2004 Massachusetts Institute of Technology Url: Start: Back: Next
8 pages
H25 MidtermPractice PDF
No ratings yet
H25 MidtermPractice PDF
5 pages
Domain Dependence in Parallel Constraint Satisfaction
No ratings yet
Domain Dependence in Parallel Constraint Satisfaction
6 pages
PDF document-41964FCD5818-1
No ratings yet
PDF document-41964FCD5818-1
9 pages
Vermelho Vs Amarelo
No ratings yet
Vermelho Vs Amarelo
18 pages
A Matlab Project in Optical Character Recognition (OCR) : Introduction: What Is OCR?
No ratings yet
A Matlab Project in Optical Character Recognition (OCR) : Introduction: What Is OCR?
6 pages
First Look at Ia32 Assembly Language
No ratings yet
First Look at Ia32 Assembly Language
14 pages
Introduction to Weight Quantization.pdf (1)
No ratings yet
Introduction to Weight Quantization.pdf (1)
9 pages
LAB MANUAL Updated PDF
No ratings yet
LAB MANUAL Updated PDF
45 pages
15 JoP Mar 08
No ratings yet
15 JoP Mar 08
1 page
Hw1 Codecamp: Part A
No ratings yet
Hw1 Codecamp: Part A
5 pages
Strong CAPTCHA Gui de Lines v1.2: 1 Introducti On
No ratings yet
Strong CAPTCHA Gui de Lines v1.2: 1 Introducti On
18 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
8 pages
Experiment #3
No ratings yet
Experiment #3
9 pages
Step-By-step Understanding LSTM Autoencoder Layers - by Chitta Ranjan - Towards Data Science
No ratings yet
Step-By-step Understanding LSTM Autoencoder Layers - by Chitta Ranjan - Towards Data Science
12 pages
DC4_lab1_py
No ratings yet
DC4_lab1_py
5 pages
Run Length Coding
67% (3)
Run Length Coding
8 pages
%1iqsv) ) Jjmgmirx%Hetxmzi, Yjjqer'Shmrk %Pksvmxlqjsv:Iv) 0Evki7Ixwsj7) QFSPW
No ratings yet
%1iqsv) ) Jjmgmirx%Hetxmzi, Yjjqer'Shmrk %Pksvmxlqjsv:Iv) 0Evki7Ixwsj7) QFSPW
10 pages
cs161 Lecture1notes Class
No ratings yet
cs161 Lecture1notes Class
4 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Arithmetic Code Discussion and Implementation
No ratings yet
Arithmetic Code Discussion and Implementation
11 pages
04 Branch Divide
No ratings yet
04 Branch Divide
10 pages
Optical Character Recognition Using MATLAB: J.Sashank Varma, G.V.N.S.Varma, K.Vikranth Reddy
No ratings yet
Optical Character Recognition Using MATLAB: J.Sashank Varma, G.V.N.S.Varma, K.Vikranth Reddy
5 pages
Lecture L14 16 Threads Notes
No ratings yet
Lecture L14 16 Threads Notes
9 pages
Ex 0001
No ratings yet
Ex 0001
4 pages
Optimizing C++/Code Optimization/faster Operations: Structure Fields Order
No ratings yet
Optimizing C++/Code Optimization/faster Operations: Structure Fields Order
5 pages
Capital University of Science and Technology Department of Computer Science CS 3163: Design and Analysis of Algorithms (3) : Fall 2020
No ratings yet
Capital University of Science and Technology Department of Computer Science CS 3163: Design and Analysis of Algorithms (3) : Fall 2020
4 pages
Intro To Computer Science
No ratings yet
Intro To Computer Science
19 pages
Part One: One-Time Pad With Ciphertext Feedback: Hints
No ratings yet
Part One: One-Time Pad With Ciphertext Feedback: Hints
9 pages
Study Materials - Denoising Autoencoders
No ratings yet
Study Materials - Denoising Autoencoders
7 pages
Assembly Language Lab
100% (1)
Assembly Language Lab
32 pages
INFO3063 Assignment #3 Winter 2007: Getting The Parameters
No ratings yet
INFO3063 Assignment #3 Winter 2007: Getting The Parameters
17 pages
Turbo C Interiew Questions With Answers..
No ratings yet
Turbo C Interiew Questions With Answers..
6 pages
Data Structure
No ratings yet
Data Structure
110 pages
IBM322 Last Year ETE
No ratings yet
IBM322 Last Year ETE
5 pages
Matrix Multiplication With CUDA - A Basic Introduction To The CUDA Programming Model
No ratings yet
Matrix Multiplication With CUDA - A Basic Introduction To The CUDA Programming Model
44 pages
Char Rec On It Ion
No ratings yet
Char Rec On It Ion
12 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
15 pages
Lecture 2.3.2VariationalAutoencoders (VAEs)
No ratings yet
Lecture 2.3.2VariationalAutoencoders (VAEs)
25 pages
15 122 hw2
No ratings yet
15 122 hw2
10 pages
Urbo C Interiew Questions With Answers
No ratings yet
Urbo C Interiew Questions With Answers
6 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Introduction to Deep Learning
From Everand
Introduction to Deep Learning
Eugene Charniak
No ratings yet
How Does Chromaprint Work
No ratings yet
How Does Chromaprint Work
4 pages
Adafruit Feather
No ratings yet
Adafruit Feather
88 pages
Adafruit Gemma m0 2
No ratings yet
Adafruit Gemma m0 2
78 pages
Hackster - io-Hardware-as-Code Part II Hello FPGA
No ratings yet
Hackster - io-Hardware-as-Code Part II Hello FPGA
7 pages
SP03 Suspensions
No ratings yet
SP03 Suspensions
1 page
Hackster - io-Hardware-as-Code Part I An Introduction
No ratings yet
Hackster - io-Hardware-as-Code Part I An Introduction
4 pages
Problem Set DE PDF
0% (1)
Problem Set DE PDF
2 pages
Fatigue Life Assessment of Bolted Connections: Aliou Badara Camara, Fabienne Pennec
No ratings yet
Fatigue Life Assessment of Bolted Connections: Aliou Badara Camara, Fabienne Pennec
8 pages
Https:/sakai - Unc.edu/access/content/group/3a66df25 c268 45f3 9d85 c9f3
No ratings yet
Https:/sakai - Unc.edu/access/content/group/3a66df25 c268 45f3 9d85 c9f3
12 pages
309-1-The Compressibility of An Airwater Mixture and A Theoritical Relation Between The Air and Water Pressures, Schuurman
No ratings yet
309-1-The Compressibility of An Airwater Mixture and A Theoritical Relation Between The Air and Water Pressures, Schuurman
13 pages
Strategic Group Mapping Slide
No ratings yet
Strategic Group Mapping Slide
3 pages
1 - Business Statistics
No ratings yet
1 - Business Statistics
82 pages
Efficient Image Deblurring Networks based on DF
No ratings yet
Efficient Image Deblurring Networks based on DF
16 pages
DMC1937
No ratings yet
DMC1937
144 pages
12.equation For Velocity and Energy at Different Positions of A Vertical Circular Motion
100% (1)
12.equation For Velocity and Energy at Different Positions of A Vertical Circular Motion
4 pages
SBM借鉴
No ratings yet
SBM借鉴
30 pages
Lesson 11 4 Properties of Logarithms
No ratings yet
Lesson 11 4 Properties of Logarithms
15 pages
Relevance, goals and results of project: Competition "Ученые Будущего", Nomination: Programming
No ratings yet
Relevance, goals and results of project: Competition "Ученые Будущего", Nomination: Programming
2 pages
Final Exam QANT630 Fall 2019 (Part-2)
No ratings yet
Final Exam QANT630 Fall 2019 (Part-2)
14 pages
Method For The Estimation of The Mean Lorentzian Bandwidth in Spectra Composed of An Unknown Number of Highly Overlapped Bands
No ratings yet
Method For The Estimation of The Mean Lorentzian Bandwidth in Spectra Composed of An Unknown Number of Highly Overlapped Bands
12 pages
A DHT11 Class For Arduino.
No ratings yet
A DHT11 Class For Arduino.
6 pages
CH 004
100% (2)
CH 004
22 pages
Demand Forecasting
100% (1)
Demand Forecasting
41 pages
Application: V.34 Transmitter and Receiver Implementation On The TMS320C50 DSP
No ratings yet
Application: V.34 Transmitter and Receiver Implementation On The TMS320C50 DSP
62 pages
Files 3 2021 September NotesHubDocument 1632938285
No ratings yet
Files 3 2021 September NotesHubDocument 1632938285
103 pages
Organizational Commitment and Job Satisfaction in Information Technology Sector
No ratings yet
Organizational Commitment and Job Satisfaction in Information Technology Sector
8 pages
Circular 20240525145156 Class-Xii HHW
No ratings yet
Circular 20240525145156 Class-Xii HHW
7 pages
Test10 Eoc Geometry
No ratings yet
Test10 Eoc Geometry
31 pages
Kater's Pendulum: Scientific Aims and Objectives
No ratings yet
Kater's Pendulum: Scientific Aims and Objectives
7 pages
Previous Year Question Bank Exademy
No ratings yet
Previous Year Question Bank Exademy
30 pages
Com Syllubus
No ratings yet
Com Syllubus
62 pages
COR-STAT1202 Formula Sheet and Distribution Tables
No ratings yet
COR-STAT1202 Formula Sheet and Distribution Tables
2 pages
Unit - Iii (Efm)
No ratings yet
Unit - Iii (Efm)
26 pages
Force and Laws of Motion
No ratings yet
Force and Laws of Motion
9 pages
Product Data: Real-Time Frequency Analyzer - Type 2143 Dual Channel Real-Time Frequency Analyzers - Types 2144, 2148/7667
No ratings yet
Product Data: Real-Time Frequency Analyzer - Type 2143 Dual Channel Real-Time Frequency Analyzers - Types 2144, 2148/7667
12 pages