An Intuitive Explanation of Connectionist Temporal Classification
An Intuitive Explanation of Connectionist Temporal Classification
Classification
towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c
9 de junio de
2018
Harald Scheidl
Jun 10, 2018
If you want a computer to recognize text, neural networks (NN) are a good choice as they
outperform all other approaches at the moment. The NN for such use-cases usually consists
of convolutional layers (CNN) to extract a sequence of features and recurrent layers
(RNN) to propagate information through this sequence. It outputs character-scores for each
sequence-element, which simply is represented by a matrix. Now, there are two things we
want to do with this matrix:
1/7
1. train: calculate the loss value to train the NN
2. infer: decode the matrix to get the text contained in the input image
Both tasks are achieved by the CTC operation. An overview of the a handwriting
recognition system is shown in Fig. 1.
Let’s have a closer look at the CTC operation and discuss how it works without hiding the
clever ideas it is based on behind complicated formulas. At the end, I will point you to
references where you can find Python code and the (not too complicated) formulas, if you
are interested.
2/7
Fig. 2: Annotation for each horizontal position of the image.
1. we only have to tell the CTC loss function the text that occurs in the image. Therefore
we ignore both the position and width of the characters in the image.
2. no further processing of the recognized text is needed.
3/7
position, which will be removed when decoding it. However, we must insert a blank between
duplicate characters like in “hello”. Further, we can repeat each character as often as we
like.
As you see, this schema also allows us to easily create different alignments of the same text,
e.g. “t-o” and “too” and “-to” all represent the same text (“to”), but with different alignments
to the image. The NN is trained to output an encoded text (encoded in the NN output
matrix).
Loss calculation
We need to calculate the loss value for given pairs of images and GT texts to train the NN.
You already know that the NN outputs a matrix containing a score for each character at
each time-step. A minimalistic matrix is shown in Fig. 3: there are two time-steps (t0, t1) and
three characters (“a”, “b” and the blank “-”). The character-scores sum to 1 for each time-
step.
Fig. 3: Output matrix of NN. The character-probability is color-coded and is also printed next
to each matrix entry. Thin lines are paths representing the text “a”, while the thick dashed
line is the only path representing the text “”.
4/7
Further, you already know that the loss is calculated by summing up all scores of all
possible alignments of the GT text, this way it does not matter where the text appears in the
image.
The score for one alignment (or path, as it is often called in the literature) is calculated by
multiplying the corresponding character scores together. In the example shown above, the
score for the path “aa” is 0.4·0.4=0.16 while it is 0.4·0.6=0.24 for “a-” and 0.6·0.4=0.24 for “-a”.
To get the score for a given GT text, we sum over the scores of all paths corresponding to
this text. Let’s assume the GT text is “a” in the example: we have to calculate all possible
paths of length 2 (because the matrix has 2 time-steps), which are: “aa”, “a-” and “-a”. We
already calculated the scores for these paths, so we just have to sum over them and get
0.4·0.4+0.4·0.6+0.6·0.4=0.64. If the GT text is assumed to be “”, we see that there is only one
corresponding path, namely “--”, which yields the overall score of 0.6·0.6=0.36.
If you looked closely, you have seen that we calculated the probability of a GT text, but not
the loss. However, the loss simply is the negative logarithm of the probability. The loss value
is back-propagated through the NN and the parameters of the NN are updated according to
the optimizer used, which I will not discuss any further here.
Decoding
When we have a trained NN, we usually want to use it to recognize text in previously unseen
images. Or in more technical terms: we want to calculate the most likely text given the
output matrix of the NN. You already know a method to calculate the score of a given text.
But this time, we are not given any text, in fact, it is exactly this text we are looking for.
Trying every possible text would work if there are only a few time-steps and characters, but
for practical use-cases, this is not feasible.
A simple and very fast algorithm is best path decoding which consists of two steps:
1. it calculates the best path by taking the most likely character per time-step.
2. it undoes the encoding by first removing duplicate characters and then removing all
blanks from the path. What remains represents the recognized text.
An example is shown in Fig. 4. The characters are “a”, “b” and “-” (blank). There are 5 time-
steps. Let’s apply our best path decoder to this matrix: the most likely character of t0 is “a”,
the same applies for t1 and t2. The blank character has the highest score at t3. Finally, “b” is
most likely at t4. This gives us the path “aaa-b”. We remove duplicate characters, this yields
“a-b”, and then we remove any blank from the remaining path, which gives us the text “ab”
which we output as the recognized text.
5/7
Fig. 4: Output matrix of NN. The thick dashed line represents the best path.
Best path decoding is, of course, only an approximation. It is easy to construct examples for
which it gives the wrong result: if you decode the matrix from Fig. 3, you get “” as the
recognized text. But we already know that the probability of “” is only 0.36 while it is 0.64 for
“a”. However, the approximation algorithm often gives good results in practical situations.
There are more advanced decoders such as beam-search decoding, prefix-search decoding
or token passing, which also use information about language structure to improve the
results.
This should give you a good understanding of what is happening behind the scenes when
you e.g. call functions like ctc_loss or ctc_greedy_decoder in TensorFlow. However, when you
want to implement CTC yourself, you need to know some more details, especially to make it
run fast. Graves et al. [1] introduce the CTC operation, the paper also shows all the relevant
6/7
math. If you are interested in how to improve decoding, take a look at the articles about
beam search decoding [2][3]. I implemented some decoders and the loss function in Python
and C++, which you can find on github [4][5]. Finally, if you want to look at the bigger picture
of how to recognize (handwritten) text, look at my article on how to build a handwritten text
recognition system [6].
7/7