RNN Approaches To Text Normalization - A Challenge
RNN Approaches To Text Normalization - A Challenge
Abstract 1 Introduction
Within the last few years a major shift has taken
This paper presents a challenge to the com-
place in speech and language technology: the field
munity: given a large corpus of written text has been taken over by deep learning approaches. For
aligned to its normalized spoken form, train an example, at a recent NAACL conference well more
RNN to learn the correct normalization func- than half the papers related in some way to word em-
tion. We present a data set of general text beddings or deep or recurrent neural networks.
where the normalizations were generated us- This change is surely justified by the impressive
ing an existing text normalization component performance gains to be had by deep learning, some-
of a text-to-speech system. This data set will
thing that has been demonstrated in a range of ar-
be released open-source in the near future.
eas from image processing, handwriting recogni-
We also present our own experiments with this tion, acoustic modeling in automatic speech recogni-
data set with a variety of different RNN archi- tion (ASR), parametric speech synthesis for text-to-
tectures. While some of the architectures do in speech (TTS), machine translation, parsing, and go
fact produce very good results when measured playing to name but a few.
in terms of overall accuracy, the errors that are While various approaches have been taken and
produced are problematic, since they would
some NN architectures have surely been carefully de-
convey completely the wrong message if such
a system were deployed in a speech applica- signed for the specific task, there is also a widespread
tion. On the other hand, we show that a simple feeling that with deep enough architectures, and
FST-based filter can mitigate those errors, and enough data, one can simply feed the data to one’s
achieve a level of accuracy not achievable by NN and have it learn the necessary function. For ex-
the RNN alone. ample:
Though our conclusions are largely negative Not only do such networks require less
on this point, we are actually not arguing that human effort than traditional approaches,
the text normalization problem is intractable they generally deliver superior perfor-
using an pure RNN approach, merely that it is mance. This is particularly true when very
not going to be something that can be solved
large amounts of training data are avail-
merely by having huge amounts of annotated
text data and feeding that to a general RNN able, as the benefits of holistic optimisa-
model. And when we open-source our data, we tion tend to outweigh those of prior knowl-
will be providing a novel data set for sequence- edge. (Graves and Jaitly, 2014, page 1)
to-sequence modeling in the hopes that the the
In this paper we present an example of an applica-
community can find better solutions.
tion that is unlikely to be amenable to such a “turn-
the-crank” approach. The example is text normaliza-
tion, specifically in the sense of a system that con- a Ib2H7=
verts from a written representation of a text into a baby Ib2H7=
representation of how that text is to be read aloud. giraffe Ib2H7=
The target applications are TTS and ASR — in the is Ib2H7=
latter case mostly for generating language modeling 6ft six feet
data from raw written text. This problem, while often tall Ib2H7=
considered mundane, is in fact very important, and a and Ib2H7=
major source of degradation of perceived quality in weighs Ib2H7=
TTS systems in particular can be traced to problems 150lb one hundred fifty pounds
with text normalization.
Figure 1: Example input-output pairs for text normaliza-
We start by describing why this application area tion. In this example, the token <b2H7> indicates that the
is a bit different from most other areas of NLP. We token is to be left alone.
then discuss prior work in this area, including related
work on applications of RNNs in text normalization
more broadly. We then describe a dataset that will
be made available open-source as a challenge to the ##v ;B`772 Bb bBt 722i iHH M/
community, and we go on to describe several experi- r2B;?b QM2 ?mM/`2/ 7B7iv TQmM/bX
ments that we have conducted with this dataset, with
various NN architectures. In the original written form there are two non-
As we show below, some of the RNNs produce standard words (Sproat et al., 2001), namely the
very good results when measured in terms of overall two measure expressions 6ft and 150lb. In order
accuracy, but they produce errors that would make to read the text, each of these must be normalized
them risky to use in a real application, since in the into a sequence of ordinary words. In this case
errorful cases, the normalization would convey com- both examples are instances of the same semiotic
pletely the wrong message. As we also demonstrate, class (Taylor, 2009), namely measure phrases. But
these errors can be ameliorated with a simple FST- in general texts may include non-standard word
based filter used in tandem with the RNN. sequences from a variety of different semiotic
But with a pure RNN approach, we have not thus classes, including measures, currency amounts,
far succeeded in avoiding the above-mentioned risky dates, times, telephone numbers, cardinal or ordinal
errors, and it is an open question whether such errors numbers, fractions, among many others. Each of
can be avoided by such a solution. We present be- these involves a specific function mapping between
low a hypothesis on why the RNNs tend to make the the written input form and the spoken output form.
kinds of errors below. We close the paper by propos- If one were to train a deep-learning system for
ing a challenge to the community based on the data text normalization, one might consider presenting
that we plan to release. the system with a large number of input-output pairs
as in Figure 1. Here we use a special token Ib2H7=
2 Why text normalization is different to indicate that the input is to be left alone. In princi-
ple this seems like a reasonable approach, but there
To lay the groundwork for discussion let us consider are a number of issues that need to be considered.
a simple example such as the following: The first is that one desirable application of such
an approach, if it can be made to work, is to develop
##v ;B`772 Bb e7i iHH M/ r2B;?b text normalization systems for languages where we
R8yH#X do not already have an existing system. If one could
do this, one could circumvent the often quite con-
If one were to ask a speaker of English to read siderable hand labor required to build systems in
this sentence, or if one were to feed it to an English more traditional approaches to text normalization.1
TTS system one would expect that it to be read more
or less as follows: 1
E.g. Ebden and Sproat, 2014.
But where would one get the necessary data? In the haps as nine twenty, but certainly never nine hundred
case of machine translation, the existence of large thirty. Oct 4 must be read as October fourth or maybe
amounts of parallel texts in multiple languages is mo- October four, but not November fourth. I mention
tivated by the fact that people want to read texts in these cases specifically, since the silly errors are er-
their own languages, and therefore someone, some- rors that current neural models trained on these sorts
where, will often go to the trouble of writing a transla- of data will make, as we demonstrate below.
tion. For speech recognition acoustic model training, Again the situation for text normalization is dif-
one could in theory use closed captioning (Bharad- ferent from that of, say, MT: in the case of MT, one
waj and Medapati, 2015), which again is produced usually must do something for any word or phrase
for a reason. In contrast, there is no natural economic in the source language. The closest equivalent of the
reason to produce normalized versions of written Ib2H7= map in MT, is probably translating a word or
texts: no English speaker needs to be told that 6ft is phrase with its most common equivalent. This will of
six feet or that 150lb is one hundred fifty pounds, and course often be correct: most of the time it would be
therefore there is no motivation for anyone to pro- reasonable to translate the cat as le chat in French,
duce such a translation. The situation in text normal- and this translation would count positively towards
ization is therefore more akin to the situation with one’s BLEU score. This is to say that in MT one gets
parsing, where one must create treebanks for a lan- credit for the “easy” cases as well as the more inter-
guage in order to make progress on that language; if esting cases — e.g. this cat’s in no hurry, where a
one wants to train text normalization systems using more appropriate translation of cat might be type or
NN approaches, one must create the training data to gars. In text normalization one gets credit only for
do so. In the case of the present paper, we were able the relatively sparse interesting cases.
to produce a training corpus since we already had Indeed, as we shall see below, if one is allowed
working text normalization systems, which allowed to count the vast majority of cases where the right
us to produce a normalized form for the raw input. answer is to leave the input token alone, some of
The normalization is obviously errorful (we give an our RNNs already perform very well. The problem
estimate of the percentage of errors below), but it is is that they tend to mess up with various semiotic
good enough to serve as a test bed for deep learn- classes in ways that would make them unusable for
ing approaches — and by making it public we hope any real application, since one could never be quite
to encourage more serious attention to this problem. sure for a new example that the system would not
But in any event, if one is planning to implement read it completely wrongly. As we will see below,
an RNN-based approach to text normalization, one the neural models occasionally read things like, £900
must take into consideration the resources needed to as nine hundred Euros — something that state-of-
produce the necessary training data. the-art hand-built text normalization systems would
A second issue is that the set of interesting cases never do, brittle though such systems may be. The
in text normalization is usually very sparse. Most occasional comparable error in an MT system would
tokens, we saw in the small example above, map be bad, but it would not contribute much to a degra-
to themselves, and while it is certainly important to dation of the system’s BLEU score. Such a misread-
get that right, one generally does not get any credit ing in a TTS system would be something that peo-
for doing so either. What is evaluated in text nor- ple would immediately notice (or, worse, not notice
malization systems is the interesting cases, the num- if they could not see the text), and would stand out
bers, times, dates, measure expressions, currency precisely because a TTS system ought to get such
amounts, and so forth, that require special treatment. examples right.
Furthermore, the requirements on accuracy are rather In this paper we try two kinds of neural models
stringent: if the text says 381 kg, then an English TTS on a text normalization problem. The first is a neu-
system had better say three hundred eighty one kilo- ral equivalent of a source-channel model that uses a
grams, or maybe three hundred eighty one kilogram, sequence-to-sequence LSTM that has been success-
but certainly not three hundred forty one kilograms. fully applied to the somewhat similar problem of
920 might be read as nine hundred twenty, or per- grapheme-to-phoneme conversion (Rao et al., 2015),
along with a standard LSTM language model archi- 4 Dataset
tecture. The second treats the entire problem as a
sequence-to-sequence task, using the same architec- Our data consists of 1.1 billion words of English
ture that has been used for a speech-to-text conver- text, and 290 million words of Russian text, from
sion problem (Chan et al., 2016). Wikipedia regions that could be decoded as UTF8,
divided into sentences, and run through the Google
TTS system’s Kestrel text normalization system
3 Prior work on text normalization (Ebden and Sproat, 2014) to produce verbalizations.
The format of the annotated data is as in Figure 1
Text normalization has a long history in speech tech- above.
nology, dating back to the earliest work on full TTS As described in (Ebden and Sproat, 2014),
synthesis (Allen et al., 1987). Sproat (1996) provided Kestrel’s verbalizations are produced by first tok-
a unifying model for most text normalization prob- enizing the input and classifying the tokens, and
lems in terms of weighted finite-state transducers then verbalizing each token according to its semi-
(WFSTs). The first work to treat the problem of text otic class. The majority of the rules are hand-built us-
normalization as essentially a language modeling ing the Thrax finite-state grammar development sys-
problem was (Sproat et al., 2001). More recent ma- tem (Roark et al., 2012). Statistical components of
chine learning work specifically addressed to TTS the system include morphosyntactic taggers for lan-
text normalization include (Sproat, 2010; Roark and guages like Russian with complex morphology,2 a
Sproat, 2014; Sproat and Hall, 2014). statistical transliteration module (Jansche and Sproat,
In the last few years there has been a lot of work 2009), and a statistical model to determine if cap-
that focuses on social media (Xia et al., 2006; Choud- italized tokens should be read as words or letter
hury et al., 2007; Kobus et al., 2008; Beaufort et sequences (Sproat and Hall, 2014). Most ordinary
al., 2010; Kaufmann, 2010; Liu et al., 2011; Pen- words are of course left alone (represented here as
nell and Liu, 2011; Aw and Lee, 2012; Liu et al., Ib2H7=), and punctuation symbols are mostly trans-
2012a; Liu et al., 2012b; Hassan and Menezes, 2013; duced to bBH (for “silence”).
Yang and Eisenstein, 2013). This work tends to fo- The data were divided into 90 files (roughly 90%)
cus on different problems from those of TTS: on for training, 5 files for online evaluation during train-
the one hand one, in social media one often has to ing (the “development” set), and 5 for testing. In the
deal with odd spellings of words such as +m H3`, test results reported below, we used the first 100K
+QQQQQQQQQQQQQHHHH, or /i bmttt, which are tokens of the final file (99) of the test data, includ-
less of an issue in most applications of TTS; on the ing the end-of-sentence marker, working out to about
other, expansion of digit sequences into words is crit- 92K real tokens for English and 93K real tokens for
ical for TTS text normalization, but of no interest to Russian.
the normalization of social media texts. A manual analysis of about 1,000 examples from
Some previous work, also on social media normal- the test data suggests an overall error rate of approx-
ization, that has made use of neural techniques in- imately 0.1% for English and 2.1% for Russian. The
cludes (Chrupała, 2014; Min and Mott, 2015). The largest category of errors for Russian involves years
latter work, for example, achieved second place in being read as cardinal numbers rather than the ex-
the constrained track of the ACL 2015 W-NUT Nor- pected ordinal form.
malization of Noisy Text (Baldwin et al., 2015), Note that although the test data were of course
achieving an F1 score of 81.75%. In the work we taken from a different portion of the Wikipedia text
report below on TTS normalization, we achieve ac- than the training and development data, nonetheless
curacies that are comparable or better than that result a huge percentage of the individual tokens of the test
(to the extent that it makes sense to compare across 2
The morphosyntactic tagger is an SVM model using
such quite different tasks), but we would argue that hand-tuned features that classify the morphological bundle for
for the intended application, such results are still not each word independently, similar to SVMTool (Giménez and
good enough. Màrquez, 2004) and MateTagger (Bohnet and Nivre, 2012).
data — 98.9% in the case of Russian and 99.5% in CQ?M Ib2H7=
the case of English — were found in the training set. HBp2b Ib2H7=
This in itself is perhaps not so surprising but it does i Ib2H7=
raise the concern that the RNN models may in fact be Rkj QM2 ir2Miv i?`22
memorizing their results, without doing much gener- EBM; Ib2H7=
alization. We discuss this issue further below. p2 p2Mm2
Finally some justification of the choice of data is M2` Ib2H7=
in order. We chose Wikipedia for two reasons. First, S H2ii2` M/ T H2ii2`
it is after all a reasonable application of TTS, and in X bBH
fact it is used already in systems that give answers
to voice queries on the Web. Second, the data are Table 1: Training data format for the normalization chan-
nel, and language model. The channel model is trained
already publicly available, so there are no licensing
to map from the first to the second column, whereas the
issues. language model is trained on the underlined tokens. The
notation t H2ii2` denotes a letter-by-letter reading, and
5 Experiment 1: Text normalization using bBH denotes silence, which is predicted by the TTS text
LSTMs normalization system for most punctuation.
The first approach depends on the observation that
text normalization can be broken down into two sub-
5.1 LSTM architecture
problems. For any token:
We train two LSTM models, one for the channel and
• What are the possible normalizations of that to- one for the language model. The data usage of each
ken, and during training is outlined in Table 1. For the chan-
• which one is appropriate to the given context? nel model, the LSTM learns to map from a sequence
of characters to one or more word tokens of output.
The first of these — the channel — can be handled in For most input tokens this will involve deciding to
a context-independent way by enumerating the set of leave it alone, that is to map it to Ib2H7=, or in the
possible normalizations: thus 123 might be one hun- case of punctuation to map it to bBH, corresponding
dred twenty three, one two three, or one twenty three. to silence. For other tokens it must decide to verbal-
The second requires context: in 123 King Ave., the ize it in a variety of different ways. For the language
correct reading in American English would normally model, the system reads the words either from the in-
be one twenty three. put, if mapped to Ib2H7= or else from the output if
The first component is a string-to-string transduc- mapped from anything else.
tion problem. Furthermore, since WFSTs can be used For the channel LSTM we used a bidirectional
to handle most or all of the needed transductions sequence-to-sequence model similar to that reported
(Sproat, 1996), the relation between the input and in (Rao et al., 2015) in two configurations: one with
output strings is regular, so that complex network two forward and two backward hidden layers, hence-
architectures involving, say, stacks should not be forth the shallow model; and one with three forward
needed. For the input, the string must be in terms and three backward hidden layers, henceforth the
of characters, since for a string like 123, one needs deep model. We kept the number of nodes in each
to see the individual digits in the sequence to know hidden layer constant at 256.3 . The output layer is a
how to read it; similarly it helps to see the individual connectionist temporal classification (CTC) (Graves
characters for a possibly OOV word such as snarky et al., 2006) layer with a softmax error function.4 In-
to classify it as a token to be left alone (Ib2H7=). On put was limited to 250 distinct characters (including
the other hand since the second component is effec- the unknown token). For the output, 1,000 distinct
tively a language-modeling problem, the appropri- 3
A larger shallow model with 1024 nodes in each layer ended
ate level of representation there is words. Therefore up severely overfitting the training data.
we also want the output of the first component to be 4
Earlier experiments with non-CTC architectures did not pro-
in terms of words. duce results as good as what we obtained with the CTC layer.
Figure 3: LSTM for the language model.
Table 3: Accuracies for the first experiment, including overall accuracy, and accuracies on various semiotic class cate-
gories of interest. Key for non-obvious cases: ALL = all cases; PLAIN = ordinary word (<b2H7>); PUNCT = punctu-
ation (bBH); TRANS = transliteration; LETTERS = letter sequence; CARDINAL = cardinal number; VERBATIM =
verbatim reading of character sequence; ORDINAL = ordinal number; DECIMAL = decimal fraction; ELECTRONIC
= electronic address; DIGIT = digit sequence; MONEY = currency amount; FRACTION = non-decimal fraction; TIME
= time expression; ADDRESS = street address. N is sometimes slightly different for each training condition since in a
few cases the model produces no output, and we discount those cases — thus in effect giving the model the benefit of
the doubt.
Table 4: Errors from the English and Russian deep models. In light of recent events, the final English example is rather
amusing.
to read the written measure expression, not merely would appear as
something from the same semantic category.
As we noted above, there was a substantial amount A HBp2 i IMQ`K= Rkj IfMQ`K= EBM; p2 X
of overlap at the individual token level between the
training and test data: could the LSTM simply have on the input side, which would map to
been memorizing? In the test data there were 475 un-
seen cases in English of which the system got 82.9% QM2 ir2Miv i?`22
correct (compared to 99.5% among the seen cases);
for Russian there were 1,089 unseen cases of which on the output side.
83.8% were predicted correctly (compared to 98.5% In this way we were able to limit the number
among the seen cases). Some examples of the cor- of input and output nodes to something reasonable.
rect predictions are given in Table 5. As can be seen, The architecture follows closely that of (Chan et al.,
these include some complicated cases, so it is fair 2016). Specifically, we used a 4-layer bidirectional
to say that the system is not simply memorizing but LSTM reader (but without the pyramidal structure
does have some capability to generalize. used Chan et al.’s task) that reads input characters, a
layer of 256 attentional units, and a 2-layer decoder
6 Experiment 2: Attention-based RNN that produces word sequences. The reader is referred
sequence-to-sequence models to (Chan et al., 2016) for more details of the frame-
work.
Our second approach involves modeling the problem It was noticed in early experiments with this con-
entirely as a sequence-to-sequence problem. That figuration that the overabundance of Ib2H7= outputs
is, rather than have a separate “channel” and lan- was swamping the training and causing the system
guage model phase, we model the whole task as one to predict Ib2H7= in too many cases. We therefore
where we map a sequence of input characters to a down-sampled the instances of Ib2H7= (and bBH) in
sequence of output words. For this we use a Tensor the training so that only roughly one in ten examples
Flow (Abadi et al., 2015) model with an attention were given to the learner; among the various settings
mechanism (Mnih et al., 2014). Attention models we tried, this seemed to give the best results both in
are particularly good for sequence-to-sequence prob- terms of performance and reduced training time.
lems since they are able to continuously update the The training, development and testing data were
decoder with information about the state of the en- the same as described in Section 4 above. The En-
coder and thus attend better to the relation between glish RNN was trained for about five and a half days
the input and output sequences. The Tensor Flow im- (460K steps) on 8 GPUs until the perplexity on the
plementation used is essentially the same as that re- held-out data was 1.003; Russian was trained for five
ported in (Chan et al., 2016). days (400K steps), reaching a perplexity of 1.002.
In principle one might treat this problem in a way
similar to how MT has been treated as a sequence- 6.1 Results
to-sequence problem (Cho et al., 2014), and simply As the results in Table 6 show, the performance is
pass the whole sentence to be normalized into a mostly better than the LSTM model described in Sec-
sequence of words. The main problem is that since tion 5. This suggests in turn that modeling the prob-
we need to treat the input as a sequence of characters lem as a pure sequence-to-sequence transduction is
as we argued above, the input layer would need indeed viable as an alternative to the source-channel
to be rather large in order to cover sentences of approach we had taken previously.
reasonable length. We therefore took a different Some errors are shown in Table 7. These errors
approach and placed each token in a window of are reminiscent of several of the errors of the LSTM
3 words to the left and 3 to the right, marking the system in Table 4, in that the wrong unit is picked.
to-be-normalized token with a distinctive begin and On the other hand it must be admitted that, in En-
end tag IMQ`K= XXX IfMQ`K=. Thus for example glish, the only clear error of that type is the one ex-
the token Rkj in the context I live at ... King Ave . ample shown in Table 7. Again, as with the source-
Input Prediction
13 October 1668 the thirteenth of october sixteen sixty eight
13.1549 km² thirteen point one five four nine square kilometers
26 июля 1864 двадцать шестого июля тысяча восемьсот шестьдесят четвертого года
26 July 1864 twenty sixth of July of the one thousand eight hundred sixty fourth year
90 кв. м. девяносто квадратных метров
90 sq. m. ninety square meters
Table 5: Correct output for test tokens that were never seen in the training data.
Table 9: Accuracies for the attention-based sequence-to- Table 10: Accuracies for the attention-based sequence-to-
sequence models for English on smaller training sets, with sequence models for English on smaller training sets, with
and without an FST filter. (Slight differences in overall and without an FST filter, on the MEASURE-MONEY
counts for what is the same dataset used for the two con- rich dataset.
ditions, reflect the fact that a few examples are “dropped”
by the way in which the decoder buffers data for the fil-
terless condition.) Finally Table 12 shows results for the RNN with
and without the FST filter on 1000 MONEY and
MEASURE expressions that have not previously
MEASURE and MONEY expressions. Specifically, been seen in the training data.11 In this case there was
we selected 1,000 sentences, each of which had one no improvement for MONEY, but there was a sub-
expression in that category. We then decoded with stantial improvement for MEASURE. In most cases,
the models trained on the smaller training set, with the MONEY examples that failed to be improved
and without the FST filter. Results are presented in with the FST filter were cases where the filter sim-
Table 10. Once again, the FST filter improves the ply did not match the input, and thus was not used.12
overall accuracy for MONEY and MEASURE, leav- The results of a similar experiment on Russian,
ing the other categories unaffected. Some examples using the smaller training set on a MEASURE-
of the improvements in both categories are shown MONEY rich corpus where the MEASURE and
in Table 11. Looking more particularly at measures, MONEY tokens were previously unseen is shown
where the largest differences are found, we find that in Table 13. On the face of it it would seem that
the only cases where the FST filter does not help is the FST filter is actually making things worse, un-
cases where the grammar fails to match against the til one looks at the differences. Of the 50 cases
input and the RNN alone is used to predict the output. where the filter made things ”worse”, 34 (70%) are
These cases are 1/2 cc, 30’ (for thirty feet), 80’, 7000 cases where there was an error in the data and a
hg (which uses the unusual unit hectogram), 600 bil- perfectly well formed measure was rendered with
lion kWh (the measure grammar did not allow for
11
a spelled number like billion), and the numberless To remind the reader, all test data are of course held out
from the training and development data, but it is common for
“measures” per km, /m². In a couple of other cases,
the same literal expression to recur.
the FST does not constrain the RNN enough: 1 g still 12
Only in three cases involving Indian Rupees, such as Rs.149
comes out as one grams, since the FST allows both did the filter match, but still the wrong answer (in this case six),
it and the correct one gram, but this of course is an was produced. In that case the RNN probably simply failed to
“acceptable” error since it is at least not misleading. produce any paths including the right answer. In such cases the
only solution is probably to override the RNN completely on a
case-by-case basis.
Input RNN RNN+FST
£5 five five pounds
11 billion AED eleven billion danish eleven billion dirhams
2 mA 2 megaamperes 2 milliamperes
33 rpm thirty two revolutions per minute thirty three revolutions per minute
Table 11: Some misleading readings of the RNN that have been corrected by the FST.