0% found this document useful (0 votes)
9 views

RNN Approaches To Text Normalization - A Challenge

Uploaded by

jesuisenissay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

RNN Approaches To Text Normalization - A Challenge

Uploaded by

jesuisenissay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

RNN Approaches to Text Normalization: A Challenge

Richard Sproat, Navdeep Jaitly


Google, Inc.
{rws,ndjaitly}@google.com

Abstract 1 Introduction
Within the last few years a major shift has taken
This paper presents a challenge to the com-
place in speech and language technology: the field
munity: given a large corpus of written text has been taken over by deep learning approaches. For
aligned to its normalized spoken form, train an example, at a recent NAACL conference well more
RNN to learn the correct normalization func- than half the papers related in some way to word em-
tion. We present a data set of general text beddings or deep or recurrent neural networks.
where the normalizations were generated us- This change is surely justified by the impressive
ing an existing text normalization component performance gains to be had by deep learning, some-
of a text-to-speech system. This data set will
thing that has been demonstrated in a range of ar-
be released open-source in the near future.
eas from image processing, handwriting recogni-
We also present our own experiments with this tion, acoustic modeling in automatic speech recogni-
data set with a variety of different RNN archi- tion (ASR), parametric speech synthesis for text-to-
tectures. While some of the architectures do in speech (TTS), machine translation, parsing, and go
fact produce very good results when measured playing to name but a few.
in terms of overall accuracy, the errors that are While various approaches have been taken and
produced are problematic, since they would
some NN architectures have surely been carefully de-
convey completely the wrong message if such
a system were deployed in a speech applica- signed for the specific task, there is also a widespread
tion. On the other hand, we show that a simple feeling that with deep enough architectures, and
FST-based filter can mitigate those errors, and enough data, one can simply feed the data to one’s
achieve a level of accuracy not achievable by NN and have it learn the necessary function. For ex-
the RNN alone. ample:
Though our conclusions are largely negative Not only do such networks require less
on this point, we are actually not arguing that human effort than traditional approaches,
the text normalization problem is intractable they generally deliver superior perfor-
using an pure RNN approach, merely that it is mance. This is particularly true when very
not going to be something that can be solved
large amounts of training data are avail-
merely by having huge amounts of annotated
text data and feeding that to a general RNN able, as the benefits of holistic optimisa-
model. And when we open-source our data, we tion tend to outweigh those of prior knowl-
will be providing a novel data set for sequence- edge. (Graves and Jaitly, 2014, page 1)
to-sequence modeling in the hopes that the the
In this paper we present an example of an applica-
community can find better solutions.
tion that is unlikely to be amenable to such a “turn-
the-crank” approach. The example is text normaliza-
tion, specifically in the sense of a system that con- a Ib2H7=
verts from a written representation of a text into a baby Ib2H7=
representation of how that text is to be read aloud. giraffe Ib2H7=
The target applications are TTS and ASR — in the is Ib2H7=
latter case mostly for generating language modeling 6ft six feet
data from raw written text. This problem, while often tall Ib2H7=
considered mundane, is in fact very important, and a and Ib2H7=
major source of degradation of perceived quality in weighs Ib2H7=
TTS systems in particular can be traced to problems 150lb one hundred fifty pounds
with text normalization.
Figure 1: Example input-output pairs for text normaliza-
We start by describing why this application area tion. In this example, the token <b2H7> indicates that the
is a bit different from most other areas of NLP. We token is to be left alone.
then discuss prior work in this area, including related
work on applications of RNNs in text normalization
more broadly. We then describe a dataset that will
be made available open-source as a challenge to the  ##v ;B`772 Bb bBt 722i iHH M/
community, and we go on to describe several experi- r2B;?b QM2 ?mM/`2/ 7B7iv TQmM/bX
ments that we have conducted with this dataset, with
various NN architectures. In the original written form there are two non-
As we show below, some of the RNNs produce standard words (Sproat et al., 2001), namely the
very good results when measured in terms of overall two measure expressions 6ft and 150lb. In order
accuracy, but they produce errors that would make to read the text, each of these must be normalized
them risky to use in a real application, since in the into a sequence of ordinary words. In this case
errorful cases, the normalization would convey com- both examples are instances of the same semiotic
pletely the wrong message. As we also demonstrate, class (Taylor, 2009), namely measure phrases. But
these errors can be ameliorated with a simple FST- in general texts may include non-standard word
based filter used in tandem with the RNN. sequences from a variety of different semiotic
But with a pure RNN approach, we have not thus classes, including measures, currency amounts,
far succeeded in avoiding the above-mentioned risky dates, times, telephone numbers, cardinal or ordinal
errors, and it is an open question whether such errors numbers, fractions, among many others. Each of
can be avoided by such a solution. We present be- these involves a specific function mapping between
low a hypothesis on why the RNNs tend to make the the written input form and the spoken output form.
kinds of errors below. We close the paper by propos- If one were to train a deep-learning system for
ing a challenge to the community based on the data text normalization, one might consider presenting
that we plan to release. the system with a large number of input-output pairs
as in Figure 1. Here we use a special token Ib2H7=
2 Why text normalization is different to indicate that the input is to be left alone. In princi-
ple this seems like a reasonable approach, but there
To lay the groundwork for discussion let us consider are a number of issues that need to be considered.
a simple example such as the following: The first is that one desirable application of such
an approach, if it can be made to work, is to develop
 ##v ;B`772 Bb e7i iHH M/ r2B;?b text normalization systems for languages where we
R8yH#X do not already have an existing system. If one could
do this, one could circumvent the often quite con-
If one were to ask a speaker of English to read siderable hand labor required to build systems in
this sentence, or if one were to feed it to an English more traditional approaches to text normalization.1
TTS system one would expect that it to be read more
or less as follows: 1
E.g. Ebden and Sproat, 2014.
But where would one get the necessary data? In the haps as nine twenty, but certainly never nine hundred
case of machine translation, the existence of large thirty. Oct 4 must be read as October fourth or maybe
amounts of parallel texts in multiple languages is mo- October four, but not November fourth. I mention
tivated by the fact that people want to read texts in these cases specifically, since the silly errors are er-
their own languages, and therefore someone, some- rors that current neural models trained on these sorts
where, will often go to the trouble of writing a transla- of data will make, as we demonstrate below.
tion. For speech recognition acoustic model training, Again the situation for text normalization is dif-
one could in theory use closed captioning (Bharad- ferent from that of, say, MT: in the case of MT, one
waj and Medapati, 2015), which again is produced usually must do something for any word or phrase
for a reason. In contrast, there is no natural economic in the source language. The closest equivalent of the
reason to produce normalized versions of written Ib2H7= map in MT, is probably translating a word or
texts: no English speaker needs to be told that 6ft is phrase with its most common equivalent. This will of
six feet or that 150lb is one hundred fifty pounds, and course often be correct: most of the time it would be
therefore there is no motivation for anyone to pro- reasonable to translate the cat as le chat in French,
duce such a translation. The situation in text normal- and this translation would count positively towards
ization is therefore more akin to the situation with one’s BLEU score. This is to say that in MT one gets
parsing, where one must create treebanks for a lan- credit for the “easy” cases as well as the more inter-
guage in order to make progress on that language; if esting cases — e.g. this cat’s in no hurry, where a
one wants to train text normalization systems using more appropriate translation of cat might be type or
NN approaches, one must create the training data to gars. In text normalization one gets credit only for
do so. In the case of the present paper, we were able the relatively sparse interesting cases.
to produce a training corpus since we already had Indeed, as we shall see below, if one is allowed
working text normalization systems, which allowed to count the vast majority of cases where the right
us to produce a normalized form for the raw input. answer is to leave the input token alone, some of
The normalization is obviously errorful (we give an our RNNs already perform very well. The problem
estimate of the percentage of errors below), but it is is that they tend to mess up with various semiotic
good enough to serve as a test bed for deep learn- classes in ways that would make them unusable for
ing approaches — and by making it public we hope any real application, since one could never be quite
to encourage more serious attention to this problem. sure for a new example that the system would not
But in any event, if one is planning to implement read it completely wrongly. As we will see below,
an RNN-based approach to text normalization, one the neural models occasionally read things like, £900
must take into consideration the resources needed to as nine hundred Euros — something that state-of-
produce the necessary training data. the-art hand-built text normalization systems would
A second issue is that the set of interesting cases never do, brittle though such systems may be. The
in text normalization is usually very sparse. Most occasional comparable error in an MT system would
tokens, we saw in the small example above, map be bad, but it would not contribute much to a degra-
to themselves, and while it is certainly important to dation of the system’s BLEU score. Such a misread-
get that right, one generally does not get any credit ing in a TTS system would be something that peo-
for doing so either. What is evaluated in text nor- ple would immediately notice (or, worse, not notice
malization systems is the interesting cases, the num- if they could not see the text), and would stand out
bers, times, dates, measure expressions, currency precisely because a TTS system ought to get such
amounts, and so forth, that require special treatment. examples right.
Furthermore, the requirements on accuracy are rather In this paper we try two kinds of neural models
stringent: if the text says 381 kg, then an English TTS on a text normalization problem. The first is a neu-
system had better say three hundred eighty one kilo- ral equivalent of a source-channel model that uses a
grams, or maybe three hundred eighty one kilogram, sequence-to-sequence LSTM that has been success-
but certainly not three hundred forty one kilograms. fully applied to the somewhat similar problem of
920 might be read as nine hundred twenty, or per- grapheme-to-phoneme conversion (Rao et al., 2015),
along with a standard LSTM language model archi- 4 Dataset
tecture. The second treats the entire problem as a
sequence-to-sequence task, using the same architec- Our data consists of 1.1 billion words of English
ture that has been used for a speech-to-text conver- text, and 290 million words of Russian text, from
sion problem (Chan et al., 2016). Wikipedia regions that could be decoded as UTF8,
divided into sentences, and run through the Google
TTS system’s Kestrel text normalization system
3 Prior work on text normalization (Ebden and Sproat, 2014) to produce verbalizations.
The format of the annotated data is as in Figure 1
Text normalization has a long history in speech tech- above.
nology, dating back to the earliest work on full TTS As described in (Ebden and Sproat, 2014),
synthesis (Allen et al., 1987). Sproat (1996) provided Kestrel’s verbalizations are produced by first tok-
a unifying model for most text normalization prob- enizing the input and classifying the tokens, and
lems in terms of weighted finite-state transducers then verbalizing each token according to its semi-
(WFSTs). The first work to treat the problem of text otic class. The majority of the rules are hand-built us-
normalization as essentially a language modeling ing the Thrax finite-state grammar development sys-
problem was (Sproat et al., 2001). More recent ma- tem (Roark et al., 2012). Statistical components of
chine learning work specifically addressed to TTS the system include morphosyntactic taggers for lan-
text normalization include (Sproat, 2010; Roark and guages like Russian with complex morphology,2 a
Sproat, 2014; Sproat and Hall, 2014). statistical transliteration module (Jansche and Sproat,
In the last few years there has been a lot of work 2009), and a statistical model to determine if cap-
that focuses on social media (Xia et al., 2006; Choud- italized tokens should be read as words or letter
hury et al., 2007; Kobus et al., 2008; Beaufort et sequences (Sproat and Hall, 2014). Most ordinary
al., 2010; Kaufmann, 2010; Liu et al., 2011; Pen- words are of course left alone (represented here as
nell and Liu, 2011; Aw and Lee, 2012; Liu et al., Ib2H7=), and punctuation symbols are mostly trans-
2012a; Liu et al., 2012b; Hassan and Menezes, 2013; duced to bBH (for “silence”).
Yang and Eisenstein, 2013). This work tends to fo- The data were divided into 90 files (roughly 90%)
cus on different problems from those of TTS: on for training, 5 files for online evaluation during train-
the one hand one, in social media one often has to ing (the “development” set), and 5 for testing. In the
deal with odd spellings of words such as +m H3`, test results reported below, we used the first 100K
+QQQQQQQQQQQQQHHHH, or /i bmttt, which are tokens of the final file (99) of the test data, includ-
less of an issue in most applications of TTS; on the ing the end-of-sentence marker, working out to about
other, expansion of digit sequences into words is crit- 92K real tokens for English and 93K real tokens for
ical for TTS text normalization, but of no interest to Russian.
the normalization of social media texts. A manual analysis of about 1,000 examples from
Some previous work, also on social media normal- the test data suggests an overall error rate of approx-
ization, that has made use of neural techniques in- imately 0.1% for English and 2.1% for Russian. The
cludes (Chrupała, 2014; Min and Mott, 2015). The largest category of errors for Russian involves years
latter work, for example, achieved second place in being read as cardinal numbers rather than the ex-
the constrained track of the ACL 2015 W-NUT Nor- pected ordinal form.
malization of Noisy Text (Baldwin et al., 2015), Note that although the test data were of course
achieving an F1 score of 81.75%. In the work we taken from a different portion of the Wikipedia text
report below on TTS normalization, we achieve ac- than the training and development data, nonetheless
curacies that are comparable or better than that result a huge percentage of the individual tokens of the test
(to the extent that it makes sense to compare across 2
The morphosyntactic tagger is an SVM model using
such quite different tasks), but we would argue that hand-tuned features that classify the morphological bundle for
for the intended application, such results are still not each word independently, similar to SVMTool (Giménez and
good enough. Màrquez, 2004) and MateTagger (Bohnet and Nivre, 2012).
data — 98.9% in the case of Russian and 99.5% in CQ?M Ib2H7=
the case of English — were found in the training set. HBp2b Ib2H7=
This in itself is perhaps not so surprising but it does i Ib2H7=
raise the concern that the RNN models may in fact be Rkj QM2 ir2Miv i?`22
memorizing their results, without doing much gener- EBM; Ib2H7=
alization. We discuss this issue further below. p2 p2Mm2
Finally some justification of the choice of data is M2` Ib2H7=
in order. We chose Wikipedia for two reasons. First, S  H2ii2` M/ T H2ii2`
it is after all a reasonable application of TTS, and in X bBH
fact it is used already in systems that give answers
to voice queries on the Web. Second, the data are Table 1: Training data format for the normalization chan-
nel, and language model. The channel model is trained
already publicly available, so there are no licensing
to map from the first to the second column, whereas the
issues. language model is trained on the underlined tokens. The
notation t H2ii2` denotes a letter-by-letter reading, and
5 Experiment 1: Text normalization using bBH denotes silence, which is predicted by the TTS text
LSTMs normalization system for most punctuation.
The first approach depends on the observation that
text normalization can be broken down into two sub-
5.1 LSTM architecture
problems. For any token:
We train two LSTM models, one for the channel and
• What are the possible normalizations of that to- one for the language model. The data usage of each
ken, and during training is outlined in Table 1. For the chan-
• which one is appropriate to the given context? nel model, the LSTM learns to map from a sequence
of characters to one or more word tokens of output.
The first of these — the channel — can be handled in For most input tokens this will involve deciding to
a context-independent way by enumerating the set of leave it alone, that is to map it to Ib2H7=, or in the
possible normalizations: thus 123 might be one hun- case of punctuation to map it to bBH, corresponding
dred twenty three, one two three, or one twenty three. to silence. For other tokens it must decide to verbal-
The second requires context: in 123 King Ave., the ize it in a variety of different ways. For the language
correct reading in American English would normally model, the system reads the words either from the in-
be one twenty three. put, if mapped to Ib2H7= or else from the output if
The first component is a string-to-string transduc- mapped from anything else.
tion problem. Furthermore, since WFSTs can be used For the channel LSTM we used a bidirectional
to handle most or all of the needed transductions sequence-to-sequence model similar to that reported
(Sproat, 1996), the relation between the input and in (Rao et al., 2015) in two configurations: one with
output strings is regular, so that complex network two forward and two backward hidden layers, hence-
architectures involving, say, stacks should not be forth the shallow model; and one with three forward
needed. For the input, the string must be in terms and three backward hidden layers, henceforth the
of characters, since for a string like 123, one needs deep model. We kept the number of nodes in each
to see the individual digits in the sequence to know hidden layer constant at 256.3 . The output layer is a
how to read it; similarly it helps to see the individual connectionist temporal classification (CTC) (Graves
characters for a possibly OOV word such as snarky et al., 2006) layer with a softmax error function.4 In-
to classify it as a token to be left alone (Ib2H7=). On put was limited to 250 distinct characters (including
the other hand since the second component is effec- the unknown token). For the output, 1,000 distinct
tively a language-modeling problem, the appropri- 3
A larger shallow model with 1024 nodes in each layer ended
ate level of representation there is words. Therefore up severely overfitting the training data.
we also want the output of the first component to be 4
Earlier experiments with non-CTC architectures did not pro-
in terms of words. duce results as good as what we obtained with the CTC layer.
Figure 3: LSTM for the language model.

# Steps Perp. LER


Ru LM 20.4B 27.1 (50.0) —
En LM 17.4B 42.7 (64.8) —
Ru shal. chan. 8.7B — 2.03%
Ru deep chan. 6.5B — 1.97%
En shal. chan. 8.3B — 1.75%
En deep chan. 1.9B — 1.76%

Table 2: Number of training steps, and perplexity or label


error rate on held out data for the LSTMs. Note that the
Figure 2: LSTM architecture for the shallow and deep label error rate for the channel is calculated on the out-
channel models for Russian. Purple LSTM layers perform put word sequence produced by the model, including the
forwards transitions and blue LSTM layers perform back- 〈b2H7〉 tag. For comparison the perplexities for a 5-gram
wards transitions. The output is produced by a CTC layer WFST language model with Katz backoff trained using
with a softmax activation function. Input tokens are char- the toolkit reported in (Allauzen et al., 2016) on the same
acters and output tokens are words. Numbers indicate the data and evaluated on the same held-out data are given in
number of nodes in each layer. parentheses. Note that the language models were trained
on 100 nodes for a total of 5 days and the channel models
a total of 10 days.
words (including Ib2H7= and the unknown token)
were allowed for English, and 2,000 for Russian, the ror rate (LER) for the LM and channel LSTMs.
larger number for Russian being required to allow for
various inflected forms. The number of words may 5.2 Decoding
seem small but it is sufficient to cover the distinct
At decoding time we need to combine the outputs
words that are output by the verbalizer for the vari-
of the channel and language model. This is done as
ous semiotic classes where the token does not simply
follows. First, for each position in the output of the
map to Ib2H7=. See Figure 2.
channel model, we prune the predicted output sym-
The language model LSTM follows a standard bols. If one hypothesis has a very high probability
RNN language model architecture, following on (default 0.98), we eliminate all other predictions at
work of Mikolov et al., 2010, with an input and that position: in practice this happens in most cases,
output layer consisting of |V| nodes — we limited since the channel model is typically very sure of it-
|V| to 100,000, a dimensionality-reduction projec- self at most of the input positions. We also prune all
tion layer, a hidden LSTM layer with a feedback hypotheses with a low probability (default 0.05). Fi-
loop, and a hierarchical softmax output layer. Dur- nally we keep only the n best hypotheses at each out-
ing training the LSTM learns to predict the next word put position: in our experiments we kept n = 5.
given the current word, but the feedback loop allows We then use the resulting pruned vectors to popu-
the model to build up a history of arbitrary length. late the corresponding positions in the input to the
See Figure 3. LM, with the channel probability for each token.
The channel and language model LSTMs are This channel probability is multiplied by the LM
trained separately. Table 2 shows the number of probability times an LM weighting factor (1 in our
training steps and the final (dev) perplexity/label er- experiments). This method of combining the chan-
nel and LM probabilities can be thought of a as a as in the last three English examples, the reading of
poor-man’s equivalent of the composition of a chan- hour rather than gigabyte in the last Russian exam-
nel and LM weighted finite-state transducer (Mohri ple, or the tendency of the Russian system to output
et al., 2002): the main difference is that there is no bBH in a lot of examples of measure phrases.
straightforward way to represent an arbitrary lattice These errors are entirely due to the channel model:
in an LSTM, so that the representation is more akin there is nothing ill-formed about the sequences pro-
to a confusion network (“sausage”) at each output duced, they just happen to be wrong given the input.
position. One way to see this is to compute the “oracle” accu-
racy, the proportion of the time that the correct an-
5.3 Results
swer is in the pseudo-lattice produced by the chan-
The first point to observe is that the overall perfor- nel. For the English deep model the oracle accuracy
mance is quite good: the accuracy is about 99% for is 0.998. Since the overall accuracy of the model is
English and 98% for Russian. But nearly all of this 0.993, this means that 27 or about 29% of the errors
can be attributed to the model predicting Ib2H7= for can be attributed to the channel model not giving
most input tokens, and bBH for punctuation tokens. the LM a choice of selecting the right model. What
To be sure these are decisions that a text normaliza- about the other 71% of the cases? While it is pos-
tion system must make: any such system must decide sible that some of these could be because the LM
to leave an input token alone, or map it to silence. chooses an impossible sequence, in most cases what
Still, as we noted in the introduction, one does not the channel model offers are perfectly possible ver-
usually get credit for getting these decisions right, balizations of something, just not necessarily correct
and when one looks at the interesting cases, the per- for the given input. If the channel model offers both
formance starts to break down, with the lowest per- twenty seconds and twenty kilograms, how is the lan-
formance predictably being found for cases such as guage model to determine which is correct, since
TIME that are not very common in these data. The even broader context may not be enough to deter-
deep models also generally tend to be better than the mine which one is more likely?
shallow models, though this is not true in all cases —
The wrong-unit readings are particularly interest-
e.g. MONEY in English. This in itself is a useful san-
ing in that one thing that is known to be true of RNNs
ity check, since we would expect improvement with
is that they are very good at learning to cluster words
deeper models.
into semantic groups based on contextual informa-
The models are certainly able to get some quite
tion. Clearly the system has learned the kinds of ex-
complicated cases right. Thus for example for the
pressions that can occur after numbers, and some
Russian input 1 октября 1812 года (“1 October,
cases it substitutes one such expression for another.
1812”), the deep model correctly predicts первого
This property is the whole basis of word embeddings
октября тысяча восемьсот двенадцатого года
(Bengio et al., 2003) and other similar techniques. In-
(“first of october of the one thousand eight hundred
terestingly, a recent paper (Arthur et al., 2016) dis-
and twelfth year”); or in English 2008-09-30 as the
cusses entirely analogous errors in neural machine
thirtieth of september two thousand eight.
translation, and attributes them to the same cause.6
But quite often the prediction is off, though in
It is quite a useful property for many applications,
ways that are themselves indicative of a deeper prob-
but in the case of text normalization it is a drawback:
lem. Thus consider the examples in Table 4, all of
unless one is trying to model a patient with seman-
which are taken from the “deep” models for English
tic paraphasia, one generally wants a TTS system
and Russian. In both languages we find examples
where the system gets numbers wrong, evidently be- system reading the correct number, but in the wrong case form.
cause it is hard for the system to learn from the train- These sorts of errors, while not desirable, are not nearly as bad
ing data exactly how to map from digit sequences as the system reading the wrong number: if all that is wrong is
to number names — see also (Gorman and Sproat, the inflection, a native speaker could still recover the intended
meaning.
2016).5 Other errors include reading the wrong unit 6
Arthur et al.’s proposed solution is in some ways similar to
5
Note that in Russian most of the number errors involve the the FST-based mechanism we propose below in Section 7.
en shallow en deep ru shallow ru deep
class N cor N cor N cor N cor
ALL 92448 0.990 92447 0.993 93194 0.982 93195 0.984
PLAIN 67954 0.997 67954 0.996 60702 0.998 60702 0.997
PUNCT 17729 1.000 17729 1.000 20264 1.000 20264 1.000
DATE 2808 0.858 2808 0.965 1495 0.886 1495 0.892
TRANS – – – – 4106 0.905 4106 0.919
LETTERS 1348 0.908 1348 0.939 1838 0.964 1838 0.971
CARDINAL 1069 0.936 1069 0.944 2388 0.771 2388 0.804
VERBATIM 1017 0.966 1017 0.969 1344 0.993 1344 0.993
MEASURE 142 0.908 142 0.944 411 0.623 411 0.645
ORDINAL 103 0.913 103 0.903 427 0.703 427 0.705
DECIMAL 89 0.989 89 0.978 60 0.617 60 0.583
ELECTRONIC 49 0.327 49 0.245 5 0.400 5 0.400
DIGIT 38 0.263 37 0.243 16 0.812 16 0.938
MONEY 37 0.946 37 0.892 18 0.500 19 0.474
FRACTION 16 0.562 16 0.562 23 0.565 23 0.478
TIME 8 0.625 8 0.750 8 0.625 8 0.875
ADDRESS 4 0.750 4 0.750 – – – –

Table 3: Accuracies for the first experiment, including overall accuracy, and accuracies on various semiotic class cate-
gories of interest. Key for non-obvious cases: ALL = all cases; PLAIN = ordinary word (<b2H7>); PUNCT = punctu-
ation (bBH); TRANS = transliteration; LETTERS = letter sequence; CARDINAL = cardinal number; VERBATIM =
verbatim reading of character sequence; ORDINAL = ordinal number; DECIMAL = decimal fraction; ELECTRONIC
= electronic address; DIGIT = digit sequence; MONEY = currency amount; FRACTION = non-decimal fraction; TIME
= time expression; ADDRESS = street address. N is sometimes slightly different for each training condition since in a
few cases the model produces no output, and we discount those cases — thus in effect giving the model the benefit of
the doubt.

Input Correct Prediction


60 sixty six
82.55 mm eighty two point five five millimeters eighty two one five five meters
2 mA two milliamperes two units
£900 million nine hundred million pounds nine hundred million euros
16 см шестнадцати сантиметров bBH сантиметров
16 cm sixteen centimeters bBH centimeters
-11 минус одиннадцать минус один
minus eleven minus one
100 000 сто тысяч один тысяч ноль ноль ноль ноль
one hundred thousand one thousand zero zero zero zero
16 ГБ шестнадцати гигабайтов шестнадцать часов
sixteen gigabytes sixteen hours

Table 4: Errors from the English and Russian deep models. In light of recent events, the final English example is rather
amusing.
to read the written measure expression, not merely would appear as
something from the same semantic category.
As we noted above, there was a substantial amount A HBp2 i IMQ`K= Rkj IfMQ`K= EBM; p2 X
of overlap at the individual token level between the
training and test data: could the LSTM simply have on the input side, which would map to
been memorizing? In the test data there were 475 un-
seen cases in English of which the system got 82.9% QM2 ir2Miv i?`22
correct (compared to 99.5% among the seen cases);
for Russian there were 1,089 unseen cases of which on the output side.
83.8% were predicted correctly (compared to 98.5% In this way we were able to limit the number
among the seen cases). Some examples of the cor- of input and output nodes to something reasonable.
rect predictions are given in Table 5. As can be seen, The architecture follows closely that of (Chan et al.,
these include some complicated cases, so it is fair 2016). Specifically, we used a 4-layer bidirectional
to say that the system is not simply memorizing but LSTM reader (but without the pyramidal structure
does have some capability to generalize. used Chan et al.’s task) that reads input characters, a
layer of 256 attentional units, and a 2-layer decoder
6 Experiment 2: Attention-based RNN that produces word sequences. The reader is referred
sequence-to-sequence models to (Chan et al., 2016) for more details of the frame-
work.
Our second approach involves modeling the problem It was noticed in early experiments with this con-
entirely as a sequence-to-sequence problem. That figuration that the overabundance of Ib2H7= outputs
is, rather than have a separate “channel” and lan- was swamping the training and causing the system
guage model phase, we model the whole task as one to predict Ib2H7= in too many cases. We therefore
where we map a sequence of input characters to a down-sampled the instances of Ib2H7= (and bBH) in
sequence of output words. For this we use a Tensor the training so that only roughly one in ten examples
Flow (Abadi et al., 2015) model with an attention were given to the learner; among the various settings
mechanism (Mnih et al., 2014). Attention models we tried, this seemed to give the best results both in
are particularly good for sequence-to-sequence prob- terms of performance and reduced training time.
lems since they are able to continuously update the The training, development and testing data were
decoder with information about the state of the en- the same as described in Section 4 above. The En-
coder and thus attend better to the relation between glish RNN was trained for about five and a half days
the input and output sequences. The Tensor Flow im- (460K steps) on 8 GPUs until the perplexity on the
plementation used is essentially the same as that re- held-out data was 1.003; Russian was trained for five
ported in (Chan et al., 2016). days (400K steps), reaching a perplexity of 1.002.
In principle one might treat this problem in a way
similar to how MT has been treated as a sequence- 6.1 Results
to-sequence problem (Cho et al., 2014), and simply As the results in Table 6 show, the performance is
pass the whole sentence to be normalized into a mostly better than the LSTM model described in Sec-
sequence of words. The main problem is that since tion 5. This suggests in turn that modeling the prob-
we need to treat the input as a sequence of characters lem as a pure sequence-to-sequence transduction is
as we argued above, the input layer would need indeed viable as an alternative to the source-channel
to be rather large in order to cover sentences of approach we had taken previously.
reasonable length. We therefore took a different Some errors are shown in Table 7. These errors
approach and placed each token in a window of are reminiscent of several of the errors of the LSTM
3 words to the left and 3 to the right, marking the system in Table 4, in that the wrong unit is picked.
to-be-normalized token with a distinctive begin and On the other hand it must be admitted that, in En-
end tag IMQ`K= XXX IfMQ`K=. Thus for example glish, the only clear error of that type is the one ex-
the token Rkj in the context I live at ... King Ave . ample shown in Table 7. Again, as with the source-
Input Prediction
13 October 1668 the thirteenth of october sixteen sixty eight
13.1549 km² thirteen point one five four nine square kilometers
26 июля 1864 двадцать шестого июля тысяча восемьсот шестьдесят четвертого года
26 July 1864 twenty sixth of July of the one thousand eight hundred sixty fourth year
90 кв. м. девяносто квадратных метров
90 sq. m. ninety square meters

Table 5: Correct output for test tokens that were never seen in the training data.

English Russian 6.2 Results on “reasonable”-sized data sets.


ALL 92416 0.997 93184 0.993
PLAIN 68029 0.998 60747 0.999 The results reported in the previous sections de-
PUNCT 17726 1.000 20263 1.000 pended on impractically large amounts of training
DATE 2808 0.999 1495 0.976 data. To develop a system for a new language one
TRANS – – 4103 0.921 needs a system that could be trained on data sets of
LETTERS 1404 0.971 1839 0.991 a size that one could expect a team of native speak-
CARDINAL 1067 0.989 2387 0.940
ers to hand label. Assuming one is willing to invest
VERBATIM 894 0.980 1298 1.000
MEASURE 142 0.986 409 0.883 a few weeks’ of work with a small team, it is not out
ORDINAL 103 0.971 427 0.956 of the question that one could label about 10 million
DECIMAL 89 1.000 60 0.867 words of Wikipedia-style text.7
ELECTRONIC 21 1.000 2 1.000 With this point in mind, we retrained the systems
DIGIT 37 0.865 16 1.000 on 11.4 million tokens of English from the beginning
MONEY 36 0.972 19 0.842 of the original training set, and 11.9 million tokens
FRACTION 13 0.923 23 0.826
of Russian. The system was trained for about 7 days
TIME 8 0.750 8 0.750
ADDRESS 3 1.000 – – for both languages, until the system had achieved a
perplexity on held-out data of 1.002 and for Russian
Table 6: Accuracies for the attention-based sequence-to- 1.007.
sequence models. Classes are as in Table 3. Results are presented in Table 8. The overall per-
formance is not greatly different from the system
trained on the larger dataset, and in some places is
actually better.8 The test data overlapped with the
training data in 96.9% of the tokens for English
channel approach, there is evidence that while the and 95.5% for Russian, with the accuracy of the
system may be doing a lot of memorization, it is non-overlapped tokens being 95.0% for English and
not merely memorizing. For English, 90.6% of the 93.5% for Russian.
cases not found in the training data were correctly The errors made by the system are comparable to
produced (compared to 99.8% of the seen cases); errors we have already seen, though in English the er-
for Russian 86.7% of the unseen cases were cor- rors in this case seem to be more concentrated in the
rect (versus 99.4% of the seen cases). Complicated reading of numeric dates. Thus to give just a few ex-
previously unseen cases in Russian, for example in- amples for English, reading 2008-07-28 as “the eigh-
clude examples like 9 июня 1966 г., correctly read teenth of september seven thousand two”, or 2009-
as девятое июня тысяча девятьсот шестьдесят
7
шестого года (‘ninth of June, of the one thou- The author hand-labeled 1,000 tokens of English in about 7
sand nine hundred sixty sixth year’); or 17.04.1750, minutes.
8
One might also anticipate that one could get better results
correctly read as семнадцатое апреля тысяча for some underperforming categories by greedy selection of
семьсот пятидесятого года (‘seventeenth of April training text that included more examples of those categories,
of the one thousand seven hundred fiftieth year’). something that was not done in this experiment.
Input Correct Prediction
2 mA two milliamperes two million liters
11/10/2008 the tenth of november two thousand eight the tenth of october two thousand eight
1/2 cc half a c c one minute c c
18:00:00Z eighteen hours zero minutes and zero seconds z eighteen hundred cubic minutes
55th fifty fifth five fifth
750 вольт семисот пятидесяти вольт семьсот пятьдесят гектаров
750 volts seven hundred fifty volts seven hundred fifty hectares
70 градусами. семьюдесятью градусами семьюдесятью граммов
70 degrees seventy degrees seventy grams
16 ГБ шестнадцати гигабайтов шестнадцати герц
16 GB sixteen gigabytes sixteen hertz

Table 7: Errors from the attention sequence-to-sequence models.

10-02 as the ninth of october twenty thousand two.


Some relatively complicated examples not seen in
the training data that the English system got right in-
cluded 221.049 km² as two hundred twenty one point
o four nine square kilometers, 24 March 1951 as the
twenty fourth of march nineteen fifty one and $42,100
English Russian as forty two thousand one hundred dollars.
ALL 92416 0.996 93184 0.994 Clearly then, the attention-based models are able
PLAIN 68029 0.997 60747 0.999 to achieve with reasonable-sized data performances
PUNCT 17726 1.000 20263 1.000 that are close to what it achieves with the large train-
DATE 2808 0.974 1495 0.977 ing data set. That said, the system of course contin-
TRANS – – 4103 0.942 ues to produce “silly” errors, which means that it will
LETTERS 1404 0.974 1839 0.991
not be sufficient on its own as the text normalization
CARDINAL 1067 0.991 2387 0.954
VERBATIM 894 0.977 1298 1.000 component of a TTS system.
MEASURE 142 0.958 409 0.927
ORDINAL 103 0.971 427 0.981 7 Finite-state filters
DECIMAL 89 1.000 60 0.917
ELECTRONIC 21 0.952 2 1.000
As we saw in the previous section, an approach that
DIGIT 37 0.703 16 1.000 uses attention-based sequence-to-sequence models
MONEY 36 0.972 19 0.895 can produce extremely high accuracies, but is still
FRACTION 13 0.846 23 0.739 prone to occasionally producing output that is com-
TIME 8 0.625 8 0.750 pletely misleading given the input. What if we apply
ADDRESS 3 1.000 – – some additional knowledge to filter the output so that
it removes silly analyses?
Table 8: Accuracies for the attention-based sequence-to- One way to do this is to construct finite-
sequence models on smaller training sets. Classes are as state filters, and use them to guide the de-
in Table 3.
coding. For example one can construct an
FST that maps from expressions of the form
IMmK#2`= IK2bm`2n##`2pBiBQM= to a
cardinal or decimal number and the possible
verbalizations of the measure abbreviation. Thus
24.2kg might verbalize as twenty four point two
kilogram or twenty four point two kilograms. The
FST thus implements an overgenerating grammar quence of labels, the object being to find the possi-
that includes the correct verbalization, but allows ble transitions to the next labels and their cost. The
other verbalizations as well. label sequence is first transformed into a trivial ac-
We constructed a Thrax grammar (Roark et al., ceptor, to which is concatenated an FSA that accepts
2012) to cover MEASURE and MONEY expres- any single output token, and thus has a branching
sions, two classes where the RNN is prone to pro- factor of |V|, the size of the output vocabulary. This
duce silly readings. The grammar consists of about FSA is then composed with the lattice. For strings
150 lines, half of which consists purely mechanical in the FSA that match against the lattice, the cost
language-independent rules to flip the order of cur- will be the exit cost at that state in the lattice; for
rency expressions so that £5 is transformed to its strings that fail the cost will be BM7BMBiv. Suppose
reading order 5£. The Thrax grammar also incorpo- that the input sequence is irQ ?mM/`2/, and that
rates English-specific lists of about 450 money and the output lattice allows irQ ?mM/`2/ FBHQ;`K
measure expressions so that, e.g., we know that kg or irQ ?mM/`2/ FBHQ;`Kb. Then the FST will re-
can be kilogram or kilograms, as well as a num- turn a score of BM7BMBiv for the label KBHHB;`K,
ber FST that is learned from a few hundred number for example. However for FBHQ;`K or FBHQ;`Kb
names using the algorithm described in (Gorman and it will return a non-infinite cost, and indeed since ex-
Sproat, 2016). Note that it is minimal effort to pro- iting on one of these corresponds to an original fi-
duce the lexical lists and the number name training nal state of the grammar, it will accrue the reward
data for a new language, certainly much less effort discussed above. These costs will then be combined
than producing a complete hand-built normalization with the RNN’s own scores for the sequence, and the
grammar.9 final result computed as with the RNN alone. Note
During decoding, the FST is composed with the that since all prefixes of the sequences allowed by
input token being considered by the RNN. If the com- the grammar are also allowed, the RNN could, in the
position fails — e.g. because this token is not one of cited instance, produce two hundred as the output.
the classes that the FST handles — then the decod- However, it will get a substantial reward for finish-
ing will proceed as it normally would via the RNN ing the sequence (two hundred kilogram or two hun-
alone. If the composition succeeds, then the FST is dred kilograms). As we shall see below, this is nearly
projected to the output, and the resulting output lat- always sufficient to persuade the RNN to take a more
tice is used to restrict the possible outputs from the reasonable path.
RNN. Since the input was a specific token — e.g. 2kg We note in passing that this method is more or less
— the output lattice will include only sequences that the opposite approach to that of (Rastogi et al., 2016).
may be verbalizations of that token. This output lat- In that work, the FST’s scoring is augmented by an
tice is transformed so that all prefixes of the output RNN, whereas in the present approach, the RNN’s
are also allowed (e.g. two is allowed as well as two decoding is guided by the use of an FST.
kilograms). This can be done simply by walking the Accuracies in English for the unfiltered and fil-
states in the lattice, and making all non-final states tered RNN outputs, where the RNN is trained on the
final with a free exit cost (i.e., 0 in the tropical semir- smaller training set described in the previous section,
ing). However, we wish to give a strong reward for are given in Table 9. The MEASURE and MONEY
traversing the whole lattice from the initial to an orig- sets show substantial improvement, while none of
inal final state, and so the exit cost for original final the other sets are affected, exactly as desired. Indeed,
states is set to a very low negative value (-1000 in in this and the following tables we retain the scores
the current implementation).10 for the non-MEASURE/MONEY cases in order to
The RNN decoder queries the lattice with a se- demonstrate that the performance on those classes is
unaffected — not a given, since in principle the FSTs
9
The Thrax grammar and associated data will be released could overapply.
along with the main datasets.
10
This final exit cost is actually set in the Thrax grammar it-
In order to focus in on the differences between the
self, though it could as easily have been done dynamically at filtered and unfiltered models, we prepared a differ-
runtime. ent subset of the final training file that was rich in
RNN RNN+FST filter RNN RNN+FST filter
ALL 92416 0.998 92435 0.998 ALL 16160 0.997 16161 0.997
PLAIN 68023 0.999 68038 0.999 PLAIN 11224 0.999 11224 0.998
PUNCT 17726 1.000 17729 1.000 PUNCT 2585 1.000 2586 1.000
DATE 2808 0.997 2808 0.997 DATE 179 1.000 179 1.000
LETTERS 1410 0.980 1411 0.980 LETTERS 149 0.993 149 0.993
CARDINAL 1067 0.995 1067 0.995 CARDINAL 434 0.998 434 0.998
VERBATIM 894 0.985 894 0.985 VERBATIM 120 0.967 120 0.967
MEASURE 142 0.972 142 0.993 MEASURE 979 0.979 979 0.991
ORDINAL 103 0.990 103 0.990 ORDINAL 18 1.000 18 1.000
DECIMAL 89 1.000 89 1.000 DECIMAL 132 1.000 132 1.000
ELECTRONIC 21 1.000 21 1.000 ELECTRONIC 1 1.000 1 1.000
DIGIT 37 0.838 37 0.838 DIGIT 14 0.929 14 0.929
MONEY 36 0.972 36 1.000 MONEY 312 0.965 312 0.971
FRACTION 13 0.846 13 0.846 FRACTION 7 1.000 7 1.000
TIME 8 0.750 8 0.750 TIME 1 1.000 1 1.000
ADDRESS 3 1.000 3 1.000 ADDRESS 2 1.000 2 1.000

Table 9: Accuracies for the attention-based sequence-to- Table 10: Accuracies for the attention-based sequence-to-
sequence models for English on smaller training sets, with sequence models for English on smaller training sets, with
and without an FST filter. (Slight differences in overall and without an FST filter, on the MEASURE-MONEY
counts for what is the same dataset used for the two con- rich dataset.
ditions, reflect the fact that a few examples are “dropped”
by the way in which the decoder buffers data for the fil-
terless condition.) Finally Table 12 shows results for the RNN with
and without the FST filter on 1000 MONEY and
MEASURE expressions that have not previously
MEASURE and MONEY expressions. Specifically, been seen in the training data.11 In this case there was
we selected 1,000 sentences, each of which had one no improvement for MONEY, but there was a sub-
expression in that category. We then decoded with stantial improvement for MEASURE. In most cases,
the models trained on the smaller training set, with the MONEY examples that failed to be improved
and without the FST filter. Results are presented in with the FST filter were cases where the filter sim-
Table 10. Once again, the FST filter improves the ply did not match the input, and thus was not used.12
overall accuracy for MONEY and MEASURE, leav- The results of a similar experiment on Russian,
ing the other categories unaffected. Some examples using the smaller training set on a MEASURE-
of the improvements in both categories are shown MONEY rich corpus where the MEASURE and
in Table 11. Looking more particularly at measures, MONEY tokens were previously unseen is shown
where the largest differences are found, we find that in Table 13. On the face of it it would seem that
the only cases where the FST filter does not help is the FST filter is actually making things worse, un-
cases where the grammar fails to match against the til one looks at the differences. Of the 50 cases
input and the RNN alone is used to predict the output. where the filter made things ”worse”, 34 (70%) are
These cases are 1/2 cc, 30’ (for thirty feet), 80’, 7000 cases where there was an error in the data and a
hg (which uses the unusual unit hectogram), 600 bil- perfectly well formed measure was rendered with
lion kWh (the measure grammar did not allow for
11
a spelled number like billion), and the numberless To remind the reader, all test data are of course held out
from the training and development data, but it is common for
“measures” per km, /m². In a couple of other cases,
the same literal expression to recur.
the FST does not constrain the RNN enough: 1 g still 12
Only in three cases involving Indian Rupees, such as Rs.149
comes out as one grams, since the FST allows both did the filter match, but still the wrong answer (in this case six),
it and the correct one gram, but this of course is an was produced. In that case the RNN probably simply failed to
“acceptable” error since it is at least not misleading. produce any paths including the right answer. In such cases the
only solution is probably to override the RNN completely on a
case-by-case basis.
Input RNN RNN+FST
£5 five five pounds
11 billion AED eleven billion danish eleven billion dirhams
2 mA 2 megaamperes 2 milliamperes
33 rpm thirty two revolutions per minute thirty three revolutions per minute

Table 11: Some misleading readings of the RNN that have been corrected by the FST.

RNN RNN+FST filter


ALL 13152 0.983 13177 0.980
PLAIN 8176 0.998 8190 0.998
PUNCT 2501 1.000 2506 1.000
DATE 130 0.969 130 0.969
TRANS 165 0.970 165 0.970
LETTERS 192 0.995 192 0.995
CARDINAL 435 0.931 437 0.931
VERBATIM 175 1.000 175 1.000
MEASURE 1131 0.877 1133 0.856
RNN RNN+FST filter ORDINAL 14 1.000 14 1.000
ALL 16032 0.995 16050 0.996 DECIMAL 49 0.939 49 0.939
PLAIN 10859 0.999 10869 0.999 DIGIT 2 1.000 2 1.000
PUNCT 2726 1.000 2730 1.000 MONEY 155 0.832 157 0.796
DATE 184 1.000 184 1.000 FRACTION 7 0.714 7 0.714
LETTERS 167 0.964 168 0.964 TIME 5 0.800 5 0.800
CARDINAL 438 0.998 439 0.998
VERBATIM 101 0.990 101 0.990 Table 13: Accuracies for the attention-based sequence-
MEASURE 863 0.961 865 0.979 to-sequence models for Russian on smaller training sets,
ORDINAL 3 1.000 3 1.000 with and without an FST filter, on the MEASURE-
DECIMAL 196 0.995 196 0.995 MONEY rich dataset, where in this case all measure and
ELECTRONIC 1 1.000 1 1.000 money phrases are previously unseen in the training.
DIGIT 13 1.000 13 1.000
MONEY 471 0.955 471 0.955
FRACTION 7 1.000 7 1.000
TIME 1 1.000 1 1.000 bBH as the ‘truth’. In nearly all other cases, the in-
ADDRESS 1 1.000 1 1.000 put was actually ill formed and both Kestrel and
the RNN without the FST filter ‘corrected’ the in-
Table 12: Accuracies for the attention-based sequence-to- put. For example a Wikipedia contributor wrote 47
sequence models for English on smaller training sets, with 292 долларов ‘47,292 dollars’, which should cor-
and without an FST filter, on the MEASURE-MONEY
rectly be 47 292 доллара, since the preceding num-
rich dataset, where in this case all measure and money
phrases are previously unseen in the training. In this case ber ends in ‘2’, and thus the word for ‘dollar’ should
the only improvement of the FST filter was to the MEA- be in the genitive singular, not the genitive plural.
SURE expressions. Now, the Kestrel grammars for Russian have the
property that they read measure and money expres-
sions, among other semiotic classes, into an internal
format that in some cases abstracts away from the
written form. In the case at hand the written долларов
gets represented internally as dollar. During the ver-
balization phase the verbalizer grammars translate
this into the form of the word required by the gram-
matical context, in this case доллара. Thus Kestrel
has the (one could argue) undesirable property of
enforcing grammatical constraints on the input. The NLP applications, such as MT or parsing, where
result is that the data contains instances of these deep learning can be used more or less “out-of-the
sorts of corrections where долларов gets rendered as box”. Indeed if one were to evaluate a text normaliza-
доллара, and the RNN left to its own devices learns tion system on the basis of how well the system does
this mapping. Thus the RNN produces сорок семь overall, then the systems reported in this paper are
тысяч двести девяносто два доллара. The FST already doing very well, with accuracies over 99%.
filter, which does not allow долларов to be read as But when one drills down and looks at the interest-
доллара, verbalizes as written — arguably the right ing cases — say dates, which account for about 2%
behavior for a TTS system, which should not be in of the tokens in these data — then the performance
the business of correcting the grammar of the input is less compelling. An MT system that fails on many
text. In addition to these cases, there were 67 cases instances of a somewhat unusual construction could
where the RNN+FST was an unequivocal improve- still be a fairly decent MT system overall. A text nor-
ment over the RNN alone, as in 10 кН ‘ten kilo- malization system that reads the year 2012 as two
Newtons’, which was read by the RNN as десяти twelve is seriously problematic no matter how well it
килолитров ‘ten kiloliters’ but by the RNN+FST as does on text overall. Ultimately the difference comes
десяти килоньютонов ‘ten kilonewtons’. This is of down to different demands of the domain: the bar for
course an instance of a broad class of category errors text normalization is simply higher.
sometimes made by the RNN alone, that we have Given past experience we anticipate three main
seen many instances of. classes of responses, which we would like to briefly
All in all then, the FST-filtration approach seems address.
to be a viable way to improve the quality of the out- The first is that our characterization of what is im-
put for targeted cases where the RNN is prone to portant for a text normalization is idiosyncratic: what
make the occasional error. justification do we have for saying that, for exam-
ple, a text normalization must get dates correct? But
8 Discussion and the challenge the response to that is obvious: the various semiotic
classes are precisely where most of the effort has
We have presented evidence in this paper that train- been devoted in developing traditional approaches to
ing neural models to learn text normalization is prob- text normalization for TTS dating back to the 1970’s
ably not going to be reducible to simply having co- (Allen et al., 1987), for the simple reason that a TTS
pious amounts of aligned written- and spoken-form system ought to be able to know how to read some-
text, and then training a general neural system to thing like Sep 12, 2014.
compute the mapping. An approach where one com- The second is that we have set up a straw man:
bines the RNN with a more knowledge-based system who ever argued that one could expect a deep learn-
such as an FST, such as we presented in Section 7, ing system to learn a text normalization system from
is probably a viable approach, but it has yet to be these kind of data? It is true that nobody has specifi-
demonstrated that one can do it with RNNs alone. cally made that claim for text normalization, but the
To be sure, our RNNs were often capable of pro- view is definitely one that is “in the air”: colleagues
ducing surprisingly good results and learning some of one of the authors who work on TTS have been
complex mappings. Yet they sometimes also pro- asked why so much hand labor goes into TTS sys-
duced weird output, making them risky for use in tems. Can one not just get a huge amount of aligned
a TTS system. Of course traditional approaches to text and speech and learn the mapping? The final and
TTS text normalization make errors, but they are not perhaps most anticipated response is: “You didn’t
likely to make an error like reading the wrong num- use the right kind of models; if you had just used
ber, or substituting hours for gigabytes, something an X model with Y objective function, etc., then you
that the RNNs are quite prone to do. The reason the would have solved the problems you noted.” Our re-
FST filtering approach works, of course, is precisely sponse to that is that the data described in this paper
because it disallows such random mappings. will be made publicly available, and people are en-
Again, the situation is different from some other couraged to try out their clever ideas for themselves.
The challenge then can be laid out simply as fol- tenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
lows: using the data reported here,13 train a pure deep 2015. TensorFlow: Large-scale machine learning on
learning based normalization system for English and heterogeneous systems. Software available from ten-
Russian that outperforms the results reported in this sorflow.org.
paper. By “outperform” here we are not primarily fo- Cyril Allauzen, Michael Riley, and Brian Roark. 2016.
Distributed representation and estimation of WFST-
cusing on the overall scores, which are already very
based n-gram models. In ACL SIGFSM Workshop on
good, but rather the scores for various of the inter- Statistical NLP and Weighted Automata.
esting categories. Rather, can one get a system, for Jonathan Allen, Sharon M. Hunnicutt, and Dennis Klatt.
example, that would never read £ as dollars or euros, 1987. From Text to Speech: The MITalk System. Cam-
or any of the other similar errors where a related but bridge University Press.
incorrect term has been substituted? (We take it as Philip Arthur, Graham Neubig, and Satoshi Nakamura.
given that the same training-development-test divi- 2016. Incorporating discrete translation lexicons into
sion of the data is used and the same scoring scripts.) neural machine translation. In EMNLP, Austin, TX.
If one could train a pure deep-learning system that Ai Ti Aw and Lian Hau Lee. 2012. Personalized normal-
failed to make these sorts of silly errors and in gen- ization for a multilingual chat system. In ACL, pages
eral did better than systems reported here on the var- 31–36, Jeju Island, Korea.
ious semiotic categories, this would represent a true Timothy Baldwin, Young-Bum Kim, Marie Catherine
advance over the state of the art reported in this pa- de Marneffe, Alan Ritter, Bo Han, and Wei Xu.
2015. Shared tasks of the 2015 workshop on noisy
per.
user-generated text: Twitter lexical normalization and
named entity recognition. In WNUT.
Acknowledgements
Richard Beaufort, Sophie Roekhaut, Louise-Amélie
We thank Alexander Gutkin for preparing the orig- Cougnon, and Cédrick Fairon. 2010. A hybrid
inal code for producing Kestrel text normalization, rule/model-based finite-state framework for normaliz-
and to Kyle Gorman for producing the data. Alexan- ing SMS messages. In ACL, pages 770–779, Uppsala,
Sweden.
der Gutkin also checked a sample of the Russian data
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
to estimate Kestrel’s error rate. We also thank both
Christian Janvin. 2003. A neural probabilistic lan-
Alexander and Kyle, as well as Brian Roark and Suy- guage model. Journal of Machine Learning Research,
oun Yoon for comments on earlier versions of this 3:1137–1155, March.
paper. Finally Hasim Sak for help with the LSTM S.S. Bharadwaj and S.B. Medapati. 2015. Training
models reported in Experiment 1. speech recognition using captions, March 26. US
Patent App. 14/037,144.
Bernd Bohnet and Joakim Nivre. 2012. A transition-
References based system for joint part-of-speech tagging and la-
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene beled non-projective dependency parsing. In EMNLP-
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, CoNLL, pages 1455–1465, Jeju Island, Korea.
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol
Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Vinyals. 2016. Listen, attend and spell: A neural net-
Irving, Michael Isard, Yangqing Jia, Rafal Jozefow- work for large vocabulary conversational speech recog-
icz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- nition. In ICASSP, pages 4960–4964.
berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre,
Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Fethi Bougares, Holger Schwenk, and Yoshua Ben-
Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul gio. 2014. Learning phrase representations using
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fer- RNN encoder-decoder for statistical machine transla-
nanda Viégas, Oriol Vinyals, Pete Warden, Martin Wat- tion. CoRR, abs/1406.1078.
13
Available at HQ+iBQM@iQ@#2@/2i2`KBM2/, most likely Monojit Choudhury, Rahul Saraf, Vijit Jain, Sudesha
under the Creative Commons 4.0 license. The data include the Sarkar, and Anupam Basu. 2007. Investigation and
training, development and test datasets with their annotated nor- modeling of the structure of texting language. Interna-
malized outputs; and scoring scripts. We will also provide a tional Journal of Document Analysis and Recognition,
mechanism to submit corrections to the data. 10:157–174.
Grzegorz Chrupała. 2014. Normalizing tweets with edit Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko-
scripts and recurrent neural embeddings. In ACL, Sin- ray Kavukcuoglu. 2014. Recurrent models of visual
gapore. attention. In NIPS, pages 2204–2212.
Peter Ebden and Richard Sproat. 2014. The Kestrel TTS Mehryar Mohri, Fernando Pereira, and Michael Riley.
text normalization system. Natural Language Engi- 2002. Weighted finite-state transducers in speech
neering, 21(3):1–21. recognition. Computer Speech & Language, 16(1):69–
Jesús Giménez and Lluís Màrquez. 2004. Svmtool: A 88.
general pos tagger generator based on support vector Deana Pennell and Yang Liu. 2011. A character-level ma-
machines. In Proceedings of the 4th LREC, Lisbon, chine translation approach for normalization of SMS
Portugal. abbreviations. In IJCNLP.
Kyle Gorman and Richard Sproat. 2016. Minimally su- Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise
pervised models for number normalization. Transac- Beaufays. 2015. Grapheme-to-phoneme conversion
tions of the Association for Computational Linguistics. using long short-term memory recurrent neural net-
Alex Graves and Navdeep Jaitly. 2014. Towards end-to- works. In ICASSP, pages 4225–4229.
end speech recognition with recurrent neural networks. Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner.
In ICML, pages 1764–1772. 2016. Weighting finite-state transductions with neural
Alex Graves, Santiago Fernández, Gaustino Gomez, and context. In Proceedings of the 2016 Conference of the
Jürgen Schmidhuber. 2006. Connectionist tempo- North American Chapter of the Association for Compu-
ral classification: Labeling unsegmented sequence data tational Linguistics: Human Language Technologies,
with recurrent neural networks. In ICML, pages 369– pages 623–633, San Diego.
376. Brian Roark and Richard Sproat. 2014. Hippocratic ab-
Hany Hassan and Arul Menezes. 2013. Social text nor- breviation expansion. In ACL, pages 364–369.
malization using contextual graph random walks. In Brian Roark, Richard Sproat, Cyril Allauzen, Michael
ACL, pages 1577–1586. Riley, Jeffrey Sorensen, and Terry Tai. 2012. The
Martin Jansche and Richard Sproat. 2009. Named entity OpenGrm open-source finite-state grammar software
transcription with pair n-gram models. In NEWS ’09, libraries. In ACL, pages 61–66.
pages 32–35, Singapore. Richard Sproat and Keith Hall. 2014. Applications of
Max Kaufmann. 2010. Syntactic normalization of Twit- maximum entropy rankers to problems in spoken lan-
ter messages. In International Conference on NLP. guage processing. In Interspeech, pages 761–764.
Catherine Kobus, François Yvon, and Géraldine Damnati. Richard Sproat, Alan Black, Stanley Chen, Shankar Ku-
2008. Normalizing SMS: are two metaphors better mar, Mari Ostendorf, and Christopher Richards. 2001.
than one? In COLING, pages 441–448, Manchester, Normalization of non-standard words. Computer
UK. Speech and Language, 15(3):287–333.
Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. Richard Sproat. 1996. Multilingual text analysis for text-
2011. Insertion, deletion, or substitution? Normalizing to-speech synthesis. Natural Language Engineering,
text messages without pre-categorization nor supervi- 2(4):369–380.
sion. In ACL, pages 71–76, Portland, Oregon, USA. Richard Sproat. 2010. Lightly supervised learning of text
Fei Liu, Fuliang Weng, and Xiao Jiang. 2012a. A broad- normalization: Russian number names. In IEEE SLT,
coverage normalization system for social media lan- pages 436–441.
guage. In ACL, pages 1035–1044, Jeju Island, Korea. Paul Taylor. 2009. Text-to-Speech Synthesis. Cambridge
Association for Computational Linguistics. University Press, Cambridge.
Xiaohua Liu, Ming Zhou, Xiangyang Zhou, Zhongyang Yunqing Xia, Kam-Fai Wong, and Wenjie Li. 2006. A
Fu, and Furu Wei. 2012b. Joint inference of named en- phonetic-based approach to Chinese chat text normal-
tity recognition and normalization for tweets. In ACL, ization. In ACL, pages 993–1000, Sydney, Australia.
pages 526–535, Jeju Island, Korea. Yi Yang and Jacob Eisenstein. 2013. A log-linear model
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer- for unsupervised text normalization. In EMNLP, pages
nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neu- 61–72.
ral network based language model. In Interspeech, vol-
ume 2, page 3.
Wookhee Min and Bradford Mott. 2015.
NCSU SAS WOOKHEE: A deep contextual
long-short term memory model for text normalization.
In WNUT.

You might also like