Keyword Detection
Keyword Detection
Abstract—This paper presents our recent attempt to make a this method has a serious defect in that it cannot detect
super-large scale spoken-term detection system, which can detect LVCSR’s out-of-vocabulary (OOV) terms. Although proper
any keyword uttered in a 2,000-hour speech database within a nouns and newly created words have great importance in a lot
few seconds. There are three problems to achieve such a system.
The system must be able to detect out-of-vocabulary (OOV) terms of applications, they tend to be OOVs because of their scarcity.
(OOV problem). The system has to respond to the user quickly Large-scale databases potentially have many OOVs. Therefore,
without sacrificing search accuracy (search speed and accuracy the system must be able to detect them. To detect the OOV
problem). The pre-stored index database should be sufficiently keywords, we introduce a phoneme-based search technique
small (index size problem). and combine it with the conventional LVCSR method.
We introduced a phoneme-based search method to detect the
OOV terms, and combined it with the LVCSR-based method. The second issue is the search speed and search accuracy
To search for a keyword from large-scale speech databases problem of an open-vocabulary spoken-keyword detection sys-
accurately and quickly, we introduced a multistage rescoring tem. To detect the OOV keywords, a conventional phoneme-
strategy which uses several search methods to reduce the search based system used a phoneme-based close-matching technique
space in a stepwise fashion. Furthermore, we constructed an out- [5], [6], [7]. The main defect of it is its slow search speed due
of-vocabulary/in-vocabulary region classifier, which allows us to
reduce the size of the index database for OOVs. We describe the to the complex close-matching procedure. Furthermore, the
prototype system and present some evaluation results. phonetic approach often produces a lot of false positive hits.
To solve those problems, we introduce a rescoring strategy,
I. I NTRODUCTION which first uses very simplified phoneme-matching to coarsely
With the growth of high-capacity storage devices and large- reduce the search space and then rescores the reduced search
scale networks, more speech databases are being accumulated space with a more accurate matching technique. We propose
and utilized. It is no longer difficult to accumulate 1-year a multistage rescoring procedure to reduce the search space
of TV programs or the speech stream of all of human life. in a stepwise fashion. That enables us to search for an OOV
To manage such large speech databases, techniques that can keyword quickly and accurately.
extract only useful information “quickly” and “accurately” are Finally, we consider the index size problem. We combined
highly desired. the phoneme-based method and LVCSR-based method for
This paper presents our recent attempt to make a super- fast and accurate search. However, the capacity of the index
large scale spoken-term detection system, which can detect database becomes larger than that of the single method.
any keyword uttered in a 2,000-hour speech database within Furthermore, we use a very accurate rescoring procedure for
a few seconds. Although there are already many studies for the detection of OOV keywords, and this procedure requires
spoken-term detection[1], [2], [3], there is no system that treats a very large index capacity (about 54 MB per 1-hour speech).
such large databases. There are three problems to achieve such Such large index database are hard to store in memory when
a system. The system must be able to detect out-of-vocabulary the database size becomes extremely large, and that causes
(OOV) terms (OOV problem). The system has to respond to a significant decrease in search speed. Therefore, we need
the user quickly without sacrificing search accuracy (search some index-reduction technique. To solve the problem, we
speed and accuracy problem). The pre-stored index database propose an out-of-vocabulary/in-vocabulary region classifier,
should be small enough (index size problem). which can accurately detect the region that does not contain
First, we consider the OOV problem of large speech data- OOVs and remove redundant indexes used for OOVs.
bases. One simple approach for spoken-term detection is to In the next section, we show the overview of our system. In
execute a large vocabulary continuous speech recognition section III, we explain the multistage rescoring strategy which
(LVCSR) in advance and to search for a keyword in the enables us to search for a keyword quickly and accurately. In
recognized word sequence or word lattice[4]. This approach section IV, we explain the OOV/IV region classifier which
usually uses a position table, which can be referred to for reduces redundant phoneme-indexes greatly. Finally, we show
finding the position of dictionary words (called word inverted some experimental results in section V and describe our first
index), so it can detect a keyword very rapidly. However, prototype system in section VI.
Indexing Phase
-N-gram Inverted Index
-Phoneme Probability Table
N2 hits
Search Phase IV Index Search Word Index
Simplified Acoustic Rescoring
IV/OOV Search N2 hits
Selection
Phoneme Simplified Acoustic High-score Low-score
Input Index Search Acoustic Rescoring Rescoring N3 hits
keyword
OOV Index Search
Output Result
Acoustic Rescoring
ranking position(sec) N3 hits
Merge Results 1 151 High-score Low-score
2 32
‥‥‥ Detection result
*OOV: Out of Vocabulary
*IV: In Vocabulary
Fig. 2. Overview of Out-of-vocabulary Index Search Procedure
Fig. 1. Overview of Spoken-Term Detection System
940
multistage rescoring. Our detection procedure consists of three
stages; the most probable top-N detection results of each stage
are sent to a more accurate next stage to correct their order Filler Model Keyword HMM *
Filler Model
(Fig. 2).
In the first stage, we have to seek a very large search space,
so we construct fast phoneme N-gram search procedure. When
Filler Model
we search for a keyword, we first convert the keyword into
phoneme sequences using conversion rules. Then we search Fig. 3. Network for Precise Acoustic Rescoring
for the region that contains many phoneme N-grams in the
keyword referring to the phoneme N-gram index. This search
The second is the word-based criterion which uses a confi-
method is very fast because it does not consider the order of
dence measure from LVCSR. A Confidence measure CM (wi )
phoneme N-grams in the keyword. Instead, the detection result
for a word wi is represented as below;
is not accurate.
Then, we rescore the top-N2 putative regions from the 1st P (X|wi )P (wi )
CM (wi ) = , (2)
stage using more precise acoustic information. Here, we used wj ∈W P (X|wj )P (wj )
the unconstrained endpoint wordspotting technique. To skip
where W represents word hypotheses for the same interval.
the computationally expensive acoustic analysis calculation,
Note that the confidence measure becomes low in case there
we use the phoneme a posteriori probability table calculated
is an OOV and in case the inputted speech is not clear or not
in the indexing phase.
grammatical. However, we can judge whether there is a OOV
Finally, we score the top-N3 segments from the 2nd stage
or not checking whether a confidence measure of the word in
using a more precise utterance-verification method based on
the best recognition path is lower or higher than a threshold.
the filler model. A filler model is a simple self-loop network of
because their are correlation between the OOV existence and
various fillers such as phonemes or syllables, and we construct
the ambiguity of LVCSR results,
a HMM network like that shown in Fig. 3. The difference
between an acoustic score of the upper filler-keyword-filler B. OOV/IV Region Classifier
path and that of the lower filler-only path is regarded as the The main idea of our OOV/IV region classifier is to
keyword existence score of the segment. As in the case of the combine the above two criteria on the basis of machine
2nd stages, we can execute the 3rd stage efficiently using the learning techniques. We construct the OOV/IV region classifier
phoneme a posteriori probability table. 1 function using OOV/IV-labeled training data. The input signal
IV. OOV/IV R EGION C LASSIFIER TO R EMOVE is segmented into regions according to the word boundaries
S UPERFLUOUS OOV I NDEXES of LVCSR’s hypothesis. Each region is labeled as either IV or
We construct an OOV/IV region classifier to remove the su- OOV according to the corresponding transcription.
perfluous OOV indexes. An OOV/IV region classifier classify We used the logistic regression algorithm as the classifier.
whether an interval of speech contains OOV terms of LVCSR The features used for classification are
the value of equation
(OOV region) or not (IV region). (1),
the value of equation (2) and s∈S1best P (s|X)P (X),
P h∈P hN best P (X|P h)P (P h). The last two are introduced
A. Conventional Method to adjust the value range of the numerator and denominator of
There are already some OOV-region detection criteria. The equation (1), which corresponds to the value range of LVCSR
first is the sentence-based criterion. The a posteriori probability results and phoneme recognition results.
that a speech X contains OOV is expressed as follows: V. E XPERIMENTAL R ESULTS
P (OOV contained|X) = 1 − P (s|X) In this section, we show some experimental results for the
s∈S multistage rescoring and the OOV/IV region classifier.
P (X|s)P (s)
s∈S
=1− A. Baseline System
P (X)
We evaluated the word index search system as the baseline.
P (X|s)P (s)
1 − s∈SN best , (1) We used “the Corpus of Spontaneous Japanese (CSJ)[9]” as
ph∈P hN best (X|ph)P (ph)
P the evaluation data. We made small data set with 6.3 hour
where S, SN best and P hN best represent all possible word speech and large data set with 460 hour speech.
sequences, the N-best word sequences from LVCSR and the 1) Evaluation on Small Data Set: We used 6.3 hours of
N-best phoneme sequences from phoneme recognizer, respec- conference speeches from CSJ selecting 30 different talkers.
tively. We collected 90 length-balanced keywords from the transcrip-
tion of the speech.
1 In addition, we can omit the post-filler-model of the upper filler-keyword-
We used Julius[10] for LVCSR. An acoustic model were
filler path (marked as “∗” in Fig. 3) for calculation efficiency, because we use
a real-time end-point tracking. Our preliminary experimental result suggested trained using 50 hours of speaker open data from CSJ. To
that the retrieval accuracy was still conserved even if we omit that path. analyze the effect of the quality of the language model, we
941
TABLE I TABLE III
T OP H IT AND FOM OF W ORD I NDEX S EARCH FOR S MALL DATA S ET T OP H IT AND FOM OF OOV I NDEX S EARCH
method TopHit(%) FOM(%) setting search time (sec) TopHit (%) FOM (%)
method IV / ALL IV / ALL for 6.3-hour DB
In-domain LM 97.2 / 82.1 71.0 / 60.0 stage 1 0.19 36.6 27.9
Out-of-domain LM 79.7 / 41.6 46.9 / 24.5 stage 1-2 3.0 67.8 55.2
stage 1-3 3.0 76.7 39.5
stage 1-2-3 3.0 85.6 58.1
TABLE II
FOM OF W ORD I NDEX S EARCH AND P HONEME N- GRAM S EARCH FOR
L ARGE DATA S ET
method IV OOV ALL were OOVs. The conventional word N-gram search method
Word Index Search 47.7 0.0 29.1
obtained a relatively good score (47.3%) for the IV keywords.
Phoneme N-gram Search 29.5 17.1 24.7
Combination 47.7 17.1 35.8 However, it cannot detect the OOV keywords. On the contrary,
the phoneme N-gram search can detect the OOV words but
its accuracy is not good (17.1%). The result is reasonable
because the phoneme N-gram search does not consider the
tested two types of language models: the in-domain language order of the phoneme N-grams in the keyword (see section III).
model and the out-of-domain language model. The in-domain Note that the simple combination of the above two methods
language model was trained using the transcription of a 206- (“combination”) already achieved a great improvement in the
hour conference speech. The out-of-domain language model FOM for all keywords (29.1% to 35.8%). We aim to improve
was trained using the transcription of a 235-hour monologue the accuracy for OOV keywords using the rescoring module,
speech. 14 keywords (16%) were OOV for the in-domain which will be evaluated next.
language model, and 43 keywords (48%) were OOV for the
out-of-domain language model. B. Evaluation of Multistage Rescoring
To measure the search ability, we used a TopHit and FOM We evaluated our multistage rescoring procedure. In this
(Figure of Merit). TopHit is the probability that the best experiment, we used only small data set (6.3-hour data) for
candidate of the keyword detection is correct. The FOM is the detailed analysis. At that time, we used simple syllable
defined as the average of the detection/false-alarm curve taken grammar for the phoneme recognition. An acoustic model
over the range 0 – 10 false alarms per hour per keyword. We was trained using 16 hours of originally collected speech.
defined the candidate hit as being correct when the output The index database (acoustic likelihood DB and syllable
segment includes the keyword correctly. recognition DB) was accumulated at the rate of 54 MB/hour.
The result is listed in TABLE I. The left side in each cell In this evaluation, we used a conventional phoneme close-
shows the result averaged by only IV terms, and the right side matching instead of phoneme N-gram search, because the
shows the result averaged by all keywords. The word index dataset was small enough.
search system cannot detect OOV terms, so the results severely
We again used the TopHit and FOM. TABLE III shows the
degrade when averaged by all keywords. The table shows
results. The results of the conventional phoneme-based method
that the TopHit and FOM for IV keywords were very high
are listed under the setting “stage-1”. “Stage 1-2” and “stage
(97.2% TopHid and 71.0% FOM) when we used the in-domain
1-3” represent the system that concatenates stage 1 and stage
language model. However, the TopHit and FOM degraded
2, or stage 1 and stage 3. ”Stage 1-2-3 ” is our proposed
when using the out-of-domain language model(79.7% TopHid
multistage rescoring system. We adjusted each system so as
and 46.9% FOM).
to respond within 3.0 sec.
2) Evaluation on Large Data Set: We used 212-hour con-
From TABLE III, the results of the conventional phoneme-
ference speech and 248-hour monologue speech from CSJ.
based (“stage-1”) approach were much worse than those at
We selected 1,000 keywords by the tf/idf metrics from the
other settings (the TopHit (36.6%) and FOM (27.9%)). The
transcription of the test set.
poor results are mainly due to the low syllable recognition rate
We also evaluated the phoneme N-gram search and com-
in this experiment (about 64%). Both the best TopHit (85.6%)
pared it with the word index search. We trained an acoustic
and the best FOM (58.1%) were achieved at the setting “stage-
model using speaker-open 100-hour speech (50-hour con-
1-2-3”. There is a great improvement in the accuracy compared
ference speech and 50-hour monologue speech) from CSJ.
to the “stage-1” setting without a large increase in search time
Language models for the LVCSR and phoneme recognizer
(only 2.81 sec).
were trained from the transcription of the same 100-hour data.
Compared to the word index search (TABLE I), the accuracy
The vocabulary size of LVCSR was 17K. We used Julius for
of our OOV index search with multistage rescoring was in the
LVCSR and phoneme recognition.
middle of the in-domain language model setting and the out-
TABLE II shows the FOM results for IV keywords, OOV
of-domain language model setting. Considering the fact that
keywords and all keywords2 . At that time, 389 keywords
constructing an appropriate in-domain language model for a
2 In this experiment, we used only the OOV detection procedure for the large scale speech database was very difficult, this result is
compound words containing OOV terms, for computational simplicity. very encouraging.
942
C. Evaluation of OOV/IV Region Classifier 100%
943
PC1
Phoneme 120
Index Search
Simplified Phoneme Phoneme
100
Number of Keywords
Acoustic Probability Table 80
Rescoring in memory
on Solid State Disk
60
PC2 TCP/IP
Word 40
Index Search
IV/OOV Selection WordIndexInverted 20
in memory
Merge Result
0
0 1 2 3 4 5 6
2.5
944