0% found this document useful (0 votes)
3 views

Keyword Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Keyword Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Open-Vocabulary Keyword Detection from

Super-Large Scale Speech Database


Naoyuki Kanda #1 , Hirohiko Sagawa #2 , Takashi Sumiyoshi #3 and Yasunari Obuchi #4
#
Central Research laboratory, Hitachi Ltd.
1-280, Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan
1
[email protected]

Abstract—This paper presents our recent attempt to make a this method has a serious defect in that it cannot detect
super-large scale spoken-term detection system, which can detect LVCSR’s out-of-vocabulary (OOV) terms. Although proper
any keyword uttered in a 2,000-hour speech database within a nouns and newly created words have great importance in a lot
few seconds. There are three problems to achieve such a system.
The system must be able to detect out-of-vocabulary (OOV) terms of applications, they tend to be OOVs because of their scarcity.
(OOV problem). The system has to respond to the user quickly Large-scale databases potentially have many OOVs. Therefore,
without sacrificing search accuracy (search speed and accuracy the system must be able to detect them. To detect the OOV
problem). The pre-stored index database should be sufficiently keywords, we introduce a phoneme-based search technique
small (index size problem). and combine it with the conventional LVCSR method.
We introduced a phoneme-based search method to detect the
OOV terms, and combined it with the LVCSR-based method. The second issue is the search speed and search accuracy
To search for a keyword from large-scale speech databases problem of an open-vocabulary spoken-keyword detection sys-
accurately and quickly, we introduced a multistage rescoring tem. To detect the OOV keywords, a conventional phoneme-
strategy which uses several search methods to reduce the search based system used a phoneme-based close-matching technique
space in a stepwise fashion. Furthermore, we constructed an out- [5], [6], [7]. The main defect of it is its slow search speed due
of-vocabulary/in-vocabulary region classifier, which allows us to
reduce the size of the index database for OOVs. We describe the to the complex close-matching procedure. Furthermore, the
prototype system and present some evaluation results. phonetic approach often produces a lot of false positive hits.
To solve those problems, we introduce a rescoring strategy,
I. I NTRODUCTION which first uses very simplified phoneme-matching to coarsely
With the growth of high-capacity storage devices and large- reduce the search space and then rescores the reduced search
scale networks, more speech databases are being accumulated space with a more accurate matching technique. We propose
and utilized. It is no longer difficult to accumulate 1-year a multistage rescoring procedure to reduce the search space
of TV programs or the speech stream of all of human life. in a stepwise fashion. That enables us to search for an OOV
To manage such large speech databases, techniques that can keyword quickly and accurately.
extract only useful information “quickly” and “accurately” are Finally, we consider the index size problem. We combined
highly desired. the phoneme-based method and LVCSR-based method for
This paper presents our recent attempt to make a super- fast and accurate search. However, the capacity of the index
large scale spoken-term detection system, which can detect database becomes larger than that of the single method.
any keyword uttered in a 2,000-hour speech database within Furthermore, we use a very accurate rescoring procedure for
a few seconds. Although there are already many studies for the detection of OOV keywords, and this procedure requires
spoken-term detection[1], [2], [3], there is no system that treats a very large index capacity (about 54 MB per 1-hour speech).
such large databases. There are three problems to achieve such Such large index database are hard to store in memory when
a system. The system must be able to detect out-of-vocabulary the database size becomes extremely large, and that causes
(OOV) terms (OOV problem). The system has to respond to a significant decrease in search speed. Therefore, we need
the user quickly without sacrificing search accuracy (search some index-reduction technique. To solve the problem, we
speed and accuracy problem). The pre-stored index database propose an out-of-vocabulary/in-vocabulary region classifier,
should be small enough (index size problem). which can accurately detect the region that does not contain
First, we consider the OOV problem of large speech data- OOVs and remove redundant indexes used for OOVs.
bases. One simple approach for spoken-term detection is to In the next section, we show the overview of our system. In
execute a large vocabulary continuous speech recognition section III, we explain the multistage rescoring strategy which
(LVCSR) in advance and to search for a keyword in the enables us to search for a keyword quickly and accurately. In
recognized word sequence or word lattice[4]. This approach section IV, we explain the OOV/IV region classifier which
usually uses a position table, which can be referred to for reduces redundant phoneme-indexes greatly. Finally, we show
finding the position of dictionary words (called word inverted some experimental results in section V and describe our first
index), so it can detect a keyword very rapidly. However, prototype system in section VI.

978-1-4244-2295-1/08/$25.00 © 2008 IEEE 939 MMSP 2008


LVCSR Word IV Index True Hit
Indexer Database
False Hit
Original Speech -Inverted Index
OOV/IV Region
Classifier Phoneme N-gram Search
N1 hits
Phoneme Remove IV Phoneme OOV Index
Recognizer Interval Indexer Database High-score Low-score

Indexing Phase
-N-gram Inverted Index
-Phoneme Probability Table
N2 hits
Search Phase IV Index Search Word Index
Simplified Acoustic Rescoring
IV/OOV Search N2 hits
Selection
Phoneme Simplified Acoustic High-score Low-score
Input Index Search Acoustic Rescoring Rescoring N3 hits
keyword
OOV Index Search
Output Result
Acoustic Rescoring
ranking position(sec) N3 hits
Merge Results 1 151 High-score Low-score
2 32
‥‥‥ Detection result
*OOV: Out of Vocabulary
*IV: In Vocabulary
Fig. 2. Overview of Out-of-vocabulary Index Search Procedure
Fig. 1. Overview of Spoken-Term Detection System

index and the phoneme a posteriori probability as the


indexes for OOV terms (OOV Index Database).
II. S YSTEM OVERVIEW
Our spoken-term detection system consists of two phases: B. Search phase
the indexing phase and the search phase (Fig.1). The search phase operates when any keyword is entered by
the user and detects positions of the keyword uttered in the
A. Indexing phase database. The search phase works as follows:
The indexing phase operates when any speech data is added 1) First, the IV/OOV selection procedure checks whether
to the speech database. An index database is made for fast the keyword is IV terms or OOV terms. IV terms are
keyword search in the search phase. The indexing phase works sent to the word index search module, and OOV terms
as follows: are sent to the OOV index search procedure. In the
1) An LVCSR converts the speech waves into a word case of a compound word, we split the word using a
confusion network, which is a compact word lattice morphological analysis technique and all terms are sent
representation[8]. to the search procedure after IV/OOV selection.
2) The word indexer makes the word inverted index from 2) The word index search module refers to the word
the word confusion network. Here, the word inverted inverted index and outputs the positions of each IV
index has the information about the word-ID, start-time, terms.
end-time and confidence measure from the LVCSR for 3) OOV terms are sent to the OOV index search proce-
each word in the confusion network. We store it as the dure which consists of three stages; the phoneme N-
index for in-vocabulary (IV) terms (IV Index Database). gram search, simplified acoustic rescoring and acoustic
3) Next, a phoneme recognizer converts the speech waves rescoring. The Ni most probable results of each stage
into phoneme sequences. In parallel, a phoneme recog- are sent to a more accurate next stage to correct their
nizer calculates a posteriori probability for each pho- order (Fig. 2). The details are explained in section III.
neme for each time frame, and saves the P highest 4) If the keyword is a compound word, the search re-
scores in the phoneme a posteriori probability table. This sults from an IV index search and OOV index search
table is used to avoid the heavy pdf (probability density are merged using their time position. Only the term
function) calculation in the search phase. sequences that are correctly connected are output as
4) An OOV/IV region classifier detects regions that do not detection results. We connect the term A and B if |B’s
contain OOVs, and removes the phoneme recognition start-time − A’s end-time| < 0.5 sec.
results and the phoneme a posteriori probability data
corresponding to the detected regions. The details of the III. D ETECTION OF OOV K EYWORDS USING P HONEME
classifier will be described in the section IV. N- GRAM S EARCH AND M ULTISTAGE R ESCORING
5) A phoneme indexer converts the remaining phoneme This section shows our fast and accurate detection proce-
recognition results into a phoneme N-gram inverted dure for OOV keyword using phoneme N-gram search and
index. The phoneme N-gram inverted index has the multistage rescoring. As we depicted in section 1, many
phoneme N-gram ID, start-time and end-time for each conventional studies used phoneme close-matching to search
phoneme N-grams in the phoneme recognition results. for OOV terms and they are slow and inaccurate. Our idea
A phoneme indexer stores the phoneme N-gram inverted to overcome that is stepwise search space reduction using

940
multistage rescoring. Our detection procedure consists of three
stages; the most probable top-N detection results of each stage
are sent to a more accurate next stage to correct their order Filler Model Keyword HMM *
Filler Model

(Fig. 2).
In the first stage, we have to seek a very large search space,
so we construct fast phoneme N-gram search procedure. When
Filler Model
we search for a keyword, we first convert the keyword into
phoneme sequences using conversion rules. Then we search Fig. 3. Network for Precise Acoustic Rescoring
for the region that contains many phoneme N-grams in the
keyword referring to the phoneme N-gram index. This search
The second is the word-based criterion which uses a confi-
method is very fast because it does not consider the order of
dence measure from LVCSR. A Confidence measure CM (wi )
phoneme N-grams in the keyword. Instead, the detection result
for a word wi is represented as below;
is not accurate.
Then, we rescore the top-N2 putative regions from the 1st P (X|wi )P (wi )
CM (wi ) =  , (2)
stage using more precise acoustic information. Here, we used wj ∈W P (X|wj )P (wj )
the unconstrained endpoint wordspotting technique. To skip
where W represents word hypotheses for the same interval.
the computationally expensive acoustic analysis calculation,
Note that the confidence measure becomes low in case there
we use the phoneme a posteriori probability table calculated
is an OOV and in case the inputted speech is not clear or not
in the indexing phase.
grammatical. However, we can judge whether there is a OOV
Finally, we score the top-N3 segments from the 2nd stage
or not checking whether a confidence measure of the word in
using a more precise utterance-verification method based on
the best recognition path is lower or higher than a threshold.
the filler model. A filler model is a simple self-loop network of
because their are correlation between the OOV existence and
various fillers such as phonemes or syllables, and we construct
the ambiguity of LVCSR results,
a HMM network like that shown in Fig. 3. The difference
between an acoustic score of the upper filler-keyword-filler B. OOV/IV Region Classifier
path and that of the lower filler-only path is regarded as the The main idea of our OOV/IV region classifier is to
keyword existence score of the segment. As in the case of the combine the above two criteria on the basis of machine
2nd stages, we can execute the 3rd stage efficiently using the learning techniques. We construct the OOV/IV region classifier
phoneme a posteriori probability table. 1 function using OOV/IV-labeled training data. The input signal
IV. OOV/IV R EGION C LASSIFIER TO R EMOVE is segmented into regions according to the word boundaries
S UPERFLUOUS OOV I NDEXES of LVCSR’s hypothesis. Each region is labeled as either IV or
We construct an OOV/IV region classifier to remove the su- OOV according to the corresponding transcription.
perfluous OOV indexes. An OOV/IV region classifier classify We used the logistic regression algorithm as the classifier.
whether an interval of speech contains OOV terms of LVCSR The features used for classification are
 the value of equation
(OOV region) or not (IV region). (1),
 the value of equation (2) and s∈S1best P (s|X)P (X),
P h∈P hN best P (X|P h)P (P h). The last two are introduced
A. Conventional Method to adjust the value range of the numerator and denominator of
There are already some OOV-region detection criteria. The equation (1), which corresponds to the value range of LVCSR
first is the sentence-based criterion. The a posteriori probability results and phoneme recognition results.
that a speech X contains OOV is expressed as follows: V. E XPERIMENTAL R ESULTS

P (OOV contained|X) = 1 − P (s|X) In this section, we show some experimental results for the
s∈S multistage rescoring and the OOV/IV region classifier.

P (X|s)P (s)
s∈S
=1− A. Baseline System
P (X)
 We evaluated the word index search system as the baseline.
P (X|s)P (s)
 1 −  s∈SN best , (1) We used “the Corpus of Spontaneous Japanese (CSJ)[9]” as
ph∈P hN best (X|ph)P (ph)
P the evaluation data. We made small data set with 6.3 hour
where S, SN best and P hN best represent all possible word speech and large data set with 460 hour speech.
sequences, the N-best word sequences from LVCSR and the 1) Evaluation on Small Data Set: We used 6.3 hours of
N-best phoneme sequences from phoneme recognizer, respec- conference speeches from CSJ selecting 30 different talkers.
tively. We collected 90 length-balanced keywords from the transcrip-
tion of the speech.
1 In addition, we can omit the post-filler-model of the upper filler-keyword-
We used Julius[10] for LVCSR. An acoustic model were
filler path (marked as “∗” in Fig. 3) for calculation efficiency, because we use
a real-time end-point tracking. Our preliminary experimental result suggested trained using 50 hours of speaker open data from CSJ. To
that the retrieval accuracy was still conserved even if we omit that path. analyze the effect of the quality of the language model, we

941
TABLE I TABLE III
T OP H IT AND FOM OF W ORD I NDEX S EARCH FOR S MALL DATA S ET T OP H IT AND FOM OF OOV I NDEX S EARCH
method TopHit(%) FOM(%) setting search time (sec) TopHit (%) FOM (%)
method IV / ALL IV / ALL for 6.3-hour DB
In-domain LM 97.2 / 82.1 71.0 / 60.0 stage 1 0.19 36.6 27.9
Out-of-domain LM 79.7 / 41.6 46.9 / 24.5 stage 1-2 3.0 67.8 55.2
stage 1-3 3.0 76.7 39.5
stage 1-2-3 3.0 85.6 58.1
TABLE II
FOM OF W ORD I NDEX S EARCH AND P HONEME N- GRAM S EARCH FOR
L ARGE DATA S ET
method IV OOV ALL were OOVs. The conventional word N-gram search method
Word Index Search 47.7 0.0 29.1
obtained a relatively good score (47.3%) for the IV keywords.
Phoneme N-gram Search 29.5 17.1 24.7
Combination 47.7 17.1 35.8 However, it cannot detect the OOV keywords. On the contrary,
the phoneme N-gram search can detect the OOV words but
its accuracy is not good (17.1%). The result is reasonable
because the phoneme N-gram search does not consider the
tested two types of language models: the in-domain language order of the phoneme N-grams in the keyword (see section III).
model and the out-of-domain language model. The in-domain Note that the simple combination of the above two methods
language model was trained using the transcription of a 206- (“combination”) already achieved a great improvement in the
hour conference speech. The out-of-domain language model FOM for all keywords (29.1% to 35.8%). We aim to improve
was trained using the transcription of a 235-hour monologue the accuracy for OOV keywords using the rescoring module,
speech. 14 keywords (16%) were OOV for the in-domain which will be evaluated next.
language model, and 43 keywords (48%) were OOV for the
out-of-domain language model. B. Evaluation of Multistage Rescoring
To measure the search ability, we used a TopHit and FOM We evaluated our multistage rescoring procedure. In this
(Figure of Merit). TopHit is the probability that the best experiment, we used only small data set (6.3-hour data) for
candidate of the keyword detection is correct. The FOM is the detailed analysis. At that time, we used simple syllable
defined as the average of the detection/false-alarm curve taken grammar for the phoneme recognition. An acoustic model
over the range 0 – 10 false alarms per hour per keyword. We was trained using 16 hours of originally collected speech.
defined the candidate hit as being correct when the output The index database (acoustic likelihood DB and syllable
segment includes the keyword correctly. recognition DB) was accumulated at the rate of 54 MB/hour.
The result is listed in TABLE I. The left side in each cell In this evaluation, we used a conventional phoneme close-
shows the result averaged by only IV terms, and the right side matching instead of phoneme N-gram search, because the
shows the result averaged by all keywords. The word index dataset was small enough.
search system cannot detect OOV terms, so the results severely
We again used the TopHit and FOM. TABLE III shows the
degrade when averaged by all keywords. The table shows
results. The results of the conventional phoneme-based method
that the TopHit and FOM for IV keywords were very high
are listed under the setting “stage-1”. “Stage 1-2” and “stage
(97.2% TopHid and 71.0% FOM) when we used the in-domain
1-3” represent the system that concatenates stage 1 and stage
language model. However, the TopHit and FOM degraded
2, or stage 1 and stage 3. ”Stage 1-2-3 ” is our proposed
when using the out-of-domain language model(79.7% TopHid
multistage rescoring system. We adjusted each system so as
and 46.9% FOM).
to respond within 3.0 sec.
2) Evaluation on Large Data Set: We used 212-hour con-
From TABLE III, the results of the conventional phoneme-
ference speech and 248-hour monologue speech from CSJ.
based (“stage-1”) approach were much worse than those at
We selected 1,000 keywords by the tf/idf metrics from the
other settings (the TopHit (36.6%) and FOM (27.9%)). The
transcription of the test set.
poor results are mainly due to the low syllable recognition rate
We also evaluated the phoneme N-gram search and com-
in this experiment (about 64%). Both the best TopHit (85.6%)
pared it with the word index search. We trained an acoustic
and the best FOM (58.1%) were achieved at the setting “stage-
model using speaker-open 100-hour speech (50-hour con-
1-2-3”. There is a great improvement in the accuracy compared
ference speech and 50-hour monologue speech) from CSJ.
to the “stage-1” setting without a large increase in search time
Language models for the LVCSR and phoneme recognizer
(only 2.81 sec).
were trained from the transcription of the same 100-hour data.
Compared to the word index search (TABLE I), the accuracy
The vocabulary size of LVCSR was 17K. We used Julius for
of our OOV index search with multistage rescoring was in the
LVCSR and phoneme recognition.
middle of the in-domain language model setting and the out-
TABLE II shows the FOM results for IV keywords, OOV
of-domain language model setting. Considering the fact that
keywords and all keywords2 . At that time, 389 keywords
constructing an appropriate in-domain language model for a
2 In this experiment, we used only the OOV detection procedure for the large scale speech database was very difficult, this result is
compound words containing OOV terms, for computational simplicity. very encouraging.

942
C. Evaluation of OOV/IV Region Classifier 100%

We evaluated our OOV/IV region classifier. In this experi- 90%

ment, we used 10-hours of conference speech from CSJ. We 80%


et
split the data into 5 subsets and defined 1 subset as test data aR70%
and the remaining 4 subsets as training data. We iterated this no60%
process 5 times and averaged the evaluation results. it
ce50%
First we evaluated the ability to reduce the superfluous OOV te
indexes. We defined the removed index rate and OOV detec- D40%
V30% all features
tion rate as the measure of the OOV/IV region classification. O
They are defined as follows: O sentence-based
20%
word-based
OOV detectionRate = 10% random-selection
# of mora in correctly classif ied OOV region 0%
(3)
# of mora in OOV region 0% 20% 40% 60% 80% 100%
Removed Index Rate
RemovedIndexRate =
# of mora in the region classif ied as IV Fig. 4. Relationship between Removed Index Rate and OOV Detection Rate
(4)
# of mora in all region
The results are shown in Fig. 4. The vertical axis indicates 35%
the OOV detection rate and the horizontal axis indicates 30%
the index reduction rate. We changed the OOV/IV miss- tir
classification cost and plotted each result on the plane3 . In e25%
Fig. 4, “all features” is our proposed OOV/IV region classifier. M20%
fo
“Sentence-based” and “word-based” show the results obtained
er 15%
by using only the sentence-based criterion or word-based ug
i 10%
criterion. We also plotted the line if we remove the indexes F all features
randomly (“random selection”). In this case, the oov detection sentence-based
5%
rate becomes inverese proportion to the removed index rate. word-based

As shown in Fig. 4, our “all features” OOV/IV region classifier 0%


0% 20% 40% 60% 80% 100%
was superior to the others in the OOV detection rate. For
example, if we keep the 80% of OOV we can remove about Removed Index Rate
46 % of the superfluous OOV indexes, which is about 8 %
higher than the “sentence-based” criterion and 18 point higher Fig. 5. Relationship between Removed Index Rate and Figure of Merit
than the “word-based” criterion.
The above results suggest the ability of the OOV/IV region
classification. However, the manner in which that affects the VI. P ROTOTYPE S YSTEM FOR 2,000- HOUR DATABASE
search accuracy is not shown. Next we evaluated the search This section describes our prototype system that can detect
accuracy using the OOV/IV region classification. We used a any keyword within a few seconds from our proprietary
simple phoneme-close matching technique for computational 2,031-hour speech database. The hardware architecture of our
simplicity. We used Julius[10] as a phoneme recognizer, and system is shown in Fig. 6. We used two Linux PCs with a
used a syllable 3-gram language model trained using 206 hours Xeon 2.13GHz CPU and 2GB memory for the OOV and IV
of speaker-open speech from conference speech of CSJ. The detection procedures. The capacities of the word inverted index
acoustic model was learned from 50 hours of speaker open database and the phoneme inverted index database were 765
speech from CSJ. and 244 MB, respectively. We used the setting N2 = 1000,
The result is shown in Fig. 5. The vertical axis indicates N3 = 50 and P = 30. Using these parametors, the capacity
FOM and the horizontal axis indicates the index reduction of a phoneme a posteriori table became 52.1 GB. We had not
rate of the superfluous OOV indexes. We again changed the tried to reduce the OOV index database using the OOV/IV
OOV/IV miss-classification cost and plotted each result. We region classifier. For the 2,000-hour speech, we could avoid
can see that the word-based criterion did not work appropri- the index problem by storing the phoneme a posteriori table
ately; the FOM degraded severely even when we eliminate in a solid state disk (SSD) which gave us very rapid random
only a small size of the indexes. The sentence-based criterion access to the table. However, the problem will become serious
can maintain the FOM until they eliminate 40% of the OOV in the near future, so we now plan to implement our index
indexes. Finally, our OOV/IV region classifier (all features) reduction procedure. We tested some IV or OOV keywords
can reduce about 55% of the OOV indexes, with only 4 % and confirmed that our system can detect them appropriately.
decrease in the FOM.
We evaluated the search time of the OOV detection proce-
3 We used MetaCost Algorithm for cost-sensitive machine learning dure. The average search time and histogram of the search time

943
PC1
Phoneme 120

Index Search
Simplified Phoneme Phoneme
100

Acoustic Rescoring Inverted Index SATA A posteriori

Number of Keywords
Acoustic Probability Table 80

Rescoring in memory
on Solid State Disk
60

PC2 TCP/IP
Word 40

Index Search
IV/OOV Selection WordIndexInverted 20

in memory
Merge Result
0
0 1 2 3 4 5 6

ALL Execution Time (sec)

Fig. 6. Hardware Architecture for Spoken-Term Detection


Fig. 8. Histogram of Execution Time of OOV Detection Procedure

2.5

SSD access To achieve a fast and accurate open-vocabulary spoken-


)c 2 keyword detection, we introduced the multistage rescoring
e Execution
S
( strategy, which uses several search methods to reduce the
e1.5
im search space in a stepwise fashion. Furthermore, we con-
T
no structed an OOV/IV region classifier, which allows us to
it 1
uc reduce the huge index for OOVs. We showed some evaluation
ex results of multistage rescoring and an OOV/IV region classi-
E0.5
fier, and made the prototype system which can treat 2,031-
0
hours of speech. We confirmed that the system can search for
1-stage 2-stage 3-stage ALL a keyword within 2.17 seconds on average even if the keyword
is an OOV.
Fig. 7. Average Execution Time of OOV Detection Procedure R EFERENCES
[1] NIST, “The spoken term detection (std) 2006 evaluation plan,” 2006.
[Online]. Available: http://www.nist.gov/speech/tests/std/docs/std06-
for randomly selected 1,000 keywords are shown in Fig. 7 and evalplan-v10.pdf
Fig. 8. The average length of a keyword was 5.2 morae. Figure [2] P. Yu et al., “Vocabulary-independent indexing of spontaneous speech,”
Speech and Audio Processing, IEEE Trans., vol. 13, no. 5 Part 1, pp.
7 indicates that the main computational cost (56% of the all 635–643, 2005.
execution time) is from the SSD access time. As we showed [3] S. Dharanipragada et al., “A multistage algorithm for spotting new words
in section 5-B, we can obtain good detection results using in speech,” Speech and Audio Processing, IEEE Trans., vol. 10, no. 8,
pp. 542–550, 2002.
the 2nd and 3rd stage. However, these stages require more [4] S. Renals et al., “Indexing and retrieval of broadcast news,” Speech
execution time. The 1st stage is very fast and that takes only Communication, vol. 32, no. 1/2, pp. 5–20, 2000.
0.147 seconds to search a 2,031-hour database. The average [5] R. Wallace et al., “A Phonetic Search Approach to the 2006 NIST
Spoken Term Detection Evaluation,” in Proc. Interspeech, 2007, pp.
search time using all stages was 2.17 second. From Fig. 8, the 2385–2388.
execution time has a large variation however, the detection [6] K. Iwata et al., “Open-Vocabulary Spoken Document Retrieval Based
procedure usually stops within about 3-4 seconds. on New Subword Models and Subword Phonetic Similarity,” in Proc.
Interspeech, 2006, pp. 325–328.
Next we evaluated the search time of the IV detection. We [7] Y. Itoh et al., “Two-Stage Vocabulary-Free Spoken Document Retrieval
used the morphological analysis system ChaSen[11] to split — Subword Identification and Re-Recognition of the Identified Sec-
the compound word and removed 124 OOV terms from the tions,” in Proc. Interspeech, 2006, pp. 1161–1164.
[8] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech
keyword set. The average length of the keyword set was 5.1 recognition: word error minimization and other applications of confusion
morae, and average word num was 1.47. The detection time networks,” Arxiv preprint cs.CL/0010012, 2000.
for each IV keyword was 2.92 millisecond on average. [9] K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evalua-
tion,” Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing
and Recognition, pp. 7–12, 2003.
VII. C ONCLUSION [10] T. Kawahara, A. Lee, K. Takeda, K. Itou, and K. Shikano, “Recent
Progress of Open-Source LVCSR Engine Julius and Japanese Model
This paper presented our recent attempt to make a super- Repository,” in Proc. ICSLP, vol. IV, 2004, pp. 3069–3072.
large scale spoken-keyword detection system, which can detect [11] Y. Matsumoto, A. Kitauchi, T. Yamashita, Y. hirano, H. Matsuda,
any keyword uttered in a 2,000-hour speech database within K. Takaoka, and M. Asahara, “Morphological Analysis System ChaSen
version 2.29 Manual.” Nara Institute of Science and Technology, 2002.
a few seconds.

944

You might also like