0% found this document useful (0 votes)

3 views

Keyword Detection

Uploaded by

202002025.jayeshsvm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Keyword Detection

Uploaded by

202002025.jayeshsvm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Open-Vocabulary Keyword Detection from

Super-Large Scale Speech Database

Naoyuki Kanda #1 , Hirohiko Sagawa #2 , Takashi Sumiyoshi #3 and Yasunari Obuchi #4
#
Central Research laboratory, Hitachi Ltd.
1-280, Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan
1
[email protected]

Abstract—This paper presents our recent attempt to make a this method has a serious defect in that it cannot detect
super-large scale spoken-term detection system, which can detect LVCSR’s out-of-vocabulary (OOV) terms. Although proper
any keyword uttered in a 2,000-hour speech database within a nouns and newly created words have great importance in a lot
few seconds. There are three problems to achieve such a system.
The system must be able to detect out-of-vocabulary (OOV) terms of applications, they tend to be OOVs because of their scarcity.
(OOV problem). The system has to respond to the user quickly Large-scale databases potentially have many OOVs. Therefore,
without sacrificing search accuracy (search speed and accuracy the system must be able to detect them. To detect the OOV
problem). The pre-stored index database should be sufficiently keywords, we introduce a phoneme-based search technique
small (index size problem). and combine it with the conventional LVCSR method.
We introduced a phoneme-based search method to detect the
OOV terms, and combined it with the LVCSR-based method. The second issue is the search speed and search accuracy
To search for a keyword from large-scale speech databases problem of an open-vocabulary spoken-keyword detection sys-
accurately and quickly, we introduced a multistage rescoring tem. To detect the OOV keywords, a conventional phoneme-
strategy which uses several search methods to reduce the search based system used a phoneme-based close-matching technique
space in a stepwise fashion. Furthermore, we constructed an out- [5], [6], [7]. The main defect of it is its slow search speed due
of-vocabulary/in-vocabulary region classifier, which allows us to
reduce the size of the index database for OOVs. We describe the to the complex close-matching procedure. Furthermore, the
prototype system and present some evaluation results. phonetic approach often produces a lot of false positive hits.
To solve those problems, we introduce a rescoring strategy,
I. I NTRODUCTION which first uses very simplified phoneme-matching to coarsely
With the growth of high-capacity storage devices and large- reduce the search space and then rescores the reduced search
scale networks, more speech databases are being accumulated space with a more accurate matching technique. We propose
and utilized. It is no longer difficult to accumulate 1-year a multistage rescoring procedure to reduce the search space
of TV programs or the speech stream of all of human life. in a stepwise fashion. That enables us to search for an OOV
To manage such large speech databases, techniques that can keyword quickly and accurately.
extract only useful information “quickly” and “accurately” are Finally, we consider the index size problem. We combined
highly desired. the phoneme-based method and LVCSR-based method for
This paper presents our recent attempt to make a super- fast and accurate search. However, the capacity of the index
large scale spoken-term detection system, which can detect database becomes larger than that of the single method.
any keyword uttered in a 2,000-hour speech database within Furthermore, we use a very accurate rescoring procedure for
a few seconds. Although there are already many studies for the detection of OOV keywords, and this procedure requires
spoken-term detection[1], [2], [3], there is no system that treats a very large index capacity (about 54 MB per 1-hour speech).
such large databases. There are three problems to achieve such Such large index database are hard to store in memory when
a system. The system must be able to detect out-of-vocabulary the database size becomes extremely large, and that causes
(OOV) terms (OOV problem). The system has to respond to a significant decrease in search speed. Therefore, we need
the user quickly without sacrificing search accuracy (search some index-reduction technique. To solve the problem, we
speed and accuracy problem). The pre-stored index database propose an out-of-vocabulary/in-vocabulary region classifier,
should be small enough (index size problem). which can accurately detect the region that does not contain
First, we consider the OOV problem of large speech data- OOVs and remove redundant indexes used for OOVs.
bases. One simple approach for spoken-term detection is to In the next section, we show the overview of our system. In
execute a large vocabulary continuous speech recognition section III, we explain the multistage rescoring strategy which
(LVCSR) in advance and to search for a keyword in the enables us to search for a keyword quickly and accurately. In
recognized word sequence or word lattice[4]. This approach section IV, we explain the OOV/IV region classifier which
usually uses a position table, which can be referred to for reduces redundant phoneme-indexes greatly. Finally, we show
finding the position of dictionary words (called word inverted some experimental results in section V and describe our first
index), so it can detect a keyword very rapidly. However, prototype system in section VI.

978-1-4244-2295-1/08/$25.00 © 2008 IEEE 939 MMSP 2008

LVCSR Word IV Index True Hit
Indexer Database
False Hit
Original Speech -Inverted Index
OOV/IV Region
Classifier Phoneme N-gram Search
N1 hits
Phoneme Remove IV Phoneme OOV Index
Recognizer Interval Indexer Database High-score Low-score

Indexing Phase
-N-gram Inverted Index
-Phoneme Probability Table
N2 hits
Search Phase IV Index Search Word Index
Simplified Acoustic Rescoring
IV/OOV Search N2 hits
Selection
Phoneme Simplified Acoustic High-score Low-score
Input Index Search Acoustic Rescoring Rescoring N3 hits
keyword
OOV Index Search
Output Result
Acoustic Rescoring
ranking position(sec) N3 hits
Merge Results 1 151 High-score Low-score
2 32
‥‥‥ Detection result
*OOV: Out of Vocabulary
*IV: In Vocabulary
Fig. 2. Overview of Out-of-vocabulary Index Search Procedure
Fig. 1. Overview of Spoken-Term Detection System

index and the phoneme a posteriori probability as the

indexes for OOV terms (OOV Index Database).
II. S YSTEM OVERVIEW
Our spoken-term detection system consists of two phases: B. Search phase
the indexing phase and the search phase (Fig.1). The search phase operates when any keyword is entered by
the user and detects positions of the keyword uttered in the
A. Indexing phase database. The search phase works as follows:
The indexing phase operates when any speech data is added 1) First, the IV/OOV selection procedure checks whether
to the speech database. An index database is made for fast the keyword is IV terms or OOV terms. IV terms are
keyword search in the search phase. The indexing phase works sent to the word index search module, and OOV terms
as follows: are sent to the OOV index search procedure. In the
1) An LVCSR converts the speech waves into a word case of a compound word, we split the word using a
confusion network, which is a compact word lattice morphological analysis technique and all terms are sent
representation[8]. to the search procedure after IV/OOV selection.
2) The word indexer makes the word inverted index from 2) The word index search module refers to the word
the word confusion network. Here, the word inverted inverted index and outputs the positions of each IV
index has the information about the word-ID, start-time, terms.
end-time and confidence measure from the LVCSR for 3) OOV terms are sent to the OOV index search proce-
each word in the confusion network. We store it as the dure which consists of three stages; the phoneme N-
index for in-vocabulary (IV) terms (IV Index Database). gram search, simplified acoustic rescoring and acoustic
3) Next, a phoneme recognizer converts the speech waves rescoring. The Ni most probable results of each stage
into phoneme sequences. In parallel, a phoneme recog- are sent to a more accurate next stage to correct their
nizer calculates a posteriori probability for each pho- order (Fig. 2). The details are explained in section III.
neme for each time frame, and saves the P highest 4) If the keyword is a compound word, the search re-
scores in the phoneme a posteriori probability table. This sults from an IV index search and OOV index search
table is used to avoid the heavy pdf (probability density are merged using their time position. Only the term
function) calculation in the search phase. sequences that are correctly connected are output as
4) An OOV/IV region classifier detects regions that do not detection results. We connect the term A and B if |B’s
contain OOVs, and removes the phoneme recognition start-time − A’s end-time| < 0.5 sec.
results and the phoneme a posteriori probability data
corresponding to the detected regions. The details of the III. D ETECTION OF OOV K EYWORDS USING P HONEME
classifier will be described in the section IV. N- GRAM S EARCH AND M ULTISTAGE R ESCORING
5) A phoneme indexer converts the remaining phoneme This section shows our fast and accurate detection proce-
recognition results into a phoneme N-gram inverted dure for OOV keyword using phoneme N-gram search and
index. The phoneme N-gram inverted index has the multistage rescoring. As we depicted in section 1, many
phoneme N-gram ID, start-time and end-time for each conventional studies used phoneme close-matching to search
phoneme N-grams in the phoneme recognition results. for OOV terms and they are slow and inaccurate. Our idea
A phoneme indexer stores the phoneme N-gram inverted to overcome that is stepwise search space reduction using

940
multistage rescoring. Our detection procedure consists of three
stages; the most probable top-N detection results of each stage
are sent to a more accurate next stage to correct their order Filler Model Keyword HMM ＊
Filler Model

(Fig. 2).
In the first stage, we have to seek a very large search space,
so we construct fast phoneme N-gram search procedure. When
Filler Model
we search for a keyword, we first convert the keyword into
phoneme sequences using conversion rules. Then we search Fig. 3. Network for Precise Acoustic Rescoring
for the region that contains many phoneme N-grams in the
keyword referring to the phoneme N-gram index. This search
The second is the word-based criterion which uses a confi-
method is very fast because it does not consider the order of
dence measure from LVCSR. A Confidence measure CM (wi )
phoneme N-grams in the keyword. Instead, the detection result
for a word wi is represented as below;
is not accurate.
Then, we rescore the top-N2 putative regions from the 1st P (X|wi )P (wi )
CM (wi ) = , (2)
stage using more precise acoustic information. Here, we used wj ∈W P (X|wj )P (wj )
the unconstrained endpoint wordspotting technique. To skip
where W represents word hypotheses for the same interval.
the computationally expensive acoustic analysis calculation,
Note that the confidence measure becomes low in case there
we use the phoneme a posteriori probability table calculated
is an OOV and in case the inputted speech is not clear or not
in the indexing phase.
grammatical. However, we can judge whether there is a OOV
Finally, we score the top-N3 segments from the 2nd stage
or not checking whether a confidence measure of the word in
using a more precise utterance-verification method based on
the best recognition path is lower or higher than a threshold.
the filler model. A filler model is a simple self-loop network of
because their are correlation between the OOV existence and
various fillers such as phonemes or syllables, and we construct
the ambiguity of LVCSR results,
a HMM network like that shown in Fig. 3. The difference
between an acoustic score of the upper filler-keyword-filler B. OOV/IV Region Classifier
path and that of the lower filler-only path is regarded as the The main idea of our OOV/IV region classifier is to
keyword existence score of the segment. As in the case of the combine the above two criteria on the basis of machine
2nd stages, we can execute the 3rd stage efficiently using the learning techniques. We construct the OOV/IV region classifier
phoneme a posteriori probability table. 1 function using OOV/IV-labeled training data. The input signal
IV. OOV/IV R EGION C LASSIFIER TO R EMOVE is segmented into regions according to the word boundaries
S UPERFLUOUS OOV I NDEXES of LVCSR’s hypothesis. Each region is labeled as either IV or
We construct an OOV/IV region classifier to remove the su- OOV according to the corresponding transcription.
perfluous OOV indexes. An OOV/IV region classifier classify We used the logistic regression algorithm as the classifier.
whether an interval of speech contains OOV terms of LVCSR The features used for classification are
the value of equation
(OOV region) or not (IV region). (1),
the value of equation (2) and s∈S1best P (s|X)P (X),
P h∈P hN best P (X|P h)P (P h). The last two are introduced
A. Conventional Method to adjust the value range of the numerator and denominator of
There are already some OOV-region detection criteria. The equation (1), which corresponds to the value range of LVCSR
first is the sentence-based criterion. The a posteriori probability results and phoneme recognition results.
that a speech X contains OOV is expressed as follows: V. E XPERIMENTAL R ESULTS

P (OOV contained|X) = 1 − P (s|X) In this section, we show some experimental results for the
s∈S multistage rescoring and the OOV/IV region classifier.

P (X|s)P (s)
s∈S
=1− A. Baseline System
P (X)
We evaluated the word index search system as the baseline.
P (X|s)P (s)
1 − s∈SN best , (1) We used “the Corpus of Spontaneous Japanese (CSJ)[9]” as
ph∈P hN best (X|ph)P (ph)
P the evaluation data. We made small data set with 6.3 hour
where S, SN best and P hN best represent all possible word speech and large data set with 460 hour speech.
sequences, the N-best word sequences from LVCSR and the 1) Evaluation on Small Data Set: We used 6.3 hours of
N-best phoneme sequences from phoneme recognizer, respec- conference speeches from CSJ selecting 30 different talkers.
tively. We collected 90 length-balanced keywords from the transcrip-
tion of the speech.
1 In addition, we can omit the post-filler-model of the upper filler-keyword-
We used Julius[10] for LVCSR. An acoustic model were
filler path (marked as “∗” in Fig. 3) for calculation efficiency, because we use
a real-time end-point tracking. Our preliminary experimental result suggested trained using 50 hours of speaker open data from CSJ. To
that the retrieval accuracy was still conserved even if we omit that path. analyze the effect of the quality of the language model, we

941
TABLE I TABLE III
T OP H IT AND FOM OF W ORD I NDEX S EARCH FOR S MALL DATA S ET T OP H IT AND FOM OF OOV I NDEX S EARCH
method TopHit(%) FOM(%) setting search time (sec) TopHit (%) FOM (%)
method IV / ALL IV / ALL for 6.3-hour DB
In-domain LM 97.2 / 82.1 71.0 / 60.0 stage 1 0.19 36.6 27.9
Out-of-domain LM 79.7 / 41.6 46.9 / 24.5 stage 1-2 3.0 67.8 55.2
stage 1-3 3.0 76.7 39.5
stage 1-2-3 3.0 85.6 58.1
TABLE II
FOM OF W ORD I NDEX S EARCH AND P HONEME N- GRAM S EARCH FOR
L ARGE DATA S ET
method IV OOV ALL were OOVs. The conventional word N-gram search method
Word Index Search 47.7 0.0 29.1
obtained a relatively good score (47.3%) for the IV keywords.
Phoneme N-gram Search 29.5 17.1 24.7
Combination 47.7 17.1 35.8 However, it cannot detect the OOV keywords. On the contrary,
the phoneme N-gram search can detect the OOV words but
its accuracy is not good (17.1%). The result is reasonable
because the phoneme N-gram search does not consider the
tested two types of language models: the in-domain language order of the phoneme N-grams in the keyword (see section III).
model and the out-of-domain language model. The in-domain Note that the simple combination of the above two methods
language model was trained using the transcription of a 206- (“combination”) already achieved a great improvement in the
hour conference speech. The out-of-domain language model FOM for all keywords (29.1% to 35.8%). We aim to improve
was trained using the transcription of a 235-hour monologue the accuracy for OOV keywords using the rescoring module,
speech. 14 keywords (16%) were OOV for the in-domain which will be evaluated next.
language model, and 43 keywords (48%) were OOV for the
out-of-domain language model. B. Evaluation of Multistage Rescoring
To measure the search ability, we used a TopHit and FOM We evaluated our multistage rescoring procedure. In this
(Figure of Merit). TopHit is the probability that the best experiment, we used only small data set (6.3-hour data) for
candidate of the keyword detection is correct. The FOM is the detailed analysis. At that time, we used simple syllable
defined as the average of the detection/false-alarm curve taken grammar for the phoneme recognition. An acoustic model
over the range 0 – 10 false alarms per hour per keyword. We was trained using 16 hours of originally collected speech.
defined the candidate hit as being correct when the output The index database (acoustic likelihood DB and syllable
segment includes the keyword correctly. recognition DB) was accumulated at the rate of 54 MB/hour.
The result is listed in TABLE I. The left side in each cell In this evaluation, we used a conventional phoneme close-
shows the result averaged by only IV terms, and the right side matching instead of phoneme N-gram search, because the
shows the result averaged by all keywords. The word index dataset was small enough.
search system cannot detect OOV terms, so the results severely
We again used the TopHit and FOM. TABLE III shows the
degrade when averaged by all keywords. The table shows
results. The results of the conventional phoneme-based method
that the TopHit and FOM for IV keywords were very high
are listed under the setting “stage-1”. “Stage 1-2” and “stage
(97.2% TopHid and 71.0% FOM) when we used the in-domain
1-3” represent the system that concatenates stage 1 and stage
language model. However, the TopHit and FOM degraded
2, or stage 1 and stage 3. ”Stage 1-2-3 ” is our proposed
when using the out-of-domain language model(79.7% TopHid
multistage rescoring system. We adjusted each system so as
and 46.9% FOM).
to respond within 3.0 sec.
2) Evaluation on Large Data Set: We used 212-hour con-
From TABLE III, the results of the conventional phoneme-
ference speech and 248-hour monologue speech from CSJ.
based (“stage-1”) approach were much worse than those at
We selected 1,000 keywords by the tf/idf metrics from the
other settings (the TopHit (36.6%) and FOM (27.9%)). The
transcription of the test set.
poor results are mainly due to the low syllable recognition rate
We also evaluated the phoneme N-gram search and com-
in this experiment (about 64%). Both the best TopHit (85.6%)
pared it with the word index search. We trained an acoustic
and the best FOM (58.1%) were achieved at the setting “stage-
model using speaker-open 100-hour speech (50-hour con-
1-2-3”. There is a great improvement in the accuracy compared
ference speech and 50-hour monologue speech) from CSJ.
to the “stage-1” setting without a large increase in search time
Language models for the LVCSR and phoneme recognizer
(only 2.81 sec).
were trained from the transcription of the same 100-hour data.
Compared to the word index search (TABLE I), the accuracy
The vocabulary size of LVCSR was 17K. We used Julius for
of our OOV index search with multistage rescoring was in the
LVCSR and phoneme recognition.
middle of the in-domain language model setting and the out-
TABLE II shows the FOM results for IV keywords, OOV
of-domain language model setting. Considering the fact that
keywords and all keywords2 . At that time, 389 keywords
constructing an appropriate in-domain language model for a
2 In this experiment, we used only the OOV detection procedure for the large scale speech database was very difficult, this result is
compound words containing OOV terms, for computational simplicity. very encouraging.

942
C. Evaluation of OOV/IV Region Classifier 100%

We evaluated our OOV/IV region classifier. In this experi- 90%

ment, we used 10-hours of conference speech from CSJ. We 80%

et
split the data into 5 subsets and defined 1 subset as test data aR70%
and the remaining 4 subsets as training data. We iterated this no60%
process 5 times and averaged the evaluation results. it
ce50%
First we evaluated the ability to reduce the superfluous OOV te
indexes. We defined the removed index rate and OOV detec- D40%
V30% all features
tion rate as the measure of the OOV/IV region classification. O
They are defined as follows: O sentence-based
20%
word-based
OOV detectionRate = 10% random-selection
# of mora in correctly classif ied OOV region 0%
(3)
# of mora in OOV region 0% 20% 40% 60% 80% 100%
Removed Index Rate
RemovedIndexRate =
# of mora in the region classif ied as IV Fig. 4. Relationship between Removed Index Rate and OOV Detection Rate
(4)
# of mora in all region
The results are shown in Fig. 4. The vertical axis indicates 35%
the OOV detection rate and the horizontal axis indicates 30%
the index reduction rate. We changed the OOV/IV miss- tir
classification cost and plotted each result on the plane3 . In e25%
Fig. 4, “all features” is our proposed OOV/IV region classifier. M20%
fo
“Sentence-based” and “word-based” show the results obtained
er 15%
by using only the sentence-based criterion or word-based ug
i 10%
criterion. We also plotted the line if we remove the indexes F all features
randomly (“random selection”). In this case, the oov detection sentence-based
5%
rate becomes inverese proportion to the removed index rate. word-based

As shown in Fig. 4, our “all features” OOV/IV region classifier 0%

0% 20% 40% 60% 80% 100%
was superior to the others in the OOV detection rate. For
example, if we keep the 80% of OOV we can remove about Removed Index Rate
46 % of the superfluous OOV indexes, which is about 8 %
higher than the “sentence-based” criterion and 18 point higher Fig. 5. Relationship between Removed Index Rate and Figure of Merit
than the “word-based” criterion.
The above results suggest the ability of the OOV/IV region
classification. However, the manner in which that affects the VI. P ROTOTYPE S YSTEM FOR 2,000- HOUR DATABASE
search accuracy is not shown. Next we evaluated the search This section describes our prototype system that can detect
accuracy using the OOV/IV region classification. We used a any keyword within a few seconds from our proprietary
simple phoneme-close matching technique for computational 2,031-hour speech database. The hardware architecture of our
simplicity. We used Julius[10] as a phoneme recognizer, and system is shown in Fig. 6. We used two Linux PCs with a
used a syllable 3-gram language model trained using 206 hours Xeon 2.13GHz CPU and 2GB memory for the OOV and IV
of speaker-open speech from conference speech of CSJ. The detection procedures. The capacities of the word inverted index
acoustic model was learned from 50 hours of speaker open database and the phoneme inverted index database were 765
speech from CSJ. and 244 MB, respectively. We used the setting N2 = 1000,
The result is shown in Fig. 5. The vertical axis indicates N3 = 50 and P = 30. Using these parametors, the capacity
FOM and the horizontal axis indicates the index reduction of a phoneme a posteriori table became 52.1 GB. We had not
rate of the superfluous OOV indexes. We again changed the tried to reduce the OOV index database using the OOV/IV
OOV/IV miss-classification cost and plotted each result. We region classifier. For the 2,000-hour speech, we could avoid
can see that the word-based criterion did not work appropri- the index problem by storing the phoneme a posteriori table
ately; the FOM degraded severely even when we eliminate in a solid state disk (SSD) which gave us very rapid random
only a small size of the indexes. The sentence-based criterion access to the table. However, the problem will become serious
can maintain the FOM until they eliminate 40% of the OOV in the near future, so we now plan to implement our index
indexes. Finally, our OOV/IV region classifier (all features) reduction procedure. We tested some IV or OOV keywords
can reduce about 55% of the OOV indexes, with only 4 % and confirmed that our system can detect them appropriately.
decrease in the FOM.
We evaluated the search time of the OOV detection proce-
3 We used MetaCost Algorithm for cost-sensitive machine learning dure. The average search time and histogram of the search time

943
PC1
Phoneme 120

Index Search
Simplified Phoneme Phoneme
100

Acoustic Rescoring Inverted Index SATA A posteriori

Number of Keywords
Acoustic Probability Table 80

Rescoring in memory
on Solid State Disk
60

PC2 TCP/IP
Word 40

Index Search
IV/OOV Selection WordIndexInverted 20

in memory
Merge Result
0
0 1 2 3 4 5 6

ALL Execution Time (sec)

Fig. 6. Hardware Architecture for Spoken-Term Detection

Fig. 8. Histogram of Execution Time of OOV Detection Procedure

2.5

SSD access To achieve a fast and accurate open-vocabulary spoken-

)c 2 keyword detection, we introduced the multistage rescoring
e Execution
S
( strategy, which uses several search methods to reduce the
e1.5
im search space in a stepwise fashion. Furthermore, we con-
T
no structed an OOV/IV region classifier, which allows us to
it 1
uc reduce the huge index for OOVs. We showed some evaluation
ex results of multistage rescoring and an OOV/IV region classi-
E0.5
fier, and made the prototype system which can treat 2,031-
0
hours of speech. We confirmed that the system can search for
1-stage 2-stage 3-stage ALL a keyword within 2.17 seconds on average even if the keyword
is an OOV.
Fig. 7. Average Execution Time of OOV Detection Procedure R EFERENCES
[1] NIST, “The spoken term detection (std) 2006 evaluation plan,” 2006.
[Online]. Available: http://www.nist.gov/speech/tests/std/docs/std06-
for randomly selected 1,000 keywords are shown in Fig. 7 and evalplan-v10.pdf
Fig. 8. The average length of a keyword was 5.2 morae. Figure [2] P. Yu et al., “Vocabulary-independent indexing of spontaneous speech,”
Speech and Audio Processing, IEEE Trans., vol. 13, no. 5 Part 1, pp.
7 indicates that the main computational cost (56% of the all 635–643, 2005.
execution time) is from the SSD access time. As we showed [3] S. Dharanipragada et al., “A multistage algorithm for spotting new words
in section 5-B, we can obtain good detection results using in speech,” Speech and Audio Processing, IEEE Trans., vol. 10, no. 8,
pp. 542–550, 2002.
the 2nd and 3rd stage. However, these stages require more [4] S. Renals et al., “Indexing and retrieval of broadcast news,” Speech
execution time. The 1st stage is very fast and that takes only Communication, vol. 32, no. 1/2, pp. 5–20, 2000.
0.147 seconds to search a 2,031-hour database. The average [5] R. Wallace et al., “A Phonetic Search Approach to the 2006 NIST
Spoken Term Detection Evaluation,” in Proc. Interspeech, 2007, pp.
search time using all stages was 2.17 second. From Fig. 8, the 2385–2388.
execution time has a large variation however, the detection [6] K. Iwata et al., “Open-Vocabulary Spoken Document Retrieval Based
procedure usually stops within about 3-4 seconds. on New Subword Models and Subword Phonetic Similarity,” in Proc.
Interspeech, 2006, pp. 325–328.
Next we evaluated the search time of the IV detection. We [7] Y. Itoh et al., “Two-Stage Vocabulary-Free Spoken Document Retrieval
used the morphological analysis system ChaSen[11] to split — Subword Identification and Re-Recognition of the Identified Sec-
the compound word and removed 124 OOV terms from the tions,” in Proc. Interspeech, 2006, pp. 1161–1164.
[8] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech
keyword set. The average length of the keyword set was 5.1 recognition: word error minimization and other applications of confusion
morae, and average word num was 1.47. The detection time networks,” Arxiv preprint cs.CL/0010012, 2000.
for each IV keyword was 2.92 millisecond on average. [9] K. Maekawa, “Corpus of Spontaneous Japanese: Its design and evalua-
tion,” Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing
and Recognition, pp. 7–12, 2003.
VII. C ONCLUSION [10] T. Kawahara, A. Lee, K. Takeda, K. Itou, and K. Shikano, “Recent
Progress of Open-Source LVCSR Engine Julius and Japanese Model
This paper presented our recent attempt to make a super- Repository,” in Proc. ICSLP, vol. IV, 2004, pp. 3069–3072.
large scale spoken-keyword detection system, which can detect [11] Y. Matsumoto, A. Kitauchi, T. Yamashita, Y. hirano, H. Matsuda,
any keyword uttered in a 2,000-hour speech database within K. Takaoka, and M. Asahara, “Morphological Analysis System ChaSen
version 2.29 Manual.” Nara Institute of Science and Technology, 2002.
a few seconds.

944

Core Vocabulary Screening
100% (1)
Core Vocabulary Screening
27 pages
Case 6
No ratings yet
Case 6
3 pages
International Standard IS0: Hydraulic Fluid Power - Valves Controlling Flow and Pressure - Test Methods
No ratings yet
International Standard IS0: Hydraulic Fluid Power - Valves Controlling Flow and Pressure - Test Methods
34 pages
Interspeech 2010 Oovrecovery PDF
No ratings yet
Interspeech 2010 Oovrecovery PDF
4 pages
Rapid and Accurate STD
No ratings yet
Rapid and Accurate STD
4 pages
Out-Of-Vocabulary Detection and Confidence Measures For Speech Recognition Using Phone Models
No ratings yet
Out-Of-Vocabulary Detection and Confidence Measures For Speech Recognition Using Phone Models
4 pages
Taslp 2011
No ratings yet
Taslp 2011
10 pages
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
Fusion of Spectrograph and LPC Analysis For Word Recognition: A New Fuzzy Approach
No ratings yet
Fusion of Spectrograph and LPC Analysis For Word Recognition: A New Fuzzy Approach
6 pages
Potter Stemmer
No ratings yet
Potter Stemmer
8 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
10 pages
Msthesis Presentation Dogancan
No ratings yet
Msthesis Presentation Dogancan
132 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Is 2016 7737405
No ratings yet
Is 2016 7737405
6 pages
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
No ratings yet
Punjabi Speech Recognition: A Survey: by Muskan and Dr. Naveen Aggarwal
7 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Urdu Speech Recognition System For District Names of Pakistan Development, Challenges and Solutions
No ratings yet
Urdu Speech Recognition System For District Names of Pakistan Development, Challenges and Solutions
5 pages
An End-to-End Architecture For Keyword Spotting and Voice Activity Detection
No ratings yet
An End-to-End Architecture For Keyword Spotting and Voice Activity Detection
5 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Audio-Speech Segmentation and Topic Detection For A Speech-Based Information Retrieval System
No ratings yet
Audio-Speech Segmentation and Topic Detection For A Speech-Based Information Retrieval System
7 pages
Dehak Et Al. - Language Recognition Via I-Vectors and Dimensional
No ratings yet
Dehak Et Al. - Language Recognition Via I-Vectors and Dimensional
4 pages
Document
No ratings yet
Document
6 pages
Use of Metadata To Improve Recognition of Spontaneous Speech and Named Entities
No ratings yet
Use of Metadata To Improve Recognition of Spontaneous Speech and Named Entities
4 pages
An Algorithm For Suffix Stripping
No ratings yet
An Algorithm For Suffix Stripping
7 pages
2001 Data Driven Approach
No ratings yet
2001 Data Driven Approach
6 pages
Corpus-Based Stemming Using Cooccurrence of Word Variants
No ratings yet
Corpus-Based Stemming Using Cooccurrence of Word Variants
21 pages
4 base
No ratings yet
4 base
4 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
aris_1440380105 -- de57437e504abe97d142fdc665db6c54 -- Anna’s Archive
No ratings yet
aris_1440380105 -- de57437e504abe97d142fdc665db6c54 -- Anna’s Archive
43 pages
Term Paper ECE-300 Topic: - Speech Recognition
No ratings yet
Term Paper ECE-300 Topic: - Speech Recognition
14 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
Speech Processing -Anu
No ratings yet
Speech Processing -Anu
78 pages
01 Ok Continuous Phoneme Recognition Based on Audio-Visual Modality Fusion
No ratings yet
01 Ok Continuous Phoneme Recognition Based on Audio-Visual Modality Fusion
8 pages
Preprocessing Signal
No ratings yet
Preprocessing Signal
6 pages
Saying What You're Looking For: Linguistics Meets Video Search
No ratings yet
Saying What You're Looking For: Linguistics Meets Video Search
14 pages
Am PDF
No ratings yet
Am PDF
11 pages
Speech Recognition Algo
No ratings yet
Speech Recognition Algo
17 pages
Speech Recognition
No ratings yet
Speech Recognition
6 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Attention Audio Visual Speech Recognition
No ratings yet
Attention Audio Visual Speech Recognition
5 pages
ENTERFACE 2010 Project Proposal: 1. Introduction and Project Objectives
No ratings yet
ENTERFACE 2010 Project Proposal: 1. Introduction and Project Objectives
7 pages
IEEE PAMI: Towards Open Vocabulary Learning A Survey
No ratings yet
IEEE PAMI: Towards Open Vocabulary Learning A Survey
20 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
NLP Project Reportttt
No ratings yet
NLP Project Reportttt
9 pages
Session 1
No ratings yet
Session 1
33 pages
Robot_arm_controller_using_fuzzy_speech_recognition
No ratings yet
Robot_arm_controller_using_fuzzy_speech_recognition
7 pages
Spin PDF
No ratings yet
Spin PDF
5 pages
NLP Qa
No ratings yet
NLP Qa
10 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
001 OK AVSR based transformer
No ratings yet
001 OK AVSR based transformer
15 pages
Testing Vocabulary ELT Assessment
No ratings yet
Testing Vocabulary ELT Assessment
26 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Study On N-Gram Language Models For Topic and Out-Of-Vocabulary Words (PDFDrive)
No ratings yet
Study On N-Gram Language Models For Topic and Out-Of-Vocabulary Words (PDFDrive)
155 pages
Formant Analys
No ratings yet
Formant Analys
1 page
46 Silence PDF
No ratings yet
46 Silence PDF
8 pages
Anomaly Detection Based Pronunciation Verification: Presented by-P.Naresh Babu Roll No:18EC01003
No ratings yet
Anomaly Detection Based Pronunciation Verification: Presented by-P.Naresh Babu Roll No:18EC01003
24 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
The Impulse Response Bible
From Everand
The Impulse Response Bible
Past To Future
No ratings yet
An Executive Guide Biometrics
From Everand
An Executive Guide Biometrics
alasdair gilchrist
No ratings yet
The spaCy Handbook: Simplifying Natural Language Processing
From Everand
The spaCy Handbook: Simplifying Natural Language Processing
Robert Johnson
No ratings yet
Digital Signal Processing for Audio Applications: Volume 2 - Code
From Everand
Digital Signal Processing for Audio Applications: Volume 2 - Code
Anton R Kamenov
5/5 (1)
Sony Kdl-40ex645 40ex641 46ex645 50ex645 55ex645 Chassis Az3fk
No ratings yet
Sony Kdl-40ex645 40ex641 46ex645 50ex645 55ex645 Chassis Az3fk
50 pages
Extension - Concept and Need - Education PDF
No ratings yet
Extension - Concept and Need - Education PDF
21 pages
Semiconductor Transistor
No ratings yet
Semiconductor Transistor
526 pages
Analysis of Instrumentation Failure Data
No ratings yet
Analysis of Instrumentation Failure Data
27 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
16 pages
MTS R0
No ratings yet
MTS R0
166 pages
What Is Boxing and Unboxing in C#?: Struct
No ratings yet
What Is Boxing and Unboxing in C#?: Struct
3 pages
MOP - SBC - Sophos XG Upgrade 19.5 MR4 Build 718
No ratings yet
MOP - SBC - Sophos XG Upgrade 19.5 MR4 Build 718
2 pages
PPS R22 Syllabus
No ratings yet
PPS R22 Syllabus
2 pages
IBM Power Systems Performance Capabilities Reference
No ratings yet
IBM Power Systems Performance Capabilities Reference
46 pages
Testing and Commissioning Progress Chart For Low Voltage Cubicle Switchboard Installation (Rev.)
No ratings yet
Testing and Commissioning Progress Chart For Low Voltage Cubicle Switchboard Installation (Rev.)
1 page
Job Interviews - Student
No ratings yet
Job Interviews - Student
4 pages
The Role of Social Media in Shaping Our Communication
No ratings yet
The Role of Social Media in Shaping Our Communication
8 pages
Te'S Raychem Cable Clamps
No ratings yet
Te'S Raychem Cable Clamps
12 pages
TR EB 15 Dev Books
No ratings yet
TR EB 15 Dev Books
9 pages
SSTV 08
No ratings yet
SSTV 08
15 pages
SRS Example 2
No ratings yet
SRS Example 2
17 pages
Fundamentals of Electrical Distribution
100% (1)
Fundamentals of Electrical Distribution
30 pages
Siddharth Sharma: Area of Interest Education
No ratings yet
Siddharth Sharma: Area of Interest Education
2 pages
AI UNIT-1
No ratings yet
AI UNIT-1
5 pages
Electricity Billing System (Akshi, Abdul Kadir, Abhishek Jain) Group - 18
No ratings yet
Electricity Billing System (Akshi, Abdul Kadir, Abhishek Jain) Group - 18
9 pages
Body Control Module (BCM) Inspection: 3A 3B 3C 3D 3E 3F 3G 3H 3I 3J 3K 3L
No ratings yet
Body Control Module (BCM) Inspection: 3A 3B 3C 3D 3E 3F 3G 3H 3I 3J 3K 3L
8 pages
AT&T Device Unlock Instructions
No ratings yet
AT&T Device Unlock Instructions
30 pages
flow chart of various parts in a drone with elaboration
No ratings yet
flow chart of various parts in a drone with elaboration
7 pages
ADBMS
No ratings yet
ADBMS
3 pages
Database PDF
No ratings yet
Database PDF
22 pages
EMEg 5221chapter 10 11
No ratings yet
EMEg 5221chapter 10 11
64 pages
Voltage Unbalance and It's Impact On The Performance of Three Phase Induction Motor: A Review
No ratings yet
Voltage Unbalance and It's Impact On The Performance of Three Phase Induction Motor: A Review
6 pages

Keyword Detection

Uploaded by

Keyword Detection

Uploaded by

Open-Vocabulary Keyword Detection from

Super-Large Scale Speech Database

978-1-4244-2295-1/08/$25.00 © 2008 IEEE 939 MMSP 2008

index and the phoneme a posteriori probability as the

We evaluated our OOV/IV region classifier. In this experi- 90%

ment, we used 10-hours of conference speech from CSJ. We 80%

As shown in Fig. 4, our “all features” OOV/IV region classifier 0%

Acoustic Rescoring Inverted Index SATA A posteriori

ALL Execution Time (sec)

Fig. 6. Hardware Architecture for Spoken-Term Detection

SSD access To achieve a fast and accurate open-vocabulary spoken-

You might also like