0% found this document useful (0 votes)
5 views5 pages

2205.08014v1

The document discusses the challenges and advancements in Automatic Speech Recognition (ASR) systems for accented speech, emphasizing the need for diverse datasets and improved model configurations. It highlights the importance of benchmarking ASR quality across different linguistic varieties and explores various pre-training approaches, including wav2vec 2.0, to enhance recognition rates for speakers with accents. The paper also presents experimental results showing significant improvements in word error rates (WER) for accented speech through innovative training strategies and synthetic data augmentation.

Uploaded by

mhb6434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

2205.08014v1

The document discusses the challenges and advancements in Automatic Speech Recognition (ASR) systems for accented speech, emphasizing the need for diverse datasets and improved model configurations. It highlights the importance of benchmarking ASR quality across different linguistic varieties and explores various pre-training approaches, including wav2vec 2.0, to enhance recognition rates for speakers with accents. The paper also presents experimental results showing significant improvements in word error rates (WER) for accented speech through innovative training strategies and synthetic data augmentation.

Uploaded by

mhb6434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Alëna Aksënova1 , Zhehuai Chen1 , Chung-Cheng Chiu2 , Daan van Esch1, Pavel Golik1 , Wei Han2 ,
Levi King1 , Bhuvana Ramabhadran1, Andrew Rosenberg1 , Suzan Schwartz1, Gary Wang1

Google Speech1 , Google Brain2


{alenaks,zhehuai,chungchengc,dvanesch,golik,weihan,
leviking,bhuv,rosenberg,suzan,wgary}@google.com

Abstract data unavailability, model architecture, and others is so crucial.


An increasing amount of research focuses on finding model
Building inclusive speech recognition systems is a crucial step configurations or approaches that would help to achieve com-
arXiv:2205.08014v1 [eess.AS] 16 May 2022

towards developing technologies that speakers of all language parable recognition rates for speakers of different language va-
varieties can use. Therefore, ASR systems must work for ev- rieties. Accented speech is indeed a challenge for ASR, and
erybody independently of the way they speak. To accomplish some research questions the readiness of end-to-end models for
this goal, there should be available data sets representing lan- industrial use [3]. ASR has a significant social impact, since
guage varieties, and also an understanding of model configura- its use cases range across multiple applications, including high
tion that is the most helpful in achieving robust understanding stakes areas such as medical transcription [6].
of all types of speech. However, there are not enough data sets
In our paper, Section 2 presents an overview of the latest
for accented speech, and for the ones that are already available,
methods and challenges within the scope of ASR benchmark-
more training approaches need to be explored to improve the
ing, focusing on recognizing linguistically diverse speech. Sec-
quality of accented speech recognition. In this paper, we dis-
tion 3 provides a brief overview of approaches to improving
cuss recent progress towards developing more inclusive ASR
ASR systems for accented speech, emphasizing experiments
systems, namely, the importance of building new data sets rep-
exploring various types of wav2vec 2.0 pre-training with dif-
resenting linguistic diversity, and exploring novel training ap-
ferent types of data, e.g. spontaneous unsupervised, synthetic,
proaches to improve performance for all users. We address re-
and accented. Finally, in Section 4, we outline the current state
cent directions within benchmarking ASR systems for accented
of accented data sets; discussing those publicly available, and
speech, measure the effects of wav2vec 2.0 pre-training on ac-
highlighting linguistic aspects insufficiently represented.
cented speech recognition, and highlight corpora relevant for
diverse ASR evaluations.
Index Terms: speech recognition, accented speech, pre- 2. Benchmarking ASR systems for accents
training, benchmarking, wav2vec 2.0, linguistic diversity
Understanding demographic information is crucial for bench-
marking ASR quality for diverse groups of speakers, such as
1. Introduction gender, age, speakers of sociolects or regional language vari-
The goal of Automatic Speech Recognition (ASR) systems is eties, non-native (L2) speakers, people experiencing speech im-
to transcribe speech, allowing voices from speakers of differ- pairments, and more [7]. Indeed, WER on a corpus of African
ing accents and language varieties to be understood. Emerging American Vernacular English (AAVE) is sometimes observed to
research aims to address this topic by quantifying the quality be as much as 85% higher than WER for a corpus of Standard
of ASR models depending on various parameters (e.g. gender, American English [2]. Researchers also show that the quality
accent), measuring bias in ASR models, identifying data sets to of speech recognition might not be the same for various genders
benchmark quality, and improving model robustness for linguis- [8, 9]. A study exploring the link between increased ASR errors
tic diversity. Moreover, diverse speech recognition has increas- and dialectal features demonstrated that geographic areas with
ingly become the focus of researchers and policy makers. For increased WER might significantly overlap with regions that
example, France is considering banning discrimination against have certain dialectal features [4]. For example, higher WER
accents1 ; and some researchers aim to investigate how the bias scores are observed in the Southern United States, likely due to
affecting certain groups of speakers arises in speech recognition certain linguistic features of AAVE and Southern American En-
models, and how to mitigate this bias [1]. glish spoken in those regions. This shows that metadata such as
Literature reporting on results of various model evaluations sociolect, gender, age, location, native language, and more are
shows that ASR models seem to frequently struggle with ac- determining for a multifaceted analysis of diverse ASR quality.
cented speech, be it achieving significantly higher word error It is important to point out that accents, dialects and other
rate (WER) for non-standard language varieties such as African language varieties are complex linguistic phenomena that span
American Vernacular English [2], code-switching [3], or ob- multiple dimensions. Obviously, differences in pronunciations
serving increased WER in certain geographic areas that could affect the relationship between acoustic input and written out-
be correlated with regional dialects [4]. In turn, this can make put symbols. But dialects can also vary in grammar and vocab-
speakers of certain linguistic varieties feel excluded, and might ulary (a common feature e.g. in AAVE or German as spoken
prompt them to try “standardizing” and slowing down their outside of Germany), affecting the distribution over the target
speech [5]. This is why measuring bias and finding ways to sequences p(y). This poses a challenge to the language mod-
improve ASR models given such factors as linguistic diversity, eling capability, whether in an external LM or one implicitly
learned by a transducer model. It is difficult to accurately iso-
1 https://www.bbc.com/news/world-europe-55069048 late linguistic features from other aspects affecting ASR quality
Table 1: Results with wav2vec 2.0 pre-training and noisy student self-training (NST) on accented data (WER in %).

Pre-training NST CORAAL:ATL CORAAL:DCB CORAAL:PRV GMU Accent


- - 17.5 16.6 24.7 8.2
Libri-Light - 16.8 17.0 22.0 5.4
YT-U - 14.2 15.3 20.1 4.7
YT-U YT-T 12.5 14.0 20.3 4.0

and control for all variables. Even acoustic conditions such as convenient to train a single model on multiple accents simulta-
recording quality and background noise can correlate with ge- neously. Such multi-dialect models tend to be easier to main-
ographic regions, which are in turn correlated to demographic tain, more robust and require no explicit dialect selection, which
differences between populations. Not only will the conversation may be inaccurate. One prominent line of research focuses on
style (spontaneous, prompted phone calls, scripted media, inter- understanding which parts of the neural networks process and
views, etc.) impact recognition accuracy, but also its content. A encode accent information. For example, research exploring
benchmark that does not address these issues explicitly is likely the weights of the hidden layers of an end-to-end system Deep-
to confound multiple aspects and mask the real ASR accuracy. Speech2 for accented speech shows that the first RNN layer
Benchmarking ASR quality when it comes to accented contains the most information about accents [12]. It suggests
speech is challenging, since multiple parameters should be ac- that this part of an end-to-end model can be adapted to learn
counted for: WER for different speech varieties, number of af- abstract representations that are less accent specific.
fected speakers, types and severity of misrecognitions, speed, Other researchers employ an accent classifier to explicitly
and others. Importantly, to understand how an ASR model per- annotate utterances with certain accent-indicating features that
forms for various groups of speakers, there should be a repre- are in turn added to the input of ASR models [13]. Modeling
sentative data set for every group, and any overall scores should approaches include accent conversion [14], where a transforma-
take all individual scores for speaker groups into account. tion is applied to a non-native utterance to make it sound as if
One approach to having a deeper insight into ASR quality the speaker had a native accent. For a thorough overview of
is to consider a distribution of per-utterance WER scores for the research on improving speech recognition for accents, see [15].
given corpus, and determine the median and the last decile P90
of the distribution instead of a single WER score [10]. Alter- 3.2. Improving ASR quality via pre-training
natively, it is possible to use a population-weighted evaluation
[7] where the overall WER is defined by all WER scores per Our experiments show that pre-training of various kinds is able
speaker group normalized by the number of speakers. In the lat- to improve recognition on accented data. Importantly, it does
ter scenario, the number of speakers within that group defines not harm performance of the same models on the standard test
the weight of the WER scores for the corresponding accents. sets. Previous work suggests that pre-training is also helpful in
However, if no transcribed or supervised data for language settings when little data is available, e.g. for building ASR sys-
varieties is available, an alternative approach to testing ASR tems for Sub-Saharan African languages [16]. In this section,
system fairness would be to use available untranscribed sam- we consider pre-training of wav2vec 2.0 models [17] using nat-
ples. In this approach, samples from diverse speaker groups ural and synthetic unsupervised speech. Our ASR model is a
undergo various acoustic transformations (e.g. noise, frequency large hybrid autoregressive transducer (HAT) [18] with Con-
scaling, etc.), and the scorer checks how the resulting recog- former layers [19] in both encoder and prediction network.
nition is affected by those transformations [11]. The overall
quality is likely to be lower for a certain accent if these syn- 3.2.1. Pre-training with unsupervised spontaneous speech
thetic perturbations significantly worsen the recognition rates In the first set of experiments we explored the effect of wav2vec
of the corresponding accented set. In this case, the data does 2.0 pre-training with spontaneous untranscribed speech [17].
not necessarily need to be supervised, since it is the stability of The settings we considered are pre-training using unsupervised
the machine transcript that gives insight into the confusion of Libri-Light [20] and unsupervised YouTube (YT-U) data [21].
the model due to the introduced transformations. We also compared the results with a wav2vec 2.0 model with-
out pre-training. Additionally, we explored the effects of noisy
3. Improving quality for accented speech student self-training (NST) [22] using some of the transcribed
public domain YouTube data (YT-T). In all cases, we trained on
To find ways of improving ASR quality for all speakers, it the entire SpeechStew data set, totalling roughly 5k hours [23].
is important to study the impact of the training data, training
We report the results on AAVE data represented by the
strategies, and model configurations on the evaluations that use
CORAAL corpus [24], and non-native (L2) accented English
accented test sets. In this section, we present several experi-
data represented by the GMU Accent data set [25]. AAVE data
ments exploring effects of wav2vec 2.0 pre-training using vari-
was collected in Atlanta, GA (ATL, 11 hours), Washington, DC
ous types of unsupervised data, and discuss their impact on the
(DCB, 34.5 hours) and Princeville, NC (PRV, 31 hours). It is
recognition of AAVE and non-native (L2) accented speech.
important to point out that the PRV data exhibits significantly
more diverse dialectal features than sets coming from the other
3.1. Previous work
two locations [2]. GMU Accent consists of 18 hours of speech
While it is possible to achieve higher accuracy with specialized produced by native speakers of 200 various languages.
models trained on various accents separately, it is usually more The results in Table 1 show that the model pre-trained on
Table 2: Results for adding synthetic data to wav2vec 2.0 pre-training (WER in %).

Pre-training data Fine-tuning data CORAAL:ATL CORAAL:DCB CORAAL:PRV


(unsupervised) (supervised)
1 M hrs YT 1000h 12.8 16.5 23.0
+100k hrs TTS utterances 1000h 12.1 15.8 23.4

unsupervised YouTube data together with the NST component Table 3: Fine-tuning with real accented data (WER in %).
gives the overall best results. The WER on AAVE test sets de-
creases by 18 to 28% relative. The pre-training is even more Real accented data Accented Short queries
effective on L2 data from the GMU Accent corpus, where the
WER drops from 8.2 to 4.0%. - 28.8 6.7
+ 20.4 6.7
3.2.2. Pre-training using synthetic speech
Another way of sourcing more relevant data is to use synthetic
speech, e.g. with encoder pre-training following the tts4pretrain ers with non-native accents.
algorithm for various accents [26]. In this scenario, a text-to- Alternatively, using voice conversion (VC) to produce more
speech (TTS) system is employed to augment the training data training data can be helpful as well. The idea of a VC model
given text. This approach helps to expand the scope of the avail- [27] is to perform an on-the-fly conversion from any type of
able data, enriching it with potentially novel lexical items. speech to accented speech in the audio domain. Down the line,
For our experiment, we follow the setup laid out in [26]. ASR models can be jointly trained with this accented data gen-
We used a wav2vec 2.0 model pre-trained with 1 million hours erated by the VC model, potentially augmented with other com-
of untranscribed speech, followed by TTS to further inject syn- ponents, such as consistency loss [28]. Such on-the-fly VC al-
thetic speech into the pre-training with around 100 million ut- lows us to augment the training data for more accent robust ASR
terances of text, corresponding to roughly 100k hours of synthe- models. VC is capable of manipulating phonetic and phonolog-
sized speech. The TTS system is trained on LibriTTS dataset. ical aspects of the accent, and even phonotactic ones depending
After pre-training, we fine-tune the system on a small subset of on the type of VC model. Overall, VC improves the quality
supervised data (around 1000 hours). of recognition for accented speech without requiring any ad-
The results in Table 2 show that synthesizing data using ditional metadata or human annotations. However, parameters
TTS and adding it to pre-training decreases the WER on two such as lexical choice and syntactic differences cannot be ad-
out of three accented test sets from the CORAAL corpus [24]. dressed by VC directly without adding a TTS generator. Fur-
Namely, the model that uses 100M additional synthetic utter- thermore, recent studies suggest that VC is particularly helpful
ances achieves a WER as low as 12.1% on the AAVE speech. for improving ASR in very low-resource settings [29].
One could use the VC model in order to increase speaker di-
3.2.3. Fine-tuning with real accented speech versity of the training data, allowing various accents to be added
to already existing data augmentation.
If real accented data is available, adding it to the training can
be expected to significantly improve the accuracy on accented
speech test sets. We find that simply blending in a small amount
4. Data sets representing linguistic diversity
of in-domain data during supervised fine-tuning can greatly re- Annotating speech utterances with metadata becomes more and
duce the WER on accented speech while maintaining perfor- more common for many data sets. For example, gender annota-
mance on other test sets. tion is provided for LibriSpeech [30] speakers, and both gender
For this experiment we sampled accented data from various and age buckets are available for Mozilla Common Voice [31]
publicly available sources representing different dialects and utterances, one of the biggest open-source speech data sets. In
language varieties; the accented dataset corpus is roughly 45 this section, we describe multiple data sets and approaches that
hours long. The non-accented test set that we used for control can be used for diverse model evaluations.
is composed of short queries sampled from anonymized traffic
data. We evaluated these test sets using pre-trained wav2vec 2.0 4.1. Regional language data sets
models, and observed that the overall quality of recognition im-
proves when accented data is added to the training. The results TIMIT (LDC93S1) represents 7 dialects of American En-
in Table 3 show that while the WER on accented data decreases glish, and there are two versions of the CALLFRIEND cor-
by 30% (from 28.8% to 20.4%), it remains unchanged on the pus of phone conversations: one for Standard American En-
short queries test set. glish (LDC96S46) and one for Southern American English
(LDC96S47). As for dialects spoken in the UK and Ireland,
they are represented within an OpenSLR data set (SLR83) [32].
3.3. Further sources of accented speech
Regional corpora are available for other languages as well, e.g.
As we showed in the previous subsection, adding various types Swiss German [33], Galician Spanish [34] and Arabic dialects
of unsupervised data to the pre-training improves the recogni- [35], along with several multilingual corpora [36]. Addition-
tion of accented speech. This suggests that further experiments ally, public challenges such as The MGB Challenge target eval-
for pre-training are promising, e.g. mixing in data from non- uations of various ASR topics including dialect identification.
target languages in order to achieve better recognition for speak- Still there are not many data sets that represent speech on a
more detailed level of geographical granularity, e.g. in various without representative test sets and established benchmarks,
metro areas or regions for a given language. However, the prac- since improving recognition for widely used corpora represent-
tice of utterance annotation with at least country-level metadata ing mainstream types of speech does not guarantee improving
is becoming more common. A recent study suggests ways to ASR for other groups of accented speakers. And indeed, with-
create such open-source data sets using public domain videos out a proper way to measure speech recognition quality for var-
[37], showing that the construction of regional corpora can be ious accents there is no way to know how certain configurations
partially automated. Other research focuses on machine learn- affect different groups of speakers and types of language diver-
ing methods to identify dialectal utterances, using the example sity. To help researchers with the challenge of obtaining repre-
of local accents spoken in Pakistan [27]. sentative data samples for various regions, non-native accents,
sociolects, and speech disorders, we provided a quick survey of
4.2. L2 accented sets currently available corpora suitable for evaluating ASR quality
on such speech varieties.
A number of corpora represent accented English speech, in-
Previous works suggested accent conversion or employing
cluding GMU Accent [25], IDEA [38], and ALLSSTAR [39].
additional accent classifiers to create more in-domain data. In
In GMU Accent corpus, for example, as many as 200 native
this paper, we showed how various types of wav2vec 2.0 pre-
languages of the speakers are represented. Data is available in
training and fine-tuning affect the recognition accuracy on ac-
other languages as well: for example, West Point developed cor-
cented test sets representing AAVE and non-native speakers.
pora representing both native and non-native speakers for Span-
Namely, we explored the effects of pre-training using unsuper-
ish [40], Russian [41], and Arabic [42]. A European consortium
vised spontaneous and synthetic speech, showing that it indeed
published IFCASL [43], a corpus of French and German native
helps to achieve higher recognition quality for accents without
and non-native speech.
increasing the WER on the other sets.
To conclude, we would like to highlight the importance of
4.3. Sociolectal data sets
collaboration in this emerging field of inclusive ASR technolo-
While it is often possible to find a representation of regional gies. Researchers and developers can get insights into challeng-
and accented speech in widely spoken languages, less data is ing areas and language aspects from linguists and dialectolo-
available for sociolects (e.g. Chicano English or Cockney En- gists, who, in turn, could help to analyze the patterns of recog-
glish). African American Vernacular English (AAVE), for ex- nition issues, and address the issues of missing data for some
ample, is spoken by approximately 40 million speakers, and is groups of speakers. Crowdsourcing and community involve-
represented within the CORAAL [24] and Voices of California ment allows for scaling data collection projects, and extensive
corpora. CORAAL also showcases regional variation within metadata annotation enables understanding of the overall recog-
AAVE, comparing data from six metro areas. However, more nition quality with respect to the factors of language variation.
work is needed to adequately represent other sociolects. Finally, it is crucially important for the researchers to share their
results and findings, drawing greater attention to this area.
4.4. Sets representing speech impairments
Another category of data sets that we would like to highlight 6. Acknowledgments
exemplifies various types of speech disorders. Speech recogni- We are grateful for the resources provided by Google, and
tion for impaired speakers is a different problem than improving for our collaborators from Speech and Brain teams, especially
accuracy for various accents. However, this group of speakers Françoise Beaufays, James Flynn, Bo Li, Pedro J. Moreno,
might benefit from more available data and training approaches Daniel Park, Sandy Ritchie, Pierric Sans, Benyah Shaparenko,
promoting accent robustness as well. For example, the Eupho- and Yu Zhang. For critical work on data sets, our thanks go to
nia project involved collecting one million utterances represent- Landis Baker, Jonathan Endale, Mandy Jordan, Prescott Nicoll,
ing different types and severities of speech impairments [44]. and Travis Trekell.
More of such data is available in the Whitaker corpus [45], the
Dysarthric Speech Database for Universal Access Research [46]
and EasyCall [47]. A Spanish corpus published in [48] contains
7. References
recordings of children with speech impairments in particular. [1] N. Markl and S. J. McNulty, “Language technology practitioners
as language managers: arbitrating data bias and predictive bias in
A growing number of researchers are working towards cre- ASR,” in LREC 2022 (submitted), 2022.
ating data sets representing different types of factors affecting [2] A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Menge-
ASR quality, e.g. accented speech, noise, various acoustic envi- sha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racial
ronments and other aspects. For example, ASR4REAL [10] is a disparities in automated speech recognition,” Proc. of the National
combination of benchmarks assembled at FAIR that is intended Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, 2020.
to represent various accents and real-life conditions. [3] V. Vielzeuf and G. Antipov, “Are E2E ASR models ready for an
industrial usage?” 2021.
5. Discussion [4] A. Aksënova, A. Bruguier, A. Ritchart-Scott, and U. Mendlovic,
“Algorithmic exploration of American English dialects,” in Proc.
To understand how well ASR models serve people speaking dif- IEEE ICASSP, Barcelona, Spain, 2020.
ferent dialects and accents, we need data representing various [5] C. Heldreth, M. Lahav, Z. Mengesha, J. Sublewski, and E. Tuen-
language varieties, and a methodology of how to approach the nerman, ““I don’t think these devices are very culturally sensi-
evaluation. In this paper, we discuss several experiments aim- tive.” - The impact of errors on African Americans in automated
ing to improve ASR quality for accented speakers, along with speech recognition,” Frontiers in AI, vol. 26, 2021.
some recent results in the area of ASR benchmarking. [6] A. Mani, S. Palaskar, and S. Konam, “Towards understanding
Indeed, a crucial research direction is to improve ASR qual- ASR error correction for medical conversations,” in Proc. ACL
ity for accented speakers. This goal is impossible to achieve Workshop on NLP for Medical Conversations, 2020, pp. 7–11.
[7] A. Aksënova, D. van Esch, J. Flynn, and P. Golik, “How might we [28] G. Wang, A. Rosenberg, Z. Chen, Y. Zhang, B. Ramabhadran,
create better benchmarks for speech recognition?” in Proc. ACL Y. Wu, and P. Moreno, “Improving speech recognition using con-
Workshop on Benchmarking: Past, Present and Future, 2021, pp. sistent predictions on synthesized speech,” in Proc. IEEE ICASSP,
22–34. 2020, pp. 7029–7033.
[8] H. Liao, G. Pundak, O. Siohan, M. Carroll, N. Coccaro, Q.-M. [29] M. Baas and H. Kamper, “Voice conversion can improve ASR in
Jiang, T. N. Sainath, A. Senior, F. Beaufays, and M. Bacchiani, very low-resource settings,” in Proc. IEEE ICASSP (submitted),
“Large vocabulary automatic speech recognition for children,” in 2022.
Proc. Interspeech, Dresden, Germany, 2015, pp. 1611–1615.
[30] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[9] R. Tatman, “Gender and dialect bias in YouTube’s automatic cap- rispeech: An ASR corpus based on public domain audio books,”
tions,” in Proc. ACL Workshop on Ethics in Natural Language in Proc. IEEE ICASSP, 2015, pp. 5206–5210.
Processing, Valencia, Spain, 2017, pp. 53–59.
[31] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen-
[10] M. Riviere, J. Copet, and G. Synnaeve, “ASR4REAL: An ex- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common
tended benchmark for speech models,” in Proc. IEEE ICASSP Voice: A massively-multilingual speech corpus,” in Proc. LREC,
(submitted), 2022. 2020, pp. 4218–4222.
[11] S. S. Rajan, S. Udeshi, and S. Chattopadhyay, “Aequevox: Auto- [32] I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open-
mated fairness testing of speech recognition systems,” in Int. Conf. source multi-speaker corpora of the English accents in the British
on Fundamental Approaches to Software Engineering, 2022. Isles,” in Proc. LREC. ELRA, 2020, pp. 6532–6541.
[12] A. Prasad and P. Jyothi, “How accents confound: Probing for ac- [33] T. Samardžić, Y. Scherrer, and E. Glaser, “ArchiMob - a corpus
cent information in end-to-end speech recognition systems,” in of spoken Swiss German,” in Proc. LREC, May 2016, pp. 4061–
Proc. of the 58th Annual ACL Meeting, 2020, pp. 3739–3753. 4066.
[13] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech [34] M. Barcala, E. Domı́nguez, A. Fernández, R. Rivas, M. P. San-
recognition using accent embeddings and multi-task learning,” in talla, V. Vázquez, and R. Villapol, “El corpus ESLORA de
Proc. Interspeech, 2018, pp. 2454–2458. Español oral: diseño, desarrollo y explotación,” CHIMERA: Re-
[14] G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and vista de Corpus de Lenguas Romances y Estudios Lingüı́sticos,
R. Gutierrez-Osuna, “Accent conversion using phonetic posteri- vol. 5, no. 2, pp. 217–237, 2018.
orgrams,” in Proc. IEEE ICASSP, 2018, pp. 5314–5318. [35] K. Lounnas, M. Abbas, and M. Lichouri, “Building a speech cor-
[15] A. Hinsvark, N. Delworth, M. D. Rio, Q. McNamara, J. Dong, pus based on Arabic podcasts for language and dialect identifica-
R. Westerman, M. Huang, J. Palakapilly, J. Drexler, I. Pirkin, tion,” in Proc. ICNLSP. ACL, 2019, pp. 54–58.
N. Bhandari, and M. Jette, “Accented speech recognition: A sur- [36] D. Imseng, H. Bourlard, H. Caesar, P. N. Garner, G. Lecorvé,
vey,” arXiv:2104.10747, 2021. and A. Nanchen, “MediaParl: Bilingual mixed language accented
[16] S. Ritchie, Y.-C. Cheng, M. Chen, R. Mathews, D. van Esch, speech database,” Proceedings of the 2012 IEEE Workshop on
and B. Li, “Large vocabulary speech recognition for Sub-Saharan Spoken Language Technology, 2012.
African languages,” in Proc. Interspeech (submitted), 2022. [37] S. Coats, “A Corpus of regional American language from
[17] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: YouTube,” in Proceedings DHNB, 2019, pp. 79–91.
A framework for self-supervised learning of speech representa- [38] “IDEA. International dialects of English archive.”
tions,” in Proc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460.
[39] A. Bradlow, “ALLSSTAR: Archive of L1 and L2 scripted and
[18] E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid au-
spontaneous transcripts and recordings.”
toregressive transducer (HAT),” in Proc. IEEE ICASSP, 2020, pp.
6139–6143. [40] J. Morgan, “West Point Heroico Spanish Speech LDC2006S37,”
2006, Philadelphia: Linguistic Data Consortium.
[19] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,
S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution- [41] S. A. LaRocca and C. Tomei, “West Point Russian Speech
augmented transformer for speech recognition,” in Proc. Inter- LDC2003S05,” 2003, Philadelphia: Linguistic Data Consortium.
speech, 2020. [42] S. A. LaRocca and R. Chouairi, “West Point Arabic Speech
[20] J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P. Mazare, LDC2002S02,” 2002, Philadelphia: Linguistic Data Consortium.
J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhoma-
[43] J. Trouvain, A. Bonneau, V. Colotte, C. Fauth, D. Fohr, D. Jouvet,
nenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux,
J. Jügler, Y. Laprie, O. Mella, B. Möbius, and F. Zimmerer, “The
“Libri-light: A benchmark for ASR with limited or no supervi-
IFCASL corpus of French and German non-native and native read
sion,” in Proc. IEEE ICASSP, 2020.
speech,” in Proc. LREC, 2016, pp. 1333–1338.
[21] Y. Zhang, D. S. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen,
[44] R. L. MacDonald, P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave,
Y. Xu, Y. Huang, S. Wang et al., “BigSSL: Exploring the fron-
K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, P. C. Nelson,
tier of large-scale semi-supervised learning for automatic speech
J. R. Green, and K. Tomanek, “Disordered speech data collection:
recognition,” arXiv:2109.13226, 2021.
Lessons learned at 1 million utterances from Project Euphonia,”
[22] D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu, and in Proc. Interspeech, 2021, pp. 4833–4837.
Q. V. Le, “Improved noisy student training for automatic speech
[45] J. Deller, M. Liu, L. Ferrier, and P. Robichaud, “The Whitaker
recognition,” in Proc. Interspeech, 2020, pp. 2817–2821.
database of dysarthric (cerebral palsy) speech,” The Journal of the
[23] W. Chan, D. Park, C. Lee, Y. Zhang, Q. Le, and M. Norouzi, Acoustical Society of America, vol. 93 (6), p. 3516–3518, 1993.
“SpeechStew: Simply mix all available speech recognition data to
train one large neural network,” in Proc. MLSLP, 2021. [46] H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. S.
Huang, K. Watkin, and S. Frame, “Dysarthric speech database for
[24] T. Kendall and C. Farrington, “The corpus of regional African universal access research.” in Proc. Interspeech, 2008, pp. 1741–
American language. version 2021.07.” 2021, Eugene, OR. 1744.
[25] S. Weinberger, “Speech accent archive,” 2015, GMU. [47] R. Turrisi, A. Braccia, M. Emanuele, S. Giulietti, M. Pugli-
[26] Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, G. Wang, and atti, M. Sensi, L. Fadiga, and L. Badino, “EasyCall corpus: a
P. Moreno, “Injecting text in self-supervised speech pretraining,” dysarthric speech dataset,” in Proc. Interspeech, 2021, pp. 41–45.
in Proc. IEEE ASRU, 2021, pp. 251–258. [48] O. Saz, W. Rodrı́guez, E. Lleida, and C. Vaquero, “A novel corpus
[27] S. H. Mohammadi and A. Kain, “An overview of voice conversion of children’s disordered speech,” in WOCCI, 2008.
systems,” Speech Communication, vol. 88, pp. 65–82, 2017.

You might also like