ASRcourseMOSIG2024
ASRcourseMOSIG2024
Laurent Besacier
October 2024
Speech facts
1
https://notebooklm.google.com/
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 5 / 97
The speech signal
Speech-to-text
Automatic Speech Recognition (ASR)
Ideally we want to have a system that deals with: spontaneous
speech, multi-speakers, unlimited output vocabulary, any acoustic
condition
But performances differ greatly for different contexts (read vs
spontaneous speech ; small vs large vocabulary ; quiet vs noisy; native
vs non-native speech)
Speech-to-text
Automatic Speech Recognition (ASR)
Ideally we want to have a system that deals with: spontaneous
speech, multi-speakers, unlimited output vocabulary, any acoustic
condition
But performances differ greatly for different contexts (read vs
spontaneous speech ; small vs large vocabulary ; quiet vs noisy; native
vs non-native speech)
Speech representations
Speech representations
Spectrograms (< 1990 and > 2015!)
time-frequency representation that is actually similar to sequence of
filterbanks ...
... but processed as an image
2
Image from Bhuvana Ramabhadran’ s presentation at Interspeech 2018
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 11 / 97
1990-2015: Bayes, HMMs, GMMs
Fundamental equation
Lexicons
ASR overview
HMMs
T
Y
P(W ) = P(w k |h) (3)
k=1
5
Figure: CTC Overview
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 27 / 97
2020: main ASR architectures
CTC overview
Connectionist Temporal Classification
Model will learn to align the transcript itself during training (Graves
et al., 2006)
Defined over a label sequence z (of length M)
blank or symbol allows M-length target sequence to be mapped to a
T-length sequence x 6
z can be represented by a set of all possible CTC paths (sequence of
labels, at frame level) that are mapped to z
ex: M=2 (z = hi) and T=3 (3 frames): possible sequences are ’hhi’,
’hii’, ’ hi’, ’h i’, ’hi ’
Probability p(z/x) evaluated as sum of probabilities over all possible
CTC paths (using Forward-Backward)
Generate frame posteriors at decoding time
6
gives the model the ability to say that a certain audio frame did not produce a
character
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 28 / 97
2020: main ASR architectures
7
Figure: CTC Loss
CTC inference
Greedy decoding
Beam-Search decoding
Attention modeling
Architecture similar to neural machine translation
Speech encoder based on CNNs or pyramidal LSTMs ?
c2
s1 s2 s3 s4
Attention modeling
Image from
https://lorenlugosch.github.io/posts/2020/11/transducer/
Image from
https://lorenlugosch.github.io/posts/2020/11/transducer/
Transducer models
Transducer models
Transducer models
Interesting features
If encoder is causal (not using something like a bidirectional RNN),
then search can run in online/streaming fashion
The predictor only has access to y (not x) unlike the decoder in an
attention model, so we can easily pre-train the predictor on text-only
data
Naturally defines alignment between x and y
Image
L. Besacier (Naver Labs from https://lorenlugosch.github.io/posts/2020/11/transducer/
Europe) ASR Intro (2024) October 2024 38 / 97
2021: Self-Supervised Learning (SSL) for Speech
Using huge unlabeled data for training ; targets are computed from
the signal itself
”learn representations using objective functions similar to those used
for supervised learning, but train networks to perform pretext tasks
where both the inputs and labels are derived from an unlabeled
dataset” (from Chen et al. (2020) )
Introduced for vision: see for instance (Chen et al., 2020)
learn representations by contrasting positive pairs against negative pairs
Introduced also in NLP: see for instance (Devlin et al., 2018)
learn representations by predicting tokens that were masked in an input
sequence
Previous works
T
Y
P(sequence) = P(t k |h) (5)
k=1
Recurrent neural network LM:
h = rnn state(E (t 1 ), E (t 2 ), ..., E (t k−1 ))
For speech, each token tk corresponds to a frame rather than a word
or character token
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 44 / 97
2021: Self-Supervised Learning (SSL) for Speech
Speech-XLNet
Speech-XLNet
A multimodal extension of
transformer encoder-decoder
models such as T5
Encode or decode both speech
and text with a single model
Maps both acoustic and text
information in a shared vector
space
Used to initialize ASR
(speech-to-text), TTS
(text-to-speech), Voice
Conversion (VC –
speech-to-speech), etc.
Experiments
on several
downstream
speech tasks
(ASR, VC,
TTS, speaker
id.) show
slightly better
results than
speech-only
pre-training
Moshi: Architecture
Moshi: Training
Conclusion
Language coverage
Google addresses (only) 100 languages (ASR)
Language technology issues: 300 languages (95 % population)
Language coverage / revitalisation / documentation issues: > 6000
languages !
Figure: from Laura Welcher - Big Data for Small Languages The Rosetta Project
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 82 / 97
Future: is ASR a solved problem ?
Whisper
A massively multilingual ASR system based on weakly-supervised learning
Radford et al. (2022)
trained on 680,000 hours of multilingual and multitask supervised
data collected from the web
use of such a large and diverse dataset leads to improved robustness
to accents, background noise and technical language
enables transcription in multiple languages, as well as translation from
those languages into English
whisper architecture is a simple end-to-end approach, implemented as
an encoder-decoder Transformer
Figure: Comparison of WER for two speech systems and human level performance
on read speech (from (Amodei et al., 2016)
Figure: Comparison of WER for two speech systems and human level performance
on accented speech (from (Amodei et al., 2016)
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 85 / 97
Future: is ASR a solved problem ?
Figure: Comparison of WER for two speech systems and human level performance
on noisy speech (from (Amodei et al., 2016)
10
The zero resource challenge: http://zerospeech.com (Dunbar et al., 2017)
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 87 / 97
Future: is ASR a solved problem ?
Resources
Questions?
Thank you
References I
Alec Radford, Jong Wook Kim, T. X. G. B. C. M. I. S. (2022). Robust speech recognition via
large-scale weak supervision.
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J.,
Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Hannun,
A. Y., Jun, B., Han, T., LeGresley, P., Li, X., Lin, L., Narang, S., Ng, A. Y., Ozair, S.,
Prenger, R., Qian, S., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, C.,
Wang, Y., Wang, Z., Xiao, B., Xie, Y., Yogatama, D., Zhan, J., and Zhu, Z. (2016). Deep
speech 2 : End-to-end speech recognition in english and mandarin. In Proceedings of the
33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA,
June 19-24, 2016, pages 173–182.
Ao, J., Wang, R., Zhou, L., Liu, S., Ren, S., Wu, Y., Ko, T., Li, Q., Zhang, Y., Wei, Z., et al.
(2021). Speecht5: Unified-modal encoder-decoder pre-training for spoken language
processing. arXiv preprint arXiv:2110.07205.
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P.,
Saraf, Y., Pino, J., Baevski, A., Conneau, A., and Auli, M. (2021). Xls-r: Self-supervised
cross-lingual speech representation learning at scale. arXiv, abs/2111.09296.
Baevski, A., Auli, M., and Mohamed, A. (2019). Effectiveness of self-supervised pre-training for
speech recognition.
References II
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for
self-supervised learning of speech representations. In Larochelle, H., Ranzato, M., Hadsell,
R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning
to align and translate. CoRR, abs/1409.0473.
Bapna, A., Cherry, C., Zhang, Y., Jia, Y., Johnson, M., Cheng, Y., Khanuja, S., Riesa, J., and
Conneau, A. (2022). mslam: Massively multilingual joint pre-training for speech and text.
CoRR, abs/2202.01374.
Bapna, A., Chung, Y., Wu, N., Gulati, A., Jia, Y., Clark, J. H., Johnson, M., Riesa, J.,
Conneau, A., and Zhang, Y. (2021). SLAM: A unified encoder for speech and language
modeling via speech-text joint pre-training. CoRR, abs/2110.10329.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language
model. J. Mach. Learn. Res., 3:1137–1155.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive
learning of visual representations.
References III
Chen, Z., Huang, H., Andrusenko, A., Hrinchuk, O., Puvvada, K. C., Li, J., Ghosh, S., Balam,
J., and Ginsburg, B. (2023). Salm: Speech-augmented language model with in-context
learning for speech recognition and translation.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based
models for speech recognition. CoRR, abs/1506.07503.
Chung, Y., Hsu, W., Tang, H., and Glass, J. R. (2019). An unsupervised autoregressive model
for speech representation learning. CoRR, abs/1904.03240.
Chung, Y.-A. and Glass, J. (2020). Improved speech representations with multi-target
autoregressive predictive coding.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep
bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Dunbar, E., Cao, X., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X., and
Dupoux, E. (2017). The zero resource speech challenge 2017. CoRR, abs/1712.04313.
Défossez, A., Mazaré, L., Orsini, M., Royer, A., Pérez, P., Jégou, H., Grave, E., and Zeghidour,
N. (2024). Moshi: a speech-text foundation model for real-time dialogue.
Graves, A., Fernández, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks. In ICML,
volume 148 of ACM International Conference Proceeding Series, pages 369–376. ACM.
References IV
Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu,
Y., and Pang, R. (2020). Conformer: Convolution-augmented transformer for speech
recognition. In Meng, H., Xu, B., and Zheng, T. F., editors, Interspeech 2020, 21st Annual
Conference of the International Speech Communication Association, Virtual Event, Shanghai,
China, 25-29 October 2020, pages 5036–5040. ISCA.
Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313(5786):504 – 507.
Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.-E., Karadayi, J.,
Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A.,
Mohamed, A., and Dupoux, E. (2019). Libri-light: A benchmark for asr with limited or no
supervision.
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.
Le, H. S., Oparin, I., Allauzen, A., Gauvain, J., and Yvon, F. (2011). Structured output layer
neural network language model. In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress
Center, Prague, Czech Republic, pages 5524–5527.
Liu, A. T., Yang, S.-w., Chi, P.-H., Hsu, P.-c., and Lee, H.-y. (2020). Mockingjay: Unsupervised
speech representation learning with deep bidirectional transformer encoders. ICASSP 2020 -
2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
References V
Mikolov, T., Karafiát, M., Burget, L., C̆ernocký, J., and Khudanpur, S. (2010). Recurrent
neural network based language model. In Interspeech.
Mnih, A. and Hinton, G. (2008). A scalable hierarchical distributed language model. In In NIPS.
Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language model. In
AISTATS’05, pages 246–252.
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: An ASR corpus
based on public domain audio books. In ICASSP, pages 5206–5210. IEEE.
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V. (2019).
Specaugment: A simple data augmentation method for automatic speech recognition.
Interspeech 2019.
Pascual, S., Ravanelli, M., Serrà, J., Bonafonte, A., and Bengio, Y. (2019). Learning
problem-agnostic speech representations from multiple self-supervised tasks. CoRR,
abs/1904.03416.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M.,
Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. (2011). The
kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition
and Understanding. IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022). Robust
speech recognition via large-scale weak supervision.
References VI
Ravanelli, M. and Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet.
Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Trmal, J., and Bengio, Y.
(2020). Multi-task self-supervised learning for robust speech recognition.
Renals, S., Morgan, N., Bourlard, H., Cohen, M., and Franco, H. (1994). Connectionist
probability estimators in HMM speech recognition. IEEE Trans. Speech and Audio
Processing, 2(1):161–174.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019a). wav2vec: Unsupervised
Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019b). wav2vec: Unsupervised
pre-training for speech recognition. CoRR, abs/1904.05862.
Schwenk, H. (2007). Continuous space language models. Computer Speech & Language,
21(3):492–518.
Song, X., Wang, G., Wu, Z., Huang, Y., Su, D., Yu, D., and Meng, H. (2019). Speech-xlnet:
Unsupervised acoustic model pretraining for self-attention networks.
van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive
predictive coding. CoRR, abs/1807.03748.
References VII
van den Oord, A., Vinyals, O., and kavukcuoglu, k. (2017). Neural discrete representation
learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages
6306–6315. Curran Associates, Inc.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing
robust features with denoising autoencoders.
Wang, W., Tang, Q., and Livescu, K. (2020). Unsupervised pre-training of bidirectional speech
encoders via masked reconstruction.
Wu, J., Gaur, Y., Chen, Z., Zhou, L., Zhu, Y., Wang, T., Li, J., Liu, S., Ren, B., Liu, L., and
Wu, Y. (2023). On decoder-only architecture for speech-to-text and large language model
integration.
Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., and Qiu, X. (2023). Speechgpt:
Empowering large language models with intrinsic cross-modal conversational abilities.