0% found this document useful (0 votes)
25 views97 pages

ASRcourseMOSIG2024

The document provides an overview of Automatic Speech Recognition (ASR), detailing its evolution from traditional methods like Bayes and HMMs to modern neural network approaches. It discusses key developments in ASR from 1990 to 2024, including the introduction of self-supervised learning and multimodal models. The document also raises questions about the future of ASR and whether it can be considered a solved problem.

Uploaded by

Laurent Besacier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views97 pages

ASRcourseMOSIG2024

The document provides an overview of Automatic Speech Recognition (ASR), detailing its evolution from traditional methods like Bayes and HMMs to modern neural network approaches. It discusses key developments in ASR from 1990 to 2024, including the introduction of self-supervised learning and multimodal models. The document also raises questions about the future of ASR and whether it can be considered a solved problem.

Uploaded by

Laurent Besacier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Automatic Speech Recognition: Introduction, Current

Trends and Open Problems


http://tinyurl.com/ASR-Intro-2024
http://tinyurl.com/ASR-Intro-2024-podcast

Laurent Besacier

October 2024

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 1 / 97


1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 2 / 97


The speech signal

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 3 / 97


The speech signal

Speech facts

Speech generally conveys a (linguistic) message (that can be reduced


to a transcript)
But not only (paralinguistics: speaker identity, speaker mood, speaker
health condition, speaker accent, etc.)
Variability at all levels (intra speaker, inter speaker, microphone,
phone line, room acoustics, style)
Speech is a continuous signal (no explicit word boundaries)
May be decomposed into elementary units of sound (phonemes) that
distinguish one word from another in a particular language (minimal
pairs)
kill vs kiss - pat vs bat
phoneme set is language dependent
acoustic realization of the phoneme is dependent of its left and right
neighbors (co-articulation)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 4 / 97


The speech signal

(Main) Speech tasks

Speech compression (solved)


Speaker recognition (strong progresses over the last 10 years but still
poor compared to other biometric modalities like fingerprint and iris)
Text-to-speech synthesis (strong progresses over the last 2-3 years, do
you know NotebookLLM 1 ?)
Speech-to-text and Speech-to-Speech (this talk)
Speech paralinguistics: detection of gender, age, deception, sincerity,
nativeness, emotion, sleepiness, cognitive disorders, (drug or alcohol)
intoxication, pathologies, etc.
Main speech conference: Interspeech (core A, every year)

1
https://notebooklm.google.com/
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 5 / 97
The speech signal

Speech-to-text
Automatic Speech Recognition (ASR)
Ideally we want to have a system that deals with: spontaneous
speech, multi-speakers, unlimited output vocabulary, any acoustic
condition
But performances differ greatly for different contexts (read vs
spontaneous speech ; small vs large vocabulary ; quiet vs noisy; native
vs non-native speech)

Figure: NIST ASR benchmark tests history (< 2015)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 6 / 97


The speech signal

Speech-to-text
Automatic Speech Recognition (ASR)
Ideally we want to have a system that deals with: spontaneous
speech, multi-speakers, unlimited output vocabulary, any acoustic
condition
But performances differ greatly for different contexts (read vs
spontaneous speech ; small vs large vocabulary ; quiet vs noisy; native
vs non-native speech)

Figure: Librispeech ASR benchmark tests history (> 2016)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 7 / 97


The speech signal

ASR as a partial task in a larger system

ASR for spoken language processing (speech understanding, speech


translation, speech summarization, etc.)
Not just a problem of noisy transcripts
No sentence boundaries, punctuation, case
Disfluencies in spontaneous speech: false starts, fillers, repaired
utterances
btw, should we keep them or remove them ?
some speech tasks are ill defined (ex: speech translation)
Time to work on end-to-end approaches from speech ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 8 / 97


The speech signal

Speech representations

Handcrafted feature vectors


standard extraction on sliding windows of 20-30ms at a frame rate of
10ms
filterbanks (signal energy in different frequency bands)
cepstral coefficients (inverse Fourier transform of the logarithm of the
estimated spectrum of a signal)
linear predictive coding (a sample is predicted as a weighted sum of
preceding samples and weights are used as features)
prosodic features (pitch, energy)
Raw waveform (> 2015)
bypass handcrafted preprocessing
preprocessing become part of the acoustic modeling and training
introducing convolutional layers in the first stages of the NN pipeline

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 9 / 97


The speech signal

Speech representations
Spectrograms (< 1990 and > 2015!)
time-frequency representation that is actually similar to sequence of
filterbanks ...
... but processed as an image

Figure: Speech signal (top) and spectrogram (bottom)

Self-supervised learnt representations (> 2020 see next sessions with


me!)
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 10 / 97
The speech signal

Progresses over the years

Figure: ASR Performance2 on English Conversational Telephony (Switchboard)

2
Image from Bhuvana Ramabhadran’ s presentation at Interspeech 2018
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 11 / 97
1990-2015: Bayes, HMMs, GMMs

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 12 / 97


1990-2015: Bayes, HMMs, GMMs

Fundamental equation

x: observation (signal or features)


w : a word sequence

w ∗ = argmaxw p(w /x) = argmaxw p(x/w ).p(w ) (1)


p(x/w ): acoustic model
p(w ): language model

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 13 / 97


1990-2015: Bayes, HMMs, GMMs

Lexicons

For acoustic modelling in large vocabulary speech recognition, we


model phones instead of full words
A pronunciation lexicon gives the decomposition of words into
phonemes
Adding a new word to the output vocabulary does not require
retraining of the acoustic models
just add an entry to the pronunciation lexicon
cat /k a t/
Hierarchical modelling of speech (signal/phones/words/utterance)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 14 / 97


1990-2015: Bayes, HMMs, GMMs

Hierarchical modelling of speech

Figure: From speech to utterances3


3
Image from Steve Renals’s lecture on ASR
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 15 / 97
1990-2015: Bayes, HMMs, GMMs

ASR overview

Figure: ASR Overview

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 16 / 97


1990-2015: Bayes, HMMs, GMMs

Acoustic modeling: HMM/GMM


Complex sequential patterns of speech decomposed into piecewise
stationary segments
Sequential structure of the data described by a sequence of states
HMM (Hidden Markov Models) transitions
Local characteristics of the data described by a distribution associated
to each state
GMM (Gaussian Mixture Models) observations (outputs)

Figure: HMM/GMM approach

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 17 / 97


1990-2015: Bayes, HMMs, GMMs

HMMs

Well known algorithms for


training the model parameters (Baum-Welch algo.)
decoding the most probable hidden state sequence (Viterbi algo.)
evaluate the likelihood of an observation being generated by a HMM
(Forward algo.)
Phonemes are generally modeled in context (1 phoneme = N HMMs)
triphones or quintphones (model co-articulation)
state or parameter tying to reduce model complexity

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 18 / 97


1990-2015: Bayes, HMMs, GMMs

Language models: from N-grams to RNNs

For a sequence of T words W = w 1 , w 2 , ..., w T


T
Y
P(W ) = P(w k |w 1 , w 2 , ..., w k−1 ) (2)
k=1

T
Y
P(W ) = P(w k |h) (3)
k=1

n-gram LM: h = w k−n+1 , w k−n+2 , ..., w k−1


recurrent neural network LM: h = rnn state(E (w 1 ), E (w 2 ), ..., E (w k−1 ))

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 19 / 97


2015: neural nets to the rescue

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 20 / 97


2015: neural nets to the rescue

NNs in the 90s and 00s

Introduced in the 80s and 90s to speech recognition, but extremely


slow and poor in performance compared to the state-of-the-art
HMM/GMM
Several papers published by ICSI, CMU, IDIAP several decades ago!
Pros: no assumption about a specific data distribution
Cons: slow and do not scale to large tasks

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 21 / 97


2015: neural nets to the rescue

NNs for acoustic modeling (1990-2010)


In most approaches, NNs model the posterior probability p(s|x) of an
HMM state s given an acoustic observation x
Existing HMM speech recognizers can be used
This model is known as hybrid NN-HMM and was introduced by
Renals et al. (1994)

Figure: Hybrid NN-HMM

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 22 / 97


2015: neural nets to the rescue

NNs for language modeling (1990-2010)

Rescoring a lattice of output hypotheses using NN LM instead of


N-gram
Introduced by Bengio et al. (2003)
Extended to large vocabulary speech recognition (Schwenk, 2007)
Reducing computational complexity
using shortlist at output layer (Schwenk, 2007)
hierarchical decomposition of output probabilities (Morin and Bengio,
2005; Mnih and Hinton, 2008; Le et al., 2011)
Recurrent neural networks were used in LM training (Mikolov et al.,
2010)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 23 / 97


2015: neural nets to the rescue

Deep learning breakthrough

Like in vision, due to


More data
ex: (2015) Librispeech (en) 1,000h (Panayotov et al., 2015)
ex: (2016) Baidu Deep Speech 2 (en) 12,000h (Amodei et al., 2016)
ex: (2017) Google Home (en) 18,000h (from a Google presentation)
ex: (2018) Google wav2words (en) >100.000h (informal discussion)
ex: (2021) Meta XLS-R >436,000h (Babu et al., 2021)
(self-supervised)
ex: (2022) OpenAI Whisper model trained on 680,000 hours of
multilingual speech4 (Alec Radford, 2022)
Computation (ex: GPU)
Better optimization algorithms and training objectives
ASR Toolkits (ex: Kaldi (Povey et al., 2011) and DL frameworks
(Tensorflow, Pytorch)
4
>75 years of speech !
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 24 / 97
2015: neural nets to the rescue

End-to-end ASR (get rid of HMMs)

Approaches for end-to-end ASR


Connectionist Temporal Classification (CTC)
Solves the problem of unaligned input and output sequences by
marginalizing the conditional likelihood of the output sequence given
the input over all possible alignments
Attention Modeling
Simultaneously optimize alignment and grapheme (or word) decoding
using attention weights (linear combination of hidden states) to
influence the generated output
Transducer-based
Allow to decouple the acoustic model from the language model;
elegant to leverage larger amounts of raw text (for LM)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 25 / 97


2020: main ASR architectures

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 26 / 97


2020: main ASR architectures

CTC loss function

5
Figure: CTC Overview
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 27 / 97
2020: main ASR architectures

CTC overview
Connectionist Temporal Classification
Model will learn to align the transcript itself during training (Graves
et al., 2006)
Defined over a label sequence z (of length M)
blank or symbol allows M-length target sequence to be mapped to a
T-length sequence x 6
z can be represented by a set of all possible CTC paths (sequence of
labels, at frame level) that are mapped to z
ex: M=2 (z = hi) and T=3 (3 frames): possible sequences are ’hhi’,
’hii’, ’ hi’, ’h i’, ’hi ’
Probability p(z/x) evaluated as sum of probabilities over all possible
CTC paths (using Forward-Backward)
Generate frame posteriors at decoding time
6
gives the model the ability to say that a certain audio frame did not produce a
character
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 28 / 97
2020: main ASR architectures

CTC loss function

7
Figure: CTC Loss

CTC loss can be very expensive to compute


The problem is there can be a massive number of alignments
We can compute the loss much faster with a dynamic programming
algorithm
7
from
https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 29 / 97
2020: main ASR architectures

CTC inference
Greedy decoding

Beam-Search decoding

Figure: from https://www.assemblyai.com/blog/


end-to-end-speech-recognition-pytorch
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 30 / 97
2020: main ASR architectures

Attention modeling
Architecture similar to neural machine translation
Speech encoder based on CNNs or pyramidal LSTMs ?

h11 h12 ... h1T −1 h1T

h21 ... h2T


2

c2

s1 s2 s3 s4

<S> A blue car </S>


Image from Alexandre Berard’s thesis

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 31 / 97


2020: main ASR architectures

Attention modeling

Initially proposed for (neural) machine translation (Bahdanau et al., 2014)


and introduced for ASR by Chorowski et al. (2015)
A context (attention) model is a function of the encoder codes and of
the previous decoded tokens
A speech encoder is defined (CNNs, pyramidal LSTMs)
While CTC generates frame-level posteriors, attention models
generate L predictions until the end-of-sequence symbol (no posterior
for a given frame)
Well-known issue with attention and CTC models is the thin lattices
we end up with

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 32 / 97


2020: main ASR architectures

Attention modeling (different view)

Also called LAS: Listen (encode) ; Attend (attention) and Spell


(decode)

Image from
https://lorenlugosch.github.io/posts/2020/11/transducer/

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 33 / 97


2020: main ASR architectures

Attention modeling (alignment)

Allows non-monotonic alignements


As opposed to CTC (monotonic)

Image from
https://lorenlugosch.github.io/posts/2020/11/transducer/

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 34 / 97


2020: main ASR architectures

CTC versus Attention

CTC provides monotonic alignements while attention allows


non-monotonic alignements
Attention more suitable for speech translation
Attention will induce Auto-Regressive decoding (one token at a time)
as attention depends on decoder’s state
slow inference
CTC does not have this constraint and is much simpler
faster inference
Can we do better ?
Yes, transducer models

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 35 / 97


2020: main ASR architectures

Transducer models

Problems with CTC


The output sequence length M has to be smaller than the input
sequence length T (prevents models that do a lot of input pooling)
The outputs are assumed to be independent of each other. CTC
models often produce wrong outputs like “I eight food”

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 36 / 97


2020: main ASR architectures

Transducer models

Solve both problems


Predictor is a language model
Joiner is a simple feed-forward network

Image from https://lorenlugosch.github.io/posts/2020/11/transducer/

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 37 / 97


2020: main ASR architectures

Transducer models
Interesting features
If encoder is causal (not using something like a bidirectional RNN),
then search can run in online/streaming fashion
The predictor only has access to y (not x) unlike the decoder in an
attention model, so we can easily pre-train the predictor on text-only
data
Naturally defines alignment between x and y

Image
L. Besacier (Naver Labs from https://lorenlugosch.github.io/posts/2020/11/transducer/
Europe) ASR Intro (2024) October 2024 38 / 97
2021: Self-Supervised Learning (SSL) for Speech

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 39 / 97


2021: Self-Supervised Learning (SSL) for Speech

Self supervised representation learning

Using huge unlabeled data for training ; targets are computed from
the signal itself
”learn representations using objective functions similar to those used
for supervised learning, but train networks to perform pretext tasks
where both the inputs and labels are derived from an unlabeled
dataset” (from Chen et al. (2020) )
Introduced for vision: see for instance (Chen et al., 2020)
learn representations by contrasting positive pairs against negative pairs
Introduced also in NLP: see for instance (Devlin et al., 2018)
learn representations by predicting tokens that were masked in an input
sequence

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 40 / 97


2021: Self-Supervised Learning (SSL) for Speech

Previous works

Stacked restricted Boltzman machines (RBM) (Hinton and


Salakhutdinov, 2006)
hidden layer extracts relevant features from the observations that serve
as input to next RBM that is stacked on top of it forming a
deterministic feed-forward neural network
Denoising autoencoders (AE) (Vincent et al., 2008)
networks which are tasked with reconstructing outputs from their
(noisy) input versions
Variational autoencoders (VAE) (Kingma and Welling, 2013)
VAE is like a traditional AE in which the encoder produces distributions
over latent representations (rather than deterministic encodings) while
the decoder is trained on samples from this distribution
both encoder and decoder are trained jointly
VQ-VAE (van den Oord et al., 2017) replaces continuous latent vectors
with deterministically quantized versions

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 41 / 97


2021: Self-Supervised Learning (SSL) for Speech

Pre-trained language models


Leverage large amount of freely available unlabeled text to facilitate
transfer learning in NLP
Yield state-of-the-art results on a wide range of NLP tasks + save
time and computational resources
Example of BERT (Devlin et al., 2018) based on the Transformer
model (Vaswani et al., 2017)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 42 / 97


2021: Self-Supervised Learning (SSL) for Speech

Self supervised representation learning from speech

Autoregressive predictive coding (APC) (Chung et al., 2019; Chung


and Glass, 2020)
Considers the sequential structure of speech and predicts information
about a future frame
Contrastive Predictive Coding (CPC) (Baevski et al., 2019; Schneider
et al., 2019a; Kahn et al., 2019)
Easier learning objective which consists in distinguishing a true future
audio frame from negatives
Other approaches for feature representation learning using multiple
self supervised tasks (Pascual et al., 2019; Ravanelli et al., 2020) or
bidirectional encoders (Song et al., 2019; Liu et al., 2020; Wang
et al., 2020)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 43 / 97


2021: Self-Supervised Learning (SSL) for Speech

Autoregressive predictive coding (APC)


Predicting the spectrum of a future frame (rather than a wave
sample) (Chung et al., 2019)
Somewhat inspired by language models (LMs) for text, which are
typically a probability distribution over sequences of T tokens
(t1 , t2 , ..., tT )
T
Y
P(sequence) = P(t k |t 1 , t 2 , ..., t k−1 ) (4)
k=1

T
Y
P(sequence) = P(t k |h) (5)
k=1
Recurrent neural network LM:
h = rnn state(E (t 1 ), E (t 2 ), ..., E (t k−1 ))
For speech, each token tk corresponds to a frame rather than a word
or character token
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 44 / 97
2021: Self-Supervised Learning (SSL) for Speech

Autoregressive predictive coding (APC)

No final set of target tokens (softmax layer replaced by a regression


layer)
Learnable parameters in APC are the RNN parameters θrnn and the
regression layer parameters θr
Encourage APC to infer more global structures rather than the local
information in the signal
ask the model to predict a frame n steps ahead of the current one
Model is optimized by minimizing the L1 loss between sequence
(x1 , x2 , ..., xT ) and the predicted sequence (y1 , y2 , ..., yT ):
T
X −n
|ti − yi |, ti = xi+n (6)
i=1

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 45 / 97


2021: Self-Supervised Learning (SSL) for Speech

Autoregressive predictive coding (APC)


Chung et al. (2019) models APC with a multi-layer unidirectional
LSTM with residual connections
After training, RNN hidden states are taken as the learned
representations
A follow-up work (Chung and Glass, 2020) adds an auxiliary objective
that serves as regularization to improve generalization

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 46 / 97


2021: Self-Supervised Learning (SSL) for Speech

Differences between speech and text SSL


Input speech representations (MFCCs for instance) are already in a
vector form (no embedding layer)
More uncertainty
text (discrete): finite number of possible outcomes (target tokens)
speech and video (continuous): infinite number of frames that can
plausibly follow a given audio (or video) clip

Figure: Figure from https://ai.facebook.com/blog/


self-supervised-learning-the-dark-matter-of-intelligence
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 47 / 97
2021: Self-Supervised Learning (SSL) for Speech

Contrastive Predictive Coding (CPC)


Idea proposed by van den Oord et al. (2018)
Maybe an easier learning objective (classification instead of regression)
Use of a contrastive loss that distinguishes a true future audio sample
from negatives
Example of wav2vec (Schneider et al., 2019b) that relies on a fully
convolutional architecture
Applied the learned representations to improve a supervised ASR
system

Figure: Figure from


L. Besacier (Naver Labs Europe) ASR (Schneider
Intro (2024) et al., 2019b)October 2024 48 / 97
2021: Self-Supervised Learning (SSL) for Speech

Contrastive Predictive Coding (CPC)


Encoder network Z = f (X ) ; 5 (causal) convolution layers ; local
feature representations zi encode 30 ms of audio every 10ms
Context network C = g (Z ) ; 9 (causal) convolution layers ; mix
multiple zi (receptive field of dimension v corresponding to 210ms)
into a single contextualized representation ci
Model trained to distinguish a sample zi+k that is k steps in the
future from distractor samples z̃ drawn from a proposal distribution
pn by minimizing a contrastive loss for each step k = 1, ..., K
Negatives examples sampled by uniformly choosing distractors from
each audio sequence: is pn (z) = 1/T where T is the sequence length

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 49 / 97


2021: Self-Supervised Learning (SSL) for Speech

Representation learning with multiple self-supervised tasks

Problem-agnostic speech encoder (PASE) (Pascual et al., 2019)


PASE+: robust speech recognition in noisy and reverberant
environments (Ravanelli et al., 2020)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 50 / 97


2021: Self-Supervised Learning (SSL) for Speech

Representation learning with multiple self-supervised tasks

Problem-agnostic speech encoder (PASE) (Pascual et al., 2019)


Jointly tackle multiple self-supervised tasks using an ensemble of
neural networks that cooperate to discover good speech
representations
Approach requires consensus across tasks, more likely to learn general,
robust, and transferable features
Authors find that such representations outperform more traditional
hand-crafted features in different speech classification tasks such as
speaker identification, emotion classification, and ASR

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 51 / 97


2021: Self-Supervised Learning (SSL) for Speech

Problem-agnostic speech encoder (PASE)


Encoder: SincNet (Ravanelli and Bengio, 2018) + Convblocks
(receptive field 150ms)
Workers: one for each task (see next slide)

Figure: Figure from (Pascual et al., 2019)


L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 52 / 97
2021: Self-Supervised Learning (SSL) for Speech

Problem-agnostic speech encoder (PASE)

Regression workers that solve 7 self-supervised tasks


Trained to minimize the mean squared error (MSE) between the
target features and the network predictions
Waveform learns to reconstruct waveforms
LPS reconstruct log power spectrum
MFCC reconstruct mel-frequency cepstral coefficients
Prosody predicts 4 basic prosodic features per frame
LIM (local info max) contrastive task where positive sample is drawn
from the same utterance and a negative sample is drawn from another
random utterance (that likely belongs to a different speaker)
GIM (global info max) similar to LIM using global representations
(averaged over 1s) instead of local ones
SPC sequence predicting coding: similar to contrastive predictive
coding (CPC) introduced earlier

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 53 / 97


2021: Self-Supervised Learning (SSL) for Speech

Problem-agnostic speech encoder (PASE)


Experiments on speaker identification, emotion recognition and ASR

Figure: Table from (Pascual et al., 2019)


L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 54 / 97
2021: Self-Supervised Learning (SSL) for Speech

Many follow-up approaches

Speech-XLNet: Unsupervised Acoustic Model Pretraining For


Self-Attention Networks (Song et al., 2019)
Mockingjay: Unsupervised speech representation learning with deep
bidirectional transformer encoders (Liu et al., 2020)
Unsupervised pre-training of bidirectional speech encoders via masked
reconstruction (Wang et al., 2020)
Wav2vec 2.0: A Framework for Self-Supervised Learning of
Speech Representations (Baevski et al., 2020)
HuBERT: Self-Supervised Speech Representation Learning by
Masked Prediction of Hidden Units (?)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 55 / 97


2021: Self-Supervised Learning (SSL) for Speech

Speech-XLNet

Speech-XLNet (Song et al., 2019)


Learn speech representations with self-attention networks
BERT-like autoencoding (AE) scheme to train a bi-directional speech
representation model (not only left-to-right)
Mask and reconstruct speech frames rather than word tokens
(regression instead of classification task)
Encourage network to learn global structures by shuffling speech
frame orders (can be also seen as dynamic data augmentation)
Training using a Mean Absolute Error (MAE) loss over several
permutations of the input frames
(Unfortunately) not compared with previous APC and CPC
approaches

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 56 / 97


2021: Self-Supervised Learning (SSL) for Speech

Speech-XLNet

Experiments on Hybrid and end-to-end ASR


Results of Hybrid ASR on TIMIT are reported below

Figure: Table from (Song et al., 2019)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 57 / 97


2021: Self-Supervised Learning (SSL) for Speech

Unsupervised speech representation learning with deep


bidirectional transformer encoders
Predict the current frame through jointly conditioning on both past
and future contexts (Mockingjay (Liu et al., 2020))
Masked acoustic modeling task (rand. mask 15% of input frames)8
Use multi-layer transformer encoders and multi-head self-attention
Add a prediction head (2 layers of feed-forward network with
layer-norm) using last encoder layer as input

Figure: Table from (Liu et al., 2020)


8
Use of additional consecutive masking where they mask consecutive frames Cnum to
zero. The model is required to infer onASR
L. Besacier (Naver Labs Europe)
global rather than local structure.
Intro (2024) October 2024 58 / 97
2021: Self-Supervised Learning (SSL) for Speech

Unsupervised speech representation learning with deep


bidirectional transformer encoders
Experiments on phoneme classification
With different amount of annotated data for training

Figure: Figure from (Liu et al., 2020)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 59 / 97


2021: Self-Supervised Learning (SSL) for Speech

Unsupervised Pre-training of Bidirectional Speech


Encoders via Masked Reconstruction
Pre-training speech representations via a masked reconstruction loss
(Wang et al., 2020)
Masking in both frequency and time to encourage model to exploit
spatio-temporal info
Elegant extension of data augmentation technique SpecAugment
(Park et al., 2019)

Figure: Figure from (Wang et al., 2020)


L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 60 / 97
2021: Self-Supervised Learning (SSL) for Speech

Unsupervised Pre-training of Bidirectional Speech


Encoders via Masked Reconstruction

Figure: From (Wang et al., 2020)


L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 61 / 97
2021: Self-Supervised Learning (SSL) for Speech

Wav2vec 2.0: A Framework for Self-Supervised Learning of


Speech Representations
Encode speech with CNN layers and then mask spans of the resulting
latent speech representations (cf masked LM)
Learn discrete speech units as latent representations9
Latent representations fed to a Transformer network to build
contextualized representations
Model trained with a contrastive task (true latent to be distinguished
from distractors)

Figure: Figure from (Baevski et al., 2020)


9
Authors found this more effective than non-quantized targets
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 62 / 97
2021: Self-Supervised Learning (SSL) for Speech

HuBERT: Self-Supervised Speech Representation Learning


by Masked Prediction of Hidden Units
Similar Conv+Transf encoder but
Uses cross-entropy loss (same as BERT) instead of contrastive loss
Discrete targets are built through a separate clustering process
Learnt discrete speech units are refined at each iteration (3 iterations
for large models)
X-LARGE version of HuBERT as 1 billion parameters
Model recently outperformed SOTA techniques for speech
recognition, generation, and compression

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 63 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 64 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

SpeechT5 (Ao et al., 2021)

A multimodal extension of
transformer encoder-decoder
models such as T5
Encode or decode both speech
and text with a single model
Maps both acoustic and text
information in a shared vector
space
Used to initialize ASR
(speech-to-text), TTS
(text-to-speech), Voice
Conversion (VC –
speech-to-speech), etc.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 65 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

SpeechT5 (Ao et al., 2021)

A single transformer encoder/decoder


backbone
Several modality-specific pre-post nets
standard for text
encoder pre-net for speech similar to the
CNN blocks of wav2vec2.0
decoder pre-net for speech is different (fully
connected net + ReLU) as the model will
output slices of filterbank features (no
speech directly)
a speaker embedding is concatenated to the
output of the speech-decoder pre-net to
support voice conversion and multi-speaker
TTS

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 66 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

SpeechT5 (Ao et al., 2021)

A composite loss with multiple


pre-training objectives
a masked language modeling (MLM)
loss on discrete latent speech
representations (à la HuBERT )
a speech reconstruction L1 loss (in
the continuous filterbank space)
a cross-entropy loss specific to the
prediction of the stop token
a text denoising objective (à la
BART )
a cross-modal objective to better
align speech and text representations
(unclear in the paper)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 67 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

SpeechT5 Ao et al. (2021)

Experiments
on several
downstream
speech tasks
(ASR, VC,
TTS, speaker
id.) show
slightly better
results than
speech-only
pre-training

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 68 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

mSLAM Bapna et al. (2022)

Extension of SLAM Bapna et al.


(2021) architecture where
speech and text unified in a
common encoder model
Use of convolution-augmented
transformer (conformer)
blocks, introduced earlier for
ASR Gulati et al. (2020)
Input is speech, text, or
concatenated speech-text
Speech-text pre-training is a
mix of self-supervised learning
objectives (rather similar to
SpeechT5) and supervised
cross-modal learning
objectives (which leverage
aligned speech-text pairs)
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 69 / 97
2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

mSLAM Bapna et al. (2022)


SSL learning objectives for
speech (à la HuBERT) and text
(BERT)
Speech-text objectives
translation language modeling
(TLM): predicts masked text
or speech spans from a
concatenated speech-text
input (to encourage use of
cross-modal context)
Connectionist Temporal
Classification (CTC) loss is
applied on the speech part of
concatenated speech-text
using character transcript as a
target (ASR loss to learn
better speech-text
L. Besacieralignments).
(Naver Labs Europe) ASR Intro (2024) October 2024 70 / 97
2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

mSLAM Bapna et al. (2022)

Massively multilingual (51


lang. speech; 101 lang.
text), 2B param
Downstream task
experiments
ASR, speech
translation, spoken
lang. id., spoken intent
classification and text
classification
Speech-text
pre-training better than
speech-only
pre-training for
multilingual ASR and
translation
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 71 / 97
2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

mSLAM Bapna et al. (2022)

Zero shot cross-modal properties


Zero shot text translation from
a fine-tuned speech translation
model
The model has never seen
src-txt/tgt-txt parallel data but
it has seen src-speech/tgt-text
+ monolingual src-txt
... but a system fine-tuned only
on text cannot translate speech

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 72 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Moshi Speech-2-Speech Model [Defossez et al., 2024]

Moshi is a Multi-stream Low-latency Speech-to-Speech


Dialogue Model Défossez et al. (2024)
Multi-stream: no explicit turns
Low-latency: they train causal models, which work on windows
of 160ms of speech
Speech-to-Speech: Model receives as input speech, and
produces both text and speech
Dialogue Model: Trained mostly on conversational data

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 73 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Moshi: Architecture

At its core is Helium, a 7B parameter text language model


(LLM), trained on 2T tokens. The model undergoes specialized
training for multistream and full-duplex capabilities.

Figure: Figure from Marcely Zanon-Boito

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 74 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Moshi: Mimi – The Audio Encoder/Decoder

Moshi’s unique audio encoder/decoder, Mimi, models both speech


semantics and acoustics. Key features include:
Causal convolution followed by transformer block for signal encoding.
Latent space discretization through vector quantization (VQ).
Semantic speech units distilled from WavLM speech model.
Output: 8 discrete substreams—7 for acoustics, 1 for semantics.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 75 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Moshi: Full-Duplex Communication and Inner Monologue

Moshi achieves full-duplex communication by modeling both user and


system audio streams into one unified sequence. An optional inner
monologue component improves speech generation by:
Predicting time-aligned text tokens as a prefix to audio tokens.
Enhancing factual accuracy and linguistic quality in generated speech.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 76 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Moshi: Training

Moshi is pre-trained on:


7 million hours of English speech, transcribed using Whisper.
Fisher conversational dataset augmented for multistream
post-training.
20k hours of synthetic speech interactions.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 77 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Moshi: Performance and Limitations

Key performance insights:


Moshi outperforms models like SpeechGPT in spoken QA but shows
knowledge degradation compared to Helium.
Inner monologue enhances real-time speech tasks with 5.7% WER on
Librispeech (clean).
Quantization reduces linguistic performance but minimally impacts
audio quality.
Future challenges include managing safety and preventing toxic audio
generation.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 78 / 97


2022-2024: Multimodal Speech-{Text,Speech} Pre-trained
Models

Conclusion

Training such models requires access to powerful computing platforms


Simpler and more efficient pre-training approaches needed
Standardization in the evaluation process also needed (need for a
multimodal and multilingual GLUE)
Only scratched the surface of zero-shot capabilities of these models
(transfer from text to speech tasks)
More research needed on the decoder side (especially to generate
expressive speech with adequate prosody)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 79 / 97


Future: is ASR a solved problem ?

1 The speech signal

2 1990-2015: Bayes, HMMs, GMMs

3 2015: neural nets to the rescue

4 2020: main ASR architectures

5 2021: Self-Supervised Learning (SSL) for Speech

6 2022-2024: Multimodal Speech-{Text,Speech} Pre-trained Models

7 Future: is ASR a solved problem ?

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 80 / 97


Future: is ASR a solved problem ?

ASR Remaining Challenges in 2024

Figure: ASR challenges in 2024 according to openai o-1

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 81 / 97


Future: is ASR a solved problem ?

Language coverage
Google addresses (only) 100 languages (ASR)
Language technology issues: 300 languages (95 % population)
Language coverage / revitalisation / documentation issues: > 6000
languages !

Figure: from Laura Welcher - Big Data for Small Languages The Rosetta Project
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 82 / 97
Future: is ASR a solved problem ?

Whisper
A massively multilingual ASR system based on weakly-supervised learning
Radford et al. (2022)
trained on 680,000 hours of multilingual and multitask supervised
data collected from the web
use of such a large and diverse dataset leads to improved robustness
to accents, background noise and technical language
enables transcription in multiple languages, as well as translation from
those languages into English
whisper architecture is a simple end-to-end approach, implemented as
an encoder-decoder Transformer

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 83 / 97


Future: is ASR a solved problem ?

Low resource ASR

Figure: Performance of Whisper models on multiple languages

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 84 / 97


Future: is ASR a solved problem ?

On par with human transcription ?

Figure: Comparison of WER for two speech systems and human level performance
on read speech (from (Amodei et al., 2016)

Figure: Comparison of WER for two speech systems and human level performance
on accented speech (from (Amodei et al., 2016)
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 85 / 97
Future: is ASR a solved problem ?

On par with human transcription ?

Figure: Comparison of WER for two speech systems and human level performance
on noisy speech (from (Amodei et al., 2016)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 86 / 97


Future: is ASR a solved problem ?

Zero resource ASR

In an unknown language, from unannotated raw speech, discover:10


Invariant subword units (phone units ?)
Words/terms (lexicon/semantic units ?)
Technological challenge
Can we build useful speech technologies without any textual resources
?
Unsupervised ASR / autonomous systems
Scientific challenge
Can we build algorithms that learn languages like infants do ?
Can we build algorithms that extract meaningful units from unknown
languages ?

10
The zero resource challenge: http://zerospeech.com (Dunbar et al., 2017)
L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 87 / 97
Future: is ASR a solved problem ?

Using LLMs for ASR

Deep integration between speech models and LLMs


Architecture of prepending continuous audio embed- dings to the text
embeddings before feeding to a decoder-only LLM
SpeechGPT(Zhang et al., 2023)
Speech-LLaMA (Wu et al., 2023)
Boosting for speech applications with the in-context learning ability
of LLMs (Chen et al., 2023)

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 88 / 97


Future: is ASR a solved problem ?

Resources

Follow me on Twitter/X: Go to @laurent besacie


Recent thread on Moshi: Go to twitter Thread on Moshi
Blog on Multimodal Speech-Text Models: Read the Blog Post
Research on Multimodal NLP for HRI: Read Naver Labs Research
Page on Multimodal NLP for HRI

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 89 / 97


Future: is ASR a solved problem ?

Questions?

Thank you

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 90 / 97


Future: is ASR a solved problem ?

References I

Alec Radford, Jong Wook Kim, T. X. G. B. C. M. I. S. (2022). Robust speech recognition via
large-scale weak supervision.
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J.,
Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Hannun,
A. Y., Jun, B., Han, T., LeGresley, P., Li, X., Lin, L., Narang, S., Ng, A. Y., Ozair, S.,
Prenger, R., Qian, S., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, C.,
Wang, Y., Wang, Z., Xiao, B., Xie, Y., Yogatama, D., Zhan, J., and Zhu, Z. (2016). Deep
speech 2 : End-to-end speech recognition in english and mandarin. In Proceedings of the
33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA,
June 19-24, 2016, pages 173–182.
Ao, J., Wang, R., Zhou, L., Liu, S., Ren, S., Wu, Y., Ko, T., Li, Q., Zhang, Y., Wei, Z., et al.
(2021). Speecht5: Unified-modal encoder-decoder pre-training for spoken language
processing. arXiv preprint arXiv:2110.07205.
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P.,
Saraf, Y., Pino, J., Baevski, A., Conneau, A., and Auli, M. (2021). Xls-r: Self-supervised
cross-lingual speech representation learning at scale. arXiv, abs/2111.09296.
Baevski, A., Auli, M., and Mohamed, A. (2019). Effectiveness of self-supervised pre-training for
speech recognition.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 91 / 97


Future: is ASR a solved problem ?

References II

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for
self-supervised learning of speech representations. In Larochelle, H., Ranzato, M., Hadsell,
R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems 33:
Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning
to align and translate. CoRR, abs/1409.0473.
Bapna, A., Cherry, C., Zhang, Y., Jia, Y., Johnson, M., Cheng, Y., Khanuja, S., Riesa, J., and
Conneau, A. (2022). mslam: Massively multilingual joint pre-training for speech and text.
CoRR, abs/2202.01374.
Bapna, A., Chung, Y., Wu, N., Gulati, A., Jia, Y., Clark, J. H., Johnson, M., Riesa, J.,
Conneau, A., and Zhang, Y. (2021). SLAM: A unified encoder for speech and language
modeling via speech-text joint pre-training. CoRR, abs/2110.10329.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language
model. J. Mach. Learn. Res., 3:1137–1155.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive
learning of visual representations.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 92 / 97


Future: is ASR a solved problem ?

References III

Chen, Z., Huang, H., Andrusenko, A., Hrinchuk, O., Puvvada, K. C., Li, J., Ghosh, S., Balam,
J., and Ginsburg, B. (2023). Salm: Speech-augmented language model with in-context
learning for speech recognition and translation.
Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based
models for speech recognition. CoRR, abs/1506.07503.
Chung, Y., Hsu, W., Tang, H., and Glass, J. R. (2019). An unsupervised autoregressive model
for speech representation learning. CoRR, abs/1904.03240.
Chung, Y.-A. and Glass, J. (2020). Improved speech representations with multi-target
autoregressive predictive coding.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep
bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Dunbar, E., Cao, X., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X., and
Dupoux, E. (2017). The zero resource speech challenge 2017. CoRR, abs/1712.04313.
Défossez, A., Mazaré, L., Orsini, M., Royer, A., Pérez, P., Jégou, H., Grave, E., and Zeghidour,
N. (2024). Moshi: a speech-text foundation model for real-time dialogue.
Graves, A., Fernández, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks. In ICML,
volume 148 of ACM International Conference Proceeding Series, pages 369–376. ACM.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 93 / 97


Future: is ASR a solved problem ?

References IV

Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu,
Y., and Pang, R. (2020). Conformer: Convolution-augmented transformer for speech
recognition. In Meng, H., Xu, B., and Zheng, T. F., editors, Interspeech 2020, 21st Annual
Conference of the International Speech Communication Association, Virtual Event, Shanghai,
China, 25-29 October 2020, pages 5036–5040. ISCA.
Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313(5786):504 – 507.
Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.-E., Karadayi, J.,
Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A.,
Mohamed, A., and Dupoux, E. (2019). Libri-light: A benchmark for asr with limited or no
supervision.
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.
Le, H. S., Oparin, I., Allauzen, A., Gauvain, J., and Yvon, F. (2011). Structured output layer
neural network language model. In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress
Center, Prague, Czech Republic, pages 5524–5527.
Liu, A. T., Yang, S.-w., Chi, P.-H., Hsu, P.-c., and Lee, H.-y. (2020). Mockingjay: Unsupervised
speech representation learning with deep bidirectional transformer encoders. ICASSP 2020 -
2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 94 / 97


Future: is ASR a solved problem ?

References V
Mikolov, T., Karafiát, M., Burget, L., C̆ernocký, J., and Khudanpur, S. (2010). Recurrent
neural network based language model. In Interspeech.
Mnih, A. and Hinton, G. (2008). A scalable hierarchical distributed language model. In In NIPS.
Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language model. In
AISTATS’05, pages 246–252.
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: An ASR corpus
based on public domain audio books. In ICASSP, pages 5206–5210. IEEE.
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V. (2019).
Specaugment: A simple data augmentation method for automatic speech recognition.
Interspeech 2019.
Pascual, S., Ravanelli, M., Serrà, J., Bonafonte, A., and Bengio, Y. (2019). Learning
problem-agnostic speech representations from multiple self-supervised tasks. CoRR,
abs/1904.03416.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M.,
Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. (2011). The
kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition
and Understanding. IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022). Robust
speech recognition via large-scale weak supervision.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 95 / 97


Future: is ASR a solved problem ?

References VI

Ravanelli, M. and Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet.
Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Trmal, J., and Bengio, Y.
(2020). Multi-task self-supervised learning for robust speech recognition.
Renals, S., Morgan, N., Bourlard, H., Cohen, M., and Franco, H. (1994). Connectionist
probability estimators in HMM speech recognition. IEEE Trans. Speech and Audio
Processing, 2(1):161–174.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019a). wav2vec: Unsupervised
Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019b). wav2vec: Unsupervised
pre-training for speech recognition. CoRR, abs/1904.05862.
Schwenk, H. (2007). Continuous space language models. Computer Speech & Language,
21(3):492–518.
Song, X., Wang, G., Wu, Z., Huang, Y., Su, D., Yu, D., and Meng, H. (2019). Speech-xlnet:
Unsupervised acoustic model pretraining for self-attention networks.
van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive
predictive coding. CoRR, abs/1807.03748.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 96 / 97


Future: is ASR a solved problem ?

References VII

van den Oord, A., Vinyals, O., and kavukcuoglu, k. (2017). Neural discrete representation
learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages
6306–6315. Curran Associates, Inc.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing
robust features with denoising autoencoders.
Wang, W., Tang, Q., and Livescu, K. (2020). Unsupervised pre-training of bidirectional speech
encoders via masked reconstruction.
Wu, J., Gaur, Y., Chen, Z., Zhou, L., Zhu, Y., Wang, T., Li, J., Liu, S., Ren, B., Liu, L., and
Wu, Y. (2023). On decoder-only architecture for speech-to-text and large language model
integration.
Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., and Qiu, X. (2023). Speechgpt:
Empowering large language models with intrinsic cross-modal conversational abilities.

L. Besacier (Naver Labs Europe) ASR Intro (2024) October 2024 97 / 97

You might also like