How To Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

Uploaded by

ale3265

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views5 pages

How To Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

Uploaded by

ale3265

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

How to Connect Speech Foundation Models and

Large Language Models?

What Matters and What Does Not
Francesco Verdini∗†§ , Pierfrancesco Melucci∗†‡§ , Stefano Perna∗‡§ , Francesco Cariaggi¶ ,
Marco Gaido⋆ , Sara Papi⋆ , Szymon Mazurek♦■ , Marek Kasztelnik■ ,
Luisa Bentivogli⋆ , Sébastien Bratières§¶ , Paolo Merialdo‡ , Simone Scardapane†
† La Sapienza, University of Rome, Italy ‡ Department of Engineering, Roma Tre University, Italy
§ Translated, Italy ¶ Pi School, Italy ⋆
Fondazione Bruno Kessler, Italy
arXiv:2409.17044v1 [cs.CL] 25 Sep 2024

♦ ■
AGH University of Krakow, Poland ACC Cyfronet AGH, Poland
Email: [email protected], [email protected],
[email protected],[email protected]

Abstract—The remarkable performance achieved by Large However, being a recent trend, research efforts have mostly
Language Models (LLM) has driven research efforts to leverage been devoted to demonstrating the effectiveness of this
them for a wide range of tasks and input modalities. In speech-to- paradigm over traditional methods, without delving into the
text (S2T) tasks, the emerging solution consists of projecting the
output of the encoder of a Speech Foundational Model (SFM) into single design choices. Specifically, many architectural so-
the LLM embedding space through an adapter module. However, lutions have been proposed for the length adapter, often
no work has yet investigated how much the downstream-task employed to both reduce the LLM computational costs and
performance depends on each component (SFM, adapter, LLM) the modality mismatch with the textual sequences. These
nor whether the best design of the adapter depends on the chosen methods span from fixed downsampling, obtained either with
SFM and LLM. To fill this gap, we evaluate the combination
of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 a stack of strided convolutions [9] or with window-level Q-
SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, Former [3], to modules with variable compression rates that
namely Automatic Speech Recognition and Speech Translation. reduce the input sequence based on its semantic content,
Our results demonstrate that the SFM plays a pivotal role in such as Continuous Integrate-and-Fire (CIF) [10] and CTC
downstream performance, while the adapter choice has moderate compression [11]. Nonetheless, a comprehensive study on the
impact and depends on the SFM and LLM.
Index Terms—automatic speech recognition, speech transla- length adapter choice is missing: while some comparisons are
tion, LLM, foundation models, adapters present in the literature [5], [7], these evaluations are narrow
in scope and assume the optimal solution is independent of
I. I NTRODUCTION the chosen SFM or LLM.
The success of Large Language Models (LLMs) [1] has In this work, we explore whether a one-size-fits-all design
attracted significant interest in extending their capabilities choice for the length adapter exists that maximizes ASR and
to handle various input modalities such as vision [2] and ST performance or if this depends on the selected SFM and
speech [3]. In the speech scenario, several studies [3]–[7] have LLM. Through a systematic comparison of a wide range of
proposed the integration of a pretrained Speech Foundation adapters proposed in the literature and by analyzing their
Model (SFM) encoder with a pretrained LLM through an impact in combination with widely used SFMs (Whisper [12]
adapter module, realizing the SFM+LLM new architectural and SeamlessM4T [13]) and LLMs (Llama [1] and Mistral
paradigm [8]. The adapter can be decomposed into two [14]), our contributions can be summarized as follows:
components, as shown in Fig. 1: a length adapter, which
compresses the speech sequence along the time dimension, and • We prove that performance highly varies when a different
a modality adapter, which maps the compressed input into an SFM is selected (on average, more than 2 COMET points
embedding space compatible with the LLM. The SFM+LLM for ST and 1 WER for ASR), whereas the choice of the
solution exploits, on the one hand, the SFM ability to extract LLM and length adapter has a less pronounced impact
high-quality semantic representations of the speech input and, on the final performance.
on the other, the fluency and vast linguistic knowledge of • We show that there is no one-size-fits-all solution for the
LLMs, achieving competitive scores for widespread tasks such length adapter as its choice highly depends on the selected
as Automatic Speech Recognition (ASR) [9] and Speech SFM and LLM combination.
Translation (ST) [7].
Our codebase will be released under the Apache 2.0 License
∗ Equal contibution. upon paper acceptance.

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
II. M ETHODOLOGY
Fig. 1 illustrates the design of SFM+LLM solutions, high-
lighting the three key components: the SFM encoder, the
adapter, and the LLM. The input audio a is first processed
by the SFM encoder SE , whose output goes through the
adapter A and the resulting embeddings are concatenated to
the embeddings P of a textual prompt.1 The operation J,
which joins the audio and prompt embeddings, is the same
as that used in LLaVa [2] for prompting. Defining L as the
LLM, the transcription or translation y is then obtained as
follows:
y = L (J (P, A (SE (a)))) (1)
As the main rationale behind the SFM+LLM solution comes
from the possibility of training a high-quality ASR or ST Fig. 1. Schema of the SFM+LLM architecture.
system without large training datasets – thus with limited
computational costs and memory requirements – we keep L
and SE frozen, training only A. This solution is coherent with so, we follow [4], [15] that use a stack of vanilla Trans-
previous work [15], which showed that the gains obtained former [20] encoder layers with bidirectional self-attention as
by fine-tuning the whole SFM encoder and LLM do not modality adapters, and investigate different methods as length
justify the additional costs. Within this framework, we answer adapters. The adapters are trained using a cross-entropy loss
our research question on the relative importance of the three on the output of the LLM having the transcripts for ASR and
components for the downstream performance by varying each translations for ST as target, unless stated otherwise. Overall,
of them as illustrated in the following sections. we investigate the following 5 types of adapters.
Base. 4 Transformer encoder layers are used and no length
A. SFM Encoder adaptation is performed.
To investigate the impact of the SFM encoder, we use two Conv-based. 2 convolutional layers with stride 2 are intro-
widely recognized SFMs for speech representation extraction: duced after the second layer of the Base adapter. No auxiliary
Whisper [12] and SeamlessM4T [13]. In particular, we use loss is used. The final compression factor is 4.
the large version of both of them, namely Whisper large- CIF-based. Similarly to Conv-based, the adapter is extended
v3,2 and SeamlessM4T v2-large.3 While Whisper is the most by introducing a Continuous Integrate-and-Fire (CIF) [10]
popular SFM in recent works on SFM+LLM [3], [4], [16]– length adapter after the second Transformer layer. CIF is
[18], representing a natural choice, the usage of SeamlessM4T a sequence compression mechanism that accumulates input
has never been explored to the best of our knowledge. features over time and emits an output when a given integra-
Nonetheless, we opted for it not only for its recognized tion threshold is reached, enabling variable-length sequence
quality, but also because its design is very different from compression while preserving key information. To train this
Whisper. SeamlessM4T is built with a customized version of module, we add two auxiliary losses: a Connectionist Tem-
Conformer layers [19] instead of Transformer ones, and also poral Classification (CTC) loss [21] with the transcripts as
the compression factor of the input sequence is very different. target, following [22], and a quantity loss that controls the
While both process audio sequences where each vector of the compression factor. The weight associated to both auxiliary
sequence represents 10ms of audio, Whisper emits one vector losses is 0.1. On average, this corresponds to a compression
every 20ms (2× downsampling), while SeamlessM4T encoder factor of 2 with SeamlessM4T and 12 with Whisper.
returns one vector every 160ms (16× downsampling). In light CTC-based. In this case, the length adapter is a CTC-
of these peculiarities, experimenting with these two SFM based compression [11], which collapses consecutive equal
encoders lets us understand whether their behavior impacts predictions of a CTC module by averaging the corresponding
on the best adapter design (in particular, the length adapter). vectors, trained on the transcripts with an auxiliary CTC loss
as done in CIF-based. On average, this corresponds to a
B. Adapter compression factor of 1.5 for SeamlessM4T and 9 for Whisper.
As we keep the SFM and LLM frozen, we design adapters WLQ-former. This adapter performs both modality and length
with high representation capacity, allowing for an effective adaptation with a window-level Q-Former [3]. This module
mapping of the embeddings to the LLM input space. To do processes variable-length encoded speech sequences by divid-
ing them into fixed-length windows of encoded frames and
1 In our experiments, we use as textual prompt “can you transcribe L ?”
s feeding each of these non-overlapping windows to a Q-former
for ASR, where Ls is the language of the audio, and “can you translate from
Ls to Lt ?” for ST, where Lt is the desired output language.
architecture [23]. The Q-former uses a fixed and configurable
2 https://huggingface.co/openai/whisper-large-v3 number of learnable query vectors to attend to each window
3 https://huggingface.co/facebook/seamless-m4t-v2-large through cross-attention. As a result, the compression factor is
controlled by the window length and the number of queries, TABLE II
which we set to 0.33 seconds and 1 respectively as per [3], and N UMBER OF PARAMETERS FOR EACH ADAPTER .
therefore results in 2 for SeamlessM4T and 16 for Whisper. # trainable parameters (M)
Adapter
len. adapt. mod. adapt. Total
TABLE I Base 0 28.35 28.35
C OMPRESSION RATE FOR EACH CONFIGURATION OF SFM/A DAPTER . Conv-based 25.20 28.35 53.55
CIF-based 28.15 28.35 56.50
SFM Adapter Compression ratio Sampling rate (Hz) CTC-based 25.20 28.35 53.55
Base 1:1 6.25 WLQ-former 33.09
Seamless

CIF-based 3:1 2.08

Conv-based 4:1 1.56
CTC-based 2:1 3.12
WLQ-former 2:1 3.12 and 4 gradient accumulation steps, for a total batch size of
Base 1:1 50.00 160 samples, and took the last checkpoint as the final model.
CIF-based 25:1 2.00 A single training run took approximately 24 hours. We used
Whisper

Conv-based 4:1 12.50

CTC-based 13:1 3.85
AdamW as the optimizer and a cosine scheduler for the
WLQ-former 16:1 3.12 learning rate with 10−4 as the peak value and a linear warmup
of 840 steps. The hyperparameters of all the Transformer
In Table I, we report the various compression factors layers in the adapters are the same as in the BERTBASE model.6
for each adapter. We also distinguish between the SFMs The number of trainable parameters for each configuration is
(SeamlessM4T and Whisper) as their output sequence length reported in Table II.
is very different and this influences the compression factor, Results are reported as COMET7 [27] and WER8 for,
especially for the content-based adapters (CIF-based and CTC- respectively, ST and ASR on the CoVoST 2 test sets. Statistical
based). For the content-based length adapters, we present the significance is computed using bootstrap resampling [28].
estimated compression ratios on the test set, evaluated after
training. Notably, the CIF-based adapter reaches the highest IV. R ESULTS
compression ratio (25:1) with Whisper, which corresponds to In Table III, we report the ASR and ST results of the
a ∼2 Hz frequency of output vectors. The CTC-based, on SFM+LLM architectures with the various combinations of
the other hand, reduces the output sequence to ∼3 Hz using SFMs, LLMs, and length adapters introduced in Section II.
SeamlessM4T while it reduces the sequence to approximately As previous works mostly rely on BLEU [29], we highlight
4 Hz with Whisper. The corresponding textual representations that our Whisper, Base adapter, and Llama model – a basic
by the LLM tokenizers contain ∼2.7 token per second, i.e. configuration adopted by previous SFM+LLM works – scores
slightly longer than those produced by CIF-based and slightly 28.7 BLEU on CoVoST en-de (the most popular language
shorter than those produced by the CTC-based and WLQ- direction), a significantly higher result than the 25.1 BLEU
former. of Qwen-Audio [30], one of the best performing SFM+LLM
solutions. This confirms the soundness of our experimental
C. LLM Decoder settings and the competitiveness of our results.
As LLMs, we select Mistral-7B-Instruct-v0.3,4 and Llama- First, we observe that the choice of the SFM is the most
3.1-8B-Instruct5 . The former is an English-centric model while critical factor in terms of downstream performance. This
the latter is trained in a multilingual setting, covering English, is shown not only by the fact that, in each configuration,
German, French, Italian, Portuguese, Hindi, Spanish, and Thai. the version equipped with SeamlessM4T outperforms the
Moreover, Llama is the most popular choice in previous works counterpart with Whisper on both tasks (ASR and ST) on
[4], [17], [24]. Also in this case, our goal is to maximize the average, but also by the average improvement of more than
difference between the investigated LLMs. 2 COMET on ST and more than 1 WER on ASR of the
best SeamlessM4T configurations (with Llama and the WLQ-
III. E XPERIMENTAL S ETTINGS
former or Base adapters) over the best Whisper ones (with
We trained all our models on CoVoST 2 [25] using English, Llama and the Conv-based adapter for ST and Llama and Base
German, Spanish, French, and Italian as source languages and for ASR).
German and English as target languages, and on MuST-C [26] Instead, the choice of the LLM is less critical, as demon-
using German, French, Italian, Spanish as target languages. strated by the small gap (<0.2 on both ASR and ST) between
The datasets for ASR were obtained by taking the audios in the best configuration with Llama (Seamless as SFM and
the source languages and their corresponding transcriptions for WLQ-former as adapter) and that with Mistral (SeamlessM4T
CoVoST, while for MuST-C we took audios and associated as SFM and Base as adapter). Moreover, the best results on
transcriptions from the English-German partition. We trained average in ST are obtained with Llama, while the best ones
for 2 epochs, corresponding to 28k steps, on 4 NVIDIA in ASR with Mistral.
GH200 96GB GPUs, using a micro-batch size of 10 samples
6 https://huggingface.co/google-bert/bert-base-uncased
4 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 7 Computed using https://huggingface.co/Unbabel/wmt22-comet-da.
5 https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct 8 Computed using jiwer.
TABLE III
ASR AND ST RESULTS ON C OVO ST TEST SETS . T HE BEST RESULT FOR EACH (SFM, LLM) CONFIGURATION IS UNDERLINED , WHILE THE OVERALL
BEST IS BOLDED . T HE DIFFERENCE WITH BASE IS STATISTICALLY SIGNIFICANT (p < 0.05) UNLESS FOR SCORES MARKED WITH ∗ .

ST - COMET (↑) ASR - WER (↓)

SFM LLM Adapter
en-de de-en es-en fr-en it-en avg en es fr it de avg
Base 84.94 84.75 86.65 84.71 85.42 85.29 6.48 6.56 9.69 7.8 8.36 7.78
Mistral

CIF-based 84.31 84.33 86.31 84.32 85.07 84.87 7.10 6.92 10.23 8.60 9.38 8.45
Conv-based 84.33 84.15 86.20 84.11 84.98 84.75 7.53 7.83 11.38 10.07 11.44 9.65
SeamlessM4T

CTC-based 82.95 82.48 85.20 82.85 83.57 83.41 7.94 7.90 12.51 10.31 12.29 10.19
WLQ-former 84.67 84.71∗ 86.60∗ 84.59 85.29 85.17 6.38∗ 6.80 9.83∗ 8.05 8.48∗ 7.91
Base 85.12 84.15 86.17 84.08 84.78 84.86 7.15 7.46 10.67 9.20 9.96 8.89
Llama 3.1

CIF-based 84.65 83.87 85.98 83.86 84.65 84.60 7.66 7.47∗ 12.36 10.18 10.50 9.63
Conv-based 85.42 84.42 86.43 84.31 85.17 85.15 7.16∗ 7.08 10.79∗ 8.99 9.83∗ 8.77
CTC-based 83.78 82.49 85.21 82.83 83.60 83.58 7.95 8.04 12.17 9.94 11.22 9.90
WLQ-former 85.65 84.84 86.66 84.68 85.39 85.44 6.62 6.69 9.96 7.97 8.71 7.99
Base 78.98 81.38 84.79 81.63 82.69 81.89 11.37 7.57 12.81 10.14 10.88 10.55
Mistral

CIF-based 77.79 80.35 84.11 80.79 81.83 80.99 12.57 8.45 14.24 12.32 13.09 12.13
Conv-based 78.73 81.26 84.72∗ 81.52∗ 82.58 81.76 11.78 7.60 13.23 10.67 11.52 10.96
CTC-based 75.56 76.53 81.75 78.33 78.55 78.14 14.69 10.63 17.15 15.09 16.50 14.81
WLQ-former 79.07∗ 81.44∗ 84.92 81.68∗ 82.92 82.00 11.82 8.21 13.60 15.77 12.55 12.39
Whisper

Base 80.43 82.15 85.21 82.33 83.06 82.64 9.90 6.33 11.27 8.52 9.09 9.02
Llama 3.1

CIF-based 78.32 78.94 82.51 80.09 80.27 80.02 12.82 8.53 14.31 12.80 13.53 12.40
Conv-based 80.84 82.57 85.49 82.60 83.51 83.00 9.90∗ 6.46 11.49∗ 8.75∗ 9.00∗ 9.12
CTC-based 76.47 73.80 80.16 77.19 76.59 76.84 14.02 10.98 17.55 16.29 17.21 15.21
WLQ-former 79.95 81.56 84.88 81.56 82.89 82.17 11.98 7.90 14.52 11.10 12.84 11.67

Second, results clearly show that there is no one-size-fits- models by connecting an SFM encoder and an LLM through
all solution for the length adapter. Interestingly, the LLM an adapter. To this aim, we compared all the combinations of
plays an important role in the choice of the adapter. With 2 SFMs, 2 LLMs, and 5 adapters, which mostly differ for their
Mistral, the Base adapter generally yields the best results, even length reduction module. With comprehensive experiments
though the WLQ-former is competitive, especially in ST. With covering two tasks – ASR and ST – and 5 language directions,
Llama, instead, the best adapter varies with the SFM used. our results demonstrate that the choice of the SFM is the most
While WLQ-former is the best option with SeamlessM4T, the critical factor influencing downstream performance. We also
Conv-based and Base adapters emerge with Whisper, with established that there is no one-size-fits-all solution for the
the former being the best in ST and close in ASR, where length adapter, as the optimal choice varies depending on the
differences among the two are almost always not statistically specific combination of SFM and LLM. Notably, the Base
significant. Across all SFM and LLM configurations and and WLQ-former adapters, which feature very different com-
tasks, the Base adapter always ranks first or second except pression factors, demonstrate strong performance across tasks,
for Llama+SeamlessM4T in ST, where it is third. Moreover, suggesting that reducing sequence length mismatch between
content-based length adapters consistently underperform other speech and text is less crucial than previously assumed.
strategies. Together with the observation that there is no
ACKNOWLEDGMENTS
clear trend of the results with respect to the compression
factor, these insights suggest that reducing the length mismatch This paper has received funding from the European Union’s
between textual and speech representations is not critical for Horizon research and innovation programme under grant
the quality of the outputs. However, reducing the speech agreement No 101135798, project Meetween (My Personal AI
sequence length lowers computational costs, making length Mediator for Virtual MEETtings BetWEEN People). We grate-
adapters still a useful module to consider. fully acknowledge Poland’s high-performance Infrastructure
All in all, our results demonstrate the need for experiment- PLGrid ACC Cyfronet AGH for providing computer facilities
ing in different settings in terms of SFM and LLM when and support within computational grant no PLG/2024/016971.
comparing adapter solutions, as improvements in one specific Marco Gaido was supported by the PNRR project FAIR -
scenario may not generalize. In addition, LLMs show to be Future AI Research (PE00000013), under the NRRP MUR
robust to input sequences of very different lengths, as the program funded by the NextGenerationEU. This work has
Base adapter, which does not compress the speech sequence, been carried out while Pierfrancesco Melucci was enrolled in
and the WLQ-former, which has high compression factors (16 the Italian National Doctorate on Artificial Intelligence run by
with Whisper), achieve competitive scores in most settings. Sapienza University of Rome in collaboration with Università
di Roma Tre.
V. C ONCLUSIONS
This work systematically analyzed the importance and de-
sign of the various building blocks that compose speech-to-text
R EFERENCES [19] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,
S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-
augmented transformer for speech recognition,” in Interspeech 2020,
[1] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, 2020, pp. 5036–5040.
A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
models,” arXiv preprint arXiv:2407.21783, 2024. L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings
[2] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Ad- of the 31st International Conference on Neural Information Processing
vances in Neural Information Processing Systems, A. Oh, T. Naumann, Systems, ser. NIPS’17, Red Hook, NY, USA, 2017, p. 6000–6010.
A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36, 2023, [21] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
pp. 34 892–34 916. ist temporal classification: labelling unsegmented sequence data with
[3] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, recurrent neural networks,” in Proceedings of the 23rd International
and C. Zhang, “SALMONN: Towards generic hearing abilities for large Conference on Machine Learning, ser. ICML ’06, New York, NY, USA,
language models,” in The Twelfth International Conference on Learning 2006, p. 369–376.
Representations, 2024. [22] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-
[4] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, to-end speech recognition using multi-task learning,” in 2017 IEEE
B. Ren, L. Liu et al., “On decoder-only architecture for speech-to-text International Conference on Acoustics, Speech and Signal Processing
and large language model integration,” in 2023 IEEE Automatic Speech (ICASSP), 2017, pp. 4835–4839.
Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. [23] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: bootstrapping language-
1–8. image pre-training with frozen image encoders and large language mod-
[5] C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y. Liu, C. Zong, and J. Zhang, els,” in Proceedings of the 40th International Conference on Machine
“Blsp: Bootstrapping language-speech pre-training via behavior align- Learning, ser. ICML’23, 2023.
ment of continuation writing,” arXiv preprint arXiv:2309.00916, 2023. [24] H. Zhang, N. Si, Y. Chen, W. Zhang, X. Yang, D. Qu, and X. Jiao,
[6] S. Hu, L. Zhou, S. Liu, S. Chen, H. Hao, J. Pan, X. Liu, J. Li, “Tuning large language model for end-to-end speech translation,” arXiv
S. Sivasankaran, L. Liu et al., “Wavllm: Towards robust and adaptive preprint arXiv:2310.02050, 2023.
speech large language model,” arXiv preprint arXiv:2404.00656, 2024. [25] C. Wang, A. Wu, J. Gu, and J. Pino, “Covost 2 and massively multilin-
[7] C. Wang, M. Liao, Z. Huang, and J. Zhang, “Blsp-kd: Bootstrapping gual speech translation,” in Interspeech 2021, 2021, pp. 2247–2251.
language-speech pre-training via knowledge distillation,” arXiv preprint [26] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi,
arXiv:2405.19041, 2024. “MuST-C: a Multilingual Speech Translation Corpus,” in Proceedings of
[8] M. Gaido, S. Papi, M. Negri, and L. Bentivogli, “Speech translation with the 2019 Conference of the North American Chapter of the Association
speech foundation models and large language models: What is there for Computational Linguistics: Human Language Technologies, Volume
and what is missing?” in Proceedings of the 62nd Annual Meeting of 1 (Long and Short Papers). Minneapolis, Minnesota: Association for
the Association for Computational Linguistics (Volume 1: Long Papers), Computational Linguistics, Jun. 2019, pp. 2012–2017.
Bangkok, Thailand, Aug. 2024, pp. 14 760–14 778. [27] R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “COMET: A neural
framework for MT evaluation,” in Proceedings of the 2020 Conference
[9] W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and
on Empirical Methods in Natural Language Processing (EMNLP),
C. Zhang, “Connecting speech encoder and large language model for
Online, Nov. 2020, pp. 2685–2702.
asr,” in ICASSP 2024-2024 IEEE International Conference on Acoustics,
[28] P. Koehn, “Statistical significance tests for machine translation evalu-
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 637–
ation,” in Proceedings of the 2004 Conference on Empirical Methods
12 641.
in Natural Language Processing, D. Lin and D. Wu, Eds., Barcelona,
[10] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to- Spain, Jul. 2004, pp. 388–395.
end speech recognition,” in ICASSP 2020 - 2020 IEEE International [29] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, for automatic evaluation of machine translation,” in Proceedings of the
pp. 6079–6083. 40th Annual Meeting of the Association for Computational Linguistics,
[11] M. Gaido, M. Cettolo, M. Negri, and M. Turchi, “CTC-based compres- Philadelphia, Pennsylvania, USA, Jul. 2002, pp. 311–318.
sion for direct speech translation,” in Proceedings of the 16th Conference [30] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou,
of the European Chapter of the Association for Computational Linguis- “Qwen-audio: Advancing universal audio understanding via unified
tics: Main Volume, Online, Apr. 2021, pp. 690–696. large-scale audio-language models,” arXiv preprint arXiv:2311.07919,
[12] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and 2023.
I. Sutskever, “Robust speech recognition via large-scale weak supervi-
sion,” in Proceedings of the 40th International Conference on Machine
Learning, ser. ICML’23, 2023.
[13] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A.
Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman et al.,
“Seamlessm4t-massively multilingual & multimodal machine transla-
tion,” arXiv preprint arXiv:2308.11596, 2023.
[14] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot,
D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al.,
“Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[15] M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y. Cao, N. Chen,
Y. Zhang, H. Soltau, P. K. Rubenstein, L. Zilka, D. Yu, G. Pundak,
N. Siddhartha, J. Schalkwyk, and Y. Wu, “Slm: Bridge the thin gap
between speech and text foundation models,” in 2023 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), 2023, pp.
1–8.
[16] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv,
J. He, J. Lin et al., “Qwen2-audio technical report,” arXiv preprint
arXiv:2407.10759, 2024.
[17] X. Chen, S. Zhang, Q. Bai, K. Chen, and S. Nakamura, “LLaST:
Improved end-to-end speech translation system leveraged by large
language models,” in Findings of the Association for Computational
Linguistics ACL 2024, Bangkok, Thailand and virtual meeting, Aug.
2024, pp. 6976–6987.
[18] S. Khurana, C. Hori, A. Laurent, G. Wichern, and J. Le Roux, “Zerost:
Zero-shot speech translation,” in Interspeech 2024, 2024, pp. 392–396.

When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
No ratings yet
When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
13 pages
Imocha - Question2
No ratings yet
Imocha - Question2
11 pages
F. Reif - Fundamentals of Statistical and Thermal Physics PDF
No ratings yet
F. Reif - Fundamentals of Statistical and Thermal Physics PDF
333 pages
A Neural Network Approach For Roughness-Dependent Update of Tyre Friction
No ratings yet
A Neural Network Approach For Roughness-Dependent Update of Tyre Friction
18 pages
2307.11795v1
No ratings yet
2307.11795v1
9 pages
DiTTo-TTS
No ratings yet
DiTTo-TTS
34 pages
Embarrassingly Simple Llm Asr
No ratings yet
Embarrassingly Simple Llm Asr
11 pages
Trend
No ratings yet
Trend
47 pages
2207.00952v1
No ratings yet
2207.00952v1
5 pages
AudioPaLM- A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM- A Large Language Model That Can Speak and Listen
27 pages
vocalnet
No ratings yet
vocalnet
15 pages
2407.03169v1
No ratings yet
2407.03169v1
5 pages
LL SM: L L S M: A Arge Anguage and Peech Odel
No ratings yet
LL SM: L L S M: A Arge Anguage and Peech Odel
8 pages
2410.03751v1
No ratings yet
2410.03751v1
20 pages
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
219 pages
Wavllm: Towards Robust and Adaptive Speech Large Language Model
No ratings yet
Wavllm: Towards Robust and Adaptive Speech Large Language Model
21 pages
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
No ratings yet
AudioChatLlama: Towards General-Purpose Speech Abilities For LLMs
11 pages
SALM Speech-Augmented Language Model With in-Context Learning for Speech Recognition and Translation
No ratings yet
SALM Speech-Augmented Language Model With in-Context Learning for Speech Recognition and Translation
5 pages
Text To Audio Generation Instruction LLM
No ratings yet
Text To Audio Generation Instruction LLM
15 pages
Aligning Pre-Trained Models For Spoken Language Translation.18294v1
No ratings yet
Aligning Pre-Trained Models For Spoken Language Translation.18294v1
12 pages
llasa
No ratings yet
llasa
25 pages
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
No ratings yet
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
11 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview
No ratings yet
Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview
34 pages
Domain Adap Asr 1
No ratings yet
Domain Adap Asr 1
5 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-To-Sequence Autoencoder
No ratings yet
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-To-Sequence Autoencoder
5 pages
Incorporating Knowledge Sources Into Statistical Speech Recognition
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
20 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
2503.04724v1
No ratings yet
2503.04724v1
12 pages
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
5 pages
Context Asr With Llm
No ratings yet
Context Asr With Llm
5 pages
A 1000-Word Vocabulary, Speaker-Independent, Continuous Live-Mode Speech Recognizer Implemented in A Single FPGA
No ratings yet
A 1000-Word Vocabulary, Speaker-Independent, Continuous Live-Mode Speech Recognizer Implemented in A Single FPGA
9 pages
Iaesarticle
No ratings yet
Iaesarticle
10 pages
Synopsis
No ratings yet
Synopsis
5 pages
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
No ratings yet
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
40 pages
Paper_TTS+Conversion
No ratings yet
Paper_TTS+Conversion
13 pages
2408.16725
No ratings yet
2408.16725
10 pages
spoken language models
No ratings yet
spoken language models
40 pages
10 - Recurrent Neural Network Based Speech Emotion
No ratings yet
10 - Recurrent Neural Network Based Speech Emotion
13 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Thesis
No ratings yet
Thesis
37 pages
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ssw9 PS2-13 Wu
No ratings yet
ssw9 PS2-13 Wu
6 pages
Practical Kaldi for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Practical Kaldi for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Performance Analysis and Comparison of LLMS Based On Transformer Technology
No ratings yet
Performance Analysis and Comparison of LLMS Based On Transformer Technology
12 pages
AnyMAL - An Efficient and Scalable Any-Modality Augmented Language Model
No ratings yet
AnyMAL - An Efficient and Scalable Any-Modality Augmented Language Model
23 pages
2202.01367v1
No ratings yet
2202.01367v1
11 pages
rag-llm-asr
No ratings yet
rag-llm-asr
5 pages
(IJCST-V9I2P18) :swati, Harpreet Kaur
No ratings yet
(IJCST-V9I2P18) :swati, Harpreet Kaur
6 pages
Paper 3
No ratings yet
Paper 3
6 pages
A Smart Texting System For Android Mobile Users: Ms. Anuja Jadhav Prof. Arvind Patil
No ratings yet
A Smart Texting System For Android Mobile Users: Ms. Anuja Jadhav Prof. Arvind Patil
3 pages
BiLSTM_BPTT
No ratings yet
BiLSTM_BPTT
8 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
8.5 Multilingual Speech Processing
No ratings yet
8.5 Multilingual Speech Processing
24 pages
Important Questions - 22ec34e - SP - Iii Yr - Sem5
No ratings yet
Important Questions - 22ec34e - SP - Iii Yr - Sem5
4 pages
Julian David Echeverry Correa
No ratings yet
Julian David Echeverry Correa
161 pages
Speech Recognition For Mobile Systems: BY: Pratibha Channamsetty Shruthi Sambasivan
No ratings yet
Speech Recognition For Mobile Systems: BY: Pratibha Channamsetty Shruthi Sambasivan
36 pages
Conformer
No ratings yet
Conformer
5 pages
Mathematics of LLMs Part 1 (1)
No ratings yet
Mathematics of LLMs Part 1 (1)
8 pages
Sensors 20 02326 PDF
No ratings yet
Sensors 20 02326 PDF
19 pages
Automata Theory Module 1
100% (2)
Automata Theory Module 1
85 pages
Adding and Subtracting Polynomials
No ratings yet
Adding and Subtracting Polynomials
8 pages
08 - Statistical Interpretation of Entropy
No ratings yet
08 - Statistical Interpretation of Entropy
11 pages
Chap 03
No ratings yet
Chap 03
33 pages
73. Li et al., 2024, Generative AI for Self-Adaptive Systems State of the Art and Research Roadmap
No ratings yet
73. Li et al., 2024, Generative AI for Self-Adaptive Systems State of the Art and Research Roadmap
26 pages
Scilab Tools For PDE's: Application To Time-Reversal: Outlines
No ratings yet
Scilab Tools For PDE's: Application To Time-Reversal: Outlines
21 pages
Crude Column Optimization Aspen Hysys
100% (2)
Crude Column Optimization Aspen Hysys
22 pages
Stats and Ecotrix
No ratings yet
Stats and Ecotrix
194 pages
Credit Card Customer Analysis
No ratings yet
Credit Card Customer Analysis
18 pages
Application of Business Intelligence in The Tourism Industry - A Case Study of A Local Food Festival in Thailand
No ratings yet
Application of Business Intelligence in The Tourism Industry - A Case Study of A Local Food Festival in Thailand
12 pages
Pref
100% (1)
Pref
2 pages
Actuarial Exemptions 2023
No ratings yet
Actuarial Exemptions 2023
1 page
What Is EViews
No ratings yet
What Is EViews
4 pages
NX Nastran 8 Advanced Nonlinear Theory and Modeling Guide
No ratings yet
NX Nastran 8 Advanced Nonlinear Theory and Modeling Guide
395 pages
January 2022 QP
No ratings yet
January 2022 QP
32 pages
RU-CSE-IV SEM TT 2023-2024 - New - Format (2) (1) - 1
No ratings yet
RU-CSE-IV SEM TT 2023-2024 - New - Format (2) (1) - 1
22 pages
Analysis and Design of Algorithms 1st edition by Amrinder Arora ISBN 1634870212 9781634870214 - The ebook with all chapters is available with just one click
100% (9)
Analysis and Design of Algorithms 1st edition by Amrinder Arora ISBN 1634870212 9781634870214 - The ebook with all chapters is available with just one click
90 pages
DS Surprise Test
No ratings yet
DS Surprise Test
9 pages
How To Model Residual Errors To Correct Time Series Forecasts With Python
No ratings yet
How To Model Residual Errors To Correct Time Series Forecasts With Python
22 pages
ME306 Lab 4
No ratings yet
ME306 Lab 4
12 pages
Algorithm - Multiply Polynomials - Stack Overflow
No ratings yet
Algorithm - Multiply Polynomials - Stack Overflow
4 pages
Divide and Conquer Approach Complete
No ratings yet
Divide and Conquer Approach Complete
63 pages
MCQ - 31 PDF
No ratings yet
MCQ - 31 PDF
4 pages
dynamic pat
No ratings yet
dynamic pat
11 pages
Digital Assignment 4: Name-Shiladry Adak REG NO.-18BIT0492
No ratings yet
Digital Assignment 4: Name-Shiladry Adak REG NO.-18BIT0492
10 pages
Adaptive Control 2nd. Edt. by Karl.J.astrom - Solution Manuel
33% (3)
Adaptive Control 2nd. Edt. by Karl.J.astrom - Solution Manuel
46 pages
02 6
No ratings yet
02 6
13 pages