How To Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
How To Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
♦ ■
AGH University of Krakow, Poland ACC Cyfronet AGH, Poland
Email: [email protected], [email protected],
[email protected],[email protected]
Abstract—The remarkable performance achieved by Large However, being a recent trend, research efforts have mostly
Language Models (LLM) has driven research efforts to leverage been devoted to demonstrating the effectiveness of this
them for a wide range of tasks and input modalities. In speech-to- paradigm over traditional methods, without delving into the
text (S2T) tasks, the emerging solution consists of projecting the
output of the encoder of a Speech Foundational Model (SFM) into single design choices. Specifically, many architectural so-
the LLM embedding space through an adapter module. However, lutions have been proposed for the length adapter, often
no work has yet investigated how much the downstream-task employed to both reduce the LLM computational costs and
performance depends on each component (SFM, adapter, LLM) the modality mismatch with the textual sequences. These
nor whether the best design of the adapter depends on the chosen methods span from fixed downsampling, obtained either with
SFM and LLM. To fill this gap, we evaluate the combination
of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 a stack of strided convolutions [9] or with window-level Q-
SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, Former [3], to modules with variable compression rates that
namely Automatic Speech Recognition and Speech Translation. reduce the input sequence based on its semantic content,
Our results demonstrate that the SFM plays a pivotal role in such as Continuous Integrate-and-Fire (CIF) [10] and CTC
downstream performance, while the adapter choice has moderate compression [11]. Nonetheless, a comprehensive study on the
impact and depends on the SFM and LLM.
Index Terms—automatic speech recognition, speech transla- length adapter choice is missing: while some comparisons are
tion, LLM, foundation models, adapters present in the literature [5], [7], these evaluations are narrow
in scope and assume the optimal solution is independent of
I. I NTRODUCTION the chosen SFM or LLM.
The success of Large Language Models (LLMs) [1] has In this work, we explore whether a one-size-fits-all design
attracted significant interest in extending their capabilities choice for the length adapter exists that maximizes ASR and
to handle various input modalities such as vision [2] and ST performance or if this depends on the selected SFM and
speech [3]. In the speech scenario, several studies [3]–[7] have LLM. Through a systematic comparison of a wide range of
proposed the integration of a pretrained Speech Foundation adapters proposed in the literature and by analyzing their
Model (SFM) encoder with a pretrained LLM through an impact in combination with widely used SFMs (Whisper [12]
adapter module, realizing the SFM+LLM new architectural and SeamlessM4T [13]) and LLMs (Llama [1] and Mistral
paradigm [8]. The adapter can be decomposed into two [14]), our contributions can be summarized as follows:
components, as shown in Fig. 1: a length adapter, which
compresses the speech sequence along the time dimension, and • We prove that performance highly varies when a different
a modality adapter, which maps the compressed input into an SFM is selected (on average, more than 2 COMET points
embedding space compatible with the LLM. The SFM+LLM for ST and 1 WER for ASR), whereas the choice of the
solution exploits, on the one hand, the SFM ability to extract LLM and length adapter has a less pronounced impact
high-quality semantic representations of the speech input and, on the final performance.
on the other, the fluency and vast linguistic knowledge of • We show that there is no one-size-fits-all solution for the
LLMs, achieving competitive scores for widespread tasks such length adapter as its choice highly depends on the selected
as Automatic Speech Recognition (ASR) [9] and Speech SFM and LLM combination.
Translation (ST) [7].
Our codebase will be released under the Apache 2.0 License
∗ Equal contibution. upon paper acceptance.
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
II. M ETHODOLOGY
Fig. 1 illustrates the design of SFM+LLM solutions, high-
lighting the three key components: the SFM encoder, the
adapter, and the LLM. The input audio a is first processed
by the SFM encoder SE , whose output goes through the
adapter A and the resulting embeddings are concatenated to
the embeddings P of a textual prompt.1 The operation J,
which joins the audio and prompt embeddings, is the same
as that used in LLaVa [2] for prompting. Defining L as the
LLM, the transcription or translation y is then obtained as
follows:
y = L (J (P, A (SE (a)))) (1)
As the main rationale behind the SFM+LLM solution comes
from the possibility of training a high-quality ASR or ST Fig. 1. Schema of the SFM+LLM architecture.
system without large training datasets – thus with limited
computational costs and memory requirements – we keep L
and SE frozen, training only A. This solution is coherent with so, we follow [4], [15] that use a stack of vanilla Trans-
previous work [15], which showed that the gains obtained former [20] encoder layers with bidirectional self-attention as
by fine-tuning the whole SFM encoder and LLM do not modality adapters, and investigate different methods as length
justify the additional costs. Within this framework, we answer adapters. The adapters are trained using a cross-entropy loss
our research question on the relative importance of the three on the output of the LLM having the transcripts for ASR and
components for the downstream performance by varying each translations for ST as target, unless stated otherwise. Overall,
of them as illustrated in the following sections. we investigate the following 5 types of adapters.
Base. 4 Transformer encoder layers are used and no length
A. SFM Encoder adaptation is performed.
To investigate the impact of the SFM encoder, we use two Conv-based. 2 convolutional layers with stride 2 are intro-
widely recognized SFMs for speech representation extraction: duced after the second layer of the Base adapter. No auxiliary
Whisper [12] and SeamlessM4T [13]. In particular, we use loss is used. The final compression factor is 4.
the large version of both of them, namely Whisper large- CIF-based. Similarly to Conv-based, the adapter is extended
v3,2 and SeamlessM4T v2-large.3 While Whisper is the most by introducing a Continuous Integrate-and-Fire (CIF) [10]
popular SFM in recent works on SFM+LLM [3], [4], [16]– length adapter after the second Transformer layer. CIF is
[18], representing a natural choice, the usage of SeamlessM4T a sequence compression mechanism that accumulates input
has never been explored to the best of our knowledge. features over time and emits an output when a given integra-
Nonetheless, we opted for it not only for its recognized tion threshold is reached, enabling variable-length sequence
quality, but also because its design is very different from compression while preserving key information. To train this
Whisper. SeamlessM4T is built with a customized version of module, we add two auxiliary losses: a Connectionist Tem-
Conformer layers [19] instead of Transformer ones, and also poral Classification (CTC) loss [21] with the transcripts as
the compression factor of the input sequence is very different. target, following [22], and a quantity loss that controls the
While both process audio sequences where each vector of the compression factor. The weight associated to both auxiliary
sequence represents 10ms of audio, Whisper emits one vector losses is 0.1. On average, this corresponds to a compression
every 20ms (2× downsampling), while SeamlessM4T encoder factor of 2 with SeamlessM4T and 12 with Whisper.
returns one vector every 160ms (16× downsampling). In light CTC-based. In this case, the length adapter is a CTC-
of these peculiarities, experimenting with these two SFM based compression [11], which collapses consecutive equal
encoders lets us understand whether their behavior impacts predictions of a CTC module by averaging the corresponding
on the best adapter design (in particular, the length adapter). vectors, trained on the transcripts with an auxiliary CTC loss
as done in CIF-based. On average, this corresponds to a
B. Adapter compression factor of 1.5 for SeamlessM4T and 9 for Whisper.
As we keep the SFM and LLM frozen, we design adapters WLQ-former. This adapter performs both modality and length
with high representation capacity, allowing for an effective adaptation with a window-level Q-Former [3]. This module
mapping of the embeddings to the LLM input space. To do processes variable-length encoded speech sequences by divid-
ing them into fixed-length windows of encoded frames and
1 In our experiments, we use as textual prompt “can you transcribe L ?”
s feeding each of these non-overlapping windows to a Q-former
for ASR, where Ls is the language of the audio, and “can you translate from
Ls to Lt ?” for ST, where Lt is the desired output language.
architecture [23]. The Q-former uses a fixed and configurable
2 https://huggingface.co/openai/whisper-large-v3 number of learnable query vectors to attend to each window
3 https://huggingface.co/facebook/seamless-m4t-v2-large through cross-attention. As a result, the compression factor is
controlled by the window length and the number of queries, TABLE II
which we set to 0.33 seconds and 1 respectively as per [3], and N UMBER OF PARAMETERS FOR EACH ADAPTER .
therefore results in 2 for SeamlessM4T and 16 for Whisper. # trainable parameters (M)
Adapter
len. adapt. mod. adapt. Total
TABLE I Base 0 28.35 28.35
C OMPRESSION RATE FOR EACH CONFIGURATION OF SFM/A DAPTER . Conv-based 25.20 28.35 53.55
CIF-based 28.15 28.35 56.50
SFM Adapter Compression ratio Sampling rate (Hz) CTC-based 25.20 28.35 53.55
Base 1:1 6.25 WLQ-former 33.09
Seamless
CIF-based 84.31 84.33 86.31 84.32 85.07 84.87 7.10 6.92 10.23 8.60 9.38 8.45
Conv-based 84.33 84.15 86.20 84.11 84.98 84.75 7.53 7.83 11.38 10.07 11.44 9.65
SeamlessM4T
CTC-based 82.95 82.48 85.20 82.85 83.57 83.41 7.94 7.90 12.51 10.31 12.29 10.19
WLQ-former 84.67 84.71∗ 86.60∗ 84.59 85.29 85.17 6.38∗ 6.80 9.83∗ 8.05 8.48∗ 7.91
Base 85.12 84.15 86.17 84.08 84.78 84.86 7.15 7.46 10.67 9.20 9.96 8.89
Llama 3.1
CIF-based 84.65 83.87 85.98 83.86 84.65 84.60 7.66 7.47∗ 12.36 10.18 10.50 9.63
Conv-based 85.42 84.42 86.43 84.31 85.17 85.15 7.16∗ 7.08 10.79∗ 8.99 9.83∗ 8.77
CTC-based 83.78 82.49 85.21 82.83 83.60 83.58 7.95 8.04 12.17 9.94 11.22 9.90
WLQ-former 85.65 84.84 86.66 84.68 85.39 85.44 6.62 6.69 9.96 7.97 8.71 7.99
Base 78.98 81.38 84.79 81.63 82.69 81.89 11.37 7.57 12.81 10.14 10.88 10.55
Mistral
CIF-based 77.79 80.35 84.11 80.79 81.83 80.99 12.57 8.45 14.24 12.32 13.09 12.13
Conv-based 78.73 81.26 84.72∗ 81.52∗ 82.58 81.76 11.78 7.60 13.23 10.67 11.52 10.96
CTC-based 75.56 76.53 81.75 78.33 78.55 78.14 14.69 10.63 17.15 15.09 16.50 14.81
WLQ-former 79.07∗ 81.44∗ 84.92 81.68∗ 82.92 82.00 11.82 8.21 13.60 15.77 12.55 12.39
Whisper
Base 80.43 82.15 85.21 82.33 83.06 82.64 9.90 6.33 11.27 8.52 9.09 9.02
Llama 3.1
CIF-based 78.32 78.94 82.51 80.09 80.27 80.02 12.82 8.53 14.31 12.80 13.53 12.40
Conv-based 80.84 82.57 85.49 82.60 83.51 83.00 9.90∗ 6.46 11.49∗ 8.75∗ 9.00∗ 9.12
CTC-based 76.47 73.80 80.16 77.19 76.59 76.84 14.02 10.98 17.55 16.29 17.21 15.21
WLQ-former 79.95 81.56 84.88 81.56 82.89 82.17 11.98 7.90 14.52 11.10 12.84 11.67
Second, results clearly show that there is no one-size-fits- models by connecting an SFM encoder and an LLM through
all solution for the length adapter. Interestingly, the LLM an adapter. To this aim, we compared all the combinations of
plays an important role in the choice of the adapter. With 2 SFMs, 2 LLMs, and 5 adapters, which mostly differ for their
Mistral, the Base adapter generally yields the best results, even length reduction module. With comprehensive experiments
though the WLQ-former is competitive, especially in ST. With covering two tasks – ASR and ST – and 5 language directions,
Llama, instead, the best adapter varies with the SFM used. our results demonstrate that the choice of the SFM is the most
While WLQ-former is the best option with SeamlessM4T, the critical factor influencing downstream performance. We also
Conv-based and Base adapters emerge with Whisper, with established that there is no one-size-fits-all solution for the
the former being the best in ST and close in ASR, where length adapter, as the optimal choice varies depending on the
differences among the two are almost always not statistically specific combination of SFM and LLM. Notably, the Base
significant. Across all SFM and LLM configurations and and WLQ-former adapters, which feature very different com-
tasks, the Base adapter always ranks first or second except pression factors, demonstrate strong performance across tasks,
for Llama+SeamlessM4T in ST, where it is third. Moreover, suggesting that reducing sequence length mismatch between
content-based length adapters consistently underperform other speech and text is less crucial than previously assumed.
strategies. Together with the observation that there is no
ACKNOWLEDGMENTS
clear trend of the results with respect to the compression
factor, these insights suggest that reducing the length mismatch This paper has received funding from the European Union’s
between textual and speech representations is not critical for Horizon research and innovation programme under grant
the quality of the outputs. However, reducing the speech agreement No 101135798, project Meetween (My Personal AI
sequence length lowers computational costs, making length Mediator for Virtual MEETtings BetWEEN People). We grate-
adapters still a useful module to consider. fully acknowledge Poland’s high-performance Infrastructure
All in all, our results demonstrate the need for experiment- PLGrid ACC Cyfronet AGH for providing computer facilities
ing in different settings in terms of SFM and LLM when and support within computational grant no PLG/2024/016971.
comparing adapter solutions, as improvements in one specific Marco Gaido was supported by the PNRR project FAIR -
scenario may not generalize. In addition, LLMs show to be Future AI Research (PE00000013), under the NRRP MUR
robust to input sequences of very different lengths, as the program funded by the NextGenerationEU. This work has
Base adapter, which does not compress the speech sequence, been carried out while Pierfrancesco Melucci was enrolled in
and the WLQ-former, which has high compression factors (16 the Italian National Doctorate on Artificial Intelligence run by
with Whisper), achieve competitive scores in most settings. Sapienza University of Rome in collaboration with Università
di Roma Tre.
V. C ONCLUSIONS
This work systematically analyzed the importance and de-
sign of the various building blocks that compose speech-to-text
R EFERENCES [19] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,
S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-
augmented transformer for speech recognition,” in Interspeech 2020,
[1] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, 2020, pp. 5036–5040.
A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
models,” arXiv preprint arXiv:2407.21783, 2024. L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings
[2] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Ad- of the 31st International Conference on Neural Information Processing
vances in Neural Information Processing Systems, A. Oh, T. Naumann, Systems, ser. NIPS’17, Red Hook, NY, USA, 2017, p. 6000–6010.
A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36, 2023, [21] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
pp. 34 892–34 916. ist temporal classification: labelling unsegmented sequence data with
[3] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, recurrent neural networks,” in Proceedings of the 23rd International
and C. Zhang, “SALMONN: Towards generic hearing abilities for large Conference on Machine Learning, ser. ICML ’06, New York, NY, USA,
language models,” in The Twelfth International Conference on Learning 2006, p. 369–376.
Representations, 2024. [22] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-
[4] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, to-end speech recognition using multi-task learning,” in 2017 IEEE
B. Ren, L. Liu et al., “On decoder-only architecture for speech-to-text International Conference on Acoustics, Speech and Signal Processing
and large language model integration,” in 2023 IEEE Automatic Speech (ICASSP), 2017, pp. 4835–4839.
Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. [23] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: bootstrapping language-
1–8. image pre-training with frozen image encoders and large language mod-
[5] C. Wang, M. Liao, Z. Huang, J. Lu, J. Wu, Y. Liu, C. Zong, and J. Zhang, els,” in Proceedings of the 40th International Conference on Machine
“Blsp: Bootstrapping language-speech pre-training via behavior align- Learning, ser. ICML’23, 2023.
ment of continuation writing,” arXiv preprint arXiv:2309.00916, 2023. [24] H. Zhang, N. Si, Y. Chen, W. Zhang, X. Yang, D. Qu, and X. Jiao,
[6] S. Hu, L. Zhou, S. Liu, S. Chen, H. Hao, J. Pan, X. Liu, J. Li, “Tuning large language model for end-to-end speech translation,” arXiv
S. Sivasankaran, L. Liu et al., “Wavllm: Towards robust and adaptive preprint arXiv:2310.02050, 2023.
speech large language model,” arXiv preprint arXiv:2404.00656, 2024. [25] C. Wang, A. Wu, J. Gu, and J. Pino, “Covost 2 and massively multilin-
[7] C. Wang, M. Liao, Z. Huang, and J. Zhang, “Blsp-kd: Bootstrapping gual speech translation,” in Interspeech 2021, 2021, pp. 2247–2251.
language-speech pre-training via knowledge distillation,” arXiv preprint [26] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi,
arXiv:2405.19041, 2024. “MuST-C: a Multilingual Speech Translation Corpus,” in Proceedings of
[8] M. Gaido, S. Papi, M. Negri, and L. Bentivogli, “Speech translation with the 2019 Conference of the North American Chapter of the Association
speech foundation models and large language models: What is there for Computational Linguistics: Human Language Technologies, Volume
and what is missing?” in Proceedings of the 62nd Annual Meeting of 1 (Long and Short Papers). Minneapolis, Minnesota: Association for
the Association for Computational Linguistics (Volume 1: Long Papers), Computational Linguistics, Jun. 2019, pp. 2012–2017.
Bangkok, Thailand, Aug. 2024, pp. 14 760–14 778. [27] R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “COMET: A neural
framework for MT evaluation,” in Proceedings of the 2020 Conference
[9] W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and
on Empirical Methods in Natural Language Processing (EMNLP),
C. Zhang, “Connecting speech encoder and large language model for
Online, Nov. 2020, pp. 2685–2702.
asr,” in ICASSP 2024-2024 IEEE International Conference on Acoustics,
[28] P. Koehn, “Statistical significance tests for machine translation evalu-
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 637–
ation,” in Proceedings of the 2004 Conference on Empirical Methods
12 641.
in Natural Language Processing, D. Lin and D. Wu, Eds., Barcelona,
[10] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to- Spain, Jul. 2004, pp. 388–395.
end speech recognition,” in ICASSP 2020 - 2020 IEEE International [29] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, for automatic evaluation of machine translation,” in Proceedings of the
pp. 6079–6083. 40th Annual Meeting of the Association for Computational Linguistics,
[11] M. Gaido, M. Cettolo, M. Negri, and M. Turchi, “CTC-based compres- Philadelphia, Pennsylvania, USA, Jul. 2002, pp. 311–318.
sion for direct speech translation,” in Proceedings of the 16th Conference [30] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou,
of the European Chapter of the Association for Computational Linguis- “Qwen-audio: Advancing universal audio understanding via unified
tics: Main Volume, Online, Apr. 2021, pp. 690–696. large-scale audio-language models,” arXiv preprint arXiv:2311.07919,
[12] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and 2023.
I. Sutskever, “Robust speech recognition via large-scale weak supervi-
sion,” in Proceedings of the 40th International Conference on Machine
Learning, ser. ICML’23, 2023.
[13] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A.
Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman et al.,
“Seamlessm4t-massively multilingual & multimodal machine transla-
tion,” arXiv preprint arXiv:2308.11596, 2023.
[14] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot,
D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al.,
“Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[15] M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y. Cao, N. Chen,
Y. Zhang, H. Soltau, P. K. Rubenstein, L. Zilka, D. Yu, G. Pundak,
N. Siddhartha, J. Schalkwyk, and Y. Wu, “Slm: Bridge the thin gap
between speech and text foundation models,” in 2023 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), 2023, pp.
1–8.
[16] Y. Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y. Leng, Y. Lv,
J. He, J. Lin et al., “Qwen2-audio technical report,” arXiv preprint
arXiv:2407.10759, 2024.
[17] X. Chen, S. Zhang, Q. Bai, K. Chen, and S. Nakamura, “LLaST:
Improved end-to-end speech translation system leveraged by large
language models,” in Findings of the Association for Computational
Linguistics ACL 2024, Bangkok, Thailand and virtual meeting, Aug.
2024, pp. 6976–6987.
[18] S. Khurana, C. Hori, A. Laurent, G. Wichern, and J. Le Roux, “Zerost:
Zero-shot speech translation,” in Interspeech 2024, 2024, pp. 392–396.