SeamlessM4T项目中的轻量化设备端模型解析-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00415/article/details/148393746

SeamlessM4T项目中的轻量化设备端模型解析

seamless_communication facebookresearch/seamless_communication: Facebook AI Research团队的一个项目，专注于研究和发展无缝沟通技术，旨在提高人机交互中语言理解和生成的自然度与流畅性。项目地址: https://gitcode.com/gh_mirrors/se/seamless_communication

项目背景

SeamlessM4T是Meta推出的多语言多任务语音翻译系统，支持语音识别(ASR)、语音到文本翻译(S2TT)和语音到语音翻译(S2ST)等多种功能。除了大型(2.3B)和中型(1.2B)模型外，项目团队还专门开发了针对移动设备的轻量化模型版本。

设备端模型特点

模型规格

目前提供了两个轻量化模型版本：

UnitY-Small完整版
- 参数量：2.81亿
- 磁盘占用：747MB
- 支持任务：S2ST、S2TT、ASR
- 支持语言：英语(eng)、法语(fra)、印地语(hin)、葡萄牙语(por)、西班牙语(spa)
UnitY-Small-S2T精简版
- 参数量：2.35亿
- 磁盘占用：481MB
- 支持任务：S2TT、ASR
- 支持语言同上

精简版移除了第二阶段的单元解码过程，体积减小约35%，更适合资源受限的设备。

技术实现

模型导出与运行

设备端模型采用PyTorch Mobile格式导出(.ptl文件)，具有以下优势：

依赖精简：运行时不需要完整的seamless_communication或fairseq2框架
跨平台：支持Python和C++环境
高效推理：针对移动设备进行了优化

基础使用示例

import torchaudio
import torch

# 加载音频文件
audio_input, _ = torchaudio.load("test_audio.wav") 

# 加载S2T模型
s2t_model = torch.jit.load("unity_on_device_s2t.ptl")

# 执行语音转文本翻译(目标语言设为西班牙语)
with torch.no_grad():
    text = s2t_model(audio_input, tgt_lang="spa")
print(text)

# 加载完整S2ST模型
s2st_model = torch.jit.load("unity_on_device.ptl")

# 执行语音到语音翻译
with torch.no_grad():
    text, units, waveform = s2st_model(audio_input, tgt_lang="spa")

# 保存生成的语音
torchaudio.save("output.wav", waveform.unsqueeze(0), sample_rate=16000)

性能表现

翻译质量评估

模型在FLEURS数据集上的表现：

| 翻译方向 | S2TT BLEU | S2ST ASR-BLEU | |---------|-----------|---------------| | 英→印地 | 10.43 | 15.06 | | 英→葡语 | 21.54 | 17.35 | | 英→俄语 | 7.88 | 5.11 | | 英→西语 | 12.78 | 11.75 | | 印地→英 | 12.92 | 10.50 | | 葡语→英 | 22.99 | 24.81 | | 俄语→英 | 18.24 | 18.24 | | 西语→英 | 14.37 | 14.85 |

注：ASR-BLEU评估使用Whisper模型作为中间转写工具