社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling

Paper-Discussion:
AnyGPT: Unified Multimodal LLM with
Discrete Sequence Modeling
Paper source: https://arxiv.org/abs/2402.12226
Presenter: Haoxiang Shi

Why
Multimodal?
Video Sampling

Forms of commercial applications of multimodal LLMs
• MM in – MM out for a chatbot
(Transcription: How about a serene lakeside?)

Background
LLMs excel in NLP tasks.
Straightforward Way: train a Multi-Model LLMs
Challenges:
• High computational resource requirements.
• Need for large multimodal datasets to maintain LLM performance in language tasks.

Background
Some Explorations:
• Emu (Sun et al., 2023b)
• SEED-LLaMA (Ge et al., 2023b)
• SpeechGPT (Zhang et al., 2023a)
Limitation:
Focus on a single modality.

Background
Any-to-Any Modality:
• NExT-GPT (Wu et al., 2023)
• CoDi-2 (Tang et al., 2023a)
• Unified-IO2 (Lu et al., 2023)
These models use separately pre-trained encoders and decoders.
Drawback:
• Representational inconsistencies between inputs and outputs of the LLMs.

Background
AnyGPT: any-to-any multimodal language model
• Unified Processing: Utilizes discrete representations.
• Versatile Input: Accepts images and audio, converting them into sequences of discrete
semantic tokens.
• Capabilities: Maintains perception, understanding, reasoning, and generation at the semantic
level in an autoregressive manner.

Overviews
MM Tokenizers + LLM + MM De-tokenizers
Advantage:
•Simple Fusion Method: Streamlined integration of modalities.
•Less Complex Structure: Easier to implement and maintain.

Model Details- Image Tokenizer
SEED tokenizer
VIT: 16*16 patches
Causal Q-former: 32 causal embedding
CodeBook: sequence of quantized codes
MLP+SD decoder: restoring the
generation embedding to the original image.

Model Details- Speech Tokenizer
SpeechTokenizer:
Single-Channel Audio Sequences: Transformed into a
discretized matrix.
Eight Hierarchical Quantizers:
First Layer: Captures semantic content.
Layers 2 to 8: Encode paralinguistic details.

Model Details- Music Tokenizer
Encodec:
• Quantization using RVQ:
• Four Quantizers: Efficiently
encode audio.
• Music Encoding: 5 seconds of
music converted into 250 latent
frames.
• Output Matrix: Generates a 250 ×
4 codes matrix.

Model Details- Vocabulary Sampling

Model Details- Backbone
Expanding Vocabulary:
• Expanding the vocabulary with new modality-specific tokens.
Unified Multimodal Language Model:
• Equipped with modality-specific tokenizers, trained by the language model
using next token prediction loss.
LLM Choice:
• LLAMA2-7B

Model Details- Multimodal Generation
Generating High-Quality Multimodal Data:
• Requirement: A large number of bits.
• Long Sequences: Increased computational resources needed.
Solution: Two-Stage Approach
Stage 1:
1. Autoregressive LLM Training: For semantic alignment.
Stage 2:
1. Non-Autoregressive Models: For generating high-fidelity multimodal content.

Dataset Constructure- Data source
Language-Centric Construction Method:
• Image-Text Datasets:
• LAION-2B
• LAION-COCO
• LAION-Aesthetics
• Speech-Text Datasets:
• Gigaspeech
• Common Voice
• Multilingual LibriSpeech
• Music-Text:
• Crawled over one million music videos and formatted as JSON files.
• Used GPT-4 for caption generation.

Dataset Constructure- Multimodal Interleaved Instruction
Generation of Text-Based Conversations
Text-to-Multimodality Conversion
After filtering, 108K data points were selected.

Experiment
• Image tasks
• Speech tasks
• Music tasks

Experiment
Image tasks:
• Image Understanding(Benchmark: cococap2014 )

Experiment
Image tasks:
• Image generation (Benchmark: cococap2014)
CLIP Score: computing a similarity
score between the generated image
and its corresponding caption from a
real image

Experiment
Speech tasks:
• Automatic Speech Recognition(Benchmark: LibriSpeech dataset )

Experiment
Speech tasks:
• Text-to-Speech (Benchmark: VCTK dataset)

Experiment
Music tasks (MusicCaps benchmark):

Limitation and future work
Author Mentions:
• Enhancing LLMs:
• Higher loss observed compared to unimodal training. Possible use of
Mixture of Experts (MoE)?
• Better Tokenizer:
• The tokenizer’s quality sets a ceiling for the model’s comprehension
and generative potential.
• Longer Context:
• Maximum length for music generation is 5 seconds.

Limitation and future work
Performance:
Scale of training
dataset?
Two-Stage Generation
Method:
Contradiction: This approach
may contradict their claim to
unify the encoder and decoder.
Computation
complexity
reduction?

Haoxiang Shi, D3 student, Sakai Lab, Waseda University 28
Thanks for your attention

Reference
1. Zhan, Jun, et al. "Anygpt: Unified multimodal llm with discrete sequence modeling." arXiv preprint
arXiv:2402.12226 (2024).
2. Sun, Quan, et al. "Generative pretraining in multimodality." arXiv preprint arXiv:2307.05222 (2023).
3. Ge, Yuying, et al. "Making llama see and draw with seed tokenizer." arXiv preprint arXiv:2310.01218 (2023).
4. Zhang, Dong, et al. "Speechgpt: Empowering large language models with intrinsic cross-modal
conversational abilities." arXiv preprint arXiv:2305.11000 (2023).
5. Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." arXiv preprint arXiv:2309.05519 (2023).
6. Tang, Zineng, et al. "CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation." Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
7. Lu, Jiasen, et al. "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and
Action." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling

Recommended

More Related Content

Similar to 社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling (20)

More from NABLAS株式会社 (20)

Recently uploaded (20)

社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling